SlideShare a Scribd company logo
1 of 39
Download to read offline
Methods to enhance the validity of
precision guidelines emerging from big data
Chirag J Patel

Lorenzini Foundation; Venice, Italy

06/16/16
chirag@hms.harvard.edu
@chiragjp
www.chiragjpgroup.org
Data streams in public health are getting large!
Capacity to measure and compute becoming high-
throughput and cheaper.
Data streams in public health are getting large!
Capacity to measure and compute becoming high-
throughput and cheaper.
N=500,000
1M genetic variants
1000s of phenotypes
Data streams in public health are getting large!
…and alternative data sets (e.g., EMR) are
omnipresent!
image: Stan Shaw (MGH)
And new concepts for discovery:

The exposome, an analog of the genome!
what to measure? how to measure?
PERSPECTIVES
Xenobiotics
Inflammation
Preexisting disease
Lipid peroxidation
Oxidative stress
Gut flora
Internal
chemical
environment
Externalenvironment
ExposomeRADIATION
DIET
POLLUTION
INFECTIONS
DRUGS
LIFE-STYLE
STRESS
Reactive electrophiles
Metals
Endocrine disrupters
Immune modulators
Receptor-binding proteins
itical entity for disease eti-
ogy (7). Recent discussion
as focused on whether and
ow to implement this vision
8). Although fully charac-
rizing human exposomes
daunting, strategies can be
eveloped for getting “snap-
hots” of critical portions of
person’s exposome during
ifferent stages of life. At
ne extreme is a “bottom-up”
rategy in which all chemi-
als in each external source
f a subject’s exposome are
easured at each time point.
lthoughthisapproachwould
ave the advantage of relat-
g important exposures to
e air, water, or diet, it would
quire enormous effort and
ould miss essential compo-
ents of the internal chemi-
al environment due to such
actors as gender, obesity,
flammation, and stress. By
ontrast, a “top-down” strat-
gy would measure all chem-
als (or products of their
ownstream processing or
ffects, so-called read-outs
r signatures) in a subject’s
ood. This would require
nly a single blood specimen
each time point and would relate directly ruptors and can be measured through serum
some (telomere) length in
peripheral blood mono-
nuclear cells responded
to chronic psychological
stress, possibly mediated
by the production of reac-
tive oxygen species (15).
Characterizing the
exposome represents a tech-
nological challenge like that of
thehumangenomeproject,which
began when DNA sequencing
was in its infancy (16). Analyti-
cal systems are needed to pro-
cess small amounts of blood from
thousands of subjects. Assays
should be multiplexed for mea-
suring many chemicals in each
class of interest. Tandem mass
spectrometry, gene and protein
chips, and microfluidic systems
offer the means to do this. Plat-
forms for high-throughput assays
shouldleadtoeconomiesofscale,
again like those experienced by
the human genome project. And
because exposome technologies
would provide feedback for thera-
peuticinterventionsandpersonal-
ized medicine, they should moti-
vate the development of commer-
cial devices for screening impor-
tant environmental exposures in
blood samples.
With successful characterization of both
Characterizing the exposome. The exposome represents
the combined exposures from all sources that reach the
internal chemical environment. Toxicologically important
classes of exposome chemicals are shown. Signatures and
biomarkers can detect these agents in blood or serum.
onOctober21,2010www.sciencemag.orgrom
how to analyze in health?
Wild, 2005

Rappaport and Smith, 2010

Buck-Louis and Sundaram 2012

Miller and Jones, 2014

Patel CJ and Ioannidis JPAI, 2014
Can we use these big data sources for discovery?
Or, for building guidelines?
Many challenges exist in the use of big data for
discovery, guideline development, and causal
research
INSIGHTS
I
n 1854, as cholera swept through Lon-
don, John Snow, the father of modern ep-
idemiology, painstakingly recorded the
locations of affected homes. After long,
laborious work, he implicated the Broad
Street water pump as the source of the
outbreak, even without knowing that a Vib-
rio organism caused cholera. “Today, Snow
might have crunched Global Positioning
System information and disease prevalence
data, solving the problem within hours” (1).
That is the potential impact of “Big Data” on
the public’s health. But the promise of Big
Data is also accompanied by claims that “the
scientific method itself is becoming obso-
lete” (2), as next-generation computers, such
as IBM’s Watson (3), sift through the digital
world to provide predictive models based
on massive information. Separating the true
signal from the gigantic amount of noise is
neither easy nor straightforward, but it is a
challenge that must be tackled if informa-
tion is ever to be translated into societal
well-being.
The term “Big Data” refers to volumes of
large, complex, linkable information (4). Be-
For nongenomic associations, false alarms
due to confounding variables or other biases
are possible even with very large-scale stud-
ies, extensive replication, and very strong
signals (9). Big Data’s strength is in finding
associations, not in showing whether these
associations have meaning. Finding a signal
is only the first step.
Even John Snow needed to start with a
plausible hypothesis to know where to look,
i.e., choose what data to examine. If all he
had was massive amounts of data, he might
well have ended up with a correlation as
spurious as the honey bee–marijuana con-
nection. Crucially, Snow “did the experi-
ment.” He removed the handle from the
water pump and dramatically reduced the
spread of cholera, thus moving from correla-
tion to causation and effective intervention.
How can we improve the potential for
Big Data to improve health and prevent
disease? One priority is that a stronger
epidemiological foundation is needed. Big
Data analysis is currently largely based on
convenient samples of people or informa-
tion available on the Internet. When as-
sociations are probed between perfectly
measured data (e.g., a genome sequence)
and poorly measured data (e.g., adminis-
By Muin J. Khoury1,2
and
John P. A. Ioannidis3
MEDICINE
Big data meets public health
Human well-being could benefit from large-scale data if large-scale noise is minimized
onJune20,2016ttp://science.sciencemag.org/
Science, 2014
Many challenges exist in the use of big data for
discovery, guideline development, and causal research
Thousands of hypotheses are possible.

Multiplicity of hypotheses.
Big data are observational.

Multiplicity of biases:
confounding, selection; reverse causal
Millions of analytic scenarios are possible.

Multiplicity of analytic methods.
Big data offers a multiplicity of possible hypotheses!

A few examples from cohort studies
JECH, 2014
Big Data offers a multiplicity of possible hypotheses!
Example: cohort database of E exposures and P phenotypes
Hum Genet 2012
JECH 2014
Curr Env Health Rep 2016
p-2
20
8
1
6
18
7
10
…
p-1
12
2
5
9
4
16
21
11
13
3
17
14
19
p
15
6 101 1482 7 …1311 12 e54 153 9 e-1e-2
…
…
…
…
E exposure factors
Pphenotypicfactors
which ones to test?

all?

the ones in blue?

E times P possibilities!

how to detect signal from noise?
Big Data offers a multiplicity of possible hypotheses!

… that depends on the domain (type of measure)!
JECH, 2014
National Health and Nutrition Examination
Survey (NHANES)
Big Data = Big Bias:

Confounding, reverse causality, and what
causes what
Interdependencies of the variables:
Correlation globes paint a complex view of exposure and
behavior
Red: positive ρ

Blue: negative ρ

thickness: |ρ|
for each pair of E:

Spearman ρ

(575 factors: 81,937 correlations)
permuted data to produce

“null ρ”

sought replication in > 1
cohort
JAMA 2014

Pac Symp Biocomput. 2015

JECH. 2015
National Health and Nutrition Examination
Survey (NHANES)
Red: positive ρ

Blue: negative ρ

thickness: |ρ|
for each pair of E:

Spearman ρ

(575 factors: 81,937 correlations)
Interdependencies of the variables:
Correlation globes paint a complex view of exposure and
behavior
permuted data to produce

“null ρ”

sought replication in > 1
cohort
JAMA 2014

Pac Symp Biocomput. 2015

JECH. 2015
National Health and Nutrition Examination
Survey (NHANES)
How to enhance the validity of precision guidelines
emerging from big data?
1.) Test systematically, address multiplicity, and
replicate.
2.) Consider modeling scenarios explicitly.
3.) Practice reproducible research and increase
data literacy
Test systematically and replicate.
Examples: “environment-wide” or “nutrient-wide”
association studies
A search engine for robust, reproducible genotype-
phenotype associations…
A RT I C L E S
13 autosomal loci exceeded the threshold for genome-wide significance (r2 < 0.05), and conditional analyses (see below) establish these SNPs
50 Locus established previously
Locus identified by current study
Locus not confirmed by current study
BCL11A
THADA
NOTCH2
ADAMTS9
IRS1
IGF2BP2
WFS1
ZBED3
CDKAL1
HHEX/IDE
KCNQ1 (2 signals*: )
TCF7L2
KCNJ11
CENTD2
MTNR1B
HMGA2 ZFAND6
PRC1
FTO
HNF1B DUSP9
Conditional analysis
Unconditional analysis
TSPAN8/LGR5
HNF1A
CDC123/CAMK1D
CHCHD9
CDKN2A/2B
SLC30A8
TP53INP1
JAZF1
KLF14
PPAR
40
30
–log10(P)–log10(P)
20
10
10
1 2 3 4 5 6 7 8
Chromosome
9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
0
0
Suggestive statistical association (P < 1 10
–5
)
Association in identified or established region (P < 1 10
–4
)
Figure 1 Genome-wide Manhattan plots for the DIAGRAM+ stage 1 meta-analysis. Top panel summarizes the results of the unconditional meta-
analysis. Previously established loci are denoted in red and loci identified by the current study are denoted in green. The ten signals in blue are those
taken forward but not confirmed in stage 2 analyses. The genes used to name signals have been chosen on the basis of proximity to the index SNP and
should not be presumed to indicate causality. The lower panel summarizes the results of equivalent meta-analysis after conditioning on 30 previously
established and newly identified autosomal T2D-associated SNPs (denoted by the dotted lines below these loci in the upper panel). Newly discovered
conditional signals (outside established loci) are denoted with an orange dot if they show suggestive levels of significance (P < 10−5), whereas
secondary signals close to already confirmed T2D loci are shown in purple (P < 10−4).
Voight et al, Nature Genetics 2012

N=8K T2D, 39K Controls
GWAS in Type 2 Diabetes
A prime example of systematic associations:
Genome-wide association studies (GWASs)
The same can be achieved with non-genetic factors.

Example: Exposures and behaviors in mortality.
Searching for 246 exposures and behaviors associated with all-
cause mortality.
NHANES: 1999-2004
National Death Index linked mortality
246 behaviors and exposures (serum/urine/self-report)
NHANES: 1999-2001
N=330 to 6008 (26 to 655 deaths)
~5.5 years of followup
Cox proportional hazards
baseline exposure and time to death
False discovery rate < 5%
NHANES: 2003-2004
N=177 to 3258 (20-202 deaths)
~2.8 years of followup
p < 0.05
Int J Epidem. 2013
Adjusted Hazard Ratio
-log10(pvalue)
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
02468
1
2
3
4
5
67
1 Physical Activity
2 Does anyone smoke in home?
3 Cadmium
4 Cadmium, urine
5 Past smoker
6 Current smoker
7 trans-lycopene
(11) 1
2
3 4
5 6
78
9
10 1112
13 14
1516
1 age (10 year increment)
2 SES_1
3 male
4 SES_0
5 black
6 SES_2
7 SES_3
8 education_hs
9 other_eth
10 mexican
11 occupation_blue_semi
12 education_less_hs
13 occupation_never
14 occupation_blue_high
15 occupation_white_semi
16 other_hispanic
(69)
Searching >250 environmental and behavioral factors in
all-cause mortality
FDR < 5%
sociodemographics
replicated factor
Int J Epidem. 2013
Adjusted Hazard Ratio
-log10(pvalue)
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
02468
1
2
3
4
5
67
1 Physical Activity
2 Does anyone smoke in home?
3 Cadmium
4 Cadmium, urine
5 Past smoker
6 Current smoker
7 trans-lycopene
(11) 1
2
3 4
5 6
78
9
10 1112
13 14
1516
1 age (10 year increment)
2 SES_1
3 male
4 SES_0
5 black
6 SES_2
7 SES_3
8 education_hs
9 other_eth
10 mexican
11 occupation_blue_semi
12 education_less_hs
13 occupation_never
14 occupation_blue_high
15 occupation_white_semi
16 other_hispanic
(69) age (10 years)
income (quintile 2)
income (quintile 1)
male
black income (quintile 3)
any one smoke in home?
serum and urine cadmium
[1 SD]
past smoker?
current smoker?serum lycopene
[1SD]
physical activity
[low, moderate, high activity]*
*derived from METs per activity and categorized by Health.gov guidelines
R2 ~ 2%
Searching >250 environmental and behavioral factors in
all-cause mortality
Searching 82 dietary factors in blood pressure:
INTERMAP and NHANES
Tzoulaki et al A Nutrient-Wide Association Study 2459
Circulation. 2012
association size
FDR < 5%
R2 ~ 7%
Testing all associations systematically:

Consideration of multiplicity of hypotheses and correlational web!
Explicit in number of hypotheses
tested
False discovery rate; 

family-wise error rate;

Report database size!
Does my correlation matter?
How does my new correlation
compare to the family of correlations?
0.17 (e.g., carotene and diabetes)

is average ρ much less than 0.17? greater?
ρ
JAMA 2014
JECH 2015
Consideration of multitude modeling scenarios.
Example: Vibration of Effects, the empirical
distribution of effect sizes due to model choice
e modelling
oblem is akin to – but less well
sed and more poorly understood than –
e testing. For example, consider the use
r regression to adjust the risk levels of
atments to the same background level
There can be many covariates, and
t of covariates can be in or out of the
With ten covariates, there are over 1000
models. Consider a maze as a metaphor
elling (Figure 3). The red line traces the
path out of the maze. The path through
ze looks simple, once it is known.
ways in the literature for dealing with model
selection, so we propose a new, composite
2. Publication bias
is general recognition that a paper
much better chance of acceptance if
hing new is found. This means that, for
ation, the claim in the paper has to
sed on a p-value less than 0.05. From
g’s point of view5
, this is quality by
tion. The journals are placing heavy
ce on a statistical test rather than
nation of the methods and steps that
o a conclusion. As to having a p-value
han 0.05, some might be tempted to
the system10
through multiple testing,
ple modelling or unfair treatment of
or some combination of the three that
to a small p-value. Researchers can be
creative in devising a plausible story to
statistical finding.
2 The data cleaning team creates a
modelling data set and a holdout set and
P < 0.05
Figure 3. The path through a complex process can appear quite simple once the path is defined. Which terms are
included in a multiple linear regression model? Each turn in a maze is analogous to including or not a specific
term in the evolving linear model. By keeping an eye on the p-value on the term selected to be at issue, one
can work towards a suitably small p-value. © ktsdesign – Fotolia
A maze of associations is one way to a fragmented
literature and Vibration of Effects
Young, 2011
univariate
sex
sex & age
sex & race
sex & race & age
JCE, 2015
Distribution of associations and p-values due to model choice:
Estimating the Vibration of Effects (or Risk)
Variable of Interest
e.g., 1 SD of log(serum Vitamin D)
Adjusting Variable Set
n=13
All-subsets Cox regression
213+ 1 = 8,193 models
SES [3rd tertile]
education [>HS]
race [white]
body mass index [normal]
total cholesterol
any heart disease
family heart disease
any hypertension
any diabetes
any cancer
current/past smoker [no smoking]
drink 5/day
physical activity
Data Source
NHANES 1999-2004
417 variables of interest
time to death
N≧1000 (≧100 deaths)
effect sizes
p-values
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
11
1
50
1 50 99
5.0
7.5
−log10(pvalue)
Vitamin D (1SD(log))
RHR = 1.14
RPvalue = 4.68
A
B
C D
E
median p-value/HR for k
percentile indicator
JCE, 2015
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
2.5
5.0
7.5
0.64 0.68 0.72 0.76
Hazard Ratio
−log10(pvalue)
Vitamin D (1SD(log))
RHR = 1.14
RP = 4.68
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
11
12
13
1
50
99
1 50 99
1
2
3
4
0.75 0.80 0.85 0.90
Hazard Ratio
−log10(pvalue)
Thyroxine (1SD(log))
RHR = 1.15
RP = 2.90
The Vibration of Effects:
Vitamin D and Thyroxine and attenuated risk in mortality
JCE, 2015
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
2.5
5.0
7.5
0.64 0.68 0.72 0.76
Hazard Ratio
−log10(pvalue)
Vitamin D (1SD(log))
RHR = 1.14
RP = 4.68
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
11
12
13
1
50
99
1 50 99
1
2
3
4
0.75 0.80 0.85 0.90
Hazard Ratio
−log10(pvalue)
Thyroxine (1SD(log))
RHR = 1.15
RP = 2.90
●
●
●
●
●
9
10
111213
1
5
10
1.3
−log10(pvalue)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
111213
1
50
99
1 50 99
5
10
1.3 1.4 1.5 1.6
Hazard Ratio
−log10(pvalue)
Cadmium (1SD(log))
adjustment=current_past_smoking
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
111213
1
50
99
1 50 99
5
10
1.3 1.4 1.5 1.6
Hazard Ratio
−log10(pvalue)
Cadmium (1SD(log))
RHR = 1.29
RP = 8.29
The Vibration of Effects: shifts in the effect size distribution
due to select adjustments (e.g., adjusting cadmium levels with
smoking status)
JCE, 2015
●●●●●●●●●●●●●●
012345678910111213 1
50
99
15099
0
1
1 2 3 4 5
Hazard Ratio
−log1
●●●
●
●●●●●●●●●●
012345678910111213
1
50
99
15099
0
1
1 2 3 4 5
Hazard Ratio
−log1
●●
●●●●●●●●●●●●
012345678910111213
1
50
99
1 50 99
0
1
1 2 3 4 5
Hazard Ratio
−log1
●
●
●
●
●
● ● ● ●
●
●
●
●
●
0
1
2
3
4
5 6 7 8 9
10
11
12
13
1
50
99
0.0
0.5
0.90 0.95 1.00 1.05
Hazard Ratio
−log1
●
●
●
●
●●●●●
●●●
2
3
4
5678910111213
50
99
0
1
0.85 0.90 0.95
Hazard Ratio
−log1
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
1
2
3
4
5
0.75 0.80 0.85 0.90
Hazard Ratio
−log10(pvalue)
Vitamin E as alpha−tocopherol (1SD(log))
RHR = 1.15
RP = 3.17
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
1
2
3
0.80 0.85 0.90
Hazard Ratio
−log10(pvalue)
Beta−carotene (1SD(log))
RHR = 1.15
RP = 2.34
●●
●●●●●●●●●● ●●
01
2345678910111213
1
50
99
1 50 99
1
2
3
0.875 0.900 0.925 0.950 0.975
Hazard Ratio
−log10(pvalue)
Caffeine (1SD(log))
RHR = 1.10
RP = 1.99
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
1011
1213
1
50
99
1 50 99
0.0
0.5
1.0
1.5
0.90 0.95 1.00
Hazard Ratio
−log10(pvalue)
Calcium (1SD(log))
RHR = 1.13
RP = 1.15
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
1112
13
1
50
99
1 50 99
0.5
1.0
1.5
2.0
2.5
0.84 0.88 0.92
Hazard Ratio
−log10(pvalue)
Carbohydrate (1SD(log))
RHR = 1.12
RP = 1.57
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
1011
1213
1
50
99
1 50 99
0.5
1.0
1.5
2.0
2.5
0.80 0.84 0.88
Hazard Ratio
−log10(pvalue)
Carotene (1SD(log))
RHR = 1.14
RP = 1.53
●
●●●●●●●●●●●●●
0
12345678910111213
1
50
99
1 50 99
0.5
1.0
1.050 1.075 1.100 1.125
Hazard Ratio
−log10(pvalue)
Cholesterol (1SD(log))
RHR = 1.08
RP = 0.64
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10
111213
1
50
99
1 50 99
1
2
3
4
0.80 0.85 0.90 0.95
Hazard Ratio
−log10(pvalue)
Copper (1SD(log))
RHR = 1.17
RP = 2.86
●
●
●
●
●
●
●
●
●
●
●
●●
●
0
1
2
3
4
5
6
7
8
910
111213
1
50
99
1 50 99
0.0
0.5
1.0
1.5
0.85 0.90 0.95 1.00
Hazard Ratio
−log10(pvalue)
Beta−cryptoxanthin (1SD(log))
RHR = 1.15
RP = 1.39
●
●
● ● ● ●
●
●
●
●
● ●
●
●
0
1
2 3 4 5 6 7
8 9
10111213
1
50
99
1 50 99
0.0
0.5
1.0
0.96 0.99 1.02 1.05 1.08
Hazard Ratio
−log10(pvalue)
Folic acid (1SD(log))
RHR = 1.09
RP = 0.41
●
●
●
●
●
●
●
●
●
●
●
● ●
●
0
1
2
3
4
5
6
7
8
9 101112
13
1
50
99
1 50 99
1
2
3
4
0.80 0.85 0.90
Hazard Ratio
−log10(pvalue)
Folate, DFE (1SD(log))
RHR = 1.14
RP = 2.39
●
●
●
●
●
●
●
●
●
●
●
● ●
●
0
1
2
3
4
5
6
7
8
9
101112
13
1
50
99
1 50 99
2
4
6
8
0.76 0.80 0.84 0.88
Hazard Ratio
−log10(pvalue)
Food folate (1SD(log))
RHR = 1.14
RP = 4.64
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10111213
1
50
99
1 50 99
1
2
3
4
0.80 0.84 0.88 0.92
Hazard Ratio
−log10(pvalue)
Dietary fiber (1SD(log))
RHR = 1.15
RP = 2.79
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
1011
1213
1
50
99
1 50 99
1
2
3
0.80 0.84 0.88 0.92 0.96
Hazard Ratio
−log10(pvalue)
Total Folate (1SD(log))
RHR = 1.15
RP = 2.11
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
1
2
0.84 0.88 0.92
Hazard Ratio
−log10(pvalue)
Iron (1SD(log))
RHR = 1.12
RP = 1.91
β-carotene caffeine
cholesterol
food folate
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
1011
1213
1
50
99
1 50 99
1
2
3
0.80 0.85 0.90
Hazard Ratio
−log10(pvalue)
Potassium (1SD(log))
RHR = 1.14
RP = 2.28
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10111213
1
50
99
1 50 99
0.5
1.0
1.5
2.0
0.850 0.875 0.900 0.925 0.950
Hazard Ratio
−log10(pvalue)
Protein (1SD(log))
RHR = 1.11
RP = 1.42
●
●
● ● ●
●
●
●
●
●
●
●
●
●
01
2 3 4 5
6
7
8
9
10
1112
13
1
50
99
1 50 99
0.0
0.5
1.0
0.95 1.00 1.05 1.10
Hazard Ratio
−log10(pvalue)
Retinol (1SD(log))
RHR = 1.13
RP = 0.67
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
0.0
0.5
1.0
1.5
1.00 1.05 1.10
Hazard Ratio
−log10(pvalue)
SFA 4:0 (1SD(log))
RHR = 1.11
RP = 1.29
●
●
●
●
●
●
●
●
●
●
●
●
●●
01
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
0.5
1.0
1.5
2.0
2.5
1.04 1.08 1.12 1.16
Hazard Ratio
−log10(pvalue)
SFA 6:0 (1SD(log))
RHR = 1.11
RP = 1.71
●
●
●
●
●
●
●
●
●
●
●
●●●
01
2
3
4
5
6
7
8
9
10
111213
1
50
99
1 50 99
2
3
4
1.12 1.16 1.20
Hazard Ratio
−log10(pvalue)
SFA 8:0 (1SD(log))
RHR = 1.10
RP = 2.55
●
●
●
●
●
●
●
●
●
●
●
●●●
0
1
2
3
4
5
6
7
8
9
10
111213
1
50
99
1 50 99
1
2
1.04 1.08 1.12 1.16
Hazard Ratio
−log10(pvalue)
SFA 10:0 (1SD(log))
RHR = 1.11
RP = 1.87
●
●
●
●
●
●
●
●
●
●
●
●●●
0
1
2
3
4
5
6
7
8
9
10
111213
1
50
99
1 50 99
1.0
1.5
2.0
2.5
3.0
1.075 1.100 1.125 1.150 1.175
Hazard Ratio
−log10(pvalue)
SFA 12:0 (1SD(log))
RHR = 1.08
RP = 1.79
●●
●
●
●
●
●
●
●
●
●
●
●●
01
2
3
4
5
6
7
8
9
10111213
1
50
99
1 50 99
0.5
1.0
1.5
2.0
1.05 1.10 1.15
Hazard Ratio
−log10(pvalue)
SFA 14:0 (1SD(log))
RHR = 1.11
RP = 1.61
●●
●
●
●
●
●
●
●
●
●
●
●●
01
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
0.0
0.5
1.0
1.00 1.05 1.10
Hazard Ratio
−log10(pvalue)
SFA 16:0 (1SD(log))
RHR = 1.11
RP = 0.84
●●
●
●
●
●
●
●
●
●
● ●
●●
01
2
3
4
5 67
891011
1213
1
50
99
1 50 99
0.0
0.5
1.0
1.02 1.06 1.10
Hazard Ratio
−log10(pvalue)
SFA 18:0 (1SD(log))
RHR = 1.10
RP = 0.73
●
●
●
●
●
●
●
●
●
●
●
● ●●
0
1
2
3
4
5
6
7
8 910
111213
1
50
99
1 50 99
0.5
1.0
1.5
2.0
0.875 0.900 0.925 0.950
Hazard Ratio
−log10(pvalue)
Selenium (1SD(log))
RHR = 1.09
RP = 1.24
●
●
●
●
●
●
●
●
●
●
●
●
●●
01
2
3
4
5
6
7
8
910
11
1213
1
50
99
1 50 99
0.0
0.5
1.0
1.00 1.05 1.10 1.15
Hazard Ratio
−log10(pvalue)
Total saturated fatty acids (1SD(log))
RHR = 1.11
RP = 0.93
● ●
●
●●●●●●●●●●●
0 1
2
345678910111213
1
50
99
1 50 99
2
4
6
0.650 0.675 0.700 0.725 0.750
Hazard Ratio
−log10(pvalue)
Sodium (1SD(log))
RHR = 1.12
RP = 3.74
●
●
●
●
●
●
●
●
●
●
●●
● ●
0
1
2
3
4
5
6
7
8910111213
1
50
99
1 50 99
1
2
3
4
0.76 0.80 0.84
Hazard Ratio
−log10(pvalue)
Total sugars (1SD(log))
RHR = 1.13
RP = 2.51
●
● ● ●
●
●
●
●
●
●
●
●
●
●
0
1 2 3 4
5
6
7
8 910
111213
1
50
99
1 50 99
0.0
0.5
1.0
0.95 1.00 1.05 1.10
Hazard Ratio
−log10(pvalue)
Total fat (1SD(log))
RHR = 1.11
RP = 0.54
●
●
●
●
●
●
●
●
●●
●●
●●
0
1
2
3
4
5
6 7891011
1213
1
50
99
1 50 99
0.5
1.0
1.5
0.87 0.90 0.93 0.96
Hazard Ratio
−log10(pvalue)
Theobromine (1SD(log))
RHR = 1.08
RP = 1.19
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
1011
1213
1
50
99
1 50 99
0.4
0.8
1.2
1.6
0.80 0.84 0.88 0.92
Hazard Ratio
−log10(pvalue)
Vitamin A (1SD(log))
RHR = 1.13
RP = 1.09
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
111213
1
50
99
1 50 99
0.0
0.5
1.0
1.5
0.85 0.90 0.95 1.00
Hazard Ratio
−log10(pvalue)
Vitamin A, RAE (1SD(log))
RHR = 1.16
RP = 1.31
●
●
●
●
●
●
●
●
●
●
● ●
●●
0
1
2
3
4
5
6
7 8 910111213
1
50
99
1 50 99
0.5
1.0
0.86 0.90 0.94 0.98
Hazard Ratio
−log10(pvalue)
Retinol (1SD(log))
RHR = 1.15
RP = 0.74
sodium sugars
SFA 6:0
SFA 8:0 SFA 10:0
JCE, 2015
Janus (two-faced) risk profile
Risk and significance depends on modeling scenario!
The Vibration of Effects: beware of the Janus effect

(both risk and protection?!)
“risk”“protection”
“significant”
Brittanica.com
Need to document analysis approach(es).
Accessible analytics tools and computer
infrastructure exist to enable reproducible research
“Ability to recompute data analytic
results given a observed dataset and
knowledge of the pipeline…”
Leek and Peng, PNAS 2015
(1) Raw data available

(2) Analytics code and documentation are available

(3) Correct analysis methodology

(4) Trained data analysts to execute research
http://github.com

repository to deposit and control code
“Markdown” files to document analytic process
code
output
annotation
annotation
http://chiragjpgroup.org/exposome-analytics-course
In conclusion: Big data promises multitude of ways
to discover precision guidelines
Thousands of hypotheses are possible.

Multiplicity of hypotheses.
Big Data are observational.

Multiplicity of biases: confounding, selection; reverse causal
Millions of analytic scenarios are possible.

Multiplicity of analytic methods.
To enhance the validity of big data results, we must:
1.) Test systematically, address hypothesis tests,
and replicate.
2.) Consider modeling scenarios explicitly.
3.) Practice reproducible research and increase
data literacy.
Harvard DBMI
Isaac Kohane

Susanne Churchill

Stan Shaw

Jenn Grandfield

Sunny Alvear

Michal Preminger

Harvard Chan
Hugues Aschard

Francesca Dominici

Chirag J Patel

chirag@hms.harvard.edu

@chiragjp

www.chiragjpgroup.org
NIH Common Fund

Big Data to Knowledge
Acknowledgements
Stanford
John Ioannidis

Atul Butte (UCSF)

U Queensland
Jian Yang

Peter Visscher

Cochrane
Belinda Burford
RagGroup
Chirag Lakhani
Adam Brown
Danielle Rasooly

Arjun Manrai

Erik Corona

Nam Pho
Dennis Bier
Emanuela Folco
Elena Colombo
Lorenzini Foundation

More Related Content

What's hot

Informatics and data analytics to support for exposome-based discovery
Informatics and data analytics to support for exposome-based discoveryInformatics and data analytics to support for exposome-based discovery
Informatics and data analytics to support for exposome-based discoveryChirag Patel
 
Big data and the exposome, Oregon State 040616
Big data and the exposome, Oregon State 040616Big data and the exposome, Oregon State 040616
Big data and the exposome, Oregon State 040616Chirag Patel
 
Studying the elusive in larger scale
Studying the elusive in larger scaleStudying the elusive in larger scale
Studying the elusive in larger scaleChirag Patel
 
EWAS and the exposome: Mt Sinai in Brescia 052119
EWAS and the exposome: Mt Sinai in Brescia 052119EWAS and the exposome: Mt Sinai in Brescia 052119
EWAS and the exposome: Mt Sinai in Brescia 052119Chirag Patel
 
AACR 041616 digital exposomes
AACR 041616 digital exposomesAACR 041616 digital exposomes
AACR 041616 digital exposomesChirag Patel
 
NCI systems epidemiology 03012019
NCI systems epidemiology 03012019NCI systems epidemiology 03012019
NCI systems epidemiology 03012019Chirag Patel
 
Building a search engine for exposures in disease
Building a search engine for exposures in disease Building a search engine for exposures in disease
Building a search engine for exposures in disease Chirag Patel
 
NSF Northeast Hub Big Data Workshop
NSF Northeast Hub Big Data WorkshopNSF Northeast Hub Big Data Workshop
NSF Northeast Hub Big Data WorkshopChirag Patel
 
Mel Reichman on Pool Shark’s Cues for More Efficient Drug Discovery
Mel Reichman on Pool Shark’s Cues for More Efficient Drug DiscoveryMel Reichman on Pool Shark’s Cues for More Efficient Drug Discovery
Mel Reichman on Pool Shark’s Cues for More Efficient Drug DiscoveryJean-Claude Bradley
 
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020dkNET
 
Soergel oa week-2014-lightning
Soergel oa week-2014-lightningSoergel oa week-2014-lightning
Soergel oa week-2014-lightningDavid Soergel
 
Japanese Environmental Children's Study and Data-driven E
Japanese Environmental Children's Study and Data-driven EJapanese Environmental Children's Study and Data-driven E
Japanese Environmental Children's Study and Data-driven EChirag Patel
 
human_mutation_article
human_mutation_articlehuman_mutation_article
human_mutation_articleNeha Gupta
 
Montgomery expression
Montgomery expressionMontgomery expression
Montgomery expressionmorenorossi
 
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression DatabaseКолкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Databasebigdatabm
 
GMI proficiency testing- Progress report 2016
GMI proficiency testing- Progress report 2016GMI proficiency testing- Progress report 2016
GMI proficiency testing- Progress report 2016ExternalEvents
 
NetBioSIG2014-Talk by David Amar
NetBioSIG2014-Talk by David AmarNetBioSIG2014-Talk by David Amar
NetBioSIG2014-Talk by David AmarAlexander Pico
 
Ransbotyn et al PUBLISHED (1)
Ransbotyn et al PUBLISHED (1)Ransbotyn et al PUBLISHED (1)
Ransbotyn et al PUBLISHED (1)Tania Acuna
 
Making the most of phenotypes in ontology-based biomedical knowledge discovery
Making the most of phenotypes in ontology-based biomedical knowledge discoveryMaking the most of phenotypes in ontology-based biomedical knowledge discovery
Making the most of phenotypes in ontology-based biomedical knowledge discoveryMichel Dumontier
 
Identification of PFOA linked metabolic diseases by crossing databases
Identification of PFOA linked metabolic diseases by crossing databasesIdentification of PFOA linked metabolic diseases by crossing databases
Identification of PFOA linked metabolic diseases by crossing databasesYoann Pageaud
 

What's hot (20)

Informatics and data analytics to support for exposome-based discovery
Informatics and data analytics to support for exposome-based discoveryInformatics and data analytics to support for exposome-based discovery
Informatics and data analytics to support for exposome-based discovery
 
Big data and the exposome, Oregon State 040616
Big data and the exposome, Oregon State 040616Big data and the exposome, Oregon State 040616
Big data and the exposome, Oregon State 040616
 
Studying the elusive in larger scale
Studying the elusive in larger scaleStudying the elusive in larger scale
Studying the elusive in larger scale
 
EWAS and the exposome: Mt Sinai in Brescia 052119
EWAS and the exposome: Mt Sinai in Brescia 052119EWAS and the exposome: Mt Sinai in Brescia 052119
EWAS and the exposome: Mt Sinai in Brescia 052119
 
AACR 041616 digital exposomes
AACR 041616 digital exposomesAACR 041616 digital exposomes
AACR 041616 digital exposomes
 
NCI systems epidemiology 03012019
NCI systems epidemiology 03012019NCI systems epidemiology 03012019
NCI systems epidemiology 03012019
 
Building a search engine for exposures in disease
Building a search engine for exposures in disease Building a search engine for exposures in disease
Building a search engine for exposures in disease
 
NSF Northeast Hub Big Data Workshop
NSF Northeast Hub Big Data WorkshopNSF Northeast Hub Big Data Workshop
NSF Northeast Hub Big Data Workshop
 
Mel Reichman on Pool Shark’s Cues for More Efficient Drug Discovery
Mel Reichman on Pool Shark’s Cues for More Efficient Drug DiscoveryMel Reichman on Pool Shark’s Cues for More Efficient Drug Discovery
Mel Reichman on Pool Shark’s Cues for More Efficient Drug Discovery
 
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
 
Soergel oa week-2014-lightning
Soergel oa week-2014-lightningSoergel oa week-2014-lightning
Soergel oa week-2014-lightning
 
Japanese Environmental Children's Study and Data-driven E
Japanese Environmental Children's Study and Data-driven EJapanese Environmental Children's Study and Data-driven E
Japanese Environmental Children's Study and Data-driven E
 
human_mutation_article
human_mutation_articlehuman_mutation_article
human_mutation_article
 
Montgomery expression
Montgomery expressionMontgomery expression
Montgomery expression
 
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression DatabaseКолкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
Колкер Е. An introduction to MOPED: Multi-Omics Profiling Expression Database
 
GMI proficiency testing- Progress report 2016
GMI proficiency testing- Progress report 2016GMI proficiency testing- Progress report 2016
GMI proficiency testing- Progress report 2016
 
NetBioSIG2014-Talk by David Amar
NetBioSIG2014-Talk by David AmarNetBioSIG2014-Talk by David Amar
NetBioSIG2014-Talk by David Amar
 
Ransbotyn et al PUBLISHED (1)
Ransbotyn et al PUBLISHED (1)Ransbotyn et al PUBLISHED (1)
Ransbotyn et al PUBLISHED (1)
 
Making the most of phenotypes in ontology-based biomedical knowledge discovery
Making the most of phenotypes in ontology-based biomedical knowledge discoveryMaking the most of phenotypes in ontology-based biomedical knowledge discovery
Making the most of phenotypes in ontology-based biomedical knowledge discovery
 
Identification of PFOA linked metabolic diseases by crossing databases
Identification of PFOA linked metabolic diseases by crossing databasesIdentification of PFOA linked metabolic diseases by crossing databases
Identification of PFOA linked metabolic diseases by crossing databases
 

Viewers also liked

Photo album
Photo albumPhoto album
Photo albumRooxiii
 
Apresentação prêmio empreendedores
Apresentação prêmio empreendedoresApresentação prêmio empreendedores
Apresentação prêmio empreendedoresJosé Augusto Fiorin
 
SEMS 2014: Updates in paeds toxicology
SEMS 2014: Updates in paeds toxicology SEMS 2014: Updates in paeds toxicology
SEMS 2014: Updates in paeds toxicology Rahul Goswami
 
Big data exposome and pediatric outcomes
Big data exposome and pediatric outcomesBig data exposome and pediatric outcomes
Big data exposome and pediatric outcomesChirag Patel
 
The Toxicology of Personal Care Products & Cosmetics
The Toxicology of Personal Care Products & CosmeticsThe Toxicology of Personal Care Products & Cosmetics
The Toxicology of Personal Care Products & CosmeticsAdrian Tam
 
ΜΑΡΜΑΡΑ ΠΑΡΘΕΝΩΝΑ
ΜΑΡΜΑΡΑ ΠΑΡΘΕΝΩΝΑΜΑΡΜΑΡΑ ΠΑΡΘΕΝΩΝΑ
ΜΑΡΜΑΡΑ ΠΑΡΘΕΝΩΝΑgvlachos
 
Εξετασεις ελληνομάθειας ΣΑΧΕΤΙ 2016
Εξετασεις ελληνομάθειας ΣΑΧΕΤΙ 2016Εξετασεις ελληνομάθειας ΣΑΧΕΤΙ 2016
Εξετασεις ελληνομάθειας ΣΑΧΕΤΙ 2016gvlachos
 
ΤΟ ΝΗΠΙΑΓΩΓΕΙΟ ΣΤΟ ΡΑΔΙΟΦΩΝΙΚΟ ΣΤΑΘΜΟ
ΤΟ ΝΗΠΙΑΓΩΓΕΙΟ ΣΤΟ ΡΑΔΙΟΦΩΝΙΚΟ ΣΤΑΘΜΟ ΤΟ ΝΗΠΙΑΓΩΓΕΙΟ ΣΤΟ ΡΑΔΙΟΦΩΝΙΚΟ ΣΤΑΘΜΟ
ΤΟ ΝΗΠΙΑΓΩΓΕΙΟ ΣΤΟ ΡΑΔΙΟΦΩΝΙΚΟ ΣΤΑΘΜΟ gvlachos
 
χαμόγελο του παιδιού
χαμόγελο του παιδιούχαμόγελο του παιδιού
χαμόγελο του παιδιούgvlachos
 
Τριήμερο ελληνικών παραδοσιακών παιχνιδιών
Τριήμερο ελληνικών παραδοσιακών παιχνιδιώνΤριήμερο ελληνικών παραδοσιακών παιχνιδιών
Τριήμερο ελληνικών παραδοσιακών παιχνιδιώνgvlachos
 
ΠΛΑΝΟ ΣΤΗΡΙΞΗΣ ΕΛΛΗΝΙΚΩΝ
ΠΛΑΝΟ ΣΤΗΡΙΞΗΣ ΕΛΛΗΝΙΚΩΝΠΛΑΝΟ ΣΤΗΡΙΞΗΣ ΕΛΛΗΝΙΚΩΝ
ΠΛΑΝΟ ΣΤΗΡΙΞΗΣ ΕΛΛΗΝΙΚΩΝgvlachos
 

Viewers also liked (14)

Photo album
Photo albumPhoto album
Photo album
 
Apresentação prêmio empreendedores
Apresentação prêmio empreendedoresApresentação prêmio empreendedores
Apresentação prêmio empreendedores
 
B epaper
B epaperB epaper
B epaper
 
About stacks
About stacksAbout stacks
About stacks
 
Paso a paso 7
Paso a paso 7Paso a paso 7
Paso a paso 7
 
SEMS 2014: Updates in paeds toxicology
SEMS 2014: Updates in paeds toxicology SEMS 2014: Updates in paeds toxicology
SEMS 2014: Updates in paeds toxicology
 
Big data exposome and pediatric outcomes
Big data exposome and pediatric outcomesBig data exposome and pediatric outcomes
Big data exposome and pediatric outcomes
 
The Toxicology of Personal Care Products & Cosmetics
The Toxicology of Personal Care Products & CosmeticsThe Toxicology of Personal Care Products & Cosmetics
The Toxicology of Personal Care Products & Cosmetics
 
ΜΑΡΜΑΡΑ ΠΑΡΘΕΝΩΝΑ
ΜΑΡΜΑΡΑ ΠΑΡΘΕΝΩΝΑΜΑΡΜΑΡΑ ΠΑΡΘΕΝΩΝΑ
ΜΑΡΜΑΡΑ ΠΑΡΘΕΝΩΝΑ
 
Εξετασεις ελληνομάθειας ΣΑΧΕΤΙ 2016
Εξετασεις ελληνομάθειας ΣΑΧΕΤΙ 2016Εξετασεις ελληνομάθειας ΣΑΧΕΤΙ 2016
Εξετασεις ελληνομάθειας ΣΑΧΕΤΙ 2016
 
ΤΟ ΝΗΠΙΑΓΩΓΕΙΟ ΣΤΟ ΡΑΔΙΟΦΩΝΙΚΟ ΣΤΑΘΜΟ
ΤΟ ΝΗΠΙΑΓΩΓΕΙΟ ΣΤΟ ΡΑΔΙΟΦΩΝΙΚΟ ΣΤΑΘΜΟ ΤΟ ΝΗΠΙΑΓΩΓΕΙΟ ΣΤΟ ΡΑΔΙΟΦΩΝΙΚΟ ΣΤΑΘΜΟ
ΤΟ ΝΗΠΙΑΓΩΓΕΙΟ ΣΤΟ ΡΑΔΙΟΦΩΝΙΚΟ ΣΤΑΘΜΟ
 
χαμόγελο του παιδιού
χαμόγελο του παιδιούχαμόγελο του παιδιού
χαμόγελο του παιδιού
 
Τριήμερο ελληνικών παραδοσιακών παιχνιδιών
Τριήμερο ελληνικών παραδοσιακών παιχνιδιώνΤριήμερο ελληνικών παραδοσιακών παιχνιδιών
Τριήμερο ελληνικών παραδοσιακών παιχνιδιών
 
ΠΛΑΝΟ ΣΤΗΡΙΞΗΣ ΕΛΛΗΝΙΚΩΝ
ΠΛΑΝΟ ΣΤΗΡΙΞΗΣ ΕΛΛΗΝΙΚΩΝΠΛΑΝΟ ΣΤΗΡΙΞΗΣ ΕΛΛΗΝΙΚΩΝ
ΠΛΑΝΟ ΣΤΗΡΙΞΗΣ ΕΛΛΗΝΙΚΩΝ
 

Similar to Methods to enhance the validity of precision guidelines emerging from big data

Big Data and the Future by Sherri Rose
Big Data and the Future by Sherri RoseBig Data and the Future by Sherri Rose
Big Data and the Future by Sherri RoseLewis Lin 🦊
 
Genomics and Proteomics - Impact on Drug Discovery
Genomics and Proteomics - Impact on Drug DiscoveryGenomics and Proteomics - Impact on Drug Discovery
Genomics and Proteomics - Impact on Drug DiscoveryPhilip Bourne
 
application_of_bioinformatics_in_various_fields.ppt
application_of_bioinformatics_in_various_fields.pptapplication_of_bioinformatics_in_various_fields.ppt
application_of_bioinformatics_in_various_fields.pptshankjunk
 
application_of_bioinformatics_in_various_fields.ppt
application_of_bioinformatics_in_various_fields.pptapplication_of_bioinformatics_in_various_fields.ppt
application_of_bioinformatics_in_various_fields.pptshankjunk
 
Moving from Big Data to Better Models of Disease and Drug Response - Joel Dudley
Moving from Big Data to Better Models of Disease and Drug Response - Joel DudleyMoving from Big Data to Better Models of Disease and Drug Response - Joel Dudley
Moving from Big Data to Better Models of Disease and Drug Response - Joel DudleyCityAge
 
Introducción a la bioinformatica
Introducción a la bioinformaticaIntroducción a la bioinformatica
Introducción a la bioinformaticaMartín Arrieta
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsmikaelhuss
 
The state of the art in behavioral machine learning for healthcare
The state of the art in behavioral machine learning for healthcareThe state of the art in behavioral machine learning for healthcare
The state of the art in behavioral machine learning for healthcareAfrica Perianez
 
INBIOMEDvision Workshop at MIE 2011. Victoria López
INBIOMEDvision Workshop at MIE 2011. Victoria LópezINBIOMEDvision Workshop at MIE 2011. Victoria López
INBIOMEDvision Workshop at MIE 2011. Victoria LópezINBIOMEDvision
 
NAMs in biomedical research
NAMs in biomedical researchNAMs in biomedical research
NAMs in biomedical researchcrovida
 
TLSC Biotech 101 Noc 2010 (Moore)
TLSC Biotech 101 Noc 2010 (Moore)TLSC Biotech 101 Noc 2010 (Moore)
TLSC Biotech 101 Noc 2010 (Moore)jmoore89
 
How to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationHow to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationJoaquin Dopazo
 

Similar to Methods to enhance the validity of precision guidelines emerging from big data (20)

Human Genome Project
Human Genome ProjectHuman Genome Project
Human Genome Project
 
Big Data and the Future by Sherri Rose
Big Data and the Future by Sherri RoseBig Data and the Future by Sherri Rose
Big Data and the Future by Sherri Rose
 
Genomics and Proteomics - Impact on Drug Discovery
Genomics and Proteomics - Impact on Drug DiscoveryGenomics and Proteomics - Impact on Drug Discovery
Genomics and Proteomics - Impact on Drug Discovery
 
application_of_bioinformatics_in_various_fields.ppt
application_of_bioinformatics_in_various_fields.pptapplication_of_bioinformatics_in_various_fields.ppt
application_of_bioinformatics_in_various_fields.ppt
 
application_of_bioinformatics_in_various_fields.ppt
application_of_bioinformatics_in_various_fields.pptapplication_of_bioinformatics_in_various_fields.ppt
application_of_bioinformatics_in_various_fields.ppt
 
Moving from Big Data to Better Models of Disease and Drug Response - Joel Dudley
Moving from Big Data to Better Models of Disease and Drug Response - Joel DudleyMoving from Big Data to Better Models of Disease and Drug Response - Joel Dudley
Moving from Big Data to Better Models of Disease and Drug Response - Joel Dudley
 
Human genome project 1
Human genome project 1Human genome project 1
Human genome project 1
 
Introducción a la bioinformatica
Introducción a la bioinformaticaIntroducción a la bioinformatica
Introducción a la bioinformatica
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
The state of the art in behavioral machine learning for healthcare
The state of the art in behavioral machine learning for healthcareThe state of the art in behavioral machine learning for healthcare
The state of the art in behavioral machine learning for healthcare
 
INBIOMEDvision Workshop at MIE 2011. Victoria López
INBIOMEDvision Workshop at MIE 2011. Victoria LópezINBIOMEDvision Workshop at MIE 2011. Victoria López
INBIOMEDvision Workshop at MIE 2011. Victoria López
 
multiomics-ebook.pdf
multiomics-ebook.pdfmultiomics-ebook.pdf
multiomics-ebook.pdf
 
JALANov2000
JALANov2000JALANov2000
JALANov2000
 
NAMs in biomedical research
NAMs in biomedical researchNAMs in biomedical research
NAMs in biomedical research
 
Bioinformatics .pptx
Bioinformatics .pptxBioinformatics .pptx
Bioinformatics .pptx
 
Gellibolian 2010 Audio Visual2
Gellibolian 2010 Audio Visual2Gellibolian 2010 Audio Visual2
Gellibolian 2010 Audio Visual2
 
TLSC Biotech 101 Noc 2010 (Moore)
TLSC Biotech 101 Noc 2010 (Moore)TLSC Biotech 101 Noc 2010 (Moore)
TLSC Biotech 101 Noc 2010 (Moore)
 
How to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationHow to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical information
 
Reaching out to collaborators and crowdsourcing for pharmaceutical research
Reaching out to collaborators and crowdsourcing for pharmaceutical research  Reaching out to collaborators and crowdsourcing for pharmaceutical research
Reaching out to collaborators and crowdsourcing for pharmaceutical research
 
Human genome project and elsi
Human genome project and elsiHuman genome project and elsi
Human genome project and elsi
 

Recently uploaded

Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...
Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...
Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...narwatsonia7
 
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort ServiceCall Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Serviceparulsinha
 
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...Miss joya
 
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment BookingCall Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Bookingnarwatsonia7
 
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment BookingCall Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment BookingNehru place Escorts
 
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.MiadAlsulami
 
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...narwatsonia7
 
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking ModelsMumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Modelssonalikaur4
 
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...Miss joya
 
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowKolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowNehru place Escorts
 
Call Girl Surat Madhuri 7001305949 Independent Escort Service Surat
Call Girl Surat Madhuri 7001305949 Independent Escort Service SuratCall Girl Surat Madhuri 7001305949 Independent Escort Service Surat
Call Girl Surat Madhuri 7001305949 Independent Escort Service Suratnarwatsonia7
 
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...narwatsonia7
 
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment BookingHousewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Bookingnarwatsonia7
 
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...narwatsonia7
 

Recently uploaded (20)

Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
 
Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...
Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...
Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...
 
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort ServiceCall Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
 
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
 
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment BookingCall Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
 
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment BookingCall Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
 
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
 
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...
Russian Call Girl Brookfield - 7001305949 Escorts Service 50% Off with Cash O...
 
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking ModelsMumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
 
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
 
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
 
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowKolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
 
Call Girl Surat Madhuri 7001305949 Independent Escort Service Surat
Call Girl Surat Madhuri 7001305949 Independent Escort Service SuratCall Girl Surat Madhuri 7001305949 Independent Escort Service Surat
Call Girl Surat Madhuri 7001305949 Independent Escort Service Surat
 
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
 
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Servicesauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
 
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
 
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment BookingHousewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
 
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCREscort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
 
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
 
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
 

Methods to enhance the validity of precision guidelines emerging from big data

  • 1. Methods to enhance the validity of precision guidelines emerging from big data Chirag J Patel Lorenzini Foundation; Venice, Italy 06/16/16 chirag@hms.harvard.edu @chiragjp www.chiragjpgroup.org
  • 2. Data streams in public health are getting large! Capacity to measure and compute becoming high- throughput and cheaper.
  • 3. Data streams in public health are getting large! Capacity to measure and compute becoming high- throughput and cheaper. N=500,000 1M genetic variants 1000s of phenotypes
  • 4. Data streams in public health are getting large! …and alternative data sets (e.g., EMR) are omnipresent! image: Stan Shaw (MGH)
  • 5. And new concepts for discovery: The exposome, an analog of the genome! what to measure? how to measure? PERSPECTIVES Xenobiotics Inflammation Preexisting disease Lipid peroxidation Oxidative stress Gut flora Internal chemical environment Externalenvironment ExposomeRADIATION DIET POLLUTION INFECTIONS DRUGS LIFE-STYLE STRESS Reactive electrophiles Metals Endocrine disrupters Immune modulators Receptor-binding proteins itical entity for disease eti- ogy (7). Recent discussion as focused on whether and ow to implement this vision 8). Although fully charac- rizing human exposomes daunting, strategies can be eveloped for getting “snap- hots” of critical portions of person’s exposome during ifferent stages of life. At ne extreme is a “bottom-up” rategy in which all chemi- als in each external source f a subject’s exposome are easured at each time point. lthoughthisapproachwould ave the advantage of relat- g important exposures to e air, water, or diet, it would quire enormous effort and ould miss essential compo- ents of the internal chemi- al environment due to such actors as gender, obesity, flammation, and stress. By ontrast, a “top-down” strat- gy would measure all chem- als (or products of their ownstream processing or ffects, so-called read-outs r signatures) in a subject’s ood. This would require nly a single blood specimen each time point and would relate directly ruptors and can be measured through serum some (telomere) length in peripheral blood mono- nuclear cells responded to chronic psychological stress, possibly mediated by the production of reac- tive oxygen species (15). Characterizing the exposome represents a tech- nological challenge like that of thehumangenomeproject,which began when DNA sequencing was in its infancy (16). Analyti- cal systems are needed to pro- cess small amounts of blood from thousands of subjects. Assays should be multiplexed for mea- suring many chemicals in each class of interest. Tandem mass spectrometry, gene and protein chips, and microfluidic systems offer the means to do this. Plat- forms for high-throughput assays shouldleadtoeconomiesofscale, again like those experienced by the human genome project. And because exposome technologies would provide feedback for thera- peuticinterventionsandpersonal- ized medicine, they should moti- vate the development of commer- cial devices for screening impor- tant environmental exposures in blood samples. With successful characterization of both Characterizing the exposome. The exposome represents the combined exposures from all sources that reach the internal chemical environment. Toxicologically important classes of exposome chemicals are shown. Signatures and biomarkers can detect these agents in blood or serum. onOctober21,2010www.sciencemag.orgrom how to analyze in health? Wild, 2005 Rappaport and Smith, 2010 Buck-Louis and Sundaram 2012 Miller and Jones, 2014 Patel CJ and Ioannidis JPAI, 2014
  • 6. Can we use these big data sources for discovery? Or, for building guidelines?
  • 7. Many challenges exist in the use of big data for discovery, guideline development, and causal research INSIGHTS I n 1854, as cholera swept through Lon- don, John Snow, the father of modern ep- idemiology, painstakingly recorded the locations of affected homes. After long, laborious work, he implicated the Broad Street water pump as the source of the outbreak, even without knowing that a Vib- rio organism caused cholera. “Today, Snow might have crunched Global Positioning System information and disease prevalence data, solving the problem within hours” (1). That is the potential impact of “Big Data” on the public’s health. But the promise of Big Data is also accompanied by claims that “the scientific method itself is becoming obso- lete” (2), as next-generation computers, such as IBM’s Watson (3), sift through the digital world to provide predictive models based on massive information. Separating the true signal from the gigantic amount of noise is neither easy nor straightforward, but it is a challenge that must be tackled if informa- tion is ever to be translated into societal well-being. The term “Big Data” refers to volumes of large, complex, linkable information (4). Be- For nongenomic associations, false alarms due to confounding variables or other biases are possible even with very large-scale stud- ies, extensive replication, and very strong signals (9). Big Data’s strength is in finding associations, not in showing whether these associations have meaning. Finding a signal is only the first step. Even John Snow needed to start with a plausible hypothesis to know where to look, i.e., choose what data to examine. If all he had was massive amounts of data, he might well have ended up with a correlation as spurious as the honey bee–marijuana con- nection. Crucially, Snow “did the experi- ment.” He removed the handle from the water pump and dramatically reduced the spread of cholera, thus moving from correla- tion to causation and effective intervention. How can we improve the potential for Big Data to improve health and prevent disease? One priority is that a stronger epidemiological foundation is needed. Big Data analysis is currently largely based on convenient samples of people or informa- tion available on the Internet. When as- sociations are probed between perfectly measured data (e.g., a genome sequence) and poorly measured data (e.g., adminis- By Muin J. Khoury1,2 and John P. A. Ioannidis3 MEDICINE Big data meets public health Human well-being could benefit from large-scale data if large-scale noise is minimized onJune20,2016ttp://science.sciencemag.org/ Science, 2014
  • 8. Many challenges exist in the use of big data for discovery, guideline development, and causal research Thousands of hypotheses are possible. Multiplicity of hypotheses. Big data are observational. Multiplicity of biases: confounding, selection; reverse causal Millions of analytic scenarios are possible. Multiplicity of analytic methods.
  • 9. Big data offers a multiplicity of possible hypotheses! A few examples from cohort studies JECH, 2014
  • 10. Big Data offers a multiplicity of possible hypotheses! Example: cohort database of E exposures and P phenotypes Hum Genet 2012 JECH 2014 Curr Env Health Rep 2016 p-2 20 8 1 6 18 7 10 … p-1 12 2 5 9 4 16 21 11 13 3 17 14 19 p 15 6 101 1482 7 …1311 12 e54 153 9 e-1e-2 … … … … E exposure factors Pphenotypicfactors which ones to test? all? the ones in blue? E times P possibilities! how to detect signal from noise?
  • 11. Big Data offers a multiplicity of possible hypotheses! … that depends on the domain (type of measure)! JECH, 2014 National Health and Nutrition Examination Survey (NHANES)
  • 12. Big Data = Big Bias: Confounding, reverse causality, and what causes what
  • 13. Interdependencies of the variables: Correlation globes paint a complex view of exposure and behavior Red: positive ρ Blue: negative ρ thickness: |ρ| for each pair of E: Spearman ρ (575 factors: 81,937 correlations) permuted data to produce “null ρ” sought replication in > 1 cohort JAMA 2014 Pac Symp Biocomput. 2015 JECH. 2015 National Health and Nutrition Examination Survey (NHANES)
  • 14. Red: positive ρ Blue: negative ρ thickness: |ρ| for each pair of E: Spearman ρ (575 factors: 81,937 correlations) Interdependencies of the variables: Correlation globes paint a complex view of exposure and behavior permuted data to produce “null ρ” sought replication in > 1 cohort JAMA 2014 Pac Symp Biocomput. 2015 JECH. 2015 National Health and Nutrition Examination Survey (NHANES)
  • 15. How to enhance the validity of precision guidelines emerging from big data? 1.) Test systematically, address multiplicity, and replicate. 2.) Consider modeling scenarios explicitly. 3.) Practice reproducible research and increase data literacy
  • 16. Test systematically and replicate. Examples: “environment-wide” or “nutrient-wide” association studies
  • 17. A search engine for robust, reproducible genotype- phenotype associations… A RT I C L E S 13 autosomal loci exceeded the threshold for genome-wide significance (r2 < 0.05), and conditional analyses (see below) establish these SNPs 50 Locus established previously Locus identified by current study Locus not confirmed by current study BCL11A THADA NOTCH2 ADAMTS9 IRS1 IGF2BP2 WFS1 ZBED3 CDKAL1 HHEX/IDE KCNQ1 (2 signals*: ) TCF7L2 KCNJ11 CENTD2 MTNR1B HMGA2 ZFAND6 PRC1 FTO HNF1B DUSP9 Conditional analysis Unconditional analysis TSPAN8/LGR5 HNF1A CDC123/CAMK1D CHCHD9 CDKN2A/2B SLC30A8 TP53INP1 JAZF1 KLF14 PPAR 40 30 –log10(P)–log10(P) 20 10 10 1 2 3 4 5 6 7 8 Chromosome 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X 0 0 Suggestive statistical association (P < 1 10 –5 ) Association in identified or established region (P < 1 10 –4 ) Figure 1 Genome-wide Manhattan plots for the DIAGRAM+ stage 1 meta-analysis. Top panel summarizes the results of the unconditional meta- analysis. Previously established loci are denoted in red and loci identified by the current study are denoted in green. The ten signals in blue are those taken forward but not confirmed in stage 2 analyses. The genes used to name signals have been chosen on the basis of proximity to the index SNP and should not be presumed to indicate causality. The lower panel summarizes the results of equivalent meta-analysis after conditioning on 30 previously established and newly identified autosomal T2D-associated SNPs (denoted by the dotted lines below these loci in the upper panel). Newly discovered conditional signals (outside established loci) are denoted with an orange dot if they show suggestive levels of significance (P < 10−5), whereas secondary signals close to already confirmed T2D loci are shown in purple (P < 10−4). Voight et al, Nature Genetics 2012 N=8K T2D, 39K Controls GWAS in Type 2 Diabetes A prime example of systematic associations: Genome-wide association studies (GWASs)
  • 18. The same can be achieved with non-genetic factors. Example: Exposures and behaviors in mortality.
  • 19. Searching for 246 exposures and behaviors associated with all- cause mortality. NHANES: 1999-2004 National Death Index linked mortality 246 behaviors and exposures (serum/urine/self-report) NHANES: 1999-2001 N=330 to 6008 (26 to 655 deaths) ~5.5 years of followup Cox proportional hazards baseline exposure and time to death False discovery rate < 5% NHANES: 2003-2004 N=177 to 3258 (20-202 deaths) ~2.8 years of followup p < 0.05 Int J Epidem. 2013
  • 20. Adjusted Hazard Ratio -log10(pvalue) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8 02468 1 2 3 4 5 67 1 Physical Activity 2 Does anyone smoke in home? 3 Cadmium 4 Cadmium, urine 5 Past smoker 6 Current smoker 7 trans-lycopene (11) 1 2 3 4 5 6 78 9 10 1112 13 14 1516 1 age (10 year increment) 2 SES_1 3 male 4 SES_0 5 black 6 SES_2 7 SES_3 8 education_hs 9 other_eth 10 mexican 11 occupation_blue_semi 12 education_less_hs 13 occupation_never 14 occupation_blue_high 15 occupation_white_semi 16 other_hispanic (69) Searching >250 environmental and behavioral factors in all-cause mortality FDR < 5% sociodemographics replicated factor Int J Epidem. 2013
  • 21. Adjusted Hazard Ratio -log10(pvalue) 0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8 02468 1 2 3 4 5 67 1 Physical Activity 2 Does anyone smoke in home? 3 Cadmium 4 Cadmium, urine 5 Past smoker 6 Current smoker 7 trans-lycopene (11) 1 2 3 4 5 6 78 9 10 1112 13 14 1516 1 age (10 year increment) 2 SES_1 3 male 4 SES_0 5 black 6 SES_2 7 SES_3 8 education_hs 9 other_eth 10 mexican 11 occupation_blue_semi 12 education_less_hs 13 occupation_never 14 occupation_blue_high 15 occupation_white_semi 16 other_hispanic (69) age (10 years) income (quintile 2) income (quintile 1) male black income (quintile 3) any one smoke in home? serum and urine cadmium [1 SD] past smoker? current smoker?serum lycopene [1SD] physical activity [low, moderate, high activity]* *derived from METs per activity and categorized by Health.gov guidelines R2 ~ 2% Searching >250 environmental and behavioral factors in all-cause mortality
  • 22. Searching 82 dietary factors in blood pressure: INTERMAP and NHANES Tzoulaki et al A Nutrient-Wide Association Study 2459 Circulation. 2012 association size FDR < 5% R2 ~ 7%
  • 23. Testing all associations systematically: Consideration of multiplicity of hypotheses and correlational web! Explicit in number of hypotheses tested False discovery rate; family-wise error rate; Report database size! Does my correlation matter? How does my new correlation compare to the family of correlations? 0.17 (e.g., carotene and diabetes) is average ρ much less than 0.17? greater? ρ JAMA 2014 JECH 2015
  • 24. Consideration of multitude modeling scenarios. Example: Vibration of Effects, the empirical distribution of effect sizes due to model choice
  • 25. e modelling oblem is akin to – but less well sed and more poorly understood than – e testing. For example, consider the use r regression to adjust the risk levels of atments to the same background level There can be many covariates, and t of covariates can be in or out of the With ten covariates, there are over 1000 models. Consider a maze as a metaphor elling (Figure 3). The red line traces the path out of the maze. The path through ze looks simple, once it is known. ways in the literature for dealing with model selection, so we propose a new, composite 2. Publication bias is general recognition that a paper much better chance of acceptance if hing new is found. This means that, for ation, the claim in the paper has to sed on a p-value less than 0.05. From g’s point of view5 , this is quality by tion. The journals are placing heavy ce on a statistical test rather than nation of the methods and steps that o a conclusion. As to having a p-value han 0.05, some might be tempted to the system10 through multiple testing, ple modelling or unfair treatment of or some combination of the three that to a small p-value. Researchers can be creative in devising a plausible story to statistical finding. 2 The data cleaning team creates a modelling data set and a holdout set and P < 0.05 Figure 3. The path through a complex process can appear quite simple once the path is defined. Which terms are included in a multiple linear regression model? Each turn in a maze is analogous to including or not a specific term in the evolving linear model. By keeping an eye on the p-value on the term selected to be at issue, one can work towards a suitably small p-value. © ktsdesign – Fotolia A maze of associations is one way to a fragmented literature and Vibration of Effects Young, 2011 univariate sex sex & age sex & race sex & race & age JCE, 2015
  • 26. Distribution of associations and p-values due to model choice: Estimating the Vibration of Effects (or Risk) Variable of Interest e.g., 1 SD of log(serum Vitamin D) Adjusting Variable Set n=13 All-subsets Cox regression 213+ 1 = 8,193 models SES [3rd tertile] education [>HS] race [white] body mass index [normal] total cholesterol any heart disease family heart disease any hypertension any diabetes any cancer current/past smoker [no smoking] drink 5/day physical activity Data Source NHANES 1999-2004 417 variables of interest time to death N≧1000 (≧100 deaths) effect sizes p-values ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10 11 1 50 1 50 99 5.0 7.5 −log10(pvalue) Vitamin D (1SD(log)) RHR = 1.14 RPvalue = 4.68 A B C D E median p-value/HR for k percentile indicator JCE, 2015 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 10 11 1213 1 50 99 1 50 99 2.5 5.0 7.5 0.64 0.68 0.72 0.76 Hazard Ratio −log10(pvalue) Vitamin D (1SD(log)) RHR = 1.14 RP = 4.68 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10 11 12 13 1 50 99 1 50 99 1 2 3 4 0.75 0.80 0.85 0.90 Hazard Ratio −log10(pvalue) Thyroxine (1SD(log)) RHR = 1.15 RP = 2.90
  • 27. The Vibration of Effects: Vitamin D and Thyroxine and attenuated risk in mortality JCE, 2015 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 10 11 1213 1 50 99 1 50 99 2.5 5.0 7.5 0.64 0.68 0.72 0.76 Hazard Ratio −log10(pvalue) Vitamin D (1SD(log)) RHR = 1.14 RP = 4.68 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10 11 12 13 1 50 99 1 50 99 1 2 3 4 0.75 0.80 0.85 0.90 Hazard Ratio −log10(pvalue) Thyroxine (1SD(log)) RHR = 1.15 RP = 2.90
  • 28. ● ● ● ● ● 9 10 111213 1 5 10 1.3 −log10(pvalue) ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10 111213 1 50 99 1 50 99 5 10 1.3 1.4 1.5 1.6 Hazard Ratio −log10(pvalue) Cadmium (1SD(log)) adjustment=current_past_smoking ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10 111213 1 50 99 1 50 99 5 10 1.3 1.4 1.5 1.6 Hazard Ratio −log10(pvalue) Cadmium (1SD(log)) RHR = 1.29 RP = 8.29 The Vibration of Effects: shifts in the effect size distribution due to select adjustments (e.g., adjusting cadmium levels with smoking status) JCE, 2015
  • 29. ●●●●●●●●●●●●●● 012345678910111213 1 50 99 15099 0 1 1 2 3 4 5 Hazard Ratio −log1 ●●● ● ●●●●●●●●●● 012345678910111213 1 50 99 15099 0 1 1 2 3 4 5 Hazard Ratio −log1 ●● ●●●●●●●●●●●● 012345678910111213 1 50 99 1 50 99 0 1 1 2 3 4 5 Hazard Ratio −log1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10 11 12 13 1 50 99 0.0 0.5 0.90 0.95 1.00 1.05 Hazard Ratio −log1 ● ● ● ● ●●●●● ●●● 2 3 4 5678910111213 50 99 0 1 0.85 0.90 0.95 Hazard Ratio −log1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10 11 1213 1 50 99 1 50 99 1 2 3 4 5 0.75 0.80 0.85 0.90 Hazard Ratio −log10(pvalue) Vitamin E as alpha−tocopherol (1SD(log)) RHR = 1.15 RP = 3.17 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10 11 1213 1 50 99 1 50 99 1 2 3 0.80 0.85 0.90 Hazard Ratio −log10(pvalue) Beta−carotene (1SD(log)) RHR = 1.15 RP = 2.34 ●● ●●●●●●●●●● ●● 01 2345678910111213 1 50 99 1 50 99 1 2 3 0.875 0.900 0.925 0.950 0.975 Hazard Ratio −log10(pvalue) Caffeine (1SD(log)) RHR = 1.10 RP = 1.99 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 1011 1213 1 50 99 1 50 99 0.0 0.5 1.0 1.5 0.90 0.95 1.00 Hazard Ratio −log10(pvalue) Calcium (1SD(log)) RHR = 1.13 RP = 1.15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10 1112 13 1 50 99 1 50 99 0.5 1.0 1.5 2.0 2.5 0.84 0.88 0.92 Hazard Ratio −log10(pvalue) Carbohydrate (1SD(log)) RHR = 1.12 RP = 1.57 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 1011 1213 1 50 99 1 50 99 0.5 1.0 1.5 2.0 2.5 0.80 0.84 0.88 Hazard Ratio −log10(pvalue) Carotene (1SD(log)) RHR = 1.14 RP = 1.53 ● ●●●●●●●●●●●●● 0 12345678910111213 1 50 99 1 50 99 0.5 1.0 1.050 1.075 1.100 1.125 Hazard Ratio −log10(pvalue) Cholesterol (1SD(log)) RHR = 1.08 RP = 0.64 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 10 111213 1 50 99 1 50 99 1 2 3 4 0.80 0.85 0.90 0.95 Hazard Ratio −log10(pvalue) Copper (1SD(log)) RHR = 1.17 RP = 2.86 ● ● ● ● ● ● ● ● ● ● ● ●● ● 0 1 2 3 4 5 6 7 8 910 111213 1 50 99 1 50 99 0.0 0.5 1.0 1.5 0.85 0.90 0.95 1.00 Hazard Ratio −log10(pvalue) Beta−cryptoxanthin (1SD(log)) RHR = 1.15 RP = 1.39 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10111213 1 50 99 1 50 99 0.0 0.5 1.0 0.96 0.99 1.02 1.05 1.08 Hazard Ratio −log10(pvalue) Folic acid (1SD(log)) RHR = 1.09 RP = 0.41 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 101112 13 1 50 99 1 50 99 1 2 3 4 0.80 0.85 0.90 Hazard Ratio −log10(pvalue) Folate, DFE (1SD(log)) RHR = 1.14 RP = 2.39 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 101112 13 1 50 99 1 50 99 2 4 6 8 0.76 0.80 0.84 0.88 Hazard Ratio −log10(pvalue) Food folate (1SD(log)) RHR = 1.14 RP = 4.64 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 10111213 1 50 99 1 50 99 1 2 3 4 0.80 0.84 0.88 0.92 Hazard Ratio −log10(pvalue) Dietary fiber (1SD(log)) RHR = 1.15 RP = 2.79 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 1011 1213 1 50 99 1 50 99 1 2 3 0.80 0.84 0.88 0.92 0.96 Hazard Ratio −log10(pvalue) Total Folate (1SD(log)) RHR = 1.15 RP = 2.11 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 10 11 1213 1 50 99 1 50 99 1 2 0.84 0.88 0.92 Hazard Ratio −log10(pvalue) Iron (1SD(log)) RHR = 1.12 RP = 1.91 β-carotene caffeine cholesterol food folate
  • 30. ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 1011 1213 1 50 99 1 50 99 1 2 3 0.80 0.85 0.90 Hazard Ratio −log10(pvalue) Potassium (1SD(log)) RHR = 1.14 RP = 2.28 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 10111213 1 50 99 1 50 99 0.5 1.0 1.5 2.0 0.850 0.875 0.900 0.925 0.950 Hazard Ratio −log10(pvalue) Protein (1SD(log)) RHR = 1.11 RP = 1.42 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 01 2 3 4 5 6 7 8 9 10 1112 13 1 50 99 1 50 99 0.0 0.5 1.0 0.95 1.00 1.05 1.10 Hazard Ratio −log10(pvalue) Retinol (1SD(log)) RHR = 1.13 RP = 0.67 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 10 11 1213 1 50 99 1 50 99 0.0 0.5 1.0 1.5 1.00 1.05 1.10 Hazard Ratio −log10(pvalue) SFA 4:0 (1SD(log)) RHR = 1.11 RP = 1.29 ● ● ● ● ● ● ● ● ● ● ● ● ●● 01 2 3 4 5 6 7 8 9 10 11 1213 1 50 99 1 50 99 0.5 1.0 1.5 2.0 2.5 1.04 1.08 1.12 1.16 Hazard Ratio −log10(pvalue) SFA 6:0 (1SD(log)) RHR = 1.11 RP = 1.71 ● ● ● ● ● ● ● ● ● ● ● ●●● 01 2 3 4 5 6 7 8 9 10 111213 1 50 99 1 50 99 2 3 4 1.12 1.16 1.20 Hazard Ratio −log10(pvalue) SFA 8:0 (1SD(log)) RHR = 1.10 RP = 2.55 ● ● ● ● ● ● ● ● ● ● ● ●●● 0 1 2 3 4 5 6 7 8 9 10 111213 1 50 99 1 50 99 1 2 1.04 1.08 1.12 1.16 Hazard Ratio −log10(pvalue) SFA 10:0 (1SD(log)) RHR = 1.11 RP = 1.87 ● ● ● ● ● ● ● ● ● ● ● ●●● 0 1 2 3 4 5 6 7 8 9 10 111213 1 50 99 1 50 99 1.0 1.5 2.0 2.5 3.0 1.075 1.100 1.125 1.150 1.175 Hazard Ratio −log10(pvalue) SFA 12:0 (1SD(log)) RHR = 1.08 RP = 1.79 ●● ● ● ● ● ● ● ● ● ● ● ●● 01 2 3 4 5 6 7 8 9 10111213 1 50 99 1 50 99 0.5 1.0 1.5 2.0 1.05 1.10 1.15 Hazard Ratio −log10(pvalue) SFA 14:0 (1SD(log)) RHR = 1.11 RP = 1.61 ●● ● ● ● ● ● ● ● ● ● ● ●● 01 2 3 4 5 6 7 8 9 10 11 1213 1 50 99 1 50 99 0.0 0.5 1.0 1.00 1.05 1.10 Hazard Ratio −log10(pvalue) SFA 16:0 (1SD(log)) RHR = 1.11 RP = 0.84 ●● ● ● ● ● ● ● ● ● ● ● ●● 01 2 3 4 5 67 891011 1213 1 50 99 1 50 99 0.0 0.5 1.0 1.02 1.06 1.10 Hazard Ratio −log10(pvalue) SFA 18:0 (1SD(log)) RHR = 1.10 RP = 0.73 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 910 111213 1 50 99 1 50 99 0.5 1.0 1.5 2.0 0.875 0.900 0.925 0.950 Hazard Ratio −log10(pvalue) Selenium (1SD(log)) RHR = 1.09 RP = 1.24 ● ● ● ● ● ● ● ● ● ● ● ● ●● 01 2 3 4 5 6 7 8 910 11 1213 1 50 99 1 50 99 0.0 0.5 1.0 1.00 1.05 1.10 1.15 Hazard Ratio −log10(pvalue) Total saturated fatty acids (1SD(log)) RHR = 1.11 RP = 0.93 ● ● ● ●●●●●●●●●●● 0 1 2 345678910111213 1 50 99 1 50 99 2 4 6 0.650 0.675 0.700 0.725 0.750 Hazard Ratio −log10(pvalue) Sodium (1SD(log)) RHR = 1.12 RP = 3.74 ● ● ● ● ● ● ● ● ● ● ●● ● ● 0 1 2 3 4 5 6 7 8910111213 1 50 99 1 50 99 1 2 3 4 0.76 0.80 0.84 Hazard Ratio −log10(pvalue) Total sugars (1SD(log)) RHR = 1.13 RP = 2.51 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 910 111213 1 50 99 1 50 99 0.0 0.5 1.0 0.95 1.00 1.05 1.10 Hazard Ratio −log10(pvalue) Total fat (1SD(log)) RHR = 1.11 RP = 0.54 ● ● ● ● ● ● ● ● ●● ●● ●● 0 1 2 3 4 5 6 7891011 1213 1 50 99 1 50 99 0.5 1.0 1.5 0.87 0.90 0.93 0.96 Hazard Ratio −log10(pvalue) Theobromine (1SD(log)) RHR = 1.08 RP = 1.19 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 9 1011 1213 1 50 99 1 50 99 0.4 0.8 1.2 1.6 0.80 0.84 0.88 0.92 Hazard Ratio −log10(pvalue) Vitamin A (1SD(log)) RHR = 1.13 RP = 1.09 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 9 10 111213 1 50 99 1 50 99 0.0 0.5 1.0 1.5 0.85 0.90 0.95 1.00 Hazard Ratio −log10(pvalue) Vitamin A, RAE (1SD(log)) RHR = 1.16 RP = 1.31 ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 1 2 3 4 5 6 7 8 910111213 1 50 99 1 50 99 0.5 1.0 0.86 0.90 0.94 0.98 Hazard Ratio −log10(pvalue) Retinol (1SD(log)) RHR = 1.15 RP = 0.74 sodium sugars SFA 6:0 SFA 8:0 SFA 10:0
  • 31. JCE, 2015 Janus (two-faced) risk profile Risk and significance depends on modeling scenario! The Vibration of Effects: beware of the Janus effect (both risk and protection?!) “risk”“protection” “significant” Brittanica.com
  • 32. Need to document analysis approach(es).
  • 33. Accessible analytics tools and computer infrastructure exist to enable reproducible research “Ability to recompute data analytic results given a observed dataset and knowledge of the pipeline…” Leek and Peng, PNAS 2015 (1) Raw data available (2) Analytics code and documentation are available (3) Correct analysis methodology (4) Trained data analysts to execute research
  • 35. “Markdown” files to document analytic process code output annotation annotation
  • 37. In conclusion: Big data promises multitude of ways to discover precision guidelines Thousands of hypotheses are possible. Multiplicity of hypotheses. Big Data are observational. Multiplicity of biases: confounding, selection; reverse causal Millions of analytic scenarios are possible. Multiplicity of analytic methods.
  • 38. To enhance the validity of big data results, we must: 1.) Test systematically, address hypothesis tests, and replicate. 2.) Consider modeling scenarios explicitly. 3.) Practice reproducible research and increase data literacy.
  • 39. Harvard DBMI Isaac Kohane Susanne Churchill Stan Shaw Jenn Grandfield Sunny Alvear Michal Preminger Harvard Chan Hugues Aschard Francesca Dominici Chirag J Patel chirag@hms.harvard.edu @chiragjp www.chiragjpgroup.org NIH Common Fund Big Data to Knowledge Acknowledgements Stanford John Ioannidis Atul Butte (UCSF) U Queensland Jian Yang Peter Visscher Cochrane Belinda Burford RagGroup Chirag Lakhani Adam Brown Danielle Rasooly Arjun Manrai Erik Corona Nam Pho Dennis Bier Emanuela Folco Elena Colombo Lorenzini Foundation