Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
Methods to enhance the validity of precision guidelines emerging from big data
1. Methods to enhance the validity of
precision guidelines emerging from big data
Chirag J Patel
Lorenzini Foundation; Venice, Italy
06/16/16
chirag@hms.harvard.edu
@chiragjp
www.chiragjpgroup.org
2. Data streams in public health are getting large!
Capacity to measure and compute becoming high-
throughput and cheaper.
3. Data streams in public health are getting large!
Capacity to measure and compute becoming high-
throughput and cheaper.
N=500,000
1M genetic variants
1000s of phenotypes
4. Data streams in public health are getting large!
…and alternative data sets (e.g., EMR) are
omnipresent!
image: Stan Shaw (MGH)
5. And new concepts for discovery:
The exposome, an analog of the genome!
what to measure? how to measure?
PERSPECTIVES
Xenobiotics
Inflammation
Preexisting disease
Lipid peroxidation
Oxidative stress
Gut flora
Internal
chemical
environment
Externalenvironment
ExposomeRADIATION
DIET
POLLUTION
INFECTIONS
DRUGS
LIFE-STYLE
STRESS
Reactive electrophiles
Metals
Endocrine disrupters
Immune modulators
Receptor-binding proteins
itical entity for disease eti-
ogy (7). Recent discussion
as focused on whether and
ow to implement this vision
8). Although fully charac-
rizing human exposomes
daunting, strategies can be
eveloped for getting “snap-
hots” of critical portions of
person’s exposome during
ifferent stages of life. At
ne extreme is a “bottom-up”
rategy in which all chemi-
als in each external source
f a subject’s exposome are
easured at each time point.
lthoughthisapproachwould
ave the advantage of relat-
g important exposures to
e air, water, or diet, it would
quire enormous effort and
ould miss essential compo-
ents of the internal chemi-
al environment due to such
actors as gender, obesity,
flammation, and stress. By
ontrast, a “top-down” strat-
gy would measure all chem-
als (or products of their
ownstream processing or
ffects, so-called read-outs
r signatures) in a subject’s
ood. This would require
nly a single blood specimen
each time point and would relate directly ruptors and can be measured through serum
some (telomere) length in
peripheral blood mono-
nuclear cells responded
to chronic psychological
stress, possibly mediated
by the production of reac-
tive oxygen species (15).
Characterizing the
exposome represents a tech-
nological challenge like that of
thehumangenomeproject,which
began when DNA sequencing
was in its infancy (16). Analyti-
cal systems are needed to pro-
cess small amounts of blood from
thousands of subjects. Assays
should be multiplexed for mea-
suring many chemicals in each
class of interest. Tandem mass
spectrometry, gene and protein
chips, and microfluidic systems
offer the means to do this. Plat-
forms for high-throughput assays
shouldleadtoeconomiesofscale,
again like those experienced by
the human genome project. And
because exposome technologies
would provide feedback for thera-
peuticinterventionsandpersonal-
ized medicine, they should moti-
vate the development of commer-
cial devices for screening impor-
tant environmental exposures in
blood samples.
With successful characterization of both
Characterizing the exposome. The exposome represents
the combined exposures from all sources that reach the
internal chemical environment. Toxicologically important
classes of exposome chemicals are shown. Signatures and
biomarkers can detect these agents in blood or serum.
onOctober21,2010www.sciencemag.orgrom
how to analyze in health?
Wild, 2005
Rappaport and Smith, 2010
Buck-Louis and Sundaram 2012
Miller and Jones, 2014
Patel CJ and Ioannidis JPAI, 2014
6. Can we use these big data sources for discovery?
Or, for building guidelines?
7. Many challenges exist in the use of big data for
discovery, guideline development, and causal
research
INSIGHTS
I
n 1854, as cholera swept through Lon-
don, John Snow, the father of modern ep-
idemiology, painstakingly recorded the
locations of affected homes. After long,
laborious work, he implicated the Broad
Street water pump as the source of the
outbreak, even without knowing that a Vib-
rio organism caused cholera. “Today, Snow
might have crunched Global Positioning
System information and disease prevalence
data, solving the problem within hours” (1).
That is the potential impact of “Big Data” on
the public’s health. But the promise of Big
Data is also accompanied by claims that “the
scientific method itself is becoming obso-
lete” (2), as next-generation computers, such
as IBM’s Watson (3), sift through the digital
world to provide predictive models based
on massive information. Separating the true
signal from the gigantic amount of noise is
neither easy nor straightforward, but it is a
challenge that must be tackled if informa-
tion is ever to be translated into societal
well-being.
The term “Big Data” refers to volumes of
large, complex, linkable information (4). Be-
For nongenomic associations, false alarms
due to confounding variables or other biases
are possible even with very large-scale stud-
ies, extensive replication, and very strong
signals (9). Big Data’s strength is in finding
associations, not in showing whether these
associations have meaning. Finding a signal
is only the first step.
Even John Snow needed to start with a
plausible hypothesis to know where to look,
i.e., choose what data to examine. If all he
had was massive amounts of data, he might
well have ended up with a correlation as
spurious as the honey bee–marijuana con-
nection. Crucially, Snow “did the experi-
ment.” He removed the handle from the
water pump and dramatically reduced the
spread of cholera, thus moving from correla-
tion to causation and effective intervention.
How can we improve the potential for
Big Data to improve health and prevent
disease? One priority is that a stronger
epidemiological foundation is needed. Big
Data analysis is currently largely based on
convenient samples of people or informa-
tion available on the Internet. When as-
sociations are probed between perfectly
measured data (e.g., a genome sequence)
and poorly measured data (e.g., adminis-
By Muin J. Khoury1,2
and
John P. A. Ioannidis3
MEDICINE
Big data meets public health
Human well-being could benefit from large-scale data if large-scale noise is minimized
onJune20,2016ttp://science.sciencemag.org/
Science, 2014
8. Many challenges exist in the use of big data for
discovery, guideline development, and causal research
Thousands of hypotheses are possible.
Multiplicity of hypotheses.
Big data are observational.
Multiplicity of biases:
confounding, selection; reverse causal
Millions of analytic scenarios are possible.
Multiplicity of analytic methods.
9. Big data offers a multiplicity of possible hypotheses!
A few examples from cohort studies
JECH, 2014
10. Big Data offers a multiplicity of possible hypotheses!
Example: cohort database of E exposures and P phenotypes
Hum Genet 2012
JECH 2014
Curr Env Health Rep 2016
p-2
20
8
1
6
18
7
10
…
p-1
12
2
5
9
4
16
21
11
13
3
17
14
19
p
15
6 101 1482 7 …1311 12 e54 153 9 e-1e-2
…
…
…
…
E exposure factors
Pphenotypicfactors
which ones to test?
all?
the ones in blue?
E times P possibilities!
how to detect signal from noise?
11. Big Data offers a multiplicity of possible hypotheses!
… that depends on the domain (type of measure)!
JECH, 2014
National Health and Nutrition Examination
Survey (NHANES)
12. Big Data = Big Bias:
Confounding, reverse causality, and what
causes what
13. Interdependencies of the variables:
Correlation globes paint a complex view of exposure and
behavior
Red: positive ρ
Blue: negative ρ
thickness: |ρ|
for each pair of E:
Spearman ρ
(575 factors: 81,937 correlations)
permuted data to produce
“null ρ”
sought replication in > 1
cohort
JAMA 2014
Pac Symp Biocomput. 2015
JECH. 2015
National Health and Nutrition Examination
Survey (NHANES)
14. Red: positive ρ
Blue: negative ρ
thickness: |ρ|
for each pair of E:
Spearman ρ
(575 factors: 81,937 correlations)
Interdependencies of the variables:
Correlation globes paint a complex view of exposure and
behavior
permuted data to produce
“null ρ”
sought replication in > 1
cohort
JAMA 2014
Pac Symp Biocomput. 2015
JECH. 2015
National Health and Nutrition Examination
Survey (NHANES)
15. How to enhance the validity of precision guidelines
emerging from big data?
1.) Test systematically, address multiplicity, and
replicate.
2.) Consider modeling scenarios explicitly.
3.) Practice reproducible research and increase
data literacy
16. Test systematically and replicate.
Examples: “environment-wide” or “nutrient-wide”
association studies
17. A search engine for robust, reproducible genotype-
phenotype associations…
A RT I C L E S
13 autosomal loci exceeded the threshold for genome-wide significance (r2 < 0.05), and conditional analyses (see below) establish these SNPs
50 Locus established previously
Locus identified by current study
Locus not confirmed by current study
BCL11A
THADA
NOTCH2
ADAMTS9
IRS1
IGF2BP2
WFS1
ZBED3
CDKAL1
HHEX/IDE
KCNQ1 (2 signals*: )
TCF7L2
KCNJ11
CENTD2
MTNR1B
HMGA2 ZFAND6
PRC1
FTO
HNF1B DUSP9
Conditional analysis
Unconditional analysis
TSPAN8/LGR5
HNF1A
CDC123/CAMK1D
CHCHD9
CDKN2A/2B
SLC30A8
TP53INP1
JAZF1
KLF14
PPAR
40
30
–log10(P)–log10(P)
20
10
10
1 2 3 4 5 6 7 8
Chromosome
9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
0
0
Suggestive statistical association (P < 1 10
–5
)
Association in identified or established region (P < 1 10
–4
)
Figure 1 Genome-wide Manhattan plots for the DIAGRAM+ stage 1 meta-analysis. Top panel summarizes the results of the unconditional meta-
analysis. Previously established loci are denoted in red and loci identified by the current study are denoted in green. The ten signals in blue are those
taken forward but not confirmed in stage 2 analyses. The genes used to name signals have been chosen on the basis of proximity to the index SNP and
should not be presumed to indicate causality. The lower panel summarizes the results of equivalent meta-analysis after conditioning on 30 previously
established and newly identified autosomal T2D-associated SNPs (denoted by the dotted lines below these loci in the upper panel). Newly discovered
conditional signals (outside established loci) are denoted with an orange dot if they show suggestive levels of significance (P < 10−5), whereas
secondary signals close to already confirmed T2D loci are shown in purple (P < 10−4).
Voight et al, Nature Genetics 2012
N=8K T2D, 39K Controls
GWAS in Type 2 Diabetes
A prime example of systematic associations:
Genome-wide association studies (GWASs)
18. The same can be achieved with non-genetic factors.
Example: Exposures and behaviors in mortality.
19. Searching for 246 exposures and behaviors associated with all-
cause mortality.
NHANES: 1999-2004
National Death Index linked mortality
246 behaviors and exposures (serum/urine/self-report)
NHANES: 1999-2001
N=330 to 6008 (26 to 655 deaths)
~5.5 years of followup
Cox proportional hazards
baseline exposure and time to death
False discovery rate < 5%
NHANES: 2003-2004
N=177 to 3258 (20-202 deaths)
~2.8 years of followup
p < 0.05
Int J Epidem. 2013
20. Adjusted Hazard Ratio
-log10(pvalue)
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
02468
1
2
3
4
5
67
1 Physical Activity
2 Does anyone smoke in home?
3 Cadmium
4 Cadmium, urine
5 Past smoker
6 Current smoker
7 trans-lycopene
(11) 1
2
3 4
5 6
78
9
10 1112
13 14
1516
1 age (10 year increment)
2 SES_1
3 male
4 SES_0
5 black
6 SES_2
7 SES_3
8 education_hs
9 other_eth
10 mexican
11 occupation_blue_semi
12 education_less_hs
13 occupation_never
14 occupation_blue_high
15 occupation_white_semi
16 other_hispanic
(69)
Searching >250 environmental and behavioral factors in
all-cause mortality
FDR < 5%
sociodemographics
replicated factor
Int J Epidem. 2013
21. Adjusted Hazard Ratio
-log10(pvalue)
0.4 0.6 0.8 1.0 1.2 1.4 1.6 2.0 2.4 2.8
02468
1
2
3
4
5
67
1 Physical Activity
2 Does anyone smoke in home?
3 Cadmium
4 Cadmium, urine
5 Past smoker
6 Current smoker
7 trans-lycopene
(11) 1
2
3 4
5 6
78
9
10 1112
13 14
1516
1 age (10 year increment)
2 SES_1
3 male
4 SES_0
5 black
6 SES_2
7 SES_3
8 education_hs
9 other_eth
10 mexican
11 occupation_blue_semi
12 education_less_hs
13 occupation_never
14 occupation_blue_high
15 occupation_white_semi
16 other_hispanic
(69) age (10 years)
income (quintile 2)
income (quintile 1)
male
black income (quintile 3)
any one smoke in home?
serum and urine cadmium
[1 SD]
past smoker?
current smoker?serum lycopene
[1SD]
physical activity
[low, moderate, high activity]*
*derived from METs per activity and categorized by Health.gov guidelines
R2 ~ 2%
Searching >250 environmental and behavioral factors in
all-cause mortality
22. Searching 82 dietary factors in blood pressure:
INTERMAP and NHANES
Tzoulaki et al A Nutrient-Wide Association Study 2459
Circulation. 2012
association size
FDR < 5%
R2 ~ 7%
23. Testing all associations systematically:
Consideration of multiplicity of hypotheses and correlational web!
Explicit in number of hypotheses
tested
False discovery rate;
family-wise error rate;
Report database size!
Does my correlation matter?
How does my new correlation
compare to the family of correlations?
0.17 (e.g., carotene and diabetes)
is average ρ much less than 0.17? greater?
ρ
JAMA 2014
JECH 2015
24. Consideration of multitude modeling scenarios.
Example: Vibration of Effects, the empirical
distribution of effect sizes due to model choice
31. JCE, 2015
Janus (two-faced) risk profile
Risk and significance depends on modeling scenario!
The Vibration of Effects: beware of the Janus effect
(both risk and protection?!)
“risk”“protection”
“significant”
Brittanica.com
33. Accessible analytics tools and computer
infrastructure exist to enable reproducible research
“Ability to recompute data analytic
results given a observed dataset and
knowledge of the pipeline…”
Leek and Peng, PNAS 2015
(1) Raw data available
(2) Analytics code and documentation are available
(3) Correct analysis methodology
(4) Trained data analysts to execute research
37. In conclusion: Big data promises multitude of ways
to discover precision guidelines
Thousands of hypotheses are possible.
Multiplicity of hypotheses.
Big Data are observational.
Multiplicity of biases: confounding, selection; reverse causal
Millions of analytic scenarios are possible.
Multiplicity of analytic methods.
38. To enhance the validity of big data results, we must:
1.) Test systematically, address hypothesis tests,
and replicate.
2.) Consider modeling scenarios explicitly.
3.) Practice reproducible research and increase
data literacy.
39. Harvard DBMI
Isaac Kohane
Susanne Churchill
Stan Shaw
Jenn Grandfield
Sunny Alvear
Michal Preminger
Harvard Chan
Hugues Aschard
Francesca Dominici
Chirag J Patel
chirag@hms.harvard.edu
@chiragjp
www.chiragjpgroup.org
NIH Common Fund
Big Data to Knowledge
Acknowledgements
Stanford
John Ioannidis
Atul Butte (UCSF)
U Queensland
Jian Yang
Peter Visscher
Cochrane
Belinda Burford
RagGroup
Chirag Lakhani
Adam Brown
Danielle Rasooly
Arjun Manrai
Erik Corona
Nam Pho
Dennis Bier
Emanuela Folco
Elena Colombo
Lorenzini Foundation