Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
Search engine for E NEU network science 080817
1. Building a search engine to find
environmental factors associated with
disease and health
Chirag J Patel
Center for Complex Network Research
Northeastern University
8/8/17
chirag@hms.harvard.edu
@chiragjp
www.chiragjpgroup.org
2. P = G + EType 2 Diabetes
Cancer
Alzheimer’s
Gene expression
Phenotype Genome
Variants
Environment
Infectious agents
Diet + Nutrients
Pollutants
Drugs
3. We are great at G investigation!
2,940 (as of 6/1/17)
36,066 G-P associations
Genome-wide Association Studies (GWAS)
https://www.ebi.ac.uk/gwas/
G
4. Nothing comparable to elucidate E influence!
E: ???
We lack high-throughput methods
and data to discover new E in P…
7. σ2
G
σ2P
H2 =
Heritability (H2) is the range of phenotypic
variability attributed to genetic variability in a
population
Indicator of the proportion of phenotypic
differences attributed to G.
8. Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Polycystic ovary syndrome
Attention deficit hyperactivity disorder
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Bone mineral density
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Rheumatoid arthritis
Crohn's disease
Migraine
Thyroid cancer
Autism
Blood pressure, diastolic
Body mass index
Depression
Coronary artery disease
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Blood pressure, systolic
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype)
Source: SNPedia.com
G estimates for burdensome diseases are low and variable:
massive opportunity for high-throughput E discovery
Type 2 Diabetes
Heart Disease
Autism (50%???)
9. Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Polycystic ovary syndrome
Attention deficit hyperactivity disorder
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Bone mineral density
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Rheumatoid arthritis
Crohn's disease
Migraine
Thyroid cancer
Autism
Blood pressure, diastolic
Body mass index
Depression
Coronary artery disease
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Blood pressure, systolic
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com
G estimates for complex traits are low and variable:
massive opportunity for high-throughput E discovery
σ2
E : Exposome!
10. It took a new paradigm of GWAS for discovery:
Human Genome Project to GWAS
Sequencing of the genome
2001
HapMap project:
http://hapmap.ncbi.nlm.nih.gov/
Characterize common variation
2001-current day
High-throughput variant
assay
< $99 for ~1M variants
Measurement tools
~2003 (ongoing)
ARTICLES
Genome-wide association study of 14,000
cases of seven common diseases and
3,000 shared controls
The Wellcome Trust Case Control Consortium*
There is increasing evidence that genome-wide association (GWA) studies represent a powerful approach to the
identification of genes involved in common human diseases. We describe a joint GWA study (using the Affymetrix GeneChip
500K Mapping Array Set) undertaken in the British population, which has examined ,2,000 individuals for each of 7 major
diseases and a shared set of ,3,000 controls. Case-control comparisons identified 24 independent association signals at
P , 5 3 1027
: 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn’s disease, 3 in rheumatoid arthritis, 7 in type 1
diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of these
signals reflect genuine susceptibility effects. We observed association at many previously identified loci, and found
compelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified a
25 27
Vol 447|7 June 2007|doi:10.1038/nature05911
WTCCC, Nature, 2008.
Comprehensive, high-throughput analyses
GWAS
11. Explaining the other 50%:
A data-driven paradigm for robust discovery of E in disease via
EWAS and the exposome
what to measure? how to measure?
PERSPECTIVES
Xenobiotics
Inflammation
Preexisting disease
Lipid peroxidation
Oxidative stress
Gut flora
Internal
chemical
environment
Externalenvironment
ExposomeRADIATION
DIET
POLLUTION
INFECTIONS
DRUGS
LIFE-STYLE
STRESS
Reactive electrophiles
Metals
Endocrine disrupters
Immune modulators
Receptor-binding proteins
itical entity for disease eti-
ogy (7). Recent discussion
as focused on whether and
ow to implement this vision
8). Although fully charac-
rizing human exposomes
daunting, strategies can be
eveloped for getting “snap-
hots” of critical portions of
person’s exposome during
ifferent stages of life. At
ne extreme is a “bottom-up”
rategy in which all chemi-
als in each external source
f a subject’s exposome are
easured at each time point.
lthoughthisapproachwould
ave the advantage of relat-
g important exposures to
e air, water, or diet, it would
quire enormous effort and
ould miss essential compo-
ents of the internal chemi-
al environment due to such
actors as gender, obesity,
flammation, and stress. By
ontrast, a “top-down” strat-
gy would measure all chem-
als (or products of their
ownstream processing or
ffects, so-called read-outs
r signatures) in a subject’s
ood. This would require
nly a single blood specimen
each time point and would relate directly ruptors and can be measured through serum
some (telomere) length in
peripheral blood mono-
nuclear cells responded
to chronic psychological
stress, possibly mediated
by the production of reac-
tive oxygen species (15).
Characterizing the
exposome represents a tech-
nological challenge like that of
thehumangenomeproject,which
began when DNA sequencing
was in its infancy (16). Analyti-
cal systems are needed to pro-
cess small amounts of blood from
thousands of subjects. Assays
should be multiplexed for mea-
suring many chemicals in each
class of interest. Tandem mass
spectrometry, gene and protein
chips, and microfluidic systems
offer the means to do this. Plat-
forms for high-throughput assays
shouldleadtoeconomiesofscale,
again like those experienced by
the human genome project. And
because exposome technologies
would provide feedback for thera-
peuticinterventionsandpersonal-
ized medicine, they should moti-
vate the development of commer-
cial devices for screening impor-
tant environmental exposures in
blood samples.
With successful characterization of both
Characterizing the exposome. The exposome represents
the combined exposures from all sources that reach the
internal chemical environment. Toxicologically important
classes of exposome chemicals are shown. Signatures and
biomarkers can detect these agents in blood or serum.
onOctober21,2010www.sciencemag.orgrom
“A more comprehensive view of
environmental exposure is
needed ... to discover major
causes of diseases...”
how to analyze in relation to health?
Wild, 2005, 2012
Rappaport and Smith, 2010, 2011
Buck-Louis and Sundaram 2012
Miller and Jones, 2014
Patel CJ and Ioannidis JPAI, 2014
12. Promises and Challenges in creating a search engine for E
in P
High-throughput E = discovery!
systematic; reproducible
multiple hypothesis control
prioritization
Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Polycystic ovary syndrome
Attention deficit hyperactivity disorder
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Bone mineral density
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Rheumatoid arthritis
Crohn's disease
Migraine
Thyroid cancer
Autism
Blood pressure, diastolic
Body mass index
Depression
Coronary artery disease
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Blood pressure, systolic
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype)
Arjun Manrai
(Yuxia Cui, David Balshaw)
ARPH 2016
JAMA 2014
JECH 2014
σ2
E : Exposome!
14. Gold standard for breadth of human exposure information:
National Health and Nutrition Examination Survey1
since the 1960s
now biannual: 1999 onwards
10,000 participants per survey
The sample for the survey is selected to represent
the U.S. population of all ages. To produce reli-
able statistics, NHANES over-samples persons 60
and older, African Americans, and Hispanics.
Since the United States has experienced dramatic
growth in the number of older people during this
century, the aging population has major impli-
cations for health care needs, public policy, and
research priorities. NCHS is working with public
health agencies to increase the knowledge of the
health status of older Americans. NHANES has a
primary role in this endeavor.
All participants visit the physician. Dietary inter-
views and body measurements are included for
everyone. All but the very young have a blood
sample taken and will have a dental screening.
Depending upon the age of the participant, the
rest of the examination includes tests and proce-
dures to assess the various aspects of health listed
above. In general, the older the individual, the
more extensive the examination.
Survey Operations
Health interviews are conducted in respondents’
homes. Health measurements are performed in
specially-designed and equipped mobile centers,
which travel to locations throughout the country.
The study team consists of a physician, medical
and health technicians, as well as dietary and health
interviewers. Many of the study staff are
bilingual (English/Spanish).
An advanced computer system using high-
end servers, desktop PCs, and wide-area
networking collect and process all of the
NHANES data, nearly eliminating the need
for paper forms and manual coding operations.
This system allows interviewers to use note-
book computers with electronic pens. The staff
at the mobile center can automatically transmit
data into data bases through such devices as
digital scales and stadiometers. Touch-sensi-
tive computer screens let respondents enter
their own responses to certain sensitive ques-
tions in complete privacy. Survey information
is available to NCHS staff within 24 hours of
collection, which enhances the capability of
collecting quality data and increases the speed
with which results are released to the public.
In each location, local health and government
officials are notified of the upcoming survey.
Households in the study area receive a letter
from the NCHS Director to introduce the
survey. Local media may feature stories about
the survey.
NHANES is designed to facilitate and en-
courage participation. Transportation is provided
to and from the mobile center if necessary.
Participants receive compensation and a report
of medical findings is given to each participant.
All information collected in the survey is kept
strictly confidential. Privacy is protected by
public laws.
Uses of the Data
Information from NHANES is made available
through an extensive series of publications and
articles in scientific and technical journals. For
data users and researchers throughout the world,
survey data are available on the internet and on
easy-to-use CD-ROMs.
Research organizations, universities, health
care providers, and educators benefit from
survey information. Primary data users are
federal agencies that collaborated in the de-
sign and development of the survey. The
National Institutes of Health, the Food and
Drug Administration, and CDC are among the
agencies that rely upon NHANES to provide
data essential for the implementation and
evaluation of program activities. The U.S.
Department of Agriculture and NCHS coop-
erate in planning and reporting dietary and
nutrition information from the survey.
NHANES’ partnership with the U.S. Environ-
mental Protection Agency allows continued
study of the many important environmental
influences on our health.
• Physical fitness and physical functioning
• Reproductive history and sexual behavior
• Respiratory disease (asthma, chronic bron-
chitis, emphysema)
• Sexually transmitted diseases
• Vision
1 http://www.cdc.gov/nchs/nhanes.htm
>250 exposures (serum + urine)
GWAS chip
>85 quantitative clinical traits
(e.g., serum glucose, lipids, body
mass index)
Death index linkage (cause of
death)
15. Gold standard for breadth of exposure & behavior data:
National Health and Nutrition Examination Survey
Nutrients and Vitamins
vitamin D, carotenes
Infectious Agents
hepatitis, HIV, Staph. aureus
Plastics and consumables
phthalates, bisphenol A
Physical Activity
e.g., stepsPesticides and pollutants
atrazine; cadmium; hydrocarbons
Drugs
statins; aspirin
16. What E are associated with aging:
all-cause mortality and
telomere length?
Int J Epidem 2013
Int J Epidem 2016
17. How does it work?:
Searching for exposures and behaviors associated with all-
cause mortality.
NHANES: 1999-2004
National Death Index linked mortality
246 behaviors and exposures (serum/urine/self-report)
NHANES: 1999-2001
N=330 to 6008 (26 to 655 deaths)
~5.5 years of followup
Cox proportional hazards
baseline exposure and time to death
False discovery rate < 5%
NHANES: 2003-2004
N=177 to 3258 (20-202 deaths)
~2.8 years of followup
p < 0.05
Int J Epidem 2013
21. Promises and Challenges in creating a search engine for E
in P
High-throughput assays of E!
scalable and standard technologies
ARPH 2016
JAMA 2014
JECH 2014
Big data = big bias!
Confounding; reverse causality
Dense correlational web of E and P
Fragmented and small E-P associations
Influence of time and life-course
22. Challenge to scale absolute E due to heterogeneity
and large dynamic range.
Rappaport et al, EHP 2015
Untargeted
Targeted
23. •Getting cheaper, but still not “at scale”
•relative not absolute
•identification of chemical analytes is an art
•detection limits not low enough for E
24. Promises and Challenges in creating a search engine for E
in P
High-throughput assays of E!
scalable and standard technologies
ARPH 2016
JAMA 2014
JECH 2014
Big data = big bias!
Confounding; reverse causality
Dense correlational web of E and P
Fragmented and small E-P associations
Influence of time and life-course
25. Example of fragmentation:
Is everything we eat associated with cancer?
Schoenfeld and Ioannidis, AJCN 2012
50 random ingredients from
Boston Cooking School
Cookbook
Any associated with cancer?
FIGURE 1. Effect estimates reported in the literature by malignancy type (top) or ingredient (bottom). Only ingredients with $10 studie
outliers are not shown (effect estimates .10).
Of 50, 40 studied in cancer risk
Weak statistical evidence:
non-replicated
inconsistent effects
non-standardized
26. Are all the drugs we take associated with cancer?
Sci Reports 2016
Associated all (~500) drugs prescribed in
entire population of Sweden
(N=9M) with time to cancer
Assessed 2 modeling techniques (Cox and case-crossover)
27. any cancer:
141 (26%)
prostate:
56 (10%)
breast:
41 (7%)
colon:
14 (3%)
What drugs are associated with time to cancer?
Too many to be plausible (up to 26%!)
Sci Reports 2016
Modest
concordance
between Cox and
case-crossover:
12 out of 141!
Most correlations
small (HR < 1.1);
residual confounding?
28. Distribution of associations and p-values due to model choice:
Estimating the Vibration of Effects (or Risk)
Variable of Interest
e.g., 1 SD of log(serum Vitamin D)
Adjusting Variable Set
n=13
All-subsets Cox regression
213+ 1 = 8,193 models
SES [3rd tertile]
education [>HS]
race [white]
body mass index [normal]
total cholesterol
any heart disease
family heart disease
any hypertension
any diabetes
any cancer
current/past smoker [no smoking]
drink 5/day
physical activity
Data Source
NHANES 1999-2004
417 variables of interest
time to death
N≧1000 (≧100 deaths)
effect sizes
p-values
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
5.0
7.5
−log10(pvalue)
Vitamin D (1SD(log))
RHR = 1.14
RPvalue = 4.68
A
B
C D
E
median p-value/HR for k
percentile indicator
JCE, 2015
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
2.5
5.0
7.5
0.64 0.68 0.72 0.76
Hazard Ratio
−log10(pvalue)
Vitamin D (1SD(log))
RHR = 1.14
RP = 4.68
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
11
12
13
1
50
99
1 50 99
1
2
3
4
0.75 0.80 0.85 0.90
Hazard Ratio
−log10(pvalue)
Thyroxine (1SD(log))
RHR = 1.15
RP = 2.90
http://bit.ly/effectvibration
29. Promises and Challenges in creating a search engine for E
in P
High-throughput assays of E!
scalable and standard technologies
ARPH 2016
JAMA 2014
JECH 2014
Big data = big bias!
Confounding; reverse causality
Dense correlational web of E and P
Fragmented and small E-P associations
Influence of time and life-course
30. Interdependencies of the exposome:
Correlation globes paint a complex view of exposure
Red: positive ρ
Blue: negative ρ
thickness: |ρ|
for each pair of E:
Spearman ρ
(575 factors: 81,937 correlations)
permuted data to produce
“null ρ”
sought replication in > 1
cohort
Pac Symp Biocomput. 2015
JECH. 2015
31. Red: positive ρ
Blue: negative ρ
thickness: |ρ|
for each pair of E:
Spearman ρ
(575 factors: 81,937 correlations)
Interdependencies of the exposome:
Correlation globes paint a complex view of exposure
permuted data to produce
“null ρ”
sought replication in > 1
cohort
Pac Symp Biocomput. 2015
JECH. 2015
Effective number of
variables:
500 (10% decrease)
33. Does my association between E and P matter in the
entire possible space of associations?
ARPH 2017
Hum Genet 2012
JECH 2014
Curr Epidemiol Rep 2017
Curr Env Health Rep 2016
p-2
20
8
1
6
18
7
10
…
p-1
12
2
5
9
4
16
21
11
13
3
17
14
19
p
15
6 101 1482 7 …1311 12 e54 153 9 e-1e-2
…
…
…
…
E exposure factors
Pphenotypicfactors
which ones to test?
all?
the ones in blue?
E times P possibilities!
how to detect signal from noise?
34. P
Scaling up the search in multiple (m=157) phenotypes:
does my single association between E and P matter?
Body Measures
Body Mass Index
Height
Blood pressure & fitness
Systolic BP
Diastolic BP
Pulse rate
VO2 Max
Metabolic
Glucose
LDL-Cholesterol
Triglycerides
Inflammation
C-reactive protein
white blood cell count
Kidney function
Creatinine
Sodium
Uric Acid
Liver function
Aspartate aminotransferase
Gamma glutamyltransferase
Aging
Telomere length
Time to death
Raj Manrai, Hugues Aschard, JPA Ioannidis, Dennis Bier
35. Creation of a phenotype-exposure association map:
A 2-D view of 209 phenotype by 514 exposure associations
> 0
< 0
Association Size:
504 E exposure and diet indicators × 209 clinical trait phenotypes
NHANES 1999-2000, 2001-2002, 2005-2006, …, 2011-2012 (8)
Median N: 150-5000 per survey
~83,092 E-P associations!
significant associations (FDR < 5%)
adjusted by age, age2, sex, race, income
Raj Manrai, Hugues Aschard, JPA Ioannidis, Dennis Bier
209phenotypes
514 exposures
37. 83,092 total associations between E and P
12,237 significant associations (6%, in yellow):
Average association size: 0.6% for 1SD change in E
percent change for 1 SD increase
7%-6%
42. High-throughput data analytics to mitigate analytical challenges of
exposome-based research:
Consider multiplicity of hypotheses and correlational web!
Does my correlation matter?
How does my new correlation
compare to the family of correlations?
What is the total variance
explained(σ2
E)?
saturated fatty acids and HA1C: 0.5%
does it matter? (i.e., 1.2% is average!)
ρ
ARPH 2016
JAMA 2014
JECH 2015
Explicit in number of hypotheses
tested
False discovery rate;
family-wise error rate;
Report database size!
p-2
20
8
1
6
18
7
10
…
p-1
12
2
5
9
4
16
21
11
13
3
17
14
19
p
15
6 101 1482 7 …1311 12 e54 153 9 e-1e-2
…
…
…
…
E exposure factors
Pphenotypicfactors
44. Eye color
Hair curliness
Type-1 diabetes
Height
Schizophrenia
Epilepsy
Graves' disease
Celiac disease
Polycystic ovary syndrome
Attention deficit hyperactivity disorder
Bipolar disorder
Obesity
Alzheimer's disease
Anorexia nervosa
Psoriasis
Bone mineral density
Menarche, age at
Nicotine dependence
Sexual orientation
Alcoholism
Lupus
Rheumatoid arthritis
Crohn's disease
Migraine
Thyroid cancer
Autism
Blood pressure, diastolic
Body mass index
Depression
Coronary artery disease
Insomnia
Menopause, age at
Heart disease
Prostate cancer
QT interval
Breast cancer
Ovarian cancer
Hangover
Stroke
Asthma
Blood pressure, systolic
Hypertension
Osteoarthritis
Parkinson's disease
Longevity
Type-2 diabetes
Gallstone disease
Testicular cancer
Cervical cancer
Sciatica
Bladder cancer
Colon cancer
Lung cancer
Leukemia
Stomach cancer
0 25 50 75 100
Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com
Use high-throughput tools and data (e.g., exposome) will
enhance discovery of the role of E (and G) in P.
46. Harvard DBMI
Susanne Churchill
Nathan Palmer
Sophia Mamousette
Sunny Alvear
Michal Preminger
Chirag J Patel
chirag@hms.harvard.edu
@chiragjp
www.chiragjpgroup.org
NIH Common Fund
Big Data to Knowledge
Acknowledgements
RagGroup
Nam Pho
Jake Chung
Kajal Claypool
Arjun Manrai
Chirag Lakhani
Adam Brown
Danielle Rasooly
Alan LeGoallec
Sivateja Tangirala
Amar Dhand
Center for Complex Networks
Mentioned Collaborators
Isaac Kohane
John Ioannidis
Dennis Bier
Hugo Aschard