2. Integration of Epidemiology and
Epigenetics
Requires:
• Subject matter knowledge
• Pressing biologic questions
• Determinants of epigenetic variation
• Possible confounders and effect modifiers
• Appropriate Study Design and Analysis
• Focus today: study design
• Can increase efficiency with study design
• Decrease bias, avoid or reduce inherent bias
3. Epigenome-Wide Association
Studies (EWAS)
• Burgeoning field, exponential increase in publications
• Vast majority study DNA methylation
• Relatively stable mark
• Methods to multiplex samples
• Established cohort often do not have the correct samples for
other epigenetics modifications
• Focus of this workshop will be DNA methylation
• Will discuss other modifications today
• Primarily working with microarray data
• Limited time and so much to cover!
6. Funding of EWAS
Percentage of Institute/Center Total Costs for FY 2012 Used for Epigenetic
Studies Overall, and by Mechanism
Burris and
Baccarelli
(2014) J Appl
Toxicol
National
Institutes of
Health (NIH)
spent over $700
million (2.8% of
their total costs)
on epigenetics
in 2012
7. Variation across the Life-Course
Mill and Heijmans (2013) Nature Reviews Genetics
9. • The study of stable, heritable
changes in gene function that
are not due to changes in the
primary sequence of DNA
• DNA methylation, histone
modification, and non-coding
RNAs (such as microRNAs)
• Cells can exhibit different
phenotypes and possess the
same genotype
Epigenetics
10. Alcohol Research: Current Reviews, Volume 35, Issue Number 1
DNA Methylation
DNA methylation: enzymatic methylation of cytosine
bases
11. DNA Methylation
Lister et al. (2009) Nature
• DNA methylation typically
occurs at CpG sites in
differentiated cells
• Non-CpG methylation is
prevalent in embryonic stem
cells
• H1 human embryonic stem
cells and IMR90 fetal lung
fibroblasts
• 25% of methylation in stem
cells was in non-CpG
contexts (mCHG and mCHH,
where H = A, C or T)
12. Distribution of CpG Loci
CpGs are not randomly distributed across the genome
• Strongly enriched in repetitive element sequences
• e.g. LINE-1
• CpG frequency is ~5 times less than expected
• About 70% of all human gene promoters have CpG
sequences
• 5mc accounts for <1% of nucleotides
13. Distribution of CpG Loci
Saxonov et al. (2005)
PNAS
See CpG enrichment
closer to the
transcription start
site (TSS)
14. Mutagenic CpG Loci
• Most common base substitution in the human genome
is C>T (65% of all single nucleotide polymorphisms in
dbSNP)
• Occur most frequently in CpG contexts
• Spontaneous deamination of 5-methyl cytosine to thymine
15. CpG Islands
CpG islands: regions of high CpG content that often occur near
the transcription start site
• Almost all housekeeping genes are associated with at least one CpG
island
Definitions:
• Gardiner-Garden & Frommer (1987)
• Definition used in Genome Browser
• At least 200 bases long
• G+C content: > 50%
• observed CpG/expected CpG ratio: >= 0.6
• Takai & Jones (2002)
• Longer than 500 bp – G+C content: > 55%
• observed CpG/expected CpG ratio: >= 0.65
• CpG islands are more likely to be associated with the 5’ regions of
genes and exclude most Alu’s with this definition
17. CpG Island “Resorts”
• CpG island surrounded by shores and shelves
• Shores 2kb out from a CpG island
• Shelves 2kb out from a CpG shore
• Methylation at CpG island shores has been
suggested to be more tissue-specific
TSS1000 TSS100 5’ UTR 1st Exon Gene Body 3’ UTR
CpG IslandShore
19. Tissue-Specificity of Shores
Doi et al. (2009) Nature Genetics
• Differentially methylation regions between fibroblasts
and induced pluripotent stem cells enriched for shores
22. Chromatin
Role: packing the DNA into the cell
and controlling transcription and
replication
Basic unit: Nucleosome
• DNA wrapped around 8 histones
• Euchromatin:
• Partially decondensed
• Transcribed genes
• Heterochromatin:
• Hypercondensed in interphase
• Transcriptionally inert
• Formation of chromosomal
structures
23. Histone Code
• Histone Code Hypothesis: transcription of genetic
information encoded in DNA is in part regulated by
chemical modifications to histone proteins
• Post-translational modifications:
• Acetylation - Lys
• Methylation (mono-, di- and tri-) - Lys and Arg
• Phosphorylation - Ser and Thr
• Ubiquitination (mono- and poly-) - Lys
• Sumoylation (Lys); ADP-ribosylation; glycosylation;
• biotinylation; carbonylation
24. Histone Code
• How modifications may impact
transcription
• Structural role for modifications
• Based on charge density of histone
tails
• Possible sterical inhibition
• Example: Acetylation (generally
associated with transcriptionally
active genes)
• Modifications as recognition sites
• Marks read by other proteins to
control expression and these
correlations can be used to aid
identification of regulatory elements
25. Histone Code
• Ernst and Kellis (2010) Nature Biotechnology
• Discovering `chromatin states' in a systematic de novo way
across a complete genome based on a multivariate Hidden
Markov Model
• Distinguished six broad classes of chromatin states: promoter,
enhancer, insulator, transcribed, repressed and inactive states
26. Tissue-specificity of Histone Code
• Ersnt et al. (2011) Nature - Mapping nine chromatin
marks across nine cell types
• Functional enrichment among sites associated with promoter
& enhancer states
Promoter states Enhancer states
• Promoter clusters
showed activity in
multiple cell types
• Enhancer clusters
are more cell type
specific
27. MicroRNAs
• MicroRNAs (miRNAs) are
noncoding RNAs that are
about 22 nucleotides in
length
• Originate from capped &
polyadenylated full length
precursors (pri-miRNA)
• Important in development
and post-transcriptional
gene regulation
• In animals, miRNAs can
regulate gene expression
post-transcriptionally by
imperfect complementarity
with a target mRNA, thereby
inhibiting protein synthesis
Nature Reviews Genetics 5, 522-531 (July 2004)
28. Why Very Few Histone
Modification or miRNA EWAS?
• Samples must be collected and stored correctly
• Stored in RNAlater, need high quality RNA
• Not treated with proteases (histones are proteins)
• May be sensitive to freeze thaw cycles or the storage
temperature
• Can be very expensive
• To estimate chromatin states, may need to sequence multiple
marks per sample
• Harder to integrate studies across labs
• Difficult to find validation cohort
• Trend towards meta-analyses
29. Cross Talk between Epigenetic
Pathways
Gene
Expression
DNA
methylation
Histone
ModificationsmiRNA
31. Study Designs for EWAS
Adapted from: Michels (ed) Epigenetic Epidemiology
32. Types of Studies: Experimental
Studies/Randomized Control Trials
• Design: exposure is randomized
• Strengths:
• There is no confounding by design, adjusted analysis
should estimate causal effect (potentially intention to
treat effect)
• Weaknesses:
• Very expensive
• Time consuming
• Must consider eqipose
33. Types of Studies: Cohort Studies
• Design: enroll a group of people who do not have an
outcome yet, collect exposure information, follow
individuals over time
• Strengths:
• Efficient design when the exposure of interest is rare
• Information collected on multiple exposures and multiple outcomes
• If exposure ascertained before outcome, eliminate recall bias
• Weaknesses:
• Very time consuming
• Very expensive
• Difficult if outcome is rare
34. Types of Studies: Case-Control
Studies
• Design: enroll cases (individuals with incident or prevalent
disease) and controls (individuals who have never had the
disease who are at risk for disease) from the population
from which the cases arose
• Strengths:
• Likely faster and at lower cost if outcome has happened already
• Efficient design when the outcome is rare
• Weaknesses:
• Only one disease/outcome studied (can use data again but must
account for the selection)
• May not be possible if the outcome is rare
• Selection of controls may be difficult
• Complete exposure information may be difficult to ascertain
retrospectively
35. Types of
Studies:
Twin Studies
Mill and Heijmans (2013) Nature Reviews Genetics
Monozygotic (MZ) twin studies
help to discern the impact of
genetic sequence on epigenetic
variation and disease risk
MZ twins share their DNA
sequence, parents, birth date
and sex, and experienced a
very similar prenatal
environment
36. Types of Studies: Family-Based
• Can use great-grandparent, grandparent, maternal
and paternal data to investigate possible
transgenerational inheritance
• If interest is intrauterine exposure, can use paternal
exposures as a negative control to disentangle
impact of home environment
• Associations may reflect shared familial confounding
factors or by parental genotypes transmitted to the
offspring
• Impact of only maternal exposure would suggest
intrauterine effect
37. Transgenerational
Inheritance
In the case of an exposed
female mouse, if she is
pregnant, the fetus can be
affected in utero (F1), as can
the germline of the fetus
(the future F2)
• considered to be parental
effects, leading to
intergenerational
epigenetic inheritance
• Only F3 individuals can be
considered as true
transgenerational
inheritance
Does it exist in humans?
Heard and Martienssen (2014)
Cell
38. Example of Negative Controls
• Maternal smoking during pregnancy on offspring birthweight is
considerably greater than that of paternal smoking during
pregnancy
• Adjustment for maternal smoking attenuates the paternal effect to zero
• In line with evidence that maternal smoking has a causal effect on
offspring birthweight
Smith (2012) Epidemiology
39. Type of Studies: Family-Based
Odds ratios in meta-analyses of association between maternal smoking during
pregnancy (vs no maternal smoking during pregnancy) and paternal smoking any time
(vs no paternal smoking any time) and overweight or obesity in childhood
Riedel et al. (2014) International Journal of Epi
40. Defining Exposure
• Possible definitions of exposure
• Maximum intensity of exposure experienced
• Average intensity over a period of time
• Cumulative amount of exposure
• Other important variables that are not accounted
for by duration and intensity:
• Age that exposure started
• Age at cessation of exposure
• Timing of exposure relative to disease onset (lag or
induction period)
41. Validity of Results
• Internal validity – the extent to which the analysis
captures the true causal association
• Threats to internal validity
• Confounding
• Selection Bias
• Information Bias
• Can be addressed in both study design and analysis
• External validity –generalizability of the results
Which do we care about more?
42. Confounding
• Standard definition: a common cause of the
outcome and exposure
• Can result in the detection of an association between
exposure and outcome even if there is no direct effect
43. Confounding
• Surrogate confounders: variables correlated with a
confounder that have no direct association with
exposure or outcome
• Adjustment for these variables may reduce but not
completely remove confounding
• Useful when true confounder cannot be measured
44. Selection Bias
• Selection bias: biases that arise from conditioning
on a common effect of two variables, one of which
is either the exposure or a cause of the exposure,
and the other is the outcome or a cause of the
outcome
• Often we think about selection bias in case-control
studies, but can arise in cohort studies as well
45. Selection Bias in Case-Control
Studies
Occurs due to inappropriate selection of controls
• Cases in the cohort are more
likely to be selected than non-
cases
• Investigators selected controls
preferentially among women with
hip fracture and estrogen is
protective against hip fracture
Hernán MA, Robins JM (2016). Causal Inference
46. Selection Bias in Case-Control
Studies: Berkson’s Bias
Occurs due to inappropriate selection of controls
• Both disease 1 and 2 are unassociated
but both affect the probability of
hospital admission
• Hospital-based controls: cases had
Disease 1 and controls had Disease 2
that is affected by the exposure A.
• Risk factor A for Disease 2 would
appear to also be a risk factor for
Disease 1 even if A does not cause
Disease 1
Hernán MA, Robins JM (2016). Causal Inference
47. Selection Bias in Cohort Studies
Loss to follow-up
• exposure has side effects that increase the
probability of dropping out and certain symptoms
of disease increase the probability of dropping out
Hernán MA, Robins JM
(2016). Causal Inference
48. Selection Bias in Cohort Studies
Healthy worker bias
• The unmeasured health status U is a determinant of
both death Y and of being at work C
• L may be the result of some blood test or physical exam
Hernán MA, Robins JM
(2016). Causal Inference
49. Selection Bias in Cohort Studies
Volunteer Bias
• Bias may be present if the study is restricted to those
who volunteered – may be related to lifestyle
• Cannot occur in a randomized study – exposure
randomization happens after they elect to participate
• Can impact generalizability
Hernán MA, Robins JM
(2016). Causal Inference
50. Information Bias
• Two important properties of measurement error
• Independence
• Non-differentiality
• Y* and A* are the measured outcome and exposure, Y
and A are the true values
Independent Dependent
Hernán MA, Robins JM
(2016). Causal Inference
51. Information Bias: Recall Bias
• Recall Bias: outcome affects the measurement of
the exposure
• Independent but differential measurement error
• If the outcome is birth defects Y and ask mother to
recall alcohol use during pregnancy A after delivery
• Recall may be affected by the outcome of the pregnancy
Hernán MA, Robins JM
(2016). Causal Inference
52. Information Bias: Detection Bias
• Detection Bias: exposure affects the measurement of
the outcome
• Independent but differential measurement error
• Smokers concerned about health impacts of smoking
may seek medical attention more than nonsmokers
• Lead to emphysema to be diagnosed more frequently among
smokers than among nonsmokers
Hernán MA, Robins JM
(2016). Causal Inference
53. Increasing Efficiency by Matching
• Due to limited resources (money, number of case
samples) studies are limited in size
• Can increase our power to detect a change by
matching on confounders
• We must adjust for confounders in the analysis, by
matching we are trying to ensure that we do not have
certain sparse strata
• The impact of matching on internal validity
depends on the type of study that is being
conducted
• Case-control vs cohort
54. Matching in Cohort Studies
• If oversampling exposure groups, can match on
confounders to increase efficiency and control for
bias
• Effect of matching in analysis
• Matching prevents confounding even in crude analysis
• Assuming no other confounding
• Don’t have to adjust for the matching factor
• To improve precision, matching should be accounted for
in the analysis
• Effect modification of matching factors can be evaluated
in follow-up studies
55. Matching in Case-Control Studies
• Matching in case-control studies helps to increase precision
but it does not remove confounding in the crude analysis
• Bias introduced by matching
• If the matching factor(s) are associated with the exposure of
interest, matching will cause the exposure distribution among the
control group to be more similar to the cases than the true
distribution of exposure in the study base
• Bias towards the null
• Must adjust for the matching factors in the analysis
• If you match in a case-control study:
• You cannot study the main effects of matching factors
• You can evaluate if the matching factors are effect modifiers
56. Inappropriate Matching in Case-
Control Studies
• Appropriate matching: matching on a confounder
• Inappropriate matching:
• Unnecessary matching: match on variable that is not
associated with the exposure
• Do not need to adjust for matching factors in analysis
• Over-matching: match on variable that is only
associated with the exposure
• Results are biased if you do not adjust for the matching factors
• Have altered the exposure distribution among the controls
• Matching on intermediate
• Have removed any possible association between exposure and
outcome
• Impossible to rectify this error in the analysis
57. Mediation of effects of exposures
on disease outcomes
• Utility of Mediation analysis
• If blood pressure partially mediates the influence of BMI on CHD,
could therapeutically modifying blood pressure help break the link
between BMI and CHD?
Blood
pressure
CHDBMI
May be interested in looking at the impact of an exposure
on an outcome through methylation
58. Mediation of effects of exposures
on disease outcomes
• Identifying direct and indirect effects requires additional modeling
assumptions:
• 𝑌𝑌𝑎𝑎𝑎𝑎∐𝐴𝐴|𝐶𝐶 : Y is independent of A adjusting for C
M Y
C1
A
59. Mediation of effects of exposures
on disease outcomes
• Identifying direct and indirect effects requires additional modeling
assumptions:
• 𝑌𝑌𝑎𝑎𝑎𝑎∐𝑀𝑀|𝐴𝐴, 𝐶𝐶: Y is independent of M adjusting for C & A
M Y
C2
C1
A
60. Mediation of effects of exposures
on disease outcomes
• Identifying direct and indirect effects requires additional modeling
assumptions:
• 𝑀𝑀𝑎𝑎∐𝐴𝐴|𝐶𝐶: M is independent of A adjusting for C
M Y
C2
C1
A
C3
61. Mediation of effects of exposures
on disease outcomes
• Identifying direct and indirect effects requires additional modeling
assumptions:
• 𝑌𝑌𝑎𝑎𝑎𝑎∐𝑀𝑀𝑎𝑎|𝐶𝐶: No effect of exposure that confounds the
mediator-outcome relationship
M Y
C2
C1
A
C3
62. An Example of When Mediation
Analysis Can Introduce Bias
• Birthweight Paradox: among low birth weight
(LBW) infants, infant mortality is lower among
infants born to smokers
Is maternal smoking
beneficial to low
birth weight infants?
Lines cross around
2kg
Hernández-Díaz et al. (2006) Am J
Epidemiology
63. An Example of When Mediation
Analysis Can Introduce Bias
• Possible explanation, there
is a common cause of LBW
and mortality that has a
greater impact on mortality
than smoking
• Therefore if infant is LBW and
mother is not a smoker, more
likely to have other condition
which is associated with
higher mortality rate
• Results in an apparent
decreased risk of mortality
among LBW infants from
smokers
Hernández-Díaz et al. (2006) Am J
Epidemiology
66. Impact of Batch Effects
• One of the largest
determinants of variation
tends to be batch
• Even after data preprocessing
(normalizing the signal
intensities across the array),
can still see the impact of
batch at the gene-level
Batch Effect. Nature Reviews Genetics 2010
66
67. Batch Effects
• Can try to reduce the impact of batch effects in the:
• Design of the study (discuss today)
• In the analysis (discuss tomorrow)
• One way to assess the impact of batch is through the
judicious use of technical replicates
• Possible to preclude completely confound the
association between epigenetic mark and exposure
with batch
• No way to fix this in the analysis
• Examples?
68. Addressing Batch Effects in Design
• The appropriate distribution of unique
biospecimens across batches depends on:
• The study design
• The question of interest
• Our ability to estimate the batch effects
• Need to think about potential sources of batch
effects
• Storage
• When samples were processed
• Chip processed on or sequencing lane
69. Addressing Batch Effects in Design
Study design Question of interest
Samples to be assayed in
same batch
Groups to be balanced
within batches
Randomized trial
Within-person changes over
time
Samples from same
participant
Intervention groups
Crossover intervention
trial
Within-person differences
between interventions
Samples from same
participant
Order of interventions
Cohort study
Comparison of exposed and
non-exposed persons
NA
Exposure categories if
categorical exposure
Case–cohort study
Comparison of diseased and
disease-free persons
NA
Proportion of cases and
subcohort membersa
Matched case–control
study
Comparison of cases and
controls
Cases and their matched
control(s)
NAb
Frequency-matched
case–control study
Comparison of cases and
controls
NA
Cases and controlsb
Frequency matching
characteristics if categorical
Case-series
Comparison of different case
groups
NA Case groups of interest
Cross-sectional study
Comparison of exposed and
non-exposed
NA
Exposure categories if
categorical exposure
Reviewed by: Tworoger and Hankinson (2006) Cancer Causes Control
71. Choice of Tissue
• Tissue of interest may not be readily accessible (e.g. brain)
• Can use reference epigenome project to inform choice of surrogate
tissue (discuss later today)
• Use available samples or establish a new study population?
• Using established study
• Nested case-control or cohort study
• Often only blood available – may not be tissue of interest
• Extensive covariate data available
• Long term outcomes
• Starting a new study
• Identify samples necessary, use correct storage
• Time consuming and more expensive
• Not possible to assess long term outcomes
73. Bias Due to Cell Composition
When we have the a
hetereogenous
tissue, cell mixture in
that tissue may
impact the results
Possible impacts of
cell composition:
• Confounding
• Mediation
• Reverse causation
Houseman (2015) Current
Environmental Health Reports
74. Estimating Cellular Composition
Houseman (2015) Current
Environmental Health Reports
Can estimate cellular composition in heterogeneous tissue
using a reference data set:
76. Verification
• Definition: replicate the findings in the same
cohort using a different technology
• Original approach may not be capturing change in
percent methylation
• Verify that the findings were not technical error
• Some previously unrecognized batch effects – e.g. find out that
cases and controls were processed by different technicians
• Might be some inherent bias associated with the platform
• If results cannot be verified with a different technology
(at least as precise as the original technology) it suggests
the original results were a false positive
77. Validation
• Definition: replication of results in an independent
cohort
• Ideally using a different technology, ensure not the
result of some inherent bias of the platform
• Similar to the purpose of validation for other types of
epidemiologic studies
• Identify potentially important effect modification
• Possible important residual unmeasured confounding