Data basics

Data Basics: From Raw
to Normalized Data
Day 2

Briefly: Bisulfite Conversion
Krueger et al. (2012) Nature Methods

450K Array
Design and Processing

Oligos
(~800,000/bead)
Bead
Type I: two bead types per
CpG site (unmethylated and
methylated)
Type II: one bead type per
CpG site
DNA preparation,
hybridisation, staining.
Detection of red and green
fluorescent signals by iScan
450K Array
Beads self-
assemble into
the pits on the
array
450k chip: 12
samples per chip
Multiple beads
per bead type
on each chip
Combined to
bead pools (all
bead types)

● Analyzes >480,000 CpG loci
● Covers 99% of all RefSeq genes,
average of 17 probes per gene
● Distributed over various
functional elements such as:
○ CpG islands, shores, and
shelves
○ 3´- and 5´-UTRs, gene
bodies
○ DNAse hypersensitive sites
○ miRNA promoters
Dedeurwaerder et al. Epigenomics (2011)
450K Array

450K Array: QC Probes
Control Probe Purpose
Staining measure efficiency and sensitivity of staining step (independent of
hybridisation/extension step)
Extension test efficiency of extension of A, T, C and G nucleotides from a hairpin
probe ( sample-independent). The perfect match hairpin controls
should result in high signal, and the mismatch probes in low signal
Hybridization test the overall performance of Infinium assay using synthetic targets
(not DNA) at 3 concentrations
Target removal test efficiency of stripping step after extension
Bisulphite conversion test efficiency of bisulphite conversion by query of C/T polymorphism
Specificity controls check for non-specific detection of methylation signal over
unmethylated background. Specificity controls are designed against
non-polymorphic T sites (G/T mismatch)
Non-polymorphic query a non polymorphic base A, T, C and G to test overall
performance of the assay from amplification to detection
Negative randomly permutated bisulphite-converted sequences containing no
CpGs. They should not hybridise to DNA. The mean of these probes
determines the system background

Type I Type II
Same chemistry as 27k array Chemistry not seen on 27k array
28% of probes on array 72% of probes on array
Designed for regions with more CG
dinucleotides - 57% of Type I probes
lie in CpG islands
21%, 26% and 11% of type II probes
lie in CpG islands, shores and shelves
respectively
Suggested to be more stable and
reproducible than the signals
provided by Type II probes
Have a decreased quantitative
dynamic range compared to Type I
probes
For either probe design, intensities are used to estimate Beta-value or M-value
450k Array: Type I vs Type II
Probes

Uses fluorescence from two different probes, unmethylated (converted) and
methylated (unconverted), to assess the level of methylation of a target CpG
● Binding at either probe is followed by single base extension that results in the
addition of a fluorescently labeled nucleotide
450K Array: Type I Probes

Methylation state is detected by single base extension and detection of a fluorescently
labelled nucleotide at the position of the 'C' of the target CpG
● Type II probes include a 'degenerate' R-base at any underlying CpG sites in the
probe body
450K Array: Type II Probes

Quality Control
Filtering and Normalization

● There are a number of R packages that incorporate QC,
filtering and normalisation into their pipelines or offer
specific functions
● Most allow you to define at least some thresholds yourself
● Option to pick and choose
minfi ChAMP
RnBeads lumi
Touleimat &
Tost
wateRmelon
Outline

AIM: identify unusual samples and technical artifacts
The array contains 65 single nucleotide polymorphisms (SNPs)
● We can use this info to identify any unintentional duplicated samples
● If we have multiple samples per individual, their samples should cluster
Raw Data: Initial QC

The array uses red/green fluorescence intensities to estimate methylation level -
the two colour channels have different background intensities
● Type I probes use the same colour to evaluate methylated and unmethylated
probes - should have less of an impact
● Type II uses green to measure methylated state and red for unmethylated -
colour bias can contribute to decreased dynamic range

As the fluorescence intensities are read across the chip - there
appears to be a tiering effect
● This technical artifact could impact the results
○ e.g. cases unintentionally clustered on the array

Plot Distributions of the samples
● Unusual distributions may reflect:
○ Real biological effects (global methylation changes)
○ Poor methylation data
Different colour/line combination
for each sample
Red indicates primary tumor, blue is
adjacent normal tissue

Plot Distributions of the samples
● Boxplots or violin plots (below) can be used to the same effect

Colour corresponds to tumor status Colour corresponds to TCGA batch
Can use multidimensional scaling to look for unusual clustering of samples

By plotting the distribution of Type I and Type II probes separately we can
observe the difference in distribution - example four samples below
● Reflects: difference in chemistry and enrichment for different elements
(e.g. CpG islands)

● Each data point has an associated detection p-value
○ Represents the probability the target signal was distinguishable
against background noise
● Scanner can encounter difficulties reading signal
○ Low staining intensities
○ Spatial artefacts
Common approaches:
● Drop probes that failed in nth% of samples
○ Common thresholds are 20%, 10%, 5% of probes at >0.05, >0.01.
● Drop samples that failed in nth% of probes
○ Common thresholds are 50%, 20% at >0.05, 0.01.
Filtering: Detection P-value

Drop those with known SNPs
residing in the probe sequence
Most common SNPs in dbSNP
are C>T transitions
● C>T transitions will be read as
an unmethylated cytosine
Observe grouping of methylation
values by genotype
Filtering: Common Practices
Related to Technical Issues

● ~4.3% of the probes are reported
to contain a known
polymorphism specifically at the
targeted C or G
○ 43% of these SNPs have a
heterozygosity of >0.1
○ Price et al. Epigenetics
Chromatin (2013)
● SNP filtering depends on the
study population / reference
genome (eg CEU)
Drop those for which the CpG site contains a SNP

Drop those in which probes anneal to multiple genomic locations
● Bisulfite conversion reduces the complexity of the genome
All unmethylated Cs converted to T
● ~10-20% of the Infinium HumanMethylation450 probes have been
identified as non-specific depending on the criteria
● Repetitive elements - may be a real signal, but uncertain meaning and
validation difficult
● Probes cross-reactive to X chromosome
○ pick up X inactivation, leading to spurious association if
outcome/exposure associated with sex
Naeem et al. BMC Genomics (2014)

Common practices related to analysis issues
● drop those on X and Y chromosomes
● drop those with lowest variation
● drop those with extreme methylation levels (eg median = 0% or 100%)
● only consider those in regions of interest (eg CpG island, shore, other)

Colour bias adjustment and Background correction
● Can adjust for colour bias in the lumi package using either:
○ Smooth quantile or shift-and-scaling normalisation
● Most methods employ simple background subtraction
○ No significant improvement in data quality
● New method in methylumi package outperforms previous methods
○ Uses “out-of-band” signal from type I probes to estimate background
rather than background control probes
■ Out-of-band - colour channel opposite their designed base extension
■ Only a few background control probes (n=614), but many Type I probes
Triche et al. Nucleic Acids Res. (2013)
Normalization: Within-Array

Colour bias adjustment and Background correction
Lumi approach, requires starting with a MethyLumiM object
● How to perform the quantile normalisation method
Methylumi approach from Triche et al. Nucleic Acids Res. (2013), requires starting with a
MethylumiSet object
● Using ‘noob’ method (noob=normal-exponential using out-of-band probes), a
convolution that assumes
data.bgcorrect<-adjColorBias.quantile(lumidata)
Signal Intensity + Background = Observed foreground intensity
data.bgcorrect<-methylumidata(methylumidata)

Probe type correction
● The identified shift between type I and type II β-values may induce a bias
in the analysis if the methylation signals corresponding to the two types
of assays are analyzed together
○ type I probes have greater stability, increased power
● Can’t just perform full quantile normalisation
○ The population to ‘correct’ (type II) is the larger group: may bias
distribution of type I probes
○ Each probe type covers different CpG and gene-sequence regions

Probe type correction options
Subset quantile normalisation Touleimat & Tost, Epigenomics (2012)
● For each probe category, use type I signals as the anchors to estimate a
reference distribution of quantiles
● use this reference to estimate a target distribution of quantiles for type II
probes
○ two different annotations for subsetting:
■ ‘relation to CpG’
■ ‘relation to gene sequence’

Subset-quantile Within Array Normalisation (SWAN) Maksimovic et al.
Genome Biology (2012)
● Assume that the overall intensity
distribution should be the same
when the underlying CpG contents
of the probes are the same
○ in other words, assume the
CpG content of the probes
reflects the biology by being a
surrogate for the CpG density
of the region

Beta-mixture quantile (BMIQ) normalisation method. Teschendorff et al.
Bioinformatics (2013)
● Major benefit over subset normalisation methods is that it is assumption free
○ State membership of individual probes is determined by maximum
probability
● Approach:
○ Fits a three-state (unmethylated, hemimethylated, fully methylated) beta
mixture model to the Type I and Type II probes separately
○ For each state, transforms probabilities of belonging to the state to
quantiles using the inverse of the cumulative beta distribution with beta
parameters estimated from the Type I probes
● Model-based method helps to avoid having gaps emerge in normalized
distribution

Red/green intensities following normalisation
BEFOREAFTER

Distribution of M-values
Distribution of Beta-values
(for one sample)
BEFORENormalisationAFTERNormalisation

● Aim is to remove other technical artifacts eg position tiering of intensities
○ reflects quantile normalisation approaches for gene expression
● Normalisation of intensities (not betas)
● Assumes the same global distributions between the samples
○ this may not be true
BEFOREAFTER
Normalization: Between-Array

Functional Normalization of 450k Methylation Array. Fortin et al. Genome
Biology (2014)
● Created to offer a means of removing between-array unwanted technical variation
even in the case of global methylation changes
● Possible to apply to cancer data or tissue differences
● Uses control probes to act as surrogates for unwanted variation
● Control probes: uses 848 control probes
● None are used to measure biological signal
● PCA of control probes, removes variation associated with first two PCs by default
● Functional normalization extends quantile normalization by adjusting for known
covariates measuring unwanted variation
● The normalization procedure is applied to the Meth and Unmeth intensities
separately, and to type I and type II signals separately

Functional Normalization of 450k Methylation Array. Fortin et al. Genome
Biology (2014)
● Like a unsupervised batch correction method
● Suggested to outperform supervised batch correction methods (e.g.
ComBat, SVA and RUV)

Normalization: Data-driven
Approaches
A data-driven approach to preprocessing Illumina 450K methylation array
data. Pidsley et al. BMC Bioinformatics (2013)
● Use three independent metrics to based on known methylation patterns to
test the performance of different normalization and background correction
schemes
● Assess patterns associated with:
● Genomic imprinting
● ‘DMRSE’ (i.e. Differentially Methylated Regions Standard Error)
● X-chromosome inactivation (XCI)
● ‘Seabird’ (named after the auk and also the mythical bird roc)
● SNP genotyping assays present on the array
● ‘GCOSE’ (Genotype Combined Standard Error)

Approaches
A data-driven approach to preprocessing Illumina 450K methylation array
data. Pidsley et al. BMC Bioinformatics (2013)
Method Background adjustment Between-array normalization Dye bias correction
naten no typ1 and typ2 together no
nanet no no typ1 and typ2 together
nanes no no typ1 and typ2 separately
danes yes no typ1 and typ2 separately
danet yes no typ1 and typ2 together
danen yes no no
daten1 yes typ1 and typ2 together no
daten2 yes typ1 and typ2 together no
nasen no typ1 and typ2 separately no
dasen yes typ1 and typ2 separately no

Approaches
method TypeI TypeII Average
raw 6.5 11 8.75
betaqn 14 13 13.5
naten 12 9 10.5
nanet 11 3 7
nanes 9.5 7.5 8.5
danes 2.5 7.5 5
danet 1 6 3.5
danen 5 12 8.5
daten1 4 4 4
daten2 8 5 6.5
nasen 9.5 1.5 5.5
dasen 2.5 1.5 2
fuks 6.5 15 10.75
tost 13 14 13.5
swan 15 10 12.5
Tested 15 pre-processing
methods across 11
methylation datasets using
the three performance
metrics
• For each dataset get mean
of three ranks across
methods
• Then get the mean of ranks
across the datasets
• “dasen” appears to do the
best across probe types

● MANY ways to perform initial quality control and pre-processing
○ Consider the samples used
■ e.g. between-array normalisation may not be appropriate for
cancer samples
GOAL: identify failed samples and reduce impact of technical artifacts without
removing meaningful biological variation
Marabita et al. Epigenetics (2013)
Summary

Summary
A few of the within-array
normalization procedures
improved the concordance
between the 450k data and
the pyrosequencing data
• Marked improvement
using SQN, noob, and
BMIQ
• Blue, orange and red
indicate Infinium typeI/II
bias correction methods,
color bias adjustment and
background correction
methods, respectively.
Dedeurwaerder et al. Briefings in Bioinformatics (2013)

Summary
Dedeurwaerder et al. Briefings in Bioinformatics (2013)
• HCT116 data: more global differences, performed worse with between-
array normalization
• Roessler’s data: no improvement with between-array normalization

● Additional considerations: Filter a priori?
○ If remove loci with little inter-sample variability, may miss loci with
small, but very significant effect sizes
○ May be SNP in probe, but SNP has a minor allele frequency too low
to impact associations with methylation
○ But removing these sites reduces the number of comparisons we
need to account for when adjusting for multiple testing
GOAL: identify failed samples and reduce impact of technical artifacts without
removing meaningful biological variation
Summary

Data Issues
Considerations for High-Level Analysis

Identifying Batch Effects
You may identify batch effects in your data that have not been removed by
normalisation
● Batch effects are subgroups of data that are not related to biological or
other variables in the study
○ chips that were run on separate days
○ bisulphite modifications that were performed in different batches

Approaches to Remove Batch
Effects
● Can adjust for batch in downstream analysis (eg in regression)
○ Has been done in some published articles
..but may not effectively deal with batch issue
Options to address batch effect:
○ ComBAT
○ SVA
○ ISVA
○ RUV2

ComBat
Johnson et al. Biostatistics (2007)
● Linear model for batch effects and
uses Empirical Bayes method to
estimate the batch effects
o Instead of full Bayesian
approach, uses Empirical Bayes
methods to estimate the
hyper-parameters from the
data
 Helps in small sample sizes
by borrowing information
across genes
Before After

ComBat
Johnson et al. Biostatistics (2007)
● Works best when:
o Small sample size
o Known batch effects
o Linear batch effects
● Disadvantages:
o Computationally intensive
o Only correct for batch effects from known sources
o Assumption of linear effects and normality may be violated
In other words, not great for large studies with complicated batch effects
ComBat function in SVA package

Input
● matrix containing methylation data (Mvalues)
mdata
● vector indicating the batch variable to adjust for
batch <- samplepheno$TCGABATCH
● model matrix containing the full model
mod <- model.matrix(~as.factor(TUMOR), data=samplepheno)
● null model (in this case no other covars so only the intercept)
mod0 = model.matrix(~1,data=samplepheno)
Run ComBat
combat_mdata<- ComBat(dat=mdata, batch=batch, mod=mod,
numCovs=NULL, par.prior=TRUE)
ComBat

Output
● Returns a corrected matrix with the same dimensions as your original
dataset with batch effects removed
● Run signficance analysis on the adjusted data
pValuesComBat = f.pvalue(combat_mdata,mod,mod0)
qValuesComBat = p.adjust(pValuesComBat,method="BH")
OR
● run analysis model using the adjusted data
result<- cpg.assoc(combat_mdata~as.factor(samplepheno$TUMOR))
ComBat

ComBat: Simulation
group = rep(c(-1,1),each=20)
coinflip = rbinom(40,size=1,prob=0.8)
batch = group*coinflip + -group*(1-coinflip)
gcoeffs = rep(0,10000)
bcoeffs = rnorm(10000,sd=2)
coeffs = cbind(bcoeffs,gcoeffs)
mod = model.matrix(~-1 + batch + group)
modelprojected<-coeffs%*%t(mod)
dat0<-t(apply(modelprojected,1,function(x)
x+rnorm(ncol(modelprojected),sd=1)))
par(mfrow=c(2,1))
plot(group,main=expression(bold("Group")),pch=16)
plot(batch,main=expression(bold("Batch")),pch=16)
Batch is strongly associated with Group, but group
does not have a direct impact on outcome

ComBat: Simulation
## Set null and alternative models (ignore batch)
mod1 = model.matrix(~group)
mod0 = cbind(mod1[,1])
par(mfrow=c(2,1))
fit <- lmFit(dat0, mod1)
fit <- eBayes(fit)
hist(fit$p.value[,"group"],
xlab="p-value",
main=expression(bold("Unadjusted")),
col="orange")
combatresults<-ComBat(dat0, batch=batch,
mod=mod1,par.prior = TRUE)
fit <- lmFit(combatresults, mod1)
fit <- eBayes(fit)
hist(fit$p.value[,"group"],
xlab="p-value",
main=expression(bold("ComBat")),
col="purple")
Still some residual confounding, but much better
than unadjusted analysis

● Leek and Story PLoS Genetics (2007)
● Used to identify and estimate surrogate variables for unknown,
unmodeled or latent sources of noise
○ Appropriate when there are many known or unknown
confounders
○ May not be appropriate if the biological groups of interest are
heterogeneous
■ eg. Comparing cancer cases and controls where there are
different cancer subgroups as do not want to lose this
variation
Surrogate Variable Analysis (SVA)

Step 1
Obtain residual matrix (remove variation associated with variables of
interest), calculate the singular value decomposition (SVD) of the
residual matrix, perform test to assess whether singular vectors
represent more variation than expected due to chance
Step 2
Identify the subset of genes driving each orthogonal signature of the
variation
● Test association between each probe and each singular vector of
the SVD
Step 3
For each of these subsets, build a SV based on the full signature of that
subset in the original data
● Allows the SVs to be correlated with the primary variables
Step 4
Include all significant SVs as covariates in subsequent regression
analyses

Leek and Story PLoS Genetics (2007)
PrimaryVariableValueUnmodeledFactorValue
Arrays
Arrays
Genes
Example of Expression Heterogeneity

Input
● Matrix containing methylation data
betas(methylumidata)
● Model matrix containing the full model
mod <- model.matrix(~as.factor(tumor), data=pData(methylumidata))
● Null model (in this case no other covars, so only the intercept)
mod0 = model.matrix(~1,data=pData(methylumidata))
Run sva
sva_output<- sva(dat=na.omit(betas(methylumidata)),mod=mod, mod0=mod0)
Main Output
● Can adjust for sva_output$sv in model - a matrix of surrogate variables

● Teschendorff et al. Bioinformatics (2011)
● Developed due to potential issues with SVA
○ Surrogate variables may capture heterogeneous phenotypes and/or model
misspecification
■ i.e. the residual variation may contain biologically relevant variation
● If potential confounders are known (either exactly or subject to
error/uncertainty) ISVA will select only those independent components that
correlate with the confounders
○ Otherwise similar approach as SVA - removes the variation in the data
matrix not associated with the phenotype of interest, and performs
Independent Component Analysis (ICA) on this residual variation matrix
○ BUT only keeps ISVs that are associated with putative confounders
Independent Surrogate Variable
Analysis (ISVA)

Input
● Matrix containing methylation data
methyldata<-na.omit(betas(methylumidata))
● Vector of the phenotype of interest (only takes numeric data)
binarytumor<-rep(0,ncol(methylumidata))
binarytumor[pData(methylumidata)$tumor=="yes"]<-1
● Matrix of potential confounding factors (may be numeric or categorical)
factors.m<-pData(methylumidata)[,c("age_at_initial_pathologic_diagnosis",
"TCGAbatch","tissue_source_site","anatomic_neoplasm_subdivision","ajcc_patholog
ic_tumor_stage")]
Run ISVA
isva.o <- DoISVA(methyldata, pheno.v=binarytumor, factors.m, factor.log=
c(FALSE,TRUE,TRUE,TRUE,TRUE,TRUE), pvthCF=0.1, th=0.001)
Analysis (ISVA)

Significant
Surrogate
Variables
Phenotype of
interest age at diagnosis TCGA batch
tissue source
site
anatomic
neoplasm
subdivision
ajcc pathologic
tumor stage
1 0.00054 0.70767 0.22121 0.56843 0.57395 0.83765
2 0.00288 0.45119 0.35796 0.44187 0.06801 0.5264
3 0.13759 0.7693 0.49232 0.474 0.3279 0.79098
4 0.53001 0.54648 0.05478 0.11074 0.55215 0.50216
5 0.01475 0.93185 0.18196 0.2447 0.84541 0.38136
6 0.74174 0.0332 0.62394 0.39353 0.9453 0.78771
7 0.0492 0.58565 0.91446 0.81124 0.71186 0.86681
8 0.04651 0.95817 0.91115 0.79168 0.02937 0.464
Looking at potential confounders and indicators of batch (p-value for association with
each candidate ISV)
● 8 candidate ISVs, 3 associated with at least one variable (p<0.1)
● Would only include the 3 significant, selected ISVs
Analysis (ISVA)

● Gagnon-Bartsch and Speed Biostatistics (2012)
● Like SVA, an analysis that estimates and adjusts for unknown
surrogate variables
● Tries to address problem that ISVA tries to tackle, discerning the
unwanted variation from the biological variation that is of interest to
the researcher
● Restricts variation decomposition to negative control genes
● Requires negative control genes are genes whose expression levels
are known a priori to be truly unassociated with the biological factor
of interest
Remove Unwanted Variation-2
(RUV2)

Summary
Approach Pros Cons
ComBat Appropriate when groups are
heterogeneous and batch effect
is known
Batch effect may be
complicated mixture of factors
SVA Do not need to know the
unmeasured confounders and
may capture impact of cell
mixture
Surrogate variables may
capture heterogeneous
phenotypes and/or model
misspecification
ISVA Avoid capturing meaningful
biological varation
Need to have surrogates for
potential confounders
RUV2 Avoid capturing meaningful
biological varation
Need to define a subgroup of
probes that are not influenced
by exposure of interest
59

● Distributions of methylation data are not always normal
○ use of transformations
○ option to remove probes with more than one mode
● Batch effects can be large and may be due to known or unknown
factors
○ use careful study design to minimise the impact of batch effects
○ use of post analysis methods to reduce batch effects
Summary

Cellular Heterogeneity
And why we care

● This section focusses on cellular heterogeneity in blood
most cohort studies are currently analysing data from blood samples
● Also relevant for data analysis in other tissues where cellular
heterogeneity is also present
○ currently less defined methods
● Why does cellular heterogeneity matter?
● What can we do about it?
Overview

● An issue for many population based studies
● DNA extracted from blood containing many cell types
○ Large source of variation in methylation data from blood
Cellular Heterogeneity in Blood

Heatmap of cell sorted 450k
data
Jaffe & Irizarry Genome
Biology (2014)

Why is it an issue?
● Bias if outcome of interest correlates with cell composition
○ Confounding by immunological profile
● Uninteresting variation: mediation by immunological profile
○ Reflects a previously known mechanism: real goal is to find differences in
methylation beyond the cell composition associations
○ Usually seen with environmental exposures
● Temporality is an issue/difficulty for cross-sectional data
○ e.g inflammation preceding or following cancer?
Cell
distribution
Methylation Infertility
PM2.5
Cell
distribution
Methylation

Flow cytometry
● Adjust for cell proportion or restrict analysis to one subtype
○ At discovery or validation/replication
HOWEVER:
○ expensive
○ time consuming
○ requires fresh samples (often impossible in cohort studies)
Gold standard
Correcting for Cellular
Heterogeneity in Blood

Houseman et al. BMC Bioinformatics (2012)
● General goal: use purified cells (“gold standard”) to build a model to
predict the distribution of leukocytes for the analysis of population data
○ Can also be used to predict the distribution of leukocytes in a single
sample given its DNA methylation profile
● Resembles regression calibration approach in measurement error
literature
○ Assumption of transportability - e.g. is the mechanism giving rise to
the measurement error the same in cord blood as it is in adult blood?
● Sorted WBCs in S0 from Houseman paper were run on the 27K array
○ Choose m sites with strongest association between methylation and
cell types based on F statistic to estimate cell proportions

Direct Epigenetic
Response
Immune Response
Measured DNA
methylation
Phenotype
Y matrix
𝑀𝑀Ω𝑇𝑇
Ω is the cell type
proportions for
each individual
𝐵𝐵𝐵𝐵 𝑇𝑇Covariates X
Ω = 𝑋𝑋Γ + Ξ
Γ Cell composition effects

• If a reference set is available, we can estimate the cell
composition effect M
• We can estimate the individual cell proportions Ω from the
methylation profile Y and the cell composition effects M
• Can then adjust for estimated cellular composition in
the subsequent analysis
𝑌𝑌 = 𝐵𝐵𝐵𝐵 𝑇𝑇
+ 𝑀𝑀Ω𝑇𝑇
+ 𝐸𝐸
Direct Effect Cell
Composition
Effect

Compared to gold standards, this method has relatively high precision
Accomando et al. Genome Biology (2014)
Correlation between cell proportions estimated by DNA methylation and
proportions quantified by established methods among whole blood samples
from disease-free human donors

Houseman et al. BMC Bioinformatics (2012)
● Coefficients estimated using 27k data do not seem to work well for 450k data
○ Can rebuild Houseman approach using cell sorted 450k dataset
● Can be implemented using estimateCellCounts() from minfi package
Reinius LE et al Plos ONE
(2012)
Cell populations isolated
by magnetic- activated
cell sorting, purified using
specific antibodies

Cell proportion predicted for Reinius (2012) samples
Predictedcellproportion

Jaffe and Irizarry. Genome Biology
(2014)
● Proportions of cell types estimated
using DNA methylation
● Can see shifts in composition
associated with aging
○ Not (usually) the variation that
we are interested in describing
● Age is confounder in many
epidemiology studies
○ must consider impact on
cellular heterogeneity
Cellular Heterogeneity and Aging

Other tissues also contain a mixture of cells
● To correct for this must either: microdissect all samples, or create a
reference microdissected data set to rebuild Houseman approach
The creation of reference-free methods
● Houseman et al. Bioinformatics (2014)
○ Uses modified ISVA to identify latent components of the observed
methylation variation, assumed to capture differences in cell
distributions
● Zou et al. Nature Methods (2014)
○ Finds simplest combination of principal components with linear mixed
model that controls for test inflation
Cellular Heterogeneity in other
Tissues

Houseman et al. Bioinformatics (2014)
● Similar to ISVA and SVA, except makes an additional biological mixture assumption
○ Dependence of the latent structure of the error on the unknown, cell-specific
methylation matrix
library(RefFreeEWAS)
test<-RefFreeEwasModel(betas(methylumidata.bgcorr)[c(1:1000),], mod, 8) # example 1000 loci,8 ISVs
testBoot <- BootRefFreeEwasModel(test,500) #500 bootstrap datasets
Significance of adjusted estimates
BstarSE<-apply(testBoot[,2, "B*",],1,function (x) sd(x))
BstarT<-test$Bstar[,2]/BstarSE
BstarP<-2*pnorm(-abs(BstarT))
Significance of adjusted estimates
BetaSE<-apply(testBoot[,2, "B",],1,function (x) sd(x))
BetaT<-test$Beta[,2]/BetaSE
BetaP<-2*pnorm(-abs(BetaT))
Tissues

Houseman et al. Bioinformatics (2014)
Find that the reference
free approach in TCGA
samples reduces the
range of effect sizes,
and attenuates
significance of 1000
sample loci
● what variation is
this capturing?
Tissues

Zou et al. Nature methods (2014)
● Factored spectrally transformed linear mixed model 'EWASher' (FaST-
LMM-EWASher)
● Reference free approach (does not estimate cell type composition)
● Computes the methylome similarity between each pair of samples in
the data set to get covariance. This is used in the linear mixed model
as an implicit proxy for cell-type composition in conjunction with
principal components.
● No issue of method portability as no reference set but if number of
true associations is large there is reduction in power
Tissues

Zou et al. Nature methods (2014)
Important note from co-author Martin Aryee on PubMed Commons:
“EWASher (Zou J, 2014) is intended to be used in EWAS settings where the
primary interest is in identifying localized differentially methylated regions (i.e.
DMRs that affect only a small fraction of methylation sites).* The results of
EWASher should be interpreted with caution in settings where large-scale
methylation changes are expected and/or of interest. The method assumes that
large-scale changes are caused by cell type composition effects and will
effectively remove these changes from consideration. This is useful in many
EWAS settings, but the assumption may not hold when studying cancer or
differences between tissues. In the cancer dataset used in our paper, for
example, we specifically identify site-specific changes that are above and beyond
global hypomethylation changes”
*my bolding
Tissues

● 354 rheumatoid arthritis cases and 312 controls across 103,638 loci
○ a. shows QQ plot from unadjusted model
○ b. shows QQ plot where cell-type composition covariates were included in
the model (Houseman)
○ c. shows QQ plot using EWASher
Tissues

● Cellular heterogeneity is a particular issue for cohort studies where
cell counts are unknown
● Heterogeneity could cause bias in results
● Cell count related hits may not be of interest
● ‘Best practice’ not yet established
● Consideration for reference-free approaches: assumes the major
determinant variation is cell composition
○ may not be true, or may not be true for all tissues
Summary

Data basics

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Data basics

Similar to Data basics (20)

Recently uploaded

Recently uploaded (20)

Data basics