SlideShare a Scribd company logo
Data Basics: From Raw
to Normalized Data
Day 2
Briefly: Bisulfite Conversion
Krueger et al. (2012) Nature Methods
450K Array
Design and Processing
Oligos
(~800,000/bead)
Bead
Type I: two bead types per
CpG site (unmethylated and
methylated)
Type II: one bead type per
CpG site
DNA preparation,
hybridisation, staining.
Detection of red and green
fluorescent signals by iScan
450K Array
Beads self-
assemble into
the pits on the
array
450k chip: 12
samples per chip
Multiple beads
per bead type
on each chip
Combined to
bead pools (all
bead types)
● Analyzes >480,000 CpG loci
● Covers 99% of all RefSeq genes,
average of 17 probes per gene
● Distributed over various
functional elements such as:
○ CpG islands, shores, and
shelves
○ 3´- and 5´-UTRs, gene
bodies
○ DNAse hypersensitive sites
○ miRNA promoters
Dedeurwaerder et al. Epigenomics (2011)
450K Array
450K Array: QC Probes
Control Probe Purpose
Staining measure efficiency and sensitivity of staining step (independent of
hybridisation/extension step)
Extension test efficiency of extension of A, T, C and G nucleotides from a hairpin
probe ( sample-independent). The perfect match hairpin controls
should result in high signal, and the mismatch probes in low signal
Hybridization test the overall performance of Infinium assay using synthetic targets
(not DNA) at 3 concentrations
Target removal test efficiency of stripping step after extension
Bisulphite conversion test efficiency of bisulphite conversion by query of C/T polymorphism
Specificity controls check for non-specific detection of methylation signal over
unmethylated background. Specificity controls are designed against
non-polymorphic T sites (G/T mismatch)
Non-polymorphic query a non polymorphic base A, T, C and G to test overall
performance of the assay from amplification to detection
Negative randomly permutated bisulphite-converted sequences containing no
CpGs. They should not hybridise to DNA. The mean of these probes
determines the system background
Type I Type II
Same chemistry as 27k array Chemistry not seen on 27k array
28% of probes on array 72% of probes on array
Designed for regions with more CG
dinucleotides - 57% of Type I probes
lie in CpG islands
21%, 26% and 11% of type II probes
lie in CpG islands, shores and shelves
respectively
Suggested to be more stable and
reproducible than the signals
provided by Type II probes
Have a decreased quantitative
dynamic range compared to Type I
probes
For either probe design, intensities are used to estimate Beta-value or M-value
450k Array: Type I vs Type II
Probes
Uses fluorescence from two different probes, unmethylated (converted) and
methylated (unconverted), to assess the level of methylation of a target CpG
● Binding at either probe is followed by single base extension that results in the
addition of a fluorescently labeled nucleotide
Dedeurwaerder et al. Epigenomics (2011)
450K Array: Type I Probes
Methylation state is detected by single base extension and detection of a fluorescently
labelled nucleotide at the position of the 'C' of the target CpG
● Type II probes include a 'degenerate' R-base at any underlying CpG sites in the
probe body
Dedeurwaerder et al. Epigenomics (2011)
450K Array: Type II Probes
Quality Control
Filtering and Normalization
● There are a number of R packages that incorporate QC,
filtering and normalisation into their pipelines or offer
specific functions
● Most allow you to define at least some thresholds yourself
● Option to pick and choose
minfi ChAMP
RnBeads lumi
Touleimat &
Tost
wateRmelon
Outline
AIM: identify unusual samples and technical artifacts
The array contains 65 single nucleotide polymorphisms (SNPs)
● We can use this info to identify any unintentional duplicated samples
● If we have multiple samples per individual, their samples should cluster
Raw Data: Initial QC
AIM: identify unusual samples and technical artifacts
The array uses red/green fluorescence intensities to estimate methylation level -
the two colour channels have different background intensities
● Type I probes use the same colour to evaluate methylated and unmethylated
probes - should have less of an impact
● Type II uses green to measure methylated state and red for unmethylated -
colour bias can contribute to decreased dynamic range
Raw Data: Initial QC
AIM: identify unusual samples and technical artifacts
As the fluorescence intensities are read across the chip - there
appears to be a tiering effect
● This technical artifact could impact the results
○ e.g. cases unintentionally clustered on the array
Raw Data: Initial QC
Plot Distributions of the samples
● Unusual distributions may reflect:
○ Real biological effects (global methylation changes)
○ Poor methylation data
AIM: identify unusual samples and technical artifacts
Different colour/line combination
for each sample
Red indicates primary tumor, blue is
adjacent normal tissue
Raw Data: Initial QC
Plot Distributions of the samples
● Boxplots or violin plots (below) can be used to the same effect
AIM: identify unusual samples and technical artifacts
Raw Data: Initial QC
AIM: identify unusual samples and technical artifacts
Colour corresponds to tumor status Colour corresponds to TCGA batch
Can use multidimensional scaling to look for unusual clustering of samples
Raw Data: Initial QC
By plotting the distribution of Type I and Type II probes separately we can
observe the difference in distribution - example four samples below
● Reflects: difference in chemistry and enrichment for different elements
(e.g. CpG islands)
AIM: identify unusual samples and technical artifacts
Raw Data: Initial QC
● Each data point has an associated detection p-value
○ Represents the probability the target signal was distinguishable
against background noise
● Scanner can encounter difficulties reading signal
○ Low staining intensities
○ Spatial artefacts
Common approaches:
● Drop probes that failed in nth% of samples
○ Common thresholds are 20%, 10%, 5% of probes at >0.05, >0.01.
● Drop samples that failed in nth% of probes
○ Common thresholds are 50%, 20% at >0.05, 0.01.
Filtering: Detection P-value
Drop those with known SNPs
residing in the probe sequence
Most common SNPs in dbSNP
are C>T transitions
● C>T transitions will be read as
an unmethylated cytosine
Observe grouping of methylation
values by genotype
Filtering: Common Practices
Related to Technical Issues
● ~4.3% of the probes are reported
to contain a known
polymorphism specifically at the
targeted C or G
○ 43% of these SNPs have a
heterozygosity of >0.1
○ Price et al. Epigenetics
Chromatin (2013)
● SNP filtering depends on the
study population / reference
genome (eg CEU)
Drop those for which the CpG site contains a SNP
Filtering: Common Practices
Related to Technical Issues
Drop those in which probes anneal to multiple genomic locations
● Bisulfite conversion reduces the complexity of the genome
All unmethylated Cs converted to T
● ~10-20% of the Infinium HumanMethylation450 probes have been
identified as non-specific depending on the criteria
● Repetitive elements - may be a real signal, but uncertain meaning and
validation difficult
● Probes cross-reactive to X chromosome
○ pick up X inactivation, leading to spurious association if
outcome/exposure associated with sex
Naeem et al. BMC Genomics (2014)
Filtering: Common Practices
Related to Technical Issues
Common practices related to analysis issues
● drop those on X and Y chromosomes
● drop those with lowest variation
● drop those with extreme methylation levels (eg median = 0% or 100%)
● only consider those in regions of interest (eg CpG island, shore, other)
Filtering: Common Practices
Related to Technical Issues
Colour bias adjustment and Background correction
● Can adjust for colour bias in the lumi package using either:
○ Smooth quantile or shift-and-scaling normalisation
● Most methods employ simple background subtraction
○ No significant improvement in data quality
● New method in methylumi package outperforms previous methods
○ Uses “out-of-band” signal from type I probes to estimate background
rather than background control probes
■ Out-of-band - colour channel opposite their designed base extension
■ Only a few background control probes (n=614), but many Type I probes
Triche et al. Nucleic Acids Res. (2013)
Normalization: Within-Array
Colour bias adjustment and Background correction
Lumi approach, requires starting with a MethyLumiM object
● How to perform the quantile normalisation method
Methylumi approach from Triche et al. Nucleic Acids Res. (2013), requires starting with a
MethylumiSet object
● Using ‘noob’ method (noob=normal-exponential using out-of-band probes), a
convolution that assumes
data.bgcorrect<-adjColorBias.quantile(lumidata)
Signal Intensity + Background = Observed foreground intensity
data.bgcorrect<-methylumidata(methylumidata)
Normalization: Within-Array
Probe type correction
● The identified shift between type I and type II β-values may induce a bias
in the analysis if the methylation signals corresponding to the two types
of assays are analyzed together
○ type I probes have greater stability, increased power
● Can’t just perform full quantile normalisation
○ The population to ‘correct’ (type II) is the larger group: may bias
distribution of type I probes
○ Each probe type covers different CpG and gene-sequence regions
Normalization: Within-Array
Probe type correction options
Subset quantile normalisation Touleimat & Tost, Epigenomics (2012)
● For each probe category, use type I signals as the anchors to estimate a
reference distribution of quantiles
● use this reference to estimate a target distribution of quantiles for type II
probes
○ two different annotations for subsetting:
■ ‘relation to CpG’
■ ‘relation to gene sequence’
Normalization: Within-Array
Probe type correction options
Subset-quantile Within Array Normalisation (SWAN) Maksimovic et al.
Genome Biology (2012)
● Assume that the overall intensity
distribution should be the same
when the underlying CpG contents
of the probes are the same
○ in other words, assume the
CpG content of the probes
reflects the biology by being a
surrogate for the CpG density
of the region
Normalization: Within-Array
Probe type correction options
Beta-mixture quantile (BMIQ) normalisation method. Teschendorff et al.
Bioinformatics (2013)
● Major benefit over subset normalisation methods is that it is assumption free
○ State membership of individual probes is determined by maximum
probability
● Approach:
○ Fits a three-state (unmethylated, hemimethylated, fully methylated) beta
mixture model to the Type I and Type II probes separately
○ For each state, transforms probabilities of belonging to the state to
quantiles using the inverse of the cumulative beta distribution with beta
parameters estimated from the Type I probes
● Model-based method helps to avoid having gaps emerge in normalized
distribution
Normalization: Within-Array
Red/green intensities following normalisation
BEFOREAFTER
Normalization: Within-Array
Distribution of M-values
Distribution of Beta-values
(for one sample)
BEFORENormalisationAFTERNormalisation
Normalization: Within-Array
● Aim is to remove other technical artifacts eg position tiering of intensities
○ reflects quantile normalisation approaches for gene expression
● Normalisation of intensities (not betas)
● Assumes the same global distributions between the samples
○ this may not be true
BEFOREAFTER
Normalization: Between-Array
Normalization: Between-Array
Functional Normalization of 450k Methylation Array. Fortin et al. Genome
Biology (2014)
● Created to offer a means of removing between-array unwanted technical variation
even in the case of global methylation changes
● Possible to apply to cancer data or tissue differences
● Uses control probes to act as surrogates for unwanted variation
● Control probes: uses 848 control probes
● None are used to measure biological signal
● PCA of control probes, removes variation associated with first two PCs by default
● Functional normalization extends quantile normalization by adjusting for known
covariates measuring unwanted variation
● The normalization procedure is applied to the Meth and Unmeth intensities
separately, and to type I and type II signals separately
Normalization: Between-Array
Functional Normalization of 450k Methylation Array. Fortin et al. Genome
Biology (2014)
● Like a unsupervised batch correction method
● Suggested to outperform supervised batch correction methods (e.g.
ComBat, SVA and RUV)
Normalization: Data-driven
Approaches
A data-driven approach to preprocessing Illumina 450K methylation array
data. Pidsley et al. BMC Bioinformatics (2013)
● Use three independent metrics to based on known methylation patterns to
test the performance of different normalization and background correction
schemes
● Assess patterns associated with:
● Genomic imprinting
● ‘DMRSE’ (i.e. Differentially Methylated Regions Standard Error)
● X-chromosome inactivation (XCI)
● ‘Seabird’ (named after the auk and also the mythical bird roc)
● SNP genotyping assays present on the array
● ‘GCOSE’ (Genotype Combined Standard Error)
Normalization: Data-driven
Approaches
A data-driven approach to preprocessing Illumina 450K methylation array
data. Pidsley et al. BMC Bioinformatics (2013)
Method Background adjustment Between-array normalization Dye bias correction
naten no typ1 and typ2 together no
nanet no no typ1 and typ2 together
nanes no no typ1 and typ2 separately
danes yes no typ1 and typ2 separately
danet yes no typ1 and typ2 together
danen yes no no
daten1 yes typ1 and typ2 together no
daten2 yes typ1 and typ2 together no
nasen no typ1 and typ2 separately no
dasen yes typ1 and typ2 separately no
Normalization: Data-driven
Approaches
method TypeI TypeII Average
raw 6.5 11 8.75
betaqn 14 13 13.5
naten 12 9 10.5
nanet 11 3 7
nanes 9.5 7.5 8.5
danes 2.5 7.5 5
danet 1 6 3.5
danen 5 12 8.5
daten1 4 4 4
daten2 8 5 6.5
nasen 9.5 1.5 5.5
dasen 2.5 1.5 2
fuks 6.5 15 10.75
tost 13 14 13.5
swan 15 10 12.5
Tested 15 pre-processing
methods across 11
methylation datasets using
the three performance
metrics
• For each dataset get mean
of three ranks across
methods
• Then get the mean of ranks
across the datasets
• “dasen” appears to do the
best across probe types
● MANY ways to perform initial quality control and pre-processing
○ Consider the samples used
■ e.g. between-array normalisation may not be appropriate for
cancer samples
GOAL: identify failed samples and reduce impact of technical artifacts without
removing meaningful biological variation
Marabita et al. Epigenetics (2013)
Summary
Summary
A few of the within-array
normalization procedures
improved the concordance
between the 450k data and
the pyrosequencing data
• Marked improvement
using SQN, noob, and
BMIQ
• Blue, orange and red
indicate Infinium typeI/II
bias correction methods,
color bias adjustment and
background correction
methods, respectively.
Dedeurwaerder et al. Briefings in Bioinformatics (2013)
Summary
Dedeurwaerder et al. Briefings in Bioinformatics (2013)
• HCT116 data: more global differences, performed worse with between-
array normalization
• Roessler’s data: no improvement with between-array normalization
● Additional considerations: Filter a priori?
○ If remove loci with little inter-sample variability, may miss loci with
small, but very significant effect sizes
○ May be SNP in probe, but SNP has a minor allele frequency too low
to impact associations with methylation
○ But removing these sites reduces the number of comparisons we
need to account for when adjusting for multiple testing
GOAL: identify failed samples and reduce impact of technical artifacts without
removing meaningful biological variation
Summary
Data Issues
Considerations for High-Level Analysis
Identifying Batch Effects
You may identify batch effects in your data that have not been removed by
normalisation
● Batch effects are subgroups of data that are not related to biological or
other variables in the study
○ chips that were run on separate days
○ bisulphite modifications that were performed in different batches
Approaches to Remove Batch
Effects
● Can adjust for batch in downstream analysis (eg in regression)
○ Has been done in some published articles
..but may not effectively deal with batch issue
Options to address batch effect:
○ ComBAT
○ SVA
○ ISVA
○ RUV2
ComBat
Johnson et al. Biostatistics (2007)
● Linear model for batch effects and
uses Empirical Bayes method to
estimate the batch effects
o Instead of full Bayesian
approach, uses Empirical Bayes
methods to estimate the
hyper-parameters from the
data
 Helps in small sample sizes
by borrowing information
across genes
Before After
ComBat
Johnson et al. Biostatistics (2007)
● Works best when:
o Small sample size
o Known batch effects
o Linear batch effects
● Disadvantages:
o Computationally intensive
o Only correct for batch effects from known sources
o Assumption of linear effects and normality may be violated
In other words, not great for large studies with complicated batch effects
ComBat function in SVA package
Input
● matrix containing methylation data (Mvalues)
mdata
● vector indicating the batch variable to adjust for
batch <- samplepheno$TCGABATCH
● model matrix containing the full model
mod <- model.matrix(~as.factor(TUMOR), data=samplepheno)
● null model (in this case no other covars so only the intercept)
mod0 = model.matrix(~1,data=samplepheno)
Run ComBat
combat_mdata<- ComBat(dat=mdata, batch=batch, mod=mod,
numCovs=NULL, par.prior=TRUE)
ComBat
Output
● Returns a corrected matrix with the same dimensions as your original
dataset with batch effects removed
● Run signficance analysis on the adjusted data
pValuesComBat = f.pvalue(combat_mdata,mod,mod0)
qValuesComBat = p.adjust(pValuesComBat,method="BH")
OR
● run analysis model using the adjusted data
result<- cpg.assoc(combat_mdata~as.factor(samplepheno$TUMOR))
ComBat
ComBat: Simulation
group = rep(c(-1,1),each=20)
coinflip = rbinom(40,size=1,prob=0.8)
batch = group*coinflip + -group*(1-coinflip)
gcoeffs = rep(0,10000)
bcoeffs = rnorm(10000,sd=2)
coeffs = cbind(bcoeffs,gcoeffs)
mod = model.matrix(~-1 + batch + group)
modelprojected<-coeffs%*%t(mod)
dat0<-t(apply(modelprojected,1,function(x)
x+rnorm(ncol(modelprojected),sd=1)))
par(mfrow=c(2,1))
plot(group,main=expression(bold("Group")),pch=16)
plot(batch,main=expression(bold("Batch")),pch=16)
Batch is strongly associated with Group, but group
does not have a direct impact on outcome
ComBat: Simulation
## Set null and alternative models (ignore batch)
mod1 = model.matrix(~group)
mod0 = cbind(mod1[,1])
par(mfrow=c(2,1))
fit <- lmFit(dat0, mod1)
fit <- eBayes(fit)
hist(fit$p.value[,"group"],
xlab="p-value",
main=expression(bold("Unadjusted")),
col="orange")
combatresults<-ComBat(dat0, batch=batch,
mod=mod1,par.prior = TRUE)
fit <- lmFit(combatresults, mod1)
fit <- eBayes(fit)
hist(fit$p.value[,"group"],
xlab="p-value",
main=expression(bold("ComBat")),
col="purple")
Still some residual confounding, but much better
than unadjusted analysis
● Leek and Story PLoS Genetics (2007)
● Used to identify and estimate surrogate variables for unknown,
unmodeled or latent sources of noise
○ Appropriate when there are many known or unknown
confounders
○ May not be appropriate if the biological groups of interest are
heterogeneous
■ eg. Comparing cancer cases and controls where there are
different cancer subgroups as do not want to lose this
variation
Surrogate Variable Analysis (SVA)
Step 1
Obtain residual matrix (remove variation associated with variables of
interest), calculate the singular value decomposition (SVD) of the
residual matrix, perform test to assess whether singular vectors
represent more variation than expected due to chance
Step 2
Identify the subset of genes driving each orthogonal signature of the
variation
● Test association between each probe and each singular vector of
the SVD
Step 3
For each of these subsets, build a SV based on the full signature of that
subset in the original data
● Allows the SVs to be correlated with the primary variables
Step 4
Include all significant SVs as covariates in subsequent regression
analyses
Surrogate Variable Analysis (SVA)
Leek and Story PLoS Genetics (2007)
PrimaryVariableValueUnmodeledFactorValue
Arrays
Arrays
Genes
Example of Expression Heterogeneity
Surrogate Variable Analysis (SVA)
Input
● Matrix containing methylation data
betas(methylumidata)
● Model matrix containing the full model
mod <- model.matrix(~as.factor(tumor), data=pData(methylumidata))
● Null model (in this case no other covars, so only the intercept)
mod0 = model.matrix(~1,data=pData(methylumidata))
Run sva
sva_output<- sva(dat=na.omit(betas(methylumidata)),mod=mod, mod0=mod0)
Main Output
● Can adjust for sva_output$sv in model - a matrix of surrogate variables
Surrogate Variable Analysis (SVA)
● Teschendorff et al. Bioinformatics (2011)
● Developed due to potential issues with SVA
○ Surrogate variables may capture heterogeneous phenotypes and/or model
misspecification
■ i.e. the residual variation may contain biologically relevant variation
● If potential confounders are known (either exactly or subject to
error/uncertainty) ISVA will select only those independent components that
correlate with the confounders
○ Otherwise similar approach as SVA - removes the variation in the data
matrix not associated with the phenotype of interest, and performs
Independent Component Analysis (ICA) on this residual variation matrix
○ BUT only keeps ISVs that are associated with putative confounders
Independent Surrogate Variable
Analysis (ISVA)
Input
● Matrix containing methylation data
methyldata<-na.omit(betas(methylumidata))
● Vector of the phenotype of interest (only takes numeric data)
binarytumor<-rep(0,ncol(methylumidata))
binarytumor[pData(methylumidata)$tumor=="yes"]<-1
● Matrix of potential confounding factors (may be numeric or categorical)
factors.m<-pData(methylumidata)[,c("age_at_initial_pathologic_diagnosis",
"TCGAbatch","tissue_source_site","anatomic_neoplasm_subdivision","ajcc_patholog
ic_tumor_stage")]
Run ISVA
isva.o <- DoISVA(methyldata, pheno.v=binarytumor, factors.m, factor.log=
c(FALSE,TRUE,TRUE,TRUE,TRUE,TRUE), pvthCF=0.1, th=0.001)
Independent Surrogate Variable
Analysis (ISVA)
Significant
Surrogate
Variables
Phenotype of
interest age at diagnosis TCGA batch
tissue source
site
anatomic
neoplasm
subdivision
ajcc pathologic
tumor stage
1 0.00054 0.70767 0.22121 0.56843 0.57395 0.83765
2 0.00288 0.45119 0.35796 0.44187 0.06801 0.5264
3 0.13759 0.7693 0.49232 0.474 0.3279 0.79098
4 0.53001 0.54648 0.05478 0.11074 0.55215 0.50216
5 0.01475 0.93185 0.18196 0.2447 0.84541 0.38136
6 0.74174 0.0332 0.62394 0.39353 0.9453 0.78771
7 0.0492 0.58565 0.91446 0.81124 0.71186 0.86681
8 0.04651 0.95817 0.91115 0.79168 0.02937 0.464
Looking at potential confounders and indicators of batch (p-value for association with
each candidate ISV)
● 8 candidate ISVs, 3 associated with at least one variable (p<0.1)
● Would only include the 3 significant, selected ISVs
Independent Surrogate Variable
Analysis (ISVA)
● Gagnon-Bartsch and Speed Biostatistics (2012)
● Like SVA, an analysis that estimates and adjusts for unknown
surrogate variables
● Tries to address problem that ISVA tries to tackle, discerning the
unwanted variation from the biological variation that is of interest to
the researcher
● Restricts variation decomposition to negative control genes
● Requires negative control genes are genes whose expression levels
are known a priori to be truly unassociated with the biological factor
of interest
Remove Unwanted Variation-2
(RUV2)
Summary
Approach Pros Cons
ComBat Appropriate when groups are
heterogeneous and batch effect
is known
Batch effect may be
complicated mixture of factors
SVA Do not need to know the
unmeasured confounders and
may capture impact of cell
mixture
Surrogate variables may
capture heterogeneous
phenotypes and/or model
misspecification
ISVA Avoid capturing meaningful
biological varation
Need to have surrogates for
potential confounders
RUV2 Avoid capturing meaningful
biological varation
Need to define a subgroup of
probes that are not influenced
by exposure of interest
59
● Distributions of methylation data are not always normal
○ use of transformations
○ option to remove probes with more than one mode
● Batch effects can be large and may be due to known or unknown
factors
○ use careful study design to minimise the impact of batch effects
○ use of post analysis methods to reduce batch effects
Summary
Cellular Heterogeneity
And why we care
● This section focusses on cellular heterogeneity in blood
most cohort studies are currently analysing data from blood samples
● Also relevant for data analysis in other tissues where cellular
heterogeneity is also present
○ currently less defined methods
● Why does cellular heterogeneity matter?
● What can we do about it?
Overview
● An issue for many population based studies
● DNA extracted from blood containing many cell types
○ Large source of variation in methylation data from blood
Cellular Heterogeneity in Blood
Heatmap of cell sorted 450k
data
Jaffe & Irizarry Genome
Biology (2014)
Cellular Heterogeneity in Blood
Why is it an issue?
● Bias if outcome of interest correlates with cell composition
○ Confounding by immunological profile
● Uninteresting variation: mediation by immunological profile
○ Reflects a previously known mechanism: real goal is to find differences in
methylation beyond the cell composition associations
○ Usually seen with environmental exposures
● Temporality is an issue/difficulty for cross-sectional data
○ e.g inflammation preceding or following cancer?
Cell
distribution
Methylation Infertility
PM2.5
Cell
distribution
Methylation
Cellular Heterogeneity in Blood
Flow cytometry
● Adjust for cell proportion or restrict analysis to one subtype
○ At discovery or validation/replication
HOWEVER:
○ expensive
○ time consuming
○ requires fresh samples (often impossible in cohort studies)
Gold standard
Correcting for Cellular
Heterogeneity in Blood
Houseman et al. BMC Bioinformatics (2012)
● General goal: use purified cells (“gold standard”) to build a model to
predict the distribution of leukocytes for the analysis of population data
○ Can also be used to predict the distribution of leukocytes in a single
sample given its DNA methylation profile
● Resembles regression calibration approach in measurement error
literature
○ Assumption of transportability - e.g. is the mechanism giving rise to
the measurement error the same in cord blood as it is in adult blood?
● Sorted WBCs in S0 from Houseman paper were run on the 27K array
○ Choose m sites with strongest association between methylation and
cell types based on F statistic to estimate cell proportions
Correcting for Cellular
Heterogeneity in Blood
Correcting for Cellular
Heterogeneity in Blood
Direct Epigenetic
Response
Immune Response
Measured DNA
methylation
Phenotype
Y matrix
𝑀𝑀Ω𝑇𝑇
Ω is the cell type
proportions for
each individual
𝐵𝐵𝐵𝐵 𝑇𝑇Covariates X
Ω = 𝑋𝑋Γ + Ξ
Γ Cell composition effects
Correcting for Cellular
Heterogeneity in Blood
• If a reference set is available, we can estimate the cell
composition effect M
• We can estimate the individual cell proportions Ω from the
methylation profile Y and the cell composition effects M
• Can then adjust for estimated cellular composition in
the subsequent analysis
𝑌𝑌 = 𝐵𝐵𝐵𝐵 𝑇𝑇
+ 𝑀𝑀Ω𝑇𝑇
+ 𝐸𝐸
Direct Effect Cell
Composition
Effect
Compared to gold standards, this method has relatively high precision
Accomando et al. Genome Biology (2014)
Correlation between cell proportions estimated by DNA methylation and
proportions quantified by established methods among whole blood samples
from disease-free human donors
Correcting for Cellular
Heterogeneity in Blood
Houseman et al. BMC Bioinformatics (2012)
● Coefficients estimated using 27k data do not seem to work well for 450k data
○ Can rebuild Houseman approach using cell sorted 450k dataset
● Can be implemented using estimateCellCounts() from minfi package
Reinius LE et al Plos ONE
(2012)
Cell populations isolated
by magnetic- activated
cell sorting, purified using
specific antibodies
Correcting for Cellular
Heterogeneity in Blood
Cell proportion predicted for Reinius (2012) samples
Predictedcellproportion
Correcting for Cellular
Heterogeneity in Blood
Jaffe and Irizarry. Genome Biology
(2014)
● Proportions of cell types estimated
using DNA methylation
● Can see shifts in composition
associated with aging
○ Not (usually) the variation that
we are interested in describing
● Age is confounder in many
epidemiology studies
○ must consider impact on
cellular heterogeneity
Cellular Heterogeneity and Aging
Other tissues also contain a mixture of cells
● To correct for this must either: microdissect all samples, or create a
reference microdissected data set to rebuild Houseman approach
The creation of reference-free methods
● Houseman et al. Bioinformatics (2014)
○ Uses modified ISVA to identify latent components of the observed
methylation variation, assumed to capture differences in cell
distributions
● Zou et al. Nature Methods (2014)
○ Finds simplest combination of principal components with linear mixed
model that controls for test inflation
Cellular Heterogeneity in other
Tissues
Houseman et al. Bioinformatics (2014)
● Similar to ISVA and SVA, except makes an additional biological mixture assumption
○ Dependence of the latent structure of the error on the unknown, cell-specific
methylation matrix
library(RefFreeEWAS)
test<-RefFreeEwasModel(betas(methylumidata.bgcorr)[c(1:1000),], mod, 8) # example 1000 loci,8 ISVs
testBoot <- BootRefFreeEwasModel(test,500) #500 bootstrap datasets
Significance of adjusted estimates
BstarSE<-apply(testBoot[,2, "B*",],1,function (x) sd(x))
BstarT<-test$Bstar[,2]/BstarSE
BstarP<-2*pnorm(-abs(BstarT))
Significance of adjusted estimates
BetaSE<-apply(testBoot[,2, "B",],1,function (x) sd(x))
BetaT<-test$Beta[,2]/BetaSE
BetaP<-2*pnorm(-abs(BetaT))
Cellular Heterogeneity in other
Tissues
Houseman et al. Bioinformatics (2014)
Find that the reference
free approach in TCGA
samples reduces the
range of effect sizes,
and attenuates
significance of 1000
sample loci
● what variation is
this capturing?
Cellular Heterogeneity in other
Tissues
Zou et al. Nature methods (2014)
● Factored spectrally transformed linear mixed model 'EWASher' (FaST-
LMM-EWASher)
● Reference free approach (does not estimate cell type composition)
● Computes the methylome similarity between each pair of samples in
the data set to get covariance. This is used in the linear mixed model
as an implicit proxy for cell-type composition in conjunction with
principal components.
● No issue of method portability as no reference set but if number of
true associations is large there is reduction in power
Cellular Heterogeneity in other
Tissues
Zou et al. Nature methods (2014)
Important note from co-author Martin Aryee on PubMed Commons:
“EWASher (Zou J, 2014) is intended to be used in EWAS settings where the
primary interest is in identifying localized differentially methylated regions (i.e.
DMRs that affect only a small fraction of methylation sites).* The results of
EWASher should be interpreted with caution in settings where large-scale
methylation changes are expected and/or of interest. The method assumes that
large-scale changes are caused by cell type composition effects and will
effectively remove these changes from consideration. This is useful in many
EWAS settings, but the assumption may not hold when studying cancer or
differences between tissues. In the cancer dataset used in our paper, for
example, we specifically identify site-specific changes that are above and beyond
global hypomethylation changes”
*my bolding
Cellular Heterogeneity in other
Tissues
● 354 rheumatoid arthritis cases and 312 controls across 103,638 loci
○ a. shows QQ plot from unadjusted model
○ b. shows QQ plot where cell-type composition covariates were included in
the model (Houseman)
○ c. shows QQ plot using EWASher
Cellular Heterogeneity in other
Tissues
● Cellular heterogeneity is a particular issue for cohort studies where
cell counts are unknown
● Heterogeneity could cause bias in results
● Cell count related hits may not be of interest
● ‘Best practice’ not yet established
● Consideration for reference-free approaches: assumes the major
determinant variation is cell composition
○ may not be true, or may not be true for all tissues
Summary

More Related Content

What's hot

Comparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andComparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andAlexander Decker
 
Functional genomics, and tools
Functional genomics, and toolsFunctional genomics, and tools
Functional genomics, and tools
KAUSHAL SAHU
 
Identifying the Coding and Non Coding Regions of DNA Using Spectral Analysis
Identifying the Coding and Non Coding Regions of DNA Using Spectral AnalysisIdentifying the Coding and Non Coding Regions of DNA Using Spectral Analysis
Identifying the Coding and Non Coding Regions of DNA Using Spectral Analysis
IJMER
 
Jiankang Wang. Principle of QTL mapping and inclusive composite interval mapp...
Jiankang Wang. Principle of QTL mapping and inclusive composite interval mapp...Jiankang Wang. Principle of QTL mapping and inclusive composite interval mapp...
Jiankang Wang. Principle of QTL mapping and inclusive composite interval mapp...
FOODCROPS
 
Data Integration, Mass Spectrometry Proteomics Software Development
Data Integration, Mass Spectrometry Proteomics Software DevelopmentData Integration, Mass Spectrometry Proteomics Software Development
Data Integration, Mass Spectrometry Proteomics Software DevelopmentNeil Swainston
 
Genetic Algorithm based Analysis of Rigid and Non Rigid Medical Images
Genetic Algorithm based Analysis of Rigid and Non Rigid Medical ImagesGenetic Algorithm based Analysis of Rigid and Non Rigid Medical Images
Genetic Algorithm based Analysis of Rigid and Non Rigid Medical Images
IRJET Journal
 
Fragment Based Drug Discovery
Fragment Based Drug DiscoveryFragment Based Drug Discovery
Fragment Based Drug Discovery
Anthony Coyne
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
Pawan Kumar
 
Structural genomics
Structural genomicsStructural genomics
Structural genomics
Vaibhav Maurya
 
Functional genomics, a conceptual approach
Functional genomics, a conceptual approachFunctional genomics, a conceptual approach
Functional genomics, a conceptual approach
KAUSHAL SAHU
 
QTL mapping in genetic analysis
QTL mapping in genetic analysisQTL mapping in genetic analysis
QTL mapping in genetic analysis
NikhilNik25
 
Novel network pharmacology methods for drug mechanism of action identificatio...
Novel network pharmacology methods for drug mechanism of action identificatio...Novel network pharmacology methods for drug mechanism of action identificatio...
Novel network pharmacology methods for drug mechanism of action identificatio...
laserxiong
 
“Proteomics” to study genes and genomes
“Proteomics” to study genes and genomes“Proteomics” to study genes and genomes
“Proteomics” to study genes and genomes
Nazish_Nehal
 
New Strategy to detect SNPs
New Strategy to detect SNPsNew Strategy to detect SNPs
New Strategy to detect SNPs
Miguel Galves
 
Protein-Protein Interactions (PPIs)
Protein-Protein Interactions (PPIs)Protein-Protein Interactions (PPIs)
Protein-Protein Interactions (PPIs)
Sai Ram
 
Proteomics
ProteomicsProteomics
Proteomics
ruchibioinfo
 
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
TELKOMNIKA JOURNAL
 

What's hot (19)

Comparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andComparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning and
 
An26247254
An26247254An26247254
An26247254
 
Functional genomics, and tools
Functional genomics, and toolsFunctional genomics, and tools
Functional genomics, and tools
 
Identifying the Coding and Non Coding Regions of DNA Using Spectral Analysis
Identifying the Coding and Non Coding Regions of DNA Using Spectral AnalysisIdentifying the Coding and Non Coding Regions of DNA Using Spectral Analysis
Identifying the Coding and Non Coding Regions of DNA Using Spectral Analysis
 
Jiankang Wang. Principle of QTL mapping and inclusive composite interval mapp...
Jiankang Wang. Principle of QTL mapping and inclusive composite interval mapp...Jiankang Wang. Principle of QTL mapping and inclusive composite interval mapp...
Jiankang Wang. Principle of QTL mapping and inclusive composite interval mapp...
 
Data Integration, Mass Spectrometry Proteomics Software Development
Data Integration, Mass Spectrometry Proteomics Software DevelopmentData Integration, Mass Spectrometry Proteomics Software Development
Data Integration, Mass Spectrometry Proteomics Software Development
 
Genetic Algorithm based Analysis of Rigid and Non Rigid Medical Images
Genetic Algorithm based Analysis of Rigid and Non Rigid Medical ImagesGenetic Algorithm based Analysis of Rigid and Non Rigid Medical Images
Genetic Algorithm based Analysis of Rigid and Non Rigid Medical Images
 
Fragment Based Drug Discovery
Fragment Based Drug DiscoveryFragment Based Drug Discovery
Fragment Based Drug Discovery
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Structural genomics
Structural genomicsStructural genomics
Structural genomics
 
Functional genomics, a conceptual approach
Functional genomics, a conceptual approachFunctional genomics, a conceptual approach
Functional genomics, a conceptual approach
 
QTL mapping in genetic analysis
QTL mapping in genetic analysisQTL mapping in genetic analysis
QTL mapping in genetic analysis
 
Novel network pharmacology methods for drug mechanism of action identificatio...
Novel network pharmacology methods for drug mechanism of action identificatio...Novel network pharmacology methods for drug mechanism of action identificatio...
Novel network pharmacology methods for drug mechanism of action identificatio...
 
“Proteomics” to study genes and genomes
“Proteomics” to study genes and genomes“Proteomics” to study genes and genomes
“Proteomics” to study genes and genomes
 
New Strategy to detect SNPs
New Strategy to detect SNPsNew Strategy to detect SNPs
New Strategy to detect SNPs
 
Protein-Protein Interactions (PPIs)
Protein-Protein Interactions (PPIs)Protein-Protein Interactions (PPIs)
Protein-Protein Interactions (PPIs)
 
Proteomics
ProteomicsProteomics
Proteomics
 
protein microarray
protein microarray protein microarray
protein microarray
 
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
Improving DNA Barcode-based Fish Identification System on Imbalanced Data usi...
 

Similar to Data basics

Detecting STR Peaks in Degraded DNA samples
Detecting STR Peaks in Degraded DNA samplesDetecting STR Peaks in Degraded DNA samples
Detecting STR Peaks in Degraded DNA samples
Emanuela Marasco
 
Micro array based comparative genomic hybridisation -Dr Yogesh D
Micro array based comparative genomic hybridisation -Dr Yogesh DMicro array based comparative genomic hybridisation -Dr Yogesh D
Micro array based comparative genomic hybridisation -Dr Yogesh D
Dr.Yogesh D
 
Genomics_Aishwarya Teli.pptx
Genomics_Aishwarya Teli.pptxGenomics_Aishwarya Teli.pptx
Genomics_Aishwarya Teli.pptx
AishwaryaTeli5
 
PCR lecture.ppt
PCR lecture.pptPCR lecture.ppt
PCR lecture.ppt
NoorKhan428102
 
Ion torrent and SOLiD Sequencing Techniques
Ion torrent and SOLiD Sequencing Techniques Ion torrent and SOLiD Sequencing Techniques
Ion torrent and SOLiD Sequencing Techniques
fikrem24yahoocom6261
 
Expanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGSExpanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGS
Integrated DNA Technologies
 
Microarray full detail
Microarray full detailMicroarray full detail
Microarray full detail
Devendra Choudhary
 
Genome walking – a new strategy for identification of nucleotide sequence in ...
Genome walking – a new strategy for identification of nucleotide sequence in ...Genome walking – a new strategy for identification of nucleotide sequence in ...
Genome walking – a new strategy for identification of nucleotide sequence in ...
Dr. Mukesh Chavan
 
20100509 bioinformatics kapushesky_lecture03-04_0
20100509 bioinformatics kapushesky_lecture03-04_020100509 bioinformatics kapushesky_lecture03-04_0
20100509 bioinformatics kapushesky_lecture03-04_0Computer Science Club
 
Techniques in Biochemistry Notes.docx
Techniques in Biochemistry Notes.docxTechniques in Biochemistry Notes.docx
Techniques in Biochemistry Notes.docx
Dr. Almas A
 
Types of PCR
Types of PCRTypes of PCR
Types of PCR
KAUSHAL SAHU
 
Technical Tips for qPCR
Technical Tips for qPCRTechnical Tips for qPCR
Technical Tips for qPCR
Integrated DNA Technologies
 
Digital PCR.pptx
Digital PCR.pptxDigital PCR.pptx
Digital PCR.pptx
AlanShwan2
 
SNPs analysis methods
SNPs analysis methodsSNPs analysis methods
SNPs analysis methods
had89
 
GIAB Tumor Normal ASHG 2023
GIAB Tumor Normal ASHG 2023GIAB Tumor Normal ASHG 2023
GIAB Tumor Normal ASHG 2023
GenomeInABottle
 
PRINCIPLES OF PCR AND GENE EXPRESSION ANALYSIS
PRINCIPLES OF PCR AND GENE EXPRESSION ANALYSISPRINCIPLES OF PCR AND GENE EXPRESSION ANALYSIS
PRINCIPLES OF PCR AND GENE EXPRESSION ANALYSIS
Sandeep Chapagain
 

Similar to Data basics (20)

Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de RT-qPCR
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de RT-qPCRCurso de Genómica - UAT (VHIR) 2012 - Análisis de datos de RT-qPCR
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de RT-qPCR
 
12 arrays
12 arrays12 arrays
12 arrays
 
12 arrays
12 arrays12 arrays
12 arrays
 
Detecting STR Peaks in Degraded DNA samples
Detecting STR Peaks in Degraded DNA samplesDetecting STR Peaks in Degraded DNA samples
Detecting STR Peaks in Degraded DNA samples
 
Micro array based comparative genomic hybridisation -Dr Yogesh D
Micro array based comparative genomic hybridisation -Dr Yogesh DMicro array based comparative genomic hybridisation -Dr Yogesh D
Micro array based comparative genomic hybridisation -Dr Yogesh D
 
Genomics_Aishwarya Teli.pptx
Genomics_Aishwarya Teli.pptxGenomics_Aishwarya Teli.pptx
Genomics_Aishwarya Teli.pptx
 
PCR lecture.ppt
PCR lecture.pptPCR lecture.ppt
PCR lecture.ppt
 
Ion torrent and SOLiD Sequencing Techniques
Ion torrent and SOLiD Sequencing Techniques Ion torrent and SOLiD Sequencing Techniques
Ion torrent and SOLiD Sequencing Techniques
 
Expanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGSExpanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGS
 
Arrays
ArraysArrays
Arrays
 
Microarray full detail
Microarray full detailMicroarray full detail
Microarray full detail
 
Genome walking – a new strategy for identification of nucleotide sequence in ...
Genome walking – a new strategy for identification of nucleotide sequence in ...Genome walking – a new strategy for identification of nucleotide sequence in ...
Genome walking – a new strategy for identification of nucleotide sequence in ...
 
20100509 bioinformatics kapushesky_lecture03-04_0
20100509 bioinformatics kapushesky_lecture03-04_020100509 bioinformatics kapushesky_lecture03-04_0
20100509 bioinformatics kapushesky_lecture03-04_0
 
Techniques in Biochemistry Notes.docx
Techniques in Biochemistry Notes.docxTechniques in Biochemistry Notes.docx
Techniques in Biochemistry Notes.docx
 
Types of PCR
Types of PCRTypes of PCR
Types of PCR
 
Technical Tips for qPCR
Technical Tips for qPCRTechnical Tips for qPCR
Technical Tips for qPCR
 
Digital PCR.pptx
Digital PCR.pptxDigital PCR.pptx
Digital PCR.pptx
 
SNPs analysis methods
SNPs analysis methodsSNPs analysis methods
SNPs analysis methods
 
GIAB Tumor Normal ASHG 2023
GIAB Tumor Normal ASHG 2023GIAB Tumor Normal ASHG 2023
GIAB Tumor Normal ASHG 2023
 
PRINCIPLES OF PCR AND GENE EXPRESSION ANALYSIS
PRINCIPLES OF PCR AND GENE EXPRESSION ANALYSISPRINCIPLES OF PCR AND GENE EXPRESSION ANALYSIS
PRINCIPLES OF PCR AND GENE EXPRESSION ANALYSIS
 

Recently uploaded

The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
RASHMI M G
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 

Recently uploaded (20)

The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 

Data basics

  • 1. Data Basics: From Raw to Normalized Data Day 2
  • 2. Briefly: Bisulfite Conversion Krueger et al. (2012) Nature Methods
  • 4. Oligos (~800,000/bead) Bead Type I: two bead types per CpG site (unmethylated and methylated) Type II: one bead type per CpG site DNA preparation, hybridisation, staining. Detection of red and green fluorescent signals by iScan 450K Array Beads self- assemble into the pits on the array 450k chip: 12 samples per chip Multiple beads per bead type on each chip Combined to bead pools (all bead types)
  • 5. ● Analyzes >480,000 CpG loci ● Covers 99% of all RefSeq genes, average of 17 probes per gene ● Distributed over various functional elements such as: ○ CpG islands, shores, and shelves ○ 3´- and 5´-UTRs, gene bodies ○ DNAse hypersensitive sites ○ miRNA promoters Dedeurwaerder et al. Epigenomics (2011) 450K Array
  • 6. 450K Array: QC Probes Control Probe Purpose Staining measure efficiency and sensitivity of staining step (independent of hybridisation/extension step) Extension test efficiency of extension of A, T, C and G nucleotides from a hairpin probe ( sample-independent). The perfect match hairpin controls should result in high signal, and the mismatch probes in low signal Hybridization test the overall performance of Infinium assay using synthetic targets (not DNA) at 3 concentrations Target removal test efficiency of stripping step after extension Bisulphite conversion test efficiency of bisulphite conversion by query of C/T polymorphism Specificity controls check for non-specific detection of methylation signal over unmethylated background. Specificity controls are designed against non-polymorphic T sites (G/T mismatch) Non-polymorphic query a non polymorphic base A, T, C and G to test overall performance of the assay from amplification to detection Negative randomly permutated bisulphite-converted sequences containing no CpGs. They should not hybridise to DNA. The mean of these probes determines the system background
  • 7. Type I Type II Same chemistry as 27k array Chemistry not seen on 27k array 28% of probes on array 72% of probes on array Designed for regions with more CG dinucleotides - 57% of Type I probes lie in CpG islands 21%, 26% and 11% of type II probes lie in CpG islands, shores and shelves respectively Suggested to be more stable and reproducible than the signals provided by Type II probes Have a decreased quantitative dynamic range compared to Type I probes For either probe design, intensities are used to estimate Beta-value or M-value 450k Array: Type I vs Type II Probes
  • 8. Uses fluorescence from two different probes, unmethylated (converted) and methylated (unconverted), to assess the level of methylation of a target CpG ● Binding at either probe is followed by single base extension that results in the addition of a fluorescently labeled nucleotide Dedeurwaerder et al. Epigenomics (2011) 450K Array: Type I Probes
  • 9. Methylation state is detected by single base extension and detection of a fluorescently labelled nucleotide at the position of the 'C' of the target CpG ● Type II probes include a 'degenerate' R-base at any underlying CpG sites in the probe body Dedeurwaerder et al. Epigenomics (2011) 450K Array: Type II Probes
  • 11. ● There are a number of R packages that incorporate QC, filtering and normalisation into their pipelines or offer specific functions ● Most allow you to define at least some thresholds yourself ● Option to pick and choose minfi ChAMP RnBeads lumi Touleimat & Tost wateRmelon Outline
  • 12. AIM: identify unusual samples and technical artifacts The array contains 65 single nucleotide polymorphisms (SNPs) ● We can use this info to identify any unintentional duplicated samples ● If we have multiple samples per individual, their samples should cluster Raw Data: Initial QC
  • 13. AIM: identify unusual samples and technical artifacts The array uses red/green fluorescence intensities to estimate methylation level - the two colour channels have different background intensities ● Type I probes use the same colour to evaluate methylated and unmethylated probes - should have less of an impact ● Type II uses green to measure methylated state and red for unmethylated - colour bias can contribute to decreased dynamic range Raw Data: Initial QC
  • 14. AIM: identify unusual samples and technical artifacts As the fluorescence intensities are read across the chip - there appears to be a tiering effect ● This technical artifact could impact the results ○ e.g. cases unintentionally clustered on the array Raw Data: Initial QC
  • 15. Plot Distributions of the samples ● Unusual distributions may reflect: ○ Real biological effects (global methylation changes) ○ Poor methylation data AIM: identify unusual samples and technical artifacts Different colour/line combination for each sample Red indicates primary tumor, blue is adjacent normal tissue Raw Data: Initial QC
  • 16. Plot Distributions of the samples ● Boxplots or violin plots (below) can be used to the same effect AIM: identify unusual samples and technical artifacts Raw Data: Initial QC
  • 17. AIM: identify unusual samples and technical artifacts Colour corresponds to tumor status Colour corresponds to TCGA batch Can use multidimensional scaling to look for unusual clustering of samples Raw Data: Initial QC
  • 18. By plotting the distribution of Type I and Type II probes separately we can observe the difference in distribution - example four samples below ● Reflects: difference in chemistry and enrichment for different elements (e.g. CpG islands) AIM: identify unusual samples and technical artifacts Raw Data: Initial QC
  • 19. ● Each data point has an associated detection p-value ○ Represents the probability the target signal was distinguishable against background noise ● Scanner can encounter difficulties reading signal ○ Low staining intensities ○ Spatial artefacts Common approaches: ● Drop probes that failed in nth% of samples ○ Common thresholds are 20%, 10%, 5% of probes at >0.05, >0.01. ● Drop samples that failed in nth% of probes ○ Common thresholds are 50%, 20% at >0.05, 0.01. Filtering: Detection P-value
  • 20. Drop those with known SNPs residing in the probe sequence Most common SNPs in dbSNP are C>T transitions ● C>T transitions will be read as an unmethylated cytosine Observe grouping of methylation values by genotype Filtering: Common Practices Related to Technical Issues
  • 21. ● ~4.3% of the probes are reported to contain a known polymorphism specifically at the targeted C or G ○ 43% of these SNPs have a heterozygosity of >0.1 ○ Price et al. Epigenetics Chromatin (2013) ● SNP filtering depends on the study population / reference genome (eg CEU) Drop those for which the CpG site contains a SNP Filtering: Common Practices Related to Technical Issues
  • 22. Drop those in which probes anneal to multiple genomic locations ● Bisulfite conversion reduces the complexity of the genome All unmethylated Cs converted to T ● ~10-20% of the Infinium HumanMethylation450 probes have been identified as non-specific depending on the criteria ● Repetitive elements - may be a real signal, but uncertain meaning and validation difficult ● Probes cross-reactive to X chromosome ○ pick up X inactivation, leading to spurious association if outcome/exposure associated with sex Naeem et al. BMC Genomics (2014) Filtering: Common Practices Related to Technical Issues
  • 23. Common practices related to analysis issues ● drop those on X and Y chromosomes ● drop those with lowest variation ● drop those with extreme methylation levels (eg median = 0% or 100%) ● only consider those in regions of interest (eg CpG island, shore, other) Filtering: Common Practices Related to Technical Issues
  • 24. Colour bias adjustment and Background correction ● Can adjust for colour bias in the lumi package using either: ○ Smooth quantile or shift-and-scaling normalisation ● Most methods employ simple background subtraction ○ No significant improvement in data quality ● New method in methylumi package outperforms previous methods ○ Uses “out-of-band” signal from type I probes to estimate background rather than background control probes ■ Out-of-band - colour channel opposite their designed base extension ■ Only a few background control probes (n=614), but many Type I probes Triche et al. Nucleic Acids Res. (2013) Normalization: Within-Array
  • 25. Colour bias adjustment and Background correction Lumi approach, requires starting with a MethyLumiM object ● How to perform the quantile normalisation method Methylumi approach from Triche et al. Nucleic Acids Res. (2013), requires starting with a MethylumiSet object ● Using ‘noob’ method (noob=normal-exponential using out-of-band probes), a convolution that assumes data.bgcorrect<-adjColorBias.quantile(lumidata) Signal Intensity + Background = Observed foreground intensity data.bgcorrect<-methylumidata(methylumidata) Normalization: Within-Array
  • 26. Probe type correction ● The identified shift between type I and type II β-values may induce a bias in the analysis if the methylation signals corresponding to the two types of assays are analyzed together ○ type I probes have greater stability, increased power ● Can’t just perform full quantile normalisation ○ The population to ‘correct’ (type II) is the larger group: may bias distribution of type I probes ○ Each probe type covers different CpG and gene-sequence regions Normalization: Within-Array
  • 27. Probe type correction options Subset quantile normalisation Touleimat & Tost, Epigenomics (2012) ● For each probe category, use type I signals as the anchors to estimate a reference distribution of quantiles ● use this reference to estimate a target distribution of quantiles for type II probes ○ two different annotations for subsetting: ■ ‘relation to CpG’ ■ ‘relation to gene sequence’ Normalization: Within-Array
  • 28. Probe type correction options Subset-quantile Within Array Normalisation (SWAN) Maksimovic et al. Genome Biology (2012) ● Assume that the overall intensity distribution should be the same when the underlying CpG contents of the probes are the same ○ in other words, assume the CpG content of the probes reflects the biology by being a surrogate for the CpG density of the region Normalization: Within-Array
  • 29. Probe type correction options Beta-mixture quantile (BMIQ) normalisation method. Teschendorff et al. Bioinformatics (2013) ● Major benefit over subset normalisation methods is that it is assumption free ○ State membership of individual probes is determined by maximum probability ● Approach: ○ Fits a three-state (unmethylated, hemimethylated, fully methylated) beta mixture model to the Type I and Type II probes separately ○ For each state, transforms probabilities of belonging to the state to quantiles using the inverse of the cumulative beta distribution with beta parameters estimated from the Type I probes ● Model-based method helps to avoid having gaps emerge in normalized distribution Normalization: Within-Array
  • 30. Red/green intensities following normalisation BEFOREAFTER Normalization: Within-Array
  • 31. Distribution of M-values Distribution of Beta-values (for one sample) BEFORENormalisationAFTERNormalisation Normalization: Within-Array
  • 32. ● Aim is to remove other technical artifacts eg position tiering of intensities ○ reflects quantile normalisation approaches for gene expression ● Normalisation of intensities (not betas) ● Assumes the same global distributions between the samples ○ this may not be true BEFOREAFTER Normalization: Between-Array
  • 33. Normalization: Between-Array Functional Normalization of 450k Methylation Array. Fortin et al. Genome Biology (2014) ● Created to offer a means of removing between-array unwanted technical variation even in the case of global methylation changes ● Possible to apply to cancer data or tissue differences ● Uses control probes to act as surrogates for unwanted variation ● Control probes: uses 848 control probes ● None are used to measure biological signal ● PCA of control probes, removes variation associated with first two PCs by default ● Functional normalization extends quantile normalization by adjusting for known covariates measuring unwanted variation ● The normalization procedure is applied to the Meth and Unmeth intensities separately, and to type I and type II signals separately
  • 34. Normalization: Between-Array Functional Normalization of 450k Methylation Array. Fortin et al. Genome Biology (2014) ● Like a unsupervised batch correction method ● Suggested to outperform supervised batch correction methods (e.g. ComBat, SVA and RUV)
  • 35. Normalization: Data-driven Approaches A data-driven approach to preprocessing Illumina 450K methylation array data. Pidsley et al. BMC Bioinformatics (2013) ● Use three independent metrics to based on known methylation patterns to test the performance of different normalization and background correction schemes ● Assess patterns associated with: ● Genomic imprinting ● ‘DMRSE’ (i.e. Differentially Methylated Regions Standard Error) ● X-chromosome inactivation (XCI) ● ‘Seabird’ (named after the auk and also the mythical bird roc) ● SNP genotyping assays present on the array ● ‘GCOSE’ (Genotype Combined Standard Error)
  • 36. Normalization: Data-driven Approaches A data-driven approach to preprocessing Illumina 450K methylation array data. Pidsley et al. BMC Bioinformatics (2013) Method Background adjustment Between-array normalization Dye bias correction naten no typ1 and typ2 together no nanet no no typ1 and typ2 together nanes no no typ1 and typ2 separately danes yes no typ1 and typ2 separately danet yes no typ1 and typ2 together danen yes no no daten1 yes typ1 and typ2 together no daten2 yes typ1 and typ2 together no nasen no typ1 and typ2 separately no dasen yes typ1 and typ2 separately no
  • 37. Normalization: Data-driven Approaches method TypeI TypeII Average raw 6.5 11 8.75 betaqn 14 13 13.5 naten 12 9 10.5 nanet 11 3 7 nanes 9.5 7.5 8.5 danes 2.5 7.5 5 danet 1 6 3.5 danen 5 12 8.5 daten1 4 4 4 daten2 8 5 6.5 nasen 9.5 1.5 5.5 dasen 2.5 1.5 2 fuks 6.5 15 10.75 tost 13 14 13.5 swan 15 10 12.5 Tested 15 pre-processing methods across 11 methylation datasets using the three performance metrics • For each dataset get mean of three ranks across methods • Then get the mean of ranks across the datasets • “dasen” appears to do the best across probe types
  • 38. ● MANY ways to perform initial quality control and pre-processing ○ Consider the samples used ■ e.g. between-array normalisation may not be appropriate for cancer samples GOAL: identify failed samples and reduce impact of technical artifacts without removing meaningful biological variation Marabita et al. Epigenetics (2013) Summary
  • 39. Summary A few of the within-array normalization procedures improved the concordance between the 450k data and the pyrosequencing data • Marked improvement using SQN, noob, and BMIQ • Blue, orange and red indicate Infinium typeI/II bias correction methods, color bias adjustment and background correction methods, respectively. Dedeurwaerder et al. Briefings in Bioinformatics (2013)
  • 40. Summary Dedeurwaerder et al. Briefings in Bioinformatics (2013) • HCT116 data: more global differences, performed worse with between- array normalization • Roessler’s data: no improvement with between-array normalization
  • 41. ● Additional considerations: Filter a priori? ○ If remove loci with little inter-sample variability, may miss loci with small, but very significant effect sizes ○ May be SNP in probe, but SNP has a minor allele frequency too low to impact associations with methylation ○ But removing these sites reduces the number of comparisons we need to account for when adjusting for multiple testing GOAL: identify failed samples and reduce impact of technical artifacts without removing meaningful biological variation Summary
  • 42. Data Issues Considerations for High-Level Analysis
  • 43. Identifying Batch Effects You may identify batch effects in your data that have not been removed by normalisation ● Batch effects are subgroups of data that are not related to biological or other variables in the study ○ chips that were run on separate days ○ bisulphite modifications that were performed in different batches
  • 44. Approaches to Remove Batch Effects ● Can adjust for batch in downstream analysis (eg in regression) ○ Has been done in some published articles ..but may not effectively deal with batch issue Options to address batch effect: ○ ComBAT ○ SVA ○ ISVA ○ RUV2
  • 45. ComBat Johnson et al. Biostatistics (2007) ● Linear model for batch effects and uses Empirical Bayes method to estimate the batch effects o Instead of full Bayesian approach, uses Empirical Bayes methods to estimate the hyper-parameters from the data  Helps in small sample sizes by borrowing information across genes Before After
  • 46. ComBat Johnson et al. Biostatistics (2007) ● Works best when: o Small sample size o Known batch effects o Linear batch effects ● Disadvantages: o Computationally intensive o Only correct for batch effects from known sources o Assumption of linear effects and normality may be violated In other words, not great for large studies with complicated batch effects ComBat function in SVA package
  • 47. Input ● matrix containing methylation data (Mvalues) mdata ● vector indicating the batch variable to adjust for batch <- samplepheno$TCGABATCH ● model matrix containing the full model mod <- model.matrix(~as.factor(TUMOR), data=samplepheno) ● null model (in this case no other covars so only the intercept) mod0 = model.matrix(~1,data=samplepheno) Run ComBat combat_mdata<- ComBat(dat=mdata, batch=batch, mod=mod, numCovs=NULL, par.prior=TRUE) ComBat
  • 48. Output ● Returns a corrected matrix with the same dimensions as your original dataset with batch effects removed ● Run signficance analysis on the adjusted data pValuesComBat = f.pvalue(combat_mdata,mod,mod0) qValuesComBat = p.adjust(pValuesComBat,method="BH") OR ● run analysis model using the adjusted data result<- cpg.assoc(combat_mdata~as.factor(samplepheno$TUMOR)) ComBat
  • 49. ComBat: Simulation group = rep(c(-1,1),each=20) coinflip = rbinom(40,size=1,prob=0.8) batch = group*coinflip + -group*(1-coinflip) gcoeffs = rep(0,10000) bcoeffs = rnorm(10000,sd=2) coeffs = cbind(bcoeffs,gcoeffs) mod = model.matrix(~-1 + batch + group) modelprojected<-coeffs%*%t(mod) dat0<-t(apply(modelprojected,1,function(x) x+rnorm(ncol(modelprojected),sd=1))) par(mfrow=c(2,1)) plot(group,main=expression(bold("Group")),pch=16) plot(batch,main=expression(bold("Batch")),pch=16) Batch is strongly associated with Group, but group does not have a direct impact on outcome
  • 50. ComBat: Simulation ## Set null and alternative models (ignore batch) mod1 = model.matrix(~group) mod0 = cbind(mod1[,1]) par(mfrow=c(2,1)) fit <- lmFit(dat0, mod1) fit <- eBayes(fit) hist(fit$p.value[,"group"], xlab="p-value", main=expression(bold("Unadjusted")), col="orange") combatresults<-ComBat(dat0, batch=batch, mod=mod1,par.prior = TRUE) fit <- lmFit(combatresults, mod1) fit <- eBayes(fit) hist(fit$p.value[,"group"], xlab="p-value", main=expression(bold("ComBat")), col="purple") Still some residual confounding, but much better than unadjusted analysis
  • 51. ● Leek and Story PLoS Genetics (2007) ● Used to identify and estimate surrogate variables for unknown, unmodeled or latent sources of noise ○ Appropriate when there are many known or unknown confounders ○ May not be appropriate if the biological groups of interest are heterogeneous ■ eg. Comparing cancer cases and controls where there are different cancer subgroups as do not want to lose this variation Surrogate Variable Analysis (SVA)
  • 52. Step 1 Obtain residual matrix (remove variation associated with variables of interest), calculate the singular value decomposition (SVD) of the residual matrix, perform test to assess whether singular vectors represent more variation than expected due to chance Step 2 Identify the subset of genes driving each orthogonal signature of the variation ● Test association between each probe and each singular vector of the SVD Step 3 For each of these subsets, build a SV based on the full signature of that subset in the original data ● Allows the SVs to be correlated with the primary variables Step 4 Include all significant SVs as covariates in subsequent regression analyses Surrogate Variable Analysis (SVA)
  • 53. Leek and Story PLoS Genetics (2007) PrimaryVariableValueUnmodeledFactorValue Arrays Arrays Genes Example of Expression Heterogeneity Surrogate Variable Analysis (SVA)
  • 54. Input ● Matrix containing methylation data betas(methylumidata) ● Model matrix containing the full model mod <- model.matrix(~as.factor(tumor), data=pData(methylumidata)) ● Null model (in this case no other covars, so only the intercept) mod0 = model.matrix(~1,data=pData(methylumidata)) Run sva sva_output<- sva(dat=na.omit(betas(methylumidata)),mod=mod, mod0=mod0) Main Output ● Can adjust for sva_output$sv in model - a matrix of surrogate variables Surrogate Variable Analysis (SVA)
  • 55. ● Teschendorff et al. Bioinformatics (2011) ● Developed due to potential issues with SVA ○ Surrogate variables may capture heterogeneous phenotypes and/or model misspecification ■ i.e. the residual variation may contain biologically relevant variation ● If potential confounders are known (either exactly or subject to error/uncertainty) ISVA will select only those independent components that correlate with the confounders ○ Otherwise similar approach as SVA - removes the variation in the data matrix not associated with the phenotype of interest, and performs Independent Component Analysis (ICA) on this residual variation matrix ○ BUT only keeps ISVs that are associated with putative confounders Independent Surrogate Variable Analysis (ISVA)
  • 56. Input ● Matrix containing methylation data methyldata<-na.omit(betas(methylumidata)) ● Vector of the phenotype of interest (only takes numeric data) binarytumor<-rep(0,ncol(methylumidata)) binarytumor[pData(methylumidata)$tumor=="yes"]<-1 ● Matrix of potential confounding factors (may be numeric or categorical) factors.m<-pData(methylumidata)[,c("age_at_initial_pathologic_diagnosis", "TCGAbatch","tissue_source_site","anatomic_neoplasm_subdivision","ajcc_patholog ic_tumor_stage")] Run ISVA isva.o <- DoISVA(methyldata, pheno.v=binarytumor, factors.m, factor.log= c(FALSE,TRUE,TRUE,TRUE,TRUE,TRUE), pvthCF=0.1, th=0.001) Independent Surrogate Variable Analysis (ISVA)
  • 57. Significant Surrogate Variables Phenotype of interest age at diagnosis TCGA batch tissue source site anatomic neoplasm subdivision ajcc pathologic tumor stage 1 0.00054 0.70767 0.22121 0.56843 0.57395 0.83765 2 0.00288 0.45119 0.35796 0.44187 0.06801 0.5264 3 0.13759 0.7693 0.49232 0.474 0.3279 0.79098 4 0.53001 0.54648 0.05478 0.11074 0.55215 0.50216 5 0.01475 0.93185 0.18196 0.2447 0.84541 0.38136 6 0.74174 0.0332 0.62394 0.39353 0.9453 0.78771 7 0.0492 0.58565 0.91446 0.81124 0.71186 0.86681 8 0.04651 0.95817 0.91115 0.79168 0.02937 0.464 Looking at potential confounders and indicators of batch (p-value for association with each candidate ISV) ● 8 candidate ISVs, 3 associated with at least one variable (p<0.1) ● Would only include the 3 significant, selected ISVs Independent Surrogate Variable Analysis (ISVA)
  • 58. ● Gagnon-Bartsch and Speed Biostatistics (2012) ● Like SVA, an analysis that estimates and adjusts for unknown surrogate variables ● Tries to address problem that ISVA tries to tackle, discerning the unwanted variation from the biological variation that is of interest to the researcher ● Restricts variation decomposition to negative control genes ● Requires negative control genes are genes whose expression levels are known a priori to be truly unassociated with the biological factor of interest Remove Unwanted Variation-2 (RUV2)
  • 59. Summary Approach Pros Cons ComBat Appropriate when groups are heterogeneous and batch effect is known Batch effect may be complicated mixture of factors SVA Do not need to know the unmeasured confounders and may capture impact of cell mixture Surrogate variables may capture heterogeneous phenotypes and/or model misspecification ISVA Avoid capturing meaningful biological varation Need to have surrogates for potential confounders RUV2 Avoid capturing meaningful biological varation Need to define a subgroup of probes that are not influenced by exposure of interest 59
  • 60. ● Distributions of methylation data are not always normal ○ use of transformations ○ option to remove probes with more than one mode ● Batch effects can be large and may be due to known or unknown factors ○ use careful study design to minimise the impact of batch effects ○ use of post analysis methods to reduce batch effects Summary
  • 62. ● This section focusses on cellular heterogeneity in blood most cohort studies are currently analysing data from blood samples ● Also relevant for data analysis in other tissues where cellular heterogeneity is also present ○ currently less defined methods ● Why does cellular heterogeneity matter? ● What can we do about it? Overview
  • 63. ● An issue for many population based studies ● DNA extracted from blood containing many cell types ○ Large source of variation in methylation data from blood Cellular Heterogeneity in Blood
  • 64. Heatmap of cell sorted 450k data Jaffe & Irizarry Genome Biology (2014) Cellular Heterogeneity in Blood
  • 65. Why is it an issue? ● Bias if outcome of interest correlates with cell composition ○ Confounding by immunological profile ● Uninteresting variation: mediation by immunological profile ○ Reflects a previously known mechanism: real goal is to find differences in methylation beyond the cell composition associations ○ Usually seen with environmental exposures ● Temporality is an issue/difficulty for cross-sectional data ○ e.g inflammation preceding or following cancer? Cell distribution Methylation Infertility PM2.5 Cell distribution Methylation Cellular Heterogeneity in Blood
  • 66. Flow cytometry ● Adjust for cell proportion or restrict analysis to one subtype ○ At discovery or validation/replication HOWEVER: ○ expensive ○ time consuming ○ requires fresh samples (often impossible in cohort studies) Gold standard Correcting for Cellular Heterogeneity in Blood
  • 67. Houseman et al. BMC Bioinformatics (2012) ● General goal: use purified cells (“gold standard”) to build a model to predict the distribution of leukocytes for the analysis of population data ○ Can also be used to predict the distribution of leukocytes in a single sample given its DNA methylation profile ● Resembles regression calibration approach in measurement error literature ○ Assumption of transportability - e.g. is the mechanism giving rise to the measurement error the same in cord blood as it is in adult blood? ● Sorted WBCs in S0 from Houseman paper were run on the 27K array ○ Choose m sites with strongest association between methylation and cell types based on F statistic to estimate cell proportions Correcting for Cellular Heterogeneity in Blood
  • 68. Correcting for Cellular Heterogeneity in Blood Direct Epigenetic Response Immune Response Measured DNA methylation Phenotype Y matrix 𝑀𝑀Ω𝑇𝑇 Ω is the cell type proportions for each individual 𝐵𝐵𝐵𝐵 𝑇𝑇Covariates X Ω = 𝑋𝑋Γ + Ξ Γ Cell composition effects
  • 69. Correcting for Cellular Heterogeneity in Blood • If a reference set is available, we can estimate the cell composition effect M • We can estimate the individual cell proportions Ω from the methylation profile Y and the cell composition effects M • Can then adjust for estimated cellular composition in the subsequent analysis 𝑌𝑌 = 𝐵𝐵𝐵𝐵 𝑇𝑇 + 𝑀𝑀Ω𝑇𝑇 + 𝐸𝐸 Direct Effect Cell Composition Effect
  • 70. Compared to gold standards, this method has relatively high precision Accomando et al. Genome Biology (2014) Correlation between cell proportions estimated by DNA methylation and proportions quantified by established methods among whole blood samples from disease-free human donors Correcting for Cellular Heterogeneity in Blood
  • 71. Houseman et al. BMC Bioinformatics (2012) ● Coefficients estimated using 27k data do not seem to work well for 450k data ○ Can rebuild Houseman approach using cell sorted 450k dataset ● Can be implemented using estimateCellCounts() from minfi package Reinius LE et al Plos ONE (2012) Cell populations isolated by magnetic- activated cell sorting, purified using specific antibodies Correcting for Cellular Heterogeneity in Blood
  • 72. Cell proportion predicted for Reinius (2012) samples Predictedcellproportion Correcting for Cellular Heterogeneity in Blood
  • 73. Jaffe and Irizarry. Genome Biology (2014) ● Proportions of cell types estimated using DNA methylation ● Can see shifts in composition associated with aging ○ Not (usually) the variation that we are interested in describing ● Age is confounder in many epidemiology studies ○ must consider impact on cellular heterogeneity Cellular Heterogeneity and Aging
  • 74. Other tissues also contain a mixture of cells ● To correct for this must either: microdissect all samples, or create a reference microdissected data set to rebuild Houseman approach The creation of reference-free methods ● Houseman et al. Bioinformatics (2014) ○ Uses modified ISVA to identify latent components of the observed methylation variation, assumed to capture differences in cell distributions ● Zou et al. Nature Methods (2014) ○ Finds simplest combination of principal components with linear mixed model that controls for test inflation Cellular Heterogeneity in other Tissues
  • 75. Houseman et al. Bioinformatics (2014) ● Similar to ISVA and SVA, except makes an additional biological mixture assumption ○ Dependence of the latent structure of the error on the unknown, cell-specific methylation matrix library(RefFreeEWAS) test<-RefFreeEwasModel(betas(methylumidata.bgcorr)[c(1:1000),], mod, 8) # example 1000 loci,8 ISVs testBoot <- BootRefFreeEwasModel(test,500) #500 bootstrap datasets Significance of adjusted estimates BstarSE<-apply(testBoot[,2, "B*",],1,function (x) sd(x)) BstarT<-test$Bstar[,2]/BstarSE BstarP<-2*pnorm(-abs(BstarT)) Significance of adjusted estimates BetaSE<-apply(testBoot[,2, "B",],1,function (x) sd(x)) BetaT<-test$Beta[,2]/BetaSE BetaP<-2*pnorm(-abs(BetaT)) Cellular Heterogeneity in other Tissues
  • 76. Houseman et al. Bioinformatics (2014) Find that the reference free approach in TCGA samples reduces the range of effect sizes, and attenuates significance of 1000 sample loci ● what variation is this capturing? Cellular Heterogeneity in other Tissues
  • 77. Zou et al. Nature methods (2014) ● Factored spectrally transformed linear mixed model 'EWASher' (FaST- LMM-EWASher) ● Reference free approach (does not estimate cell type composition) ● Computes the methylome similarity between each pair of samples in the data set to get covariance. This is used in the linear mixed model as an implicit proxy for cell-type composition in conjunction with principal components. ● No issue of method portability as no reference set but if number of true associations is large there is reduction in power Cellular Heterogeneity in other Tissues
  • 78. Zou et al. Nature methods (2014) Important note from co-author Martin Aryee on PubMed Commons: “EWASher (Zou J, 2014) is intended to be used in EWAS settings where the primary interest is in identifying localized differentially methylated regions (i.e. DMRs that affect only a small fraction of methylation sites).* The results of EWASher should be interpreted with caution in settings where large-scale methylation changes are expected and/or of interest. The method assumes that large-scale changes are caused by cell type composition effects and will effectively remove these changes from consideration. This is useful in many EWAS settings, but the assumption may not hold when studying cancer or differences between tissues. In the cancer dataset used in our paper, for example, we specifically identify site-specific changes that are above and beyond global hypomethylation changes” *my bolding Cellular Heterogeneity in other Tissues
  • 79. ● 354 rheumatoid arthritis cases and 312 controls across 103,638 loci ○ a. shows QQ plot from unadjusted model ○ b. shows QQ plot where cell-type composition covariates were included in the model (Houseman) ○ c. shows QQ plot using EWASher Cellular Heterogeneity in other Tissues
  • 80. ● Cellular heterogeneity is a particular issue for cohort studies where cell counts are unknown ● Heterogeneity could cause bias in results ● Cell count related hits may not be of interest ● ‘Best practice’ not yet established ● Consideration for reference-free approaches: assumes the major determinant variation is cell composition ○ may not be true, or may not be true for all tissues Summary