Normalization of Illumina 450 DNA methylation data

Analysis of normalization
techniques for Illumina
Infinium 450k DNA
methylation data
beta-mixture quantile normalization

Related papers
• A beta-mixture quantile normalization method for correcting probe
design bias in Illumina Infinium 450k DNA methylation data: Andrew E.
Teschendorff, Francesco Marabita, Matthias Lechner, Thomas Bartlett,
Jesper Tegner, David Gomez-Cabrero, and Stephan Beck (2012)
• Evaluation of the infinium methylation 450k technology. Epigenomics,
3, 771–784: Dedeurwaerder,S. et al. (2011)
• Complete pipeline for infinium human methylation 450k beadchip data
processing using subset quantile normalization for accurate dna
methylation estimation. Epigenomics, 4, 325–341: Touleimat,N. and
Tost,J. (2012)
• Comparison of Beta-value and M-value methods for quantifying
methylation levels by microarray analysis. BMC Bioinformatics, 11, 587:
Du,P. et al. (2010)
• Applications of beta-mixture models in bioinformatics Yuan Ji1 et al.
(2005)
• SWAN: Subset quantile Within Array Normalization for Illumina
Infinium HumanMethylation 450k beadchips Maksimovic et al. 2012
• The minfi User's Guide Analyzing Illumina 450k Methylation Arrays
Hansen et al. (2011)

Background information
• DNA methylation – addition of a methyl group which affects gene expression.
• Beta Value (β): β = M/(M + U + α)
• Measure of methylation for each CpG
• Where M = Methylated intensity and U = Unmethylated intensity
• 27k array design (old)
• Infinium I assays only: M and U same color, different beads
• 450k array design (new)
• Hybrid of the Infinium I and Infinium II assays. Two different assays on
the same array
• Infinium II assays: M and U different color, same bead. Single probe pair
for each CpG site
• Allows (for 12 samples in parallel) assessment of the methylation status
of more than 480,000 cytosines distributed over the whole genome.
• Covers 99% of all RefSeq genes, average of 17 probes per gene.

Illumina Infinium 450k DNA methylation Beadchip
• Useful tool in EWAS studies
• Can provide more insight than 27k DNA methylation Beadchip.
• Problem: Two different designs causes the methylation vales derived from
these two designs to exhibit different distributions
• β-values obtained from Infinium II probes were less accurate and
reproducible than those obtained from Infinium I probes. Confirmed in at
least two papers
• “Evaluation of the Infinium Methylation 450K technology”
Dedeurwaeder et al., 2011
• “Complete pipeline for Infinium® Human Methylation 450K BeadChip
data processing using subset quantile normalization for accurate DNA
methylation estimation” Touleimat,N. and Tost,J. (2012)
• Inf1 probes can report for a wider range of β-values, reflecting all possible
methylation states even after adjustment for differences in biological
characteristics such as CpG density
• Because of this Inf2 probes may not be able to report with the same
sensitivity as Inf1probes as shown in the following graphs.

Infinium I vs Infinium II β Values
Dedeurwaeder et al., 2011

Infinium I vs Infinium II β Values
Touleimat,N. and Tost,J. (2012)

How to account for variation?
• Should account for extra source of variation between probe
type1 and probe type2 by normalizing each.
• Normalization means adjusting values measured on different
scales to a notionally common scale. In more complicated
cases, normalization may refer to more sophisticated
adjustments where the intention is to bring the entire probability
distributions of adjusted values into alignment.
• Several methods have been developed to normalize between
probe1 and probe2 data
• Peak-based-correction (PBC) – Adjust type2 probes based
on type1 probe peak values
• Subset Quantile Normalization (SQN) – Adjust type2 probes
quantile rank based on similar type1 probe‟s quantile rank
• Beta-MIxture Quantile dilation (BMIQ) – Adjust type2 probe
distribution based on type1 distribution

Normalization technique PBC
• Peak-based-correction (PBC) - Proposed by Dedeurwaeder et al., 2011, to
rescale the Infinium2 data on the basis of infinium1 density distribution
modes. 4 Steps to PBC:
1) Convert βvalues to Mvalues: Mvalue = log2(βvalue/(1 – βvalue)
2) Determine peaks from Infinium I and II independently using kernel
density estimation with Gaussian smoothing function and a band-width
= 0.5. Unmethylated peak summits were computed as SU = argmax
(density Mvalue) for negative Mvalues for both Infinium I and II.
Similarly, methylated peak summits were computed as SM = argmax
(density Mvalue) for positive Mvalues

3) Rescale raw Mvalues using peak summits as reference to get corrected
Mvalues
• The corrected Mvalues were then obtained by rescaling independently
negative and positive Mvalues using the distance between the peak
summits and zero.
• For negative Mvalues the corrected Mvalues were computed as follows:
corrected Mvalue = Mvalue/σU where σU is the distance between the
peak summit and zero (σU = 0 - SU).
• Corrected positive Mvalues were computed using the formula: corrected
Mvalue = Mvalue/σM with σM = SM - 0.

4) Rescale corrected M-values to match Infinium I range, then convert
back to β-values
• To convert back the corrected M-values to β-values, the M-values
were first rescaled to match Infinium I range. Negative M-values
were rescaled by the Infinium I σU (rescaled M-value = corrected M-
value. σU) and positive M-values by the Infinium I σM (rescaled M-
value = corrected M-value. σM). Finally, rescaled M-values were
converted to β-values by means of the relation β-value = 2M-
value/(2M-value + 1)

(A) Bar plots indicating the range of b‐values generated for HCT116 wild‐type (WT) sample (r3) with the Infinium I and Infinium II assays. (B)
Density plots of the beta‐values for the two Infinium assay types considered for HCT116 WT sample (r3). (C) Box plots of probe‐wise variance
between the three replicates of HCT116 WT (r1, r2 and r3) probes (outliers not drawn). On the left part of the figure, b‐values have undergone
no correction (raw data); on the right part, they have been subjected to the peak‐based correction.
Data: eight tumor samples,
eight normal breast tissue
samples

• PBC efficiently corrects for InfI/Inf2 shift and improves results.
• PBC implemented in R package Illumina Methylation Analyzer (IMA)
• However, two recent studies have exposed potential problems with PBC
• PBC depends on bimodal shape of methylation density profiles. It
breaks down when the methylation density distribution does not exhibit
well-defined peaks/modes (Maksimovic et al., 2012, Touleimat and
Tost, 2012)
• One proposed solution is Subset Quantile Normalization (SQN ) to
correct for this. (Touleimat and Tost, 2012)
• Another solutions is the technique is Beta-MIxture Quantile
Normalization (BMIQ) (Teschendorff et al. 2012)

Normalization technique SQN
• In general, β-values distributions should be normalized using standard
approaches, such as quantile normalization for inter-sample normalization.
However, three constraints prohibit such a straightforward approach for the
two different assays on the 450k beadchip:
1) The number of InfI (28%) and InfII (72%) probes differ and prevent from
computing a common set of reference quantiles
2) The population to „correct‟ (InfII) is the larger one and may therefore
bias the distribution of the other population (InfI)
3) There is a large imbalance in the proportions of Inf I and Inf II probes
covering the different CpG and gene-sequence regions
• A global standardization of methylation values distributions may lead to a
dramatic loss of information because the variation of the methylation status
may be specific for probes covering different subcategories of CpG
• SQN proposed to solve the two first issues by normalizing the gene-
expression signal by splitting between type1 and type2 and „anchor‟ type2
probes by the more stable and accurate type1 probes.

How does SQN work?
• Reference quantiles of a target set of features are estimated from the
smaller set of features used as „anchors‟ that are considered to be more
reliable and stable.
• Modifies the values of the target distribution based on rank equivalence
• Correct the data so that non-anchor and anchor probes of the same
percentiles will have the same value.
• Use InfI signals as the anchors to estimate a reference distribution of
quantiles and to use this reference to estimate a target distribution of
quantiles for InfII probes
• This should provide an accurate normalization of InfI/InfII probes and
correct for the shift
• Implemented two versions of SQN approach using provided Illumina
annotations.
1) Based on the „relation to CpG‟
2) Based on the „relation to gene sequence‟

Touleimat,N. and Tost,J. (2012)

Touleimat,N. and
Tost,J. (2012)

Verification of SQN
• Touleimat,N. and Tost,J. (2012) verified their results using Pyrosequencing
• A technique which, according to their paper, provides high quantitative
precision and provides data with single-nucleotide resolution.
• Chose 13 probes for comparison which had to meet following criteria:
• Stable methylation values between samples of the same phenotype
(β SD < 0.1)
• differentially methylated (differential methylation > 20%) between
samples of different phenotypes
• Most importantly, large difference between median β-values
obtained with each variant of our preprocessing pipeline
• Their results, Table 1 on next slide, show SQN using the relation to CpG
annotations to identify category-related anchors provided the greatest
number of closest methylation values (n = 7) to those obtained by
pyrosequencing for the very same CpG.
• Note: With the exception of normalization method F, most performed
fairly well with G being best

Median of paired differences of Methylation values to PS

Verification of SQN Cont.
• Their results, Table 2 on next slide, also show the SQN approach, together
with the peak-based correction approach, provided the smallest absolute
differences in the methylation values when compared with pyrosequencing-
based methylation values.
• Note: Most performed fairly well with G and E tied for best results

Median of DNA methylation differences

Subset Quantile Normalization (SQN) Results
• In general, SQN works well and avoids sensitivity issues to variations in the
shape of the methylation density curves seen by PBC.
• However, SQN requires a separate normalization to be performed on
selected subsets of probes that are matched for biological characteristics
(e.g. CpG density).
• SQN depends on a priori choices of which biological characteristics to use
when matching the type1 and type2 distribution
• Another model, BMIQ, is assumption-free, as it does not require a separate
normalization to be performed

Beta-MIxture Quantile dilation (BMIQ)
• New technique proposed aims to adjust the beta-values of type2 design
probes into a statistical distribution characteristic of type1 probes in order to
make their statistical distributions comparable. 3 steps:
1. Assign probes to methylation states
2. Transform probabilities into quantiles
3. Perform methylation dependent dilation transformation to preserve the
monotonicity and continuity of the data

• Authors verified data by comparing results from tumor tissue samples to
other known methods. After assessment, BMIQ improves on „no
normalization‟ and compares favorably to other methods of normalization
with:
• Improved robustness of the normalization procedure
• Reduced technical variation and bias of type2 probe values
• Elimination of type1 enrichment bias cause by lower dynamic range of
type2 probes
• Code available at http://code.google.com/p/bmiq/downloads/list

BMIQ INPUT
• ### beta.v: vector consisting of beta-values for a given sample. NAs are not allowed Beta-values
that are exactly 0 or 1 will be replaced by the min positive or max value below 1, respectively.
• ### design.v: corresponding vector specifying probe design type (1=type1,2=type2). This must be
of the same length as beta.v and in the same order.
• ### doH: perform normalization for hemimethylated type2 probes. By default TRUE.
• ### nfit: number of probes of a given design to use for the fitting. Default is 50000. Smaller values
(~10000) will make BMIQ run faster at the expense of a small loss in accuracy. For most
applications, 10000 is ok.
• ### nL: number of states in beta mixture model. 3 by default. At present BMIQ only works for nL=3.
• ### th1.v: thresholds used for the initialization of the EM-algorithm, they should represent best
guesses for calling type1 probes hemi-methylated and methylated, and will be refined by the EM
algorithm. Default values work well in most cases.
• ### th2.v: thresholds used for the initialization of the EM-algorithm, they should represent best
guesses for calling type2 probes hemi-methylated and methylated, and will be refined by the EM
algorithm. By default this is null, and the thresholds are estimated based on th1.v and a modified
PBC correction method.
• ### niter: maximum number of EM iterations to do. By default 5.
• ### tol: tolerance threshold for EM algorithm. By default 0.001.
• ### plots: logical specifying whether to plot the fits and normalized profiles out. By default TRUE.
• ### sampleID: the ID of the sample being normalized.

• ### OUTPUT
• ### A list with the following elements:
• ### nbeta: the normalized beta-profile for the sample
• ### class1: the assigned methylation state of type1 probes
• ### class2: the assigned methylation state of type2 probes
• ### av1: mean beta-values for the nL classes for type1 probes.
• ### av2: mean beta-values for the nL classes for type2 probes.
• ### hf: the "Hubble" dilation factor
• ### th1: estimated thresholds used for type1 probes
• ### th2: estimated thresholds used for type2 probes

BMIQ Paper used 10, 450k data sets
• Datasets 1 and 2: (BT) and (CL) subset of the dataset considered in
Dedeurwaerder et al. (2011). eight fresh frozen (FF) breast tumors and eight
normal breast tissue specimens [hereafter referred to as (BT)], as well as the
three replicates from the HCT116 WT cell-line [hereafter referred to as (CL)].
For these cell-lines, matched bisulphite pyrosequencing (BPS) data were
available for nine type2 probes.
• Datasets 3 and 4: (FFPE) and (FF) consists of 32 formalin-fixed paraffin-
embedded (FFPE) head and neck cancers HNCs), of which 18 were HPV+
and 14 HPV-, as well as five fresh frozen HNCs (FF), of which 2 were HPV+
and 3 HPV-. Available from GEO under accession number GSE38271.
• Dataset 5: (GBM) consists of 81 glioblastoma multiformes (GBMs) (Turcan
et al., 2012), 49 of which were categorized as CpG island methylator positive
(CIMP+) and 32 as CIMP-.
• Datasets 6–10: TCGA, LIV, LC, BLDC, HCC samples are all from the
TCGA: Dataset6 (TCGA) consists of 10 samples as provided in the
Bioconductor data package TCGAmethylation 450k, Dataset7 (LIV) consists
of nine normal liver tissue samples from Batch203 in the TCGA data portal,
Dataset8 (LC) consists of 22 lung cancer samples from Batch196, Dataset9
(BLDC) consists of 12 bladder cancer samples from Batch86 and Dataset10
(HCC) consists of 10 hepatocellular carcinoma samples from Batch153.

BMIQ normalization criteria
i. Must allow for the different biological characteristics of type1 and type2
probes
• Type1 probes are significantly more likely to map to CpG islands than
type2 probes, and hence the relative proportion of methylated and
unmethylated probes will vary between the two designs. In the case of
the type2 probes, this means that these proportions must be invariant
under the normalization transformation.
ii. The transformation of the type2 probe values should reduce the bias
• which amounts to matching of the density distributions of the two
design types, specially at the unmethylated and methylated extremes.
iii. The transformation must be monotonic
• Relative ranking of beta values of the type2 probes must be invariant
under the transformation.

BMIQ normalization strategy
• Fit a three state beta mixture model (unmethylated-U, hemimethylated-
H, fully methylated-M) to type1 and type2 probes separately using three
steps
• Note: Let {(aI
U,bI
U),(aI
H,bI
H),(aI
M,bI
M)} denote the parameters of the three
beta distributions for the type1 probes, and similarly let
{(aII
U,bII
U),(aII
H,bII
H),(aII
M,bII
M)} describe the estimated parameters the
three beta components for the type2 probes. State membership of
individual probes is determined by the maximum probability criterion.

Beta Distribution
• Family of continuous probability distributions defined on the interval [0, 1]
parametrized by two positive shape parameters, denoted by α and β, that
appear as exponents of the random variable and control the shape of the
distribution.
http://en.wikipedia.org/wiki/Beta_distribution

BMIQ normalization strategy 3 steps cont.
1. For those type2 probes assigned to the U-state, transform their probabilities
of belonging to the U-state to quantiles using the inverse of the cumulative
beta distribution with beta parameters (aI
U,bI
U) estimated from the type1 U
component. Let nu
II denote the normalized values of the type2 U-probes.
2. For those type2 probes assigned to the M-state, transform their probabilities
of belonging to the M-state to quantiles using the inverse of the cumulative
beta distribution with beta parameters (aI
M,bI
M) estimated from the type1 M
component. Let nM
II denote the normalized values of the type2 M-probes.
3. For the type2 probes assigned to the H-state, we perform a dilation (scale)
transformation to „fit‟ the data into the „gap‟ with endpoints defined by
ma{nu
II} and min{nM
II}

BMIQ normalization procedure, First model each β value

Aside: Expectation Maximization using Beta Mixture
• EM – Uses Beta-mixture model. From Ji et al. 2005
• The beta-mixture model deals with a vector of correlation coefficients of
gene-expression levels. Correlation coefficients are assumed to come from
multiple underlying probability distributions, in our case, beta distributions. To
fit the beta distribution, for each correlation coefficient xi , we apply a linear
transformation yi = (xi +1)/2, so that the range of the transformed values is
between 0 and 1. The index i represents the gene with respect to which the
correlation coefficient y is calculated. Let {yi }, i = 1, . . . , n, denote the
transformed correlation coefficients (where n is the total number of
observations and L is the number of components in the mixture) under a
mixture of beta distributions,
Denotes the density of the beta-distribution:

Use expectation maximization algorithm (Dempster et al., 1977) to iteratively
maximize the log-likelihood and update the conditional probability that yi comes
from the l-th component, which is defined as
Consists of 4 steps. Repeat the first three until Repeat M-step and E-step until
the change in the value of the log-likelihood in Equation (1) is negligible.
Ji et al. 2005
The EM algorithm yields the final estimated posterior probability z∗ , the value
of which represents the il posterior probability that correlation coefficient yi
comes from component l.

BMIQ normalization procedure
• Results from EM algorithm are two-tailed so need to subdivide Beta values
into those values which fall left or right of the mean. Unmethylated being to
the left and Methylated to the right.
• Use these to normalize U and M beta values
• Now need to normalize H beta values
• Normalized beta-values for the H-probes is given by the conformal
(shift+dilation) transformation based on max{M} and min{U} values
• This conformal transformation involves a non-uniform rescaling of the H
probe beta values since it depends on the beta-value of the probe. This
is absolutely key in order to avoid gaps or holes from emerging in the
normalized distribution
• It is important to match normalize with respect to which tail the beta value
falls in because the left tail end of the methylated type2 distribution is
generally not well described by a beta-distribution, presumably a result of die
bias. Similar for unmethylated and the right tail.

BMIQ normalization procedure
• Resulting thresholds would normally fall within the ranges 0.2–0.3 and 0.60–
0.8, respectively. Having thus identified reasonable initial estimates for the
weights {πU
II,πH
II,πM
(II)} the algorithm will then automatically determine the
unmethylated, hemimethylated and methylated fractions for each sample
individually.

Improved robustness of BMIQ
• BMIQ does not use the type1 modes to adjust the type2 data, and hence
BMIQ normalization of the type2 probes generated a much smoother density
distribution, suggestive of an improved normalization framework (Fig. 1B)

BMIQ reduces technical variation (ERROR)
• BMIQ not only led to a significant improvement, but was also marginally
better than PBC (Fig. 2B)
Manhattan distance – distance between two points in a grid based on a strictly horizontal and/or vertical path

BMIQ reduces bias of type2 methylation values
• BMIQ significantly reduced the bias of type2 values (Fig. 3), although
there was no improvement over PBC itself

BMIQ eliminates the type1 enrichment bias
• To assess any potential bias towards type1 probes, computed for a given
number of top ranked probes the odds ratio (OR) of relative enrichment of
type1 over type2 probes. BMIQ successfully avoided any type1/type2
enrichment bias in all three datasets, indicative of an improved normalization
of type2 values

Reduced technical variability within probe clusters
• Defined probe clusters as contiguous regions containing at least seven
probes with no two adjacent probes separated by >300bp.
• Within these probe clusters, paper posited that pairs of adjacent probes,
one from each design and within 200 bp of each other, should have
similar methylation values.
• To evaluate normalization algorithms evaluated which one minimizes the
absolute difference in methylation between such closely adjacent type1-
type2 pairs

Reduced technical variability within probe clusters

BMIQ robustly identifies features associated with HPV status
• Paper attempted to verify that a reduction in technical variation obtained
with BMIQ is not at the expense of reduced biological signal.
• Used a training test set strategy to identify features in a training set and
calling them true positives if validated in a test set.
• Allows for a comparison of sensitivity and positive predictive value (PPV)
between the different normalization methods.
• BMIQ identified more differentially methylated features than PBC or
SWAN, not at the expense of a smaller PPV, and so, overall, BMIQ
identified more true positives

Results
• Because of the different nature of type1 and type2 probes on the Illumina
450k Methylation Beadchip a different kind of normalization is necessary
then what was used on 27k data
• There are several methods to do this, each is better then performing
quantile normalization without discriminating probe types.
• Normalization with regard to probe type improved robustness, Reduced
technical variation, Reduced bias of type2 methylation values,
Elimination of type1 enrichment bias

Normalization of Illumina 450 DNA methylation data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Normalization of Illumina 450 DNA methylation data

Similar to Normalization of Illumina 450 DNA methylation data (20)

Recently uploaded

Recently uploaded (20)

Normalization of Illumina 450 DNA methylation data