Microarray and its application

Made by - SHRUTI DABRAL
Measuring dissimilarity of
expression pattern

NORMALIZATION
There are three widely used techniques that can be used to normalize
gene-expression data from a single array hybridization. All of these
assume that all (or most) of the genes in the array, some subset of
genes, or a set of exogenous controls that have been ‘spiked’ into the
RNA before labelling, should have an average expression ratio equal
to one. The normalization factor is then used to adjust the data to
compensate for experimental variability and to ‘balance’ the
fluorescence signals from the two samples being compared.
 Total intensity normalization
Total intensity normalization data relies on the assumption that the
quantity of initial mRNA is the same for both labeled samples.
Furthermore, one assumes that some genes are upregulated in the
query sample relative to the control and that others are
downregulated. For the hundreds or thousands of genes in the array,
these changes should balance out so that the total quantity of RNA
hybridizing to the array from each sample is the same. Under this
assumption, a normalization factor can be calculated and used to re-
scale the intensity for each gene in the array.

 Normalization using regression techniques
For mRNA derived from closely related samples,
a significant fraction of the assayed genes would
be expected to be expressed at similar levels.
Normalization of these data is equivalent to
calculating the best-fit slope using regression
techniques and adjusting the intensities so that
the calculated slope is one. In many experiments,
the intensities are nonlinear, and local regression
techniques are more suitable, such as LOWESS
(LOcally WEighted Scatterplot Smoothing)
regression.

 Normalization using ratio statistics
A third normalization option is a method based on
the ratio statistics described by Chen et al.20. They
assume that although individual genes might be up-
or downregulated, in closely related cells, the total
quantity of RNA produced is approximately the same
for essential genes, such as ‘housekeeping genes’.
Using this assumption, they develop an approximate
probability density for the ratio Tk = Rk /Gk (where
Rk and Gk are, respectively, the measured red and
green intensities for the kth array element). They then
describe how this can be used in an iterative process
that normalizes the mean expression ratio to one and
calculates confidence limits that can be used to
identify differentially expressed genes.

 Gene expression matrix that contains rows representing
genes and columns representing particular conditions.
Each cell contains a value, given in arbitrary units, that refl
ects the expression level of a gene under a corresponding
condition. B: Condition C4 is used as a reference and all
other conditions are normalized with respect to C4 to
obtain expression ratios. C: In this table all expression
ratios were converted into the log2 (expression ratio)
values. This representation has an advantage of treating
up-regulation and down-regulation on comparable scales.
D: Discrete values for the elements in Table 1.C. Genes
with log2 (expression ratio) values greater than 1 were
changed to 1, genes with values less than –1 were
changed to –1. Any value between –1 and 1 was changed
to 0.

 Distance measures
Analysis of gene expression data is primarily
based on comparison of gene expression profiles
or sample expression profiles. In order to
compare expression profiles, we need a measure
to quantify how similar or dissimilar are the
objects that are being considered. A variety of
distance measures can be used to calculate
similarity in expression profiles and these are
discussed below.

 Euclidean distance
Euclidean distance is one of the common distance
measures used to calculate similarity between
expression profiles. The Euclidean distance
between two vectors of dimension 2, say A=[a1, a2]
and B=[b1, b2] can be calculated as:
D(A,B)=
In other words, the Euclidean distance between two
genes is the square root of the sum of the squares
of the distances between the values in each
condition (dimension).
(a1-b1)(a1-b1) + (a2-b2)(a2-b2)

 Pearson correlation coefficient =One of the
most commonly used metrics to measure
similarity between expression profiles is the
Pearson correlation coeffi cient (PCC) (Eisen et
al. 1998). Given the expression ratios for two
genes under three conditions A=[a1, a2, a3] and
B=[b1, b2, b3], PCC can be computed as follows:
Step1: Compute mean of A and B
Step2: “Mean centre” expression profiles
Step3: Calculate PCC as the cosine of the angle
between the mean-centred profi les

Step2: “Mean centre” expression
profiles

Step3: Calculate PCC as the cosine of the angle between
the mean-centred profiles

 The reason why we “mean centre” the expression
profiles is to make sure that we compare shapes of
the expression profiles and not their magnitude.
Mean centring maintains the shape of the profile,
but it changes the magnitude of the profile as shown
in . A PCC value of 1 essentially means that the
two genes have similar expression profiles and
a value of –1 means that the two genes have
exactly opposite expression profiles. A value of
0 means that no relationship can be inferred
between the expression profiles of genes. In
reality, PCC values range from –1 to +1. A PCC
value ≥ 0.7 suggests that the genes behave similarly
and a PCC value ≤ -0.7 suggests that the genes
have opposite behavior. The value of 0.7 is an
arbitrary cut-off, and in real cases this value can be
chosen depending on the dataset used.

 CLUSTER ANALYSIS
The term ‘cluster analysis’ actually encompasses
several different classification algorithms that can
be used to develop taxonomies (typically as part
of exploratory data analysis). Note that in this
classification, the higher the level of aggregation,
the less similar are members in the respective
class.

 Clustering methods can be hierarchical
(grouping objects into clusters and specifying
relationships among objects in a cluster,
resembling a phylogenetic tree) or non-
hierarchical (grouping into clusters without
specifying relationships between objects in a
cluster) as schematically represented in Figure 5.
Remember, an object may refer to a gene or a
sample, and a cluster refers to a set of objects
that behave in a similar manner.

Hierarchical methods
• Hierarchical clustering methods produce a tree or
dendrogram.
• They avoid specifying how many clusters are
appropriate
by providing a partition for each k obtained from
cutting
the tree at some level.
• The tree can be built in two distinct ways
– bottom-up: agglomerative clustering;
– top-down: divisive clustering.

 • Single-linkage clustering
The distance between two clusters, i and j, is
calculated as the minimum distance between a
member of cluster i and a member of cluster j.
Consequently, this technique is also referred to as
the minimum, or nearestneighbour, method. This
method tends to produce clusters that are ‘loose’
because clusters can be joined if any two
members are close together. In particular, this
method often results in ‘chaining’, or the
sequential addition of single samples to an
existing cluster. This produces trees with many
long, single-addition branches representing
clusters that have grown by accretion.

• Complete-linkage clustering.
Complete-linkage clustering is also known as the
maximum or furthest-neighbour method. The distance
between two clusters is calculated as the greatest
distance between members of the relevant clusters. Not
surprisingly, this method tends to produce very compact
clusters of elements and the clusters are often very
similar in size.
• Average-linkage clustering
The distance between clusters is calculated using
average values. There are, in fact, various methods for
calculating averages. The most common is the
unweighted pair-group method average (UPGMA). The
average distance is calculated from the distance between
each point in a cluster and all other points in another
cluster. The two clusters with the lowest average distance
are joined together to form a new cluster. Related

 Centroid linkage clustering
In centroid linkage clustering, an average
expression profile (called a centroid) is calculated
in two steps. First, the mean in each dimension of
the expression profiles is calculated for all objects
in a cluster. Then, distance between the clusters
is measured as the distance between the average
expression profiles of the two clusters.

Distances between clusters used for
hierarchical clustering
• Calculation of the distance between two clusters
is based on the pairwise distances between
members of the clusters
– Mean-link: average of pairwise dissimilarities
– Single-link: minimum of pairwise dissimilarities.
– Complete-link: maximum& of pairwise
dissimilarities.
– Distance between centroids
• Complete linkage gives preference to
compact/spherical
clusters.
• Single linkage can produce long stretched clusters

 Hierarchical clustering divisive
Hierarchical divisive clustering is the opposite of the
agglomerative method, where the entire set of objects is
considered as a single cluster and is broken down into two or
more clusters that have similar expression profiles. After this is
done, each cluster is considered separately and the divisive
process is repeated iteratively until all objects have been
separated into single objects. The division of objects into
clusters on each iterative step may be decided upon by
principal component analysis which determines a vector that
separates given objects. This method is less popular than
agglomerative clustering, but has successfully been used in
the analysis of gene expression data by Alon et al. (1999).
Non-hierarchical clustering One of the major criticisms of
hierarchical clustering is that there is no compelling evidence
that a hierarchical structure best suits grouping of the
expression profi les. An alternative to this method is a non-
hierarchical clustering, which requires predetermination of the
number of clusters. Non-hierarchical clustering then groups
existing objects into these predefi ned clusters rather than
organizing them into a hierarchical structure.

 RELATING EXPRESSION DATA TO OTHER
BIOLOGICAL INFORMATION
 Predicting binding sites
 Predicting protein interactions and protein
functions
 Predicting functionally conserved modules
 ‘Reverse-engineering’ of gene regulatory
networks

Study material
 http://www.mrc-
lmb.cam.ac.uk/genomes/madanm/microarray/cha
pter-final.pdf
 http://www.dbbm.fiocruz.br/class/Lecture/d17/arra
y_2/NatureReviews_JQ.pdf

Microarray and its application

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Microarray and its application

Similar to Microarray and its application (20)

More from prateek kumar

More from prateek kumar (8)

Recently uploaded

Recently uploaded (20)

Microarray and its application