A comparative study of Clustering for Gene expression data in Bioinformatics

Welcome to my presentation
on
A Comparative Study of Clustering for Gene
Expression Data in Bioinformatics
Roll: 08054746
Reg: 1484
Department of Statistics
Rajshahi University
Rajshahi-6205
Md. Bipul Hossen, Dept. of Statistics, University of Rajshahi

1

Outline
1. Why choosing clustering technique ?
2. Some Objectives
3. Methods and materials
4. Results and Discussions
5. Conclusion


2

1. Why choosing Clustering Technique
Cluster analysis programs are routinely run as a first
step of data summary and grouping genes in a
microarray data analysis.
Mainly the gene expression data is so much
noisy, mixture with expression pattern, down
regulated and up regulated.
That’s why we show here a comparative study of four
clustering algorithms and two proximity measures
applied on most commonly used iris data, simulated
data and six real cancer gene expression data sets.


3

2. Some Objectives
 Find significant cluster according to
similarities, intensities and regulations among it’s
objects.
 Compare several method of HC with K-means
based on two proximity measures.
 To asses the quality and reliability of clustering by
Calinaski Harabasz (CH) and Daviece Bouldin (DB)
index.

Bioinformatics Lab, Dept. of Statistics, University of Rajshahi

4

Methods
1. Single Linkage or Nearest
Neighbor Method
2. Complete Linkage or
Furthest Neighbor Method
3. Average Linkage Method
K-means clustering

5

Davies–Bouldin (DB) Index
The Davies–Bouldin index is a metric for evaluating
clustering algorithms (Davies and Bouldin, 1969). This
is an internal evaluation scheme and it is a cluster
separation measure.


6

Calinski-Harabasz (CH) Index

Where, SSB is the overall between-cluster variance, SSW is the overall within-cluster
variance, k is the number of clusters, and N is the number of observations.


7

Data sets
Chip Tissue

n

Armstrong-V2 [2]

Affy

Blood

72

3

Bhattacharjee
[3]
Nutt-V1 [6]

Affy

Lung

203

Affy

Brain

Alizadeh-V2 [1]

cDNA

Garber [4]

m

d

24,20,28

12582

2194

5

139,17,6,21,20

12600

1543

50

4

14,7,14,15

12625

1377

Blood

62

3

42,9,11

4022

2093

cDNA

Liang [5]

Lung

66

4

17,40,4,5

24192

4553

cDNA

Dataset

#C Dist. Classes

Brain

37

3

28,6,3

24192

1411


8

2 Clusters
3 Clusters

In this example, the objects g1, g2, g3, g4, g5, g6, g7, g8, g9 and g10 have been clustered. The
place at the bottom of the tree, where the object names are written, are called leaves. The
junctions are called nodes. It is possible to use a hierarchical clustering algorithm to find groups
in the data, by cutting the tree at a certain height. For instance, it might be considered than on
the example there are two groups, (g2, g3, g1, g8) and (g6, g10, g5, g7, g4, g9) or three groups
(g2, g3, g1, g8), (g6, g10) and (g5, g7, g4, g9) or ten groups, each containing only one leaf.

9

Hierarchical Clustering of Simulated Data

Fig: Heat map

Green color dendrogram shows the best result and we make a Heat map by
this method. i.e Complete HC with respect to Euclidean distance give the
best result then other methods.

10

K-means of Simulated Data

Table: Davies-Bouldin index
No. of Cluster
Cluster Size
DB index

K=2
20,40
0.897

K=3
20,20,20
0.321

K=4
12,20,8,20
0.797

K=5
4,4,12,20,20
0.825

From the above table we see, when the number of cluster k=3
the DB index give the lower value. Therefore we may conclude
that three clusters are present in this data set.

11

HC of Armstrong-V2 Data(d)


12

Several HC Nutt-V1 Data (c)


13

K-means of Alizadeh-V2 Data

No. of Cluster
Cluster Size
DB index

K=2

K=3
, ,

K=4

K=5
22 9 3 10 18

2.

Table represent, when the number of cluster k=3 the DB index give the lower
value. The sizes of the cluster is ,
and
and the actual cluster size is , 9
and 11. When the number of cluster is 3 than the DB index gives the lower value.
Therefore we may conclude that three clusters are present in Alizadeh-V2 data.

14

K-means of Liang Data

No. of Cluster
Cluster Size
DB index

K=2
29 8
1.23

K=3
, ,

K=4
6 9 3 19
2.09

K=5
1 2 19 14 1
1.215

Table 4.3 represent, when the number of cluster k=3 the DB index give the lower
value. The sizes of the cluster is ,
and 3 and the actual cluster size is ,
and
. When the number of cluster is 3 than the DB index gives the lower value.
Therefore we may conclude that three clusters are present in Armstrong-V2 data.

15

Several HC Liang Data(c,d,e,f)
28,6,3


16

Heat map of Liang Data


17

Compare HC with K-means for Affymetrix data sets
Dataset

Distance Method

Cluster Method

Calinski-Harabasz (CH)

Armstrong-V2

Euclidean

Single

1.889

Euclidean

Complete

11.803

Euclidean

Average

6.674

Pearson

Single

0.914

Pearson

Complete

12.559

Pearson

Average

10.393

K-means

11.943

Euclidean

Single

1.786

Euclidean

Complete

34.702

Euclidean

Average

26.850

Pearson

Single

1.700

Pearson

Complete

26.512

Pearson

Average

12.902

K-means

22.924

Euclidean

Single

3.167

Euclidean

Complete

7.938

Euclidean

Average

5.269

Pearson

Single

0.941

Pearson

Complete

4.273

Pearson

Average

2.987

K-means

6.051

Bhattacharjee

Nutt-V1


18

Compare HC with K-means for Affymetrix data sets by visualization technique
Mean of the CH index for Affy Chip
20
18
16

CH index

14

12

Pearson
Euclidean

10
8
6
4
2
0
Single

Average

Complete

K-Means

From the above graph we see that Complete linkage with Euclidean achieves CH index
of 18.14 which is larger CH than Single, Average and K-means with respect to their
proximity measure. Therefore we may conclude that the complete linkage method
gives the better result for the Affymetrix data sets.

19

Compare HC with K-means for cDNA data sets
Dataset

Distance Method

Cluster Method

Calinski-Harabasz (CH)

Alizadeh-V2

Euclidean
Euclidean
Euclidean
Pearson
Pearson
Pearson

Single
Complete
Average
Single
Complete
Average

2.047
11.161
11.068
0.980
11.229
10.319

K-means

13.003

Single

2.772

Garber

Euclidean
Euclidean

Euclidean
Pearson
Pearson
Pearson

Liang

Complete

19.097

5.166
0.855
7.693
18.912
9.269
9.057
19.665
10.279
19.665
19.665
19.665

K-means

Euclidean
Euclidean
Euclidean
Pearson
Pearson
Pearson

Average
Single
Complete
Average
K-means
Single
Complete
Average
Single
Complete
Average

23.781


20

Compare HC with K-means for cDNA data sets by visualization technique

Mean of the CH index for cDNA Chip
18
16
14

CH index

12
10

Euclidean
Pearson

8
6
4
2
0
Single

Complete

Average

K-Means

From the above graph we see that K-means achieves a CH index of 17.01 which is
larger CH than Single, Complete and Average with respect to their proximity
measure. Therefore we may conclude that the K-means method gives the better
result for the cDNA data sets.

21

Conclusions
Our results reveal that the complete linkage with euclidean
distance exhibited the best performance for Affymetrix data
sets. For cDNA data sets the K-means clustering exhibited the
best performance in terms of recovering the true structure of
the data sets. To the best of our knowledge, the comparative
study of several HC and K-means with the validity index as CH
and DB are poorly documented in literature.


22

Future Research Interest
1. Comparison on Hierarchical clustering method with the
Self-Organizing Maps method and other existing update
clustering methods.
2. Investigate the performance of the different hierarchical
clustering method in a comparison of the other existing
methods by false discovery rate (FDR), misclassification
error rate (MER), receiver operating characteristic (ROC)
and area under ROC curve using resampling technique.
3. Comparing both supervised and unsupervised methods
for gene expression data.

23

Reference

[1] Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell
JI, Yang L, Marti GE, Moore T, Hudson J, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner
TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown
PO, Staudt LM (2000); Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling.
Nature. 403:503-511.
[2] Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander
ES, Golub TR, Korsmeyer SJ (2002); MLL translocations specify a distinctgene expression profile that
distinguishes a unique leukemia; Nat Genet. 30:41-47.
[3] Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette
M,Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M
(2001); Classification of human lung carcinomas by mRNA expression profiling reveals distinct
adenocarcinoma subclasses; Proc Natl Acad Sci USA. 98(24):13790-13795.
[4] Garber ME, Troyanskaya OG, Schluens K, Petersen S, Thaesler Z, Pacyna-Gengelbach M, Rijn M van
de, Rosen GD, Perou CM, Whyte RI, Altman RB, Brown PO, Botstein D, Petersen I (2001); Diversity of gene
expression in adenocarcinoma of the lung; Proc Natl Acad Sci USA. 98(24):13784-13789.
[5] Liang Y, Diehn M, Watson N, Bollen AW, Aldape KD, Nicholas MK, Lamborn KR, Berger MS, Botstein
D, Brown PO, Israel MA (2005); Gene expression profiling reveals molecularly and clinically distinct subtypes
of glioblastoma multiforme; Proc Natl Acad Sci USA. 102(16):5814-5819.
[6] Nutt CL, Mani DR, Betensky RA, Tamayo P, Cairncross JG, Ladd C, Pohl U, Hartmann C, McLaughlin
ME, Batchelor TT, Black PM, von Deimling A, Pomeroy SL, Golub TR, Louis DN (2003); Gene expressionbased
classification of malignant gliomas correlates better with survival than histological classification; Cancer Res.
63(7):1602-1607.

A comparative study of Clustering for Gene expression data in Bioinformatics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A comparative study of Clustering for Gene expression data in Bioinformatics

Similar to A comparative study of Clustering for Gene expression data in Bioinformatics (20)

Recently uploaded

Recently uploaded (20)

A comparative study of Clustering for Gene expression data in Bioinformatics