3.
Introduction
Spectral clustering
– Initially spectral clustering was motivated by concept of graph
partitioning.
– Spectral clustering is more advanced algorithm compared to
traditional algorithms
• Result obtained by spectral clustering often exceed the
performance of traditional clustering algorithm such as the kmeans algorithm.
– Simple to implement and can be solved efficiently by standard
linear algebra software.
Spectral clustering problem
– Spectral clustering can cause problem in both memory use and
computation time when the size of eigenvector is large.
Page 3
4.
Introduction
Approaches
•Parallel Spectral Clustering
•Data reduction
Our Approach
•Focuses on reducing data size and clustering.
•Three stages:
•First, Data reduction
•Circle technique that is circles cover a dataset. Each point
must appear at one circle.
•Second, Spectral clustering for reduced data.
•Spectral clustering using centers of circles.
•Third, Assignment of each point of data set.
•Cluster label of circle’s center assign into each point of
current circle.
5.
Spectral Clustering
Graph theory
– G=(V,E) is undirected graph
• Adjacency matrix:
• Degree matrix: D
W =( wij )i, j =1,.., n
n
d i = ∑ wij
j =1
where d i is representing sum of all vertices only adjacent to vi .The degree
matrix D is defined as the diagonal matrix with the degrees on diagonal.
Similarity graphs:
• Epsilon-neighborhood graph
• T-nearest neighbor graph
• Fully connected graph
Page 5
6.
Spectral Clustering
•Graph Laplacian
•Unnormalized:
•Normalized:
•Symmetric:
L = D −W
Lsym = D −1/ 2 LD −1/ 2 = I − D −1/ 2WD −1/ 2
•Random walk:
Lrw = D −1 L = I − D −1W
•Spectral Clustering Algorithm
Given a set of points X = {x1,..., xn } in R and K cluster:
1. Form the similarity matrix defined by : Gaussian
similarity function is defined as
x −x
j
i
wij = exp
2
2δ
2
7.
Spectral Clustering
2. Construct the Laplacian matrix: Symmetric normalized Laplacian
matrix
A = [Wij ]
L = D −1/ 2 AD −1/ 2
3. Find the first k eigenvector of L and form the matrix by stacking
the eigenvectors in columns:
U = [ u1 uk ] ∈ R nxk
4. Form the matrix Y from U by normalizing each of U ’s rows to
have
U ij
Yij =
i = 1,.., k
unit length:
k
∑ U ij
j =1
2
Y
5. Treat each row of
k
as a point in and classify them into classes
j
i
j
Y
via k-means algorithm
6. Assign the original points to cluster if and only if row of the
matrix was assigned to cluster .
8.
Our Approach
Data Reduction:
– We use circle technique method when we reduce size of large data set. In circle
technique, circles cover a dataset. Each point must appear at one circle.
– We create circles by following steps:
• Find the first point of all data. Then it can be center of first circle.
• Calculate radius of circle.
• Find the data points which is inside the radius of circle.
• Find the center of next circle of all data.
• Repeat the step 2, 3 and 4 until all data points are covered.
First step: find the center of the first circle.
•
All data point’s average point is to be center of first circle.
n
x0 = ∑ xi / n
i
Page 8
9.
Our Approach
Second step: Calculate the radius of circle.
We need to calculate distance of t-nearest neighbors of current
center point. Then this average of distance became radius of
current circle.
Radius =
∑(x
t
j =1
j − xi ) t
2
Third step: Find data points which is inside the radius of circle.
All points whose distance from center is less than or equal to
value of radius of circle, and all points is uncovered. Because
every point is appear the only one circle.
Fourth step: find the center of next circle of all data.
The center of next circle is the point whose distance is larger than
or equal to Radius*1.5 and less than or equal to Radius * 2 from
the center of current circle.Point is uncovered data.
10.
Our Approach
Radius*2
Radius*1.5
Centers
Figure 1: Principle of our approach
•Spectral Clustering for reduced data.
Using circles' centers we will do spectral clustering. We will
construct similarity matrix by circles' centers then compute the
eigenvectors of Laplacian matrix. We will use k-means clustering
through first k eigenvectors of Laplacian matrix.
•Assignment of each point of data set.
Cluster label of circles' centers assign to each point of current circle.
11.
Experiments-Toy sample
– Data set: Toy sample has 2000 data points with 2 dimensions. The data points
were divided into 5 clusters.
– Data reduction: Three kind of t-nearest neighbor parameter’s value.
Datasets
Toy sample
Number of
data points
t=20(1%)
t=30(1.5%)
#of circle
percent
#of circle
percent
#of circle
percent
327
16
239
12
149
7
2000
– Clustering & Assignment:
Assign into
Data
all point of
Spectral
reduction
data set
Clustering
Page 11
t=40(2%)
12.
Experiments- Cluster Quality
Several measurements had to used to measure the cluster equality.
Rand Index: Rand measure is a measure of the similarity between two data
clustering’s. Given a set of n objects S={O1,….,On}, suppose U={u1,….,un} and
V={v1,….,vn} represent two different partitions of the objects in S.
RI =
TP + TN
TP + FP + FN + TN
•
TP is the number of pairs of objects that are placed in the same class in U and in the
same cluster in V,
•
TN is the number of pairs of objects in the same class in U but not in the same cluster
in V,
•
FP is number of pairs of objects in the same cluster in V but not the in the same class
in U,
•
FN is the number of pairs of objects in different classes and different clusters in both.
– The Rand index lies between 0 and 1.
Page 12
13.
Cluster Quality
Adjusted Rand Index: The Adjusted Rand Index is the corrected-forchange version of Rand index. The expected value of the Rand Index of
two random partitions does not take a constant value.
AR =
2(TP * TN − FP * FN )
(TP + TN )(TN + FP) + (TP + FN )( FN + FP)
Jaccard Coefficient: It is very similar to the Rand Index, however it
disregards the pairs of objects that are in different clusters for both
clustering. It is defined by:
Jac =
TP
TN + FP + FN
Normalized Mutual Information:
NMI =
I (C , C ' )
H (C ) H (C ' )
where I (C , C ' ) is the mutual information between C and C ' .
H (C ' )
is entropies .
H (C )
and
14.
Cluster Quality- Toy sample
Clus ter Quality
Clus ter Quality
0.9500
0.9000
0.9000
0.8500
0.8500
0.8000
size=327
size=239
size=194
0.7500
0.7000
Adjusted Rand Index
1.0000
0.9500
Rand Index
1.0000
0.8000
s ize=327
s ize=239
s ize=149
0.7500
0.7000
0.6500
0.6500
0.6000
0.6000
0.5500
0.5500
0.5000
0.5000
5
10
20
50
100
5
full
10
20
Figure 1: Rand Index score for
Toy sample
100
full
Figure 2: Adjusted Rand
Index score for Toy sample
Cluster Quality
Clus ter Quality
1.0000
0.9500
0.9500
0.9000
0.8500
0.8000
s ize=327
s ize=239
s ize=149
0.7500
0.7000
0.6500
0.6000
0.5500
Normalized Mutual Inform ation
1.0000
Jaccard Coefficient
50
t- neares t neighbors
t- nearest neighbor
0.9000
0.8500
0.8000
size=327
size=239
size=149
0.7500
0.7000
0.6500
0.6000
0.5500
0.5000
0.5000
5
10
20
50
100
full
t- neares t neighbors
Figure 3: Jaccard Coefficient
score for Toy sample
5
10
20
50
100
t- nearest neighbors
full
Figure 4: Normalized Mutual
Information score for Toy sample
15.
Experiments-Corel Data set
– Data set: Corel image data set has 2074 images with 144 features. The
data points were divided into 18 clusters.
– Data reduction: Three kind of t-nearest neighbor parameter’s value.
Datasets
Corel
Number
of data
points
2074
t=20(1%)
t=30(1.5%)
t=40(2%)
#of
circle
percent
#of
circle
percent
#of
circle
percent
630
30
513
24
434
20
– Cluster quality: To evaluate our approach we compare our results with
the quality measurements of Parallel Spectral Clustering experiments
on Corel dataset. Normalized Mutual Information measurement for
cluster quality was used in parallel spectral clustering.
Page 15
16.
Cluster Quality- Corel Data set
Clu s te r Quality
Clus ter Quality
0.9000
0.7
0.8000
0.6
NMI
0.6000
0.5000
PSC
SCLD
0.4000
0.3000
Adjusted Rand Index
0.7000
0.2000
0.5
0.4
PSC
SCLD
0.3
0.2
0.1
0.1000
0
0.0000
5
10
15
20
50
100
150
10
200
50
150
t- neares t neighbors
t- neares t n eighbors
Figure 6: Adjusted Rand Index
score
Figure 5: NMI score
Clus ter Qualit y
Clus t er Quality
0.6
0.9700
0.9600
0.5
Rand Index
0.9400
0.9300
PSC
SCLD
0.9200
0.9100
0.9000
0.8900
Jaccard Coefficient
0.9500
0.4
PSC
SCLD
0.3
0.2
0.1
0.8800
0
0.8700
10
50
150
t - neares t neighbors
Figure 7: Rand Index score
10
50
150
t- neares t neighbors
Figure 8:Jaccard Coefficient score
17.
Cluster Quality- Corel Data set
Clus ter Quality
0.8000
0.7000
Adjusted Rand Index
0.6000
0.5000
s ize=630
s ize=513
s ize=434
0.4000
0.3000
0.2000
0.1000
0.0000
5
10
15
20
50
100
150
200
t- neares t neighbors
Figure 9 represents Adjusted Rand Index cluster quality scores of
reduced data from Corel data set. Every reduced data and t-nearest
neighbor calculated for each size by Adjusted Rend Index
measurement.Cluster quality of every reduced data similar to each
other reduced data
Results of experiment: We can be look our approach performs better
than Parallel Spectral clustering from above depicted figures.
18.
Conclusion
We investigated Spectral clustering on large data set. Our proposed
approach is to reduce size of data set for spectral clustering. Our
approach consists of three stages:
• Data reduction
• Spectral clustering for reduced data
• Assignment of each point of data set.
Our experiments are applied on two datasets:
– Toy sample and Corel image data set. Toy sample has 2000 data points with 2
dimensions. Corel data set has 2074 images with 144 features.
For Corel data set, our approach gave better performance than Parallel
Spectral clustering.
Experimental results illustrate that our approach can be successfully
cluster the samples in these data sets.
Page 18
19.
Reference
[1] Ulrike von Luxburg “A Tutorial on Spectral Clustering”, Max Planck Institute for Biological
Cybernetics, Technical Report No.TR-149, August 2006.
[2] Denis HAMAD, Philippe BEILA “Introduction to spectral clustering” Information and Communication Technologies:
from Theory to Application, ICTTA 2008, 3rd International Conference, April 2008.
[3] Jain A., M.N. Murty and P.J. Flynn “Data clustering: A review”. ACM Computing Surveys, vol 31(3), pp.264-323, 1999.
[4] Wen-Yen Chen, Yangqiu Song, Hangjie Bai, Chih-Jen Lin, Edward Y.Chang “PSC: Parallel Spectral clustering”, 2008.
[5] A.Strehl and J.Ghosh “Cluster Ensembles- A Knowledge Reuse Framework for Combining Multiple Partitions”, Journal
of Machine Learning Research, vol3, pp.583-617, March 2003.
[6] Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE, “Survey of Clustering Algorithm”, vol.16, NO.3,
MAY 2005.
[7] An Introduction to Cluster Analysis for Data Mining”, (2000) http://wwwusers.cs.umn.edu/~han/dmclass/cluster_survey_10_02_00.pdf,
[8] Glenn Fung. “A Comprehensive Overview of Basic Clustering Algorithms” June 2, 2001,
http://pages.cs.wisc.edu/~gfung/clustering.pdf
[9] Ali, R., Ghani, U., Saeed, A.: Data clustering and its applications (1998),
http://members.tripod.com/asim_saeed/paper.htm
[10] K.Y. YEung, W.L.Ruzzo “Details of the Adjusted Rand index and Clustering algorithms Supplement to the paper” An
empirical study on Principal Component Analysis for clustering gene expression data”(to appear in Bioinformatics)”, May
3, 2001
Page 19
20.
Reference
[11] S. Wagner, D.Wagner “Comparing Clusterings- An Overview” , January 12, 2007,
http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/812079
[12] M. Meila “Comparing Clusterings” , ACM International Conference Proceeding Series; Vol.119, Proceedings of
the 22nd International conference on Machine learning ,Bonn, Germany, pp.277-584, 2005.
http://www.stat.washington.edu/www/research/report
[13] P.N. Tan, M.Steinbach, V.Kumar “Introduction to Data Mining” chapter 8: Cluster Analysis: Basic Concepts and
Algorithms, Addison-Wesley, March 25, 2006
[14] B.Mathias, S.Juri “Spectral Clustering” , 23.10.2008
http://www.inf.unibz.it/dis/teaching/DWDM/reports/1/spectral.pdf
[15] B. Everit, S.Landau and M.Leese, “Cluster Analysis”, London:Arnoldi , 2001
[16] J.Hartigan “Clustering Algorithms”, New York: Wiley, 1975
[17] A.Jain and R.Dubes “Algorithms for Clustering Data”, Englewood Cliffs, NJ: Prentice-Hall, 1988
[18] T.Graeped “Statistical physis of clustering algorithm”, DIPLOMARBEIT, TECHNIQUE UNIVERSITÄT, FB
PHYSIK, INSTITUT FÜR THEORETISHE PHYSIK, technical report 171822
http://stat.cs.tu-berlin.de/~guru/papers/diplom.ps http://www.research.microsoft.com/~thoreg/papers/d
[19] A.MCCallum, K.Nigam, L.U.Ungar “Efficient Clustering of High-Dimensional Data Sets with Application to
Reference Matching”, KNOWLEDGE DISCOVERY AND DATA MINING, pp.169-178,2000.
[20] B.Auffarth “Spectral Graph clustering” , Universitat de Barcelona course report for T´ecnicas Avanzadas de
Aprendizaje at Universitat Polit`ecnica de Catalunya ,January 15, 2007.
21.
Related Works
Parallel Spectral Clustering
Large-scale spectral clustering
– Two types approach:
• Sparsifying the similarity matrix
• Nystrom approximation
– N data onto p distributed machine nodes. On each node, PSC
computes the similarities between local data.
– For the eigenvector matrix, PSC store it on distributed nodes to reduce
per-node memory use.
Page 22
22.
Related Works
Canopy Clustering
– Clustering large, high dimensional datasets.
– Very simple and fast accurate method for grouping objects into
clusters.
– Two distance thresholds T1>T2 for processing
– The basic algorithm is to begin with a set of points and remove one at
random,
• At each point:
– if its distance from first point is <T1, then add the point to cluster,
– if the distance is <T2, then remove the point from the set.
– The algorithm loops until the initial set is empty.
– A given point may occur in more then one Canopy.
Be the first to comment