Upcoming SlideShare
×

# Лекц 13

573

Published on

Published in: Technology
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total Views
573
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
0
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Лекц 13

1. 1. 석사학위 청구논문 최종발표 Spectral Clustering for Large Data set using Data Reduction Data Mining Lab. 석사과정 , 칭저리그 온다르마 2009 년 06 월 22 일
2. 2. Contents Introduction Spectral Clustering Our Approach Experiments Conclusion Reference Page  2
3. 3. Introduction  Spectral clustering – Initially spectral clustering was motivated by concept of graph partitioning. – Spectral clustering is more advanced algorithm compared to traditional algorithms • Result obtained by spectral clustering often exceed the performance of traditional clustering algorithm such as the kmeans algorithm. – Simple to implement and can be solved efficiently by standard linear algebra software.  Spectral clustering problem – Spectral clustering can cause problem in both memory use and computation time when the size of eigenvector is large. Page  3
4. 4. Introduction Approaches •Parallel Spectral Clustering •Data reduction Our Approach •Focuses on reducing data size and clustering. •Three stages: •First, Data reduction •Circle technique that is circles cover a dataset. Each point must appear at one circle. •Second, Spectral clustering for reduced data. •Spectral clustering using centers of circles. •Third, Assignment of each point of data set. •Cluster label of circle’s center assign into each point of current circle.
5. 5. Spectral Clustering  Graph theory – G=(V,E) is undirected graph • Adjacency matrix: • Degree matrix: D W =( wij )i, j =1,.., n n d i = ∑ wij j =1 where d i is representing sum of all vertices only adjacent to vi .The degree matrix D is defined as the diagonal matrix with the degrees on diagonal.  Similarity graphs: • Epsilon-neighborhood graph • T-nearest neighbor graph • Fully connected graph Page  5
6. 6. Spectral Clustering •Graph Laplacian •Unnormalized: •Normalized: •Symmetric: L = D −W Lsym = D −1/ 2 LD −1/ 2 = I − D −1/ 2WD −1/ 2 •Random walk: Lrw = D −1 L = I − D −1W •Spectral Clustering Algorithm Given a set of points X = {x1,..., xn } in R and K cluster: 1. Form the similarity matrix defined by : Gaussian similarity function is defined as  x −x j  i wij = exp 2  2δ  2     
7. 7. Spectral Clustering 2. Construct the Laplacian matrix: Symmetric normalized Laplacian matrix A = [Wij ] L = D −1/ 2 AD −1/ 2 3. Find the first k eigenvector of L and form the matrix by stacking the eigenvectors in columns: U = [ u1 uk ] ∈ R nxk 4. Form the matrix Y from U by normalizing each of U ’s rows to have U ij Yij = i = 1,.., k unit length: k  ∑ U ij   j =1  2 Y 5. Treat each row of k as a point in and classify them into classes j i j Y via k-means algorithm 6. Assign the original points to cluster if and only if row of the matrix was assigned to cluster .
8. 8. Our Approach  Data Reduction: – We use circle technique method when we reduce size of large data set. In circle technique, circles cover a dataset. Each point must appear at one circle. – We create circles by following steps: • Find the first point of all data. Then it can be center of first circle. • Calculate radius of circle. • Find the data points which is inside the radius of circle. • Find the center of next circle of all data. • Repeat the step 2, 3 and 4 until all data points are covered.  First step: find the center of the first circle. • All data point’s average point is to be center of first circle. n x0 = ∑ xi / n i Page  8
9. 9. Our Approach Second step: Calculate the radius of circle. We need to calculate distance of t-nearest neighbors of current center point. Then this average of distance became radius of current circle. Radius = ∑(x t j =1 j − xi ) t 2 Third step: Find data points which is inside the radius of circle. All points whose distance from center is less than or equal to value of radius of circle, and all points is uncovered. Because every point is appear the only one circle. Fourth step: find the center of next circle of all data. The center of next circle is the point whose distance is larger than or equal to Radius*1.5 and less than or equal to Radius * 2 from the center of current circle.Point is uncovered data.
10. 10. Our Approach Radius*2 Radius*1.5 Centers Figure 1: Principle of our approach •Spectral Clustering for reduced data. Using circles' centers we will do spectral clustering. We will construct similarity matrix by circles' centers then compute the eigenvectors of Laplacian matrix. We will use k-means clustering through first k eigenvectors of Laplacian matrix. •Assignment of each point of data set. Cluster label of circles' centers assign to each point of current circle.
11. 11. Experiments-Toy sample – Data set: Toy sample has 2000 data points with 2 dimensions. The data points were divided into 5 clusters. – Data reduction: Three kind of t-nearest neighbor parameter’s value. Datasets Toy sample Number of data points t=20(1%) t=30(1.5%) #of circle percent #of circle percent #of circle percent 327 16 239 12 149 7 2000 – Clustering & Assignment: Assign into Data all point of Spectral reduction data set Clustering Page  11 t=40(2%)
12. 12. Experiments- Cluster Quality  Several measurements had to used to measure the cluster equality.  Rand Index: Rand measure is a measure of the similarity between two data clustering’s. Given a set of n objects S={O1,….,On}, suppose U={u1,….,un} and V={v1,….,vn} represent two different partitions of the objects in S. RI = TP + TN TP + FP + FN + TN • TP is the number of pairs of objects that are placed in the same class in U and in the same cluster in V, • TN is the number of pairs of objects in the same class in U but not in the same cluster in V, • FP is number of pairs of objects in the same cluster in V but not the in the same class in U, • FN is the number of pairs of objects in different classes and different clusters in both. – The Rand index lies between 0 and 1. Page  12
13. 13. Cluster Quality Adjusted Rand Index: The Adjusted Rand Index is the corrected-forchange version of Rand index. The expected value of the Rand Index of two random partitions does not take a constant value. AR = 2(TP * TN − FP * FN ) (TP + TN )(TN + FP) + (TP + FN )( FN + FP) Jaccard Coefficient: It is very similar to the Rand Index, however it disregards the pairs of objects that are in different clusters for both clustering. It is defined by: Jac = TP TN + FP + FN Normalized Mutual Information: NMI = I (C , C ' ) H (C ) H (C ' ) where I (C , C ' ) is the mutual information between C and C ' . H (C ' ) is entropies . H (C ) and
14. 14. Cluster Quality- Toy sample Clus ter Quality Clus ter Quality 0.9500 0.9000 0.9000 0.8500 0.8500 0.8000 size=327 size=239 size=194 0.7500 0.7000 Adjusted Rand Index 1.0000 0.9500 Rand Index 1.0000 0.8000 s ize=327 s ize=239 s ize=149 0.7500 0.7000 0.6500 0.6500 0.6000 0.6000 0.5500 0.5500 0.5000 0.5000 5 10 20 50 100 5 full 10 20 Figure 1: Rand Index score for Toy sample 100 full Figure 2: Adjusted Rand Index score for Toy sample Cluster Quality Clus ter Quality 1.0000 0.9500 0.9500 0.9000 0.8500 0.8000 s ize=327 s ize=239 s ize=149 0.7500 0.7000 0.6500 0.6000 0.5500 Normalized Mutual Inform ation 1.0000 Jaccard Coefficient 50 t- neares t neighbors t- nearest neighbor 0.9000 0.8500 0.8000 size=327 size=239 size=149 0.7500 0.7000 0.6500 0.6000 0.5500 0.5000 0.5000 5 10 20 50 100 full t- neares t neighbors Figure 3: Jaccard Coefficient score for Toy sample 5 10 20 50 100 t- nearest neighbors full Figure 4: Normalized Mutual Information score for Toy sample
15. 15. Experiments-Corel Data set – Data set: Corel image data set has 2074 images with 144 features. The data points were divided into 18 clusters. – Data reduction: Three kind of t-nearest neighbor parameter’s value. Datasets Corel Number of data points 2074 t=20(1%) t=30(1.5%) t=40(2%) #of circle percent #of circle percent #of circle percent 630 30 513 24 434 20 – Cluster quality: To evaluate our approach we compare our results with the quality measurements of Parallel Spectral Clustering experiments on Corel dataset. Normalized Mutual Information measurement for cluster quality was used in parallel spectral clustering. Page  15
16. 16. Cluster Quality- Corel Data set Clu s te r Quality Clus ter Quality 0.9000 0.7 0.8000 0.6 NMI 0.6000 0.5000 PSC SCLD 0.4000 0.3000 Adjusted Rand Index 0.7000 0.2000 0.5 0.4 PSC SCLD 0.3 0.2 0.1 0.1000 0 0.0000 5 10 15 20 50 100 150 10 200 50 150 t- neares t neighbors t- neares t n eighbors Figure 6: Adjusted Rand Index score Figure 5: NMI score Clus ter Qualit y Clus t er Quality 0.6 0.9700 0.9600 0.5 Rand Index 0.9400 0.9300 PSC SCLD 0.9200 0.9100 0.9000 0.8900 Jaccard Coefficient 0.9500 0.4 PSC SCLD 0.3 0.2 0.1 0.8800 0 0.8700 10 50 150 t - neares t neighbors Figure 7: Rand Index score 10 50 150 t- neares t neighbors Figure 8:Jaccard Coefficient score
17. 17. Cluster Quality- Corel Data set Clus ter Quality 0.8000 0.7000 Adjusted Rand Index 0.6000 0.5000 s ize=630 s ize=513 s ize=434 0.4000 0.3000 0.2000 0.1000 0.0000 5 10 15 20 50 100 150 200 t- neares t neighbors Figure 9 represents Adjusted Rand Index cluster quality scores of reduced data from Corel data set. Every reduced data and t-nearest neighbor calculated for each size by Adjusted Rend Index measurement.Cluster quality of every reduced data similar to each other reduced data Results of experiment: We can be look our approach performs better than Parallel Spectral clustering from above depicted figures.
18. 18. Conclusion  We investigated Spectral clustering on large data set. Our proposed approach is to reduce size of data set for spectral clustering. Our approach consists of three stages: • Data reduction • Spectral clustering for reduced data • Assignment of each point of data set.  Our experiments are applied on two datasets: – Toy sample and Corel image data set. Toy sample has 2000 data points with 2 dimensions. Corel data set has 2074 images with 144 features.  For Corel data set, our approach gave better performance than Parallel Spectral clustering.  Experimental results illustrate that our approach can be successfully cluster the samples in these data sets. Page  18
19. 19. Reference [1] Ulrike von Luxburg “A Tutorial on Spectral Clustering”, Max Planck Institute for Biological Cybernetics, Technical Report No.TR-149, August 2006. [2] Denis HAMAD, Philippe BEILA “Introduction to spectral clustering” Information and Communication Technologies: from Theory to Application, ICTTA 2008, 3rd International Conference, April 2008. [3] Jain A., M.N. Murty and P.J. Flynn “Data clustering: A review”. ACM Computing Surveys, vol 31(3), pp.264-323, 1999. [4] Wen-Yen Chen, Yangqiu Song, Hangjie Bai, Chih-Jen Lin, Edward Y.Chang “PSC: Parallel Spectral clustering”, 2008. [5] A.Strehl and J.Ghosh “Cluster Ensembles- A Knowledge Reuse Framework for Combining Multiple Partitions”, Journal of Machine Learning Research, vol3, pp.583-617, March 2003. [6] Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE, “Survey of Clustering Algorithm”, vol.16, NO.3, MAY 2005. [7] An Introduction to Cluster Analysis for Data Mining”, (2000) http://wwwusers.cs.umn.edu/~han/dmclass/cluster_survey_10_02_00.pdf, [8] Glenn Fung. “A Comprehensive Overview of Basic Clustering Algorithms” June 2, 2001, http://pages.cs.wisc.edu/~gfung/clustering.pdf [9] Ali, R., Ghani, U., Saeed, A.: Data clustering and its applications (1998), http://members.tripod.com/asim_saeed/paper.htm [10] K.Y. YEung, W.L.Ruzzo “Details of the Adjusted Rand index and Clustering algorithms Supplement to the paper” An empirical study on Principal Component Analysis for clustering gene expression data”(to appear in Bioinformatics)”, May 3, 2001 Page  19
20. 20. Reference [11] S. Wagner, D.Wagner “Comparing Clusterings- An Overview” , January 12, 2007, http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/812079 [12] M. Meila “Comparing Clusterings” , ACM International Conference Proceeding Series; Vol.119, Proceedings of the 22nd International conference on Machine learning ,Bonn, Germany, pp.277-584, 2005. http://www.stat.washington.edu/www/research/report [13] P.N. Tan, M.Steinbach, V.Kumar “Introduction to Data Mining” chapter 8: Cluster Analysis: Basic Concepts and Algorithms, Addison-Wesley, March 25, 2006 [14] B.Mathias, S.Juri “Spectral Clustering” , 23.10.2008 http://www.inf.unibz.it/dis/teaching/DWDM/reports/1/spectral.pdf [15] B. Everit, S.Landau and M.Leese, “Cluster Analysis”, London:Arnoldi , 2001 [16] J.Hartigan “Clustering Algorithms”, New York: Wiley, 1975 [17] A.Jain and R.Dubes “Algorithms for Clustering Data”, Englewood Cliffs, NJ: Prentice-Hall, 1988 [18] T.Graeped “Statistical physis of clustering algorithm”, DIPLOMARBEIT, TECHNIQUE UNIVERSITÄT, FB PHYSIK, INSTITUT FÜR THEORETISHE PHYSIK, technical report 171822 http://stat.cs.tu-berlin.de/~guru/papers/diplom.ps http://www.research.microsoft.com/~thoreg/papers/d [19] A.MCCallum, K.Nigam, L.U.Ungar “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching”, KNOWLEDGE DISCOVERY AND DATA MINING, pp.169-178,2000. [20] B.Auffarth “Spectral Graph clustering” , Universitat de Barcelona course report for T´ecnicas Avanzadas de Aprendizaje at Universitat Polit`ecnica de Catalunya ,January 15, 2007.
21. 21. Related Works  Parallel Spectral Clustering Large-scale spectral clustering – Two types approach: • Sparsifying the similarity matrix • Nystrom approximation – N data onto p distributed machine nodes. On each node, PSC computes the similarities between local data. – For the eigenvector matrix, PSC store it on distributed nodes to reduce per-node memory use. Page  22
22. 22. Related Works  Canopy Clustering – Clustering large, high dimensional datasets. – Very simple and fast accurate method for grouping objects into clusters. – Two distance thresholds T1>T2 for processing – The basic algorithm is to begin with a set of points and remove one at random, • At each point: – if its distance from first point is <T1, then add the point to cluster, – if the distance is <T2, then remove the point from the set. – The algorithm loops until the initial set is empty. – A given point may occur in more then one Canopy.