Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- P U N I S H M E N T A N D P R A Y... by ghanyog 80 views
- Salvador Dali by Mireia Buchaca 805 views
- Libertà Resort Rossi E Mail by imoveisdorio 448 views
- Santa Monica Jardins by imoveisdorio 476 views
- Holistic Networking Dr Shriniwas... by ghanyog 108 views
- Accidents Holistic Solutons Dr by ghanyog 67 views

842 views

Published on

No Downloads

Total views

842

On SlideShare

0

From Embeds

0

Number of Embeds

84

Shares

0

Downloads

39

Comments

0

Likes

1

No embeds

No notes for slide

- 1. 4th International Summer School Achievements and Applications of Contemporary Informatics, Mathematics and Physics National University of Technology of the Ukraine Kiev, Ukraine, August 5-16, 2009 Methods from Mathematical Data Mining (Supported by Optimization) Gerhard-Wilhelm Weber * and Başak Akteke-Öztürk Gerhard- Akteke- Institute of Applied Mathematics Middle East Technical University, Ankara, Turkey * Faculty of Economics, Management and Law, University of Siegen, Germany Center for Research on Optimization and Control, University of Aveiro, Portugal 1 EURO CBBM EURO EURO ORD EURO CE*OC August 8, 2009
- 2. 4th International Summer School Achievements and Applications of Contemporary Informatics, Mathematics and Physics National University of Technology of the Ukraine Kiev, Ukraine, August 5-16, 2009 Clustering Theory Cluster Number and Cluster Stability Estimation Z. Volkovich Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel Z. Barzily Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel G.-W. Weber Departments of Scientific Computing, Financial Mathematics and Actuarial Sciences, Institute of Applied Mathematics, Middle East Technical University, 06531, Ankara, Turkey D. Toledano-Kitai Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel 2 August 8, 2009
- 3. Clustering • An essential tool for “unsupervised” learning is cluster analysis which suggests categorizing data (objects, instances) into groups such that the likeness within a group is much higher than the one between the groups. • This resemblance is often described by a distance function. 3 August 8, 2009
- 4. Clustering For a given set S ⊂ IR d a clustering algorithm CL constructs a clustered set: CL(S, int-part, k) = Π(S) = (π1(S) ,…, πk (S)), such that CL(x) = CL(y) = i, if x and y are similar: x, y ∈ πi(S), for some i=1,…,k; and CL(x) ≠ CL(y), if x and y are dissimilar. 4 August 8, 2009
- 5. Clustering The disjoint subsets πi (S), i=1,…,k, are named clusters: k U π (S ) i =1 i = S , and π i ∩ π j = ∅ for i ≠ j. 5 August 8, 2009
- 6. Clustering CL(x) = CL(y) CL(x) ≠ CL(y) 6 August 8, 2009
- 7. Clustering The iterative clustering process is usually carried out in two phases: a partitioning phase and a quality assessment phase. In the partitioning phase, a label is assigned to each element in view of the assumption that, in addition to the observed features, for each data item, there is a hidden, unobserved feature representing cluster membership. The quality assessment phase measures the grouping quality. The outcome of the clustering process is a partition that acquires the highest quality score. Except for the data itself, two essential input parameters are typically required: an initial partition and a suggested number of clusters. Here, the parameters are denoted as • int-part ; • k. 7 August 8, 2009
- 8. The Problem Partitions generated by the iterative algorithms are commonly sensitive to initial partitions fed in as an input parameter. Selection of “good” initial partitions is an essential clustering problem. Another problem arising here is choosing the right number of the clusters. It is well known that this key task of the cluster analysis is ill posed. For instance, the “correct” number of clusters in a data set can depend on the scale in which the data are measured. In this talk, we address to the last problem concerning determination of the number of clusters. 8 August 8, 2009
- 9. The Problem Partitions generated by the iterative algorithms are commonly sensitive to initial partitions fed in as an input parameter. Selection of “good” initial partitions is an essential clustering problem. Another problem arising here is choosing the right number of the clusters. It is well known that this key task of the cluster analysis is ill posed. For instance, the “correct” number of clusters in a data set can depend on the scale in which the data are measured. 9 August 8, 2009
- 10. The Problem Many approaches to this problem exploit the within-cluster dispersion matrix (defined according to the pattern of a covariance matrix). The span of this matrix (column space) usually decreases as the number of groups rises, and may have a point in which it “falls”. Such an “elbow” on the graph locates, in several known methods, the “true” number of clusters. Stability based approaches, for the cluster validation problem, evaluate the partitions’ variability under repeated applications of a clustering algorithm. Low variability is understood as high consistency in the result obtained, and the number of clusters that maximizes cluster stability is accepted as an estimate for the “true” number of clusters. 10 August 8, 2009
- 11. The Concept In the current talk, the problem of determining the true number of clusters is addressed by the cluster stability approach. We propose a method for the study of cluster stability. This method suggests a geometrical stability of a partition. • We draw samples from the source data and estimate the clusters by means of each of the drawn samples. • We compare pairs of the partitions obtained. • A pair is considered to be consistent if the obtained division is close. 11 August 8, 2009
- 12. The Concept • We quantify this closeness by the number of edges connecting points from different samples in a minimal spanning tree (MST) constructed for each one of the clusters. • We use the Friedman and Rafsky two sample test statistic which measures these quantities. Under the null hypothesis on the homogeneity of the source data, this statistic is approximately normally distributed. So, the case of well mingled samples within the clusters leads to normal distribution of the considered statistic. 12 August 8, 2009
- 13. The Concept Examples of MST produced by samples within a cluster: 13 August 8, 2009
- 14. The Concept The left-side picture is an example of “a good cluster” where the quantity of edges connecting points from different samples (marked by solid red lines) is relatively big. The right-side picture images a “poor situation” when only one (and long) edge connects the (sub-) clusters. 14 August 8, 2009
- 15. The Two-Sample MST-Test Henze and Penrose (1979) considered the asymptotic behavior of Rmn : the number of edges of V which connect a point of S to a point of T . Suppose that |S|=m → ∞ and |T|=n → ∞ such that m /(m+n) → p∈ (0, 1). ∈ Introducing q = 1 − p and r = 2pq, they obtained: 1 Rmn − 2mn m+n ( → N 0, σ d 2 ), m+n 2 where the convergence is in distribution and N(0, σ d ) denotes the normal distribution with a 0 expectation and a variance 2 σ d := r (r + Cd (1 − 2r)), for some constant Cd depending only on the space’s dimension d. 15 August 8, 2009
- 16. Concept • Resting upon this fact, the standard score 2K m Y j := Rj − m K of the mentioned edges quantity is calculated for each cluster j=1,…, K , where m is the sample size and K denotes the number of clusters. % • The partition quality Y is represented by the worst cluster corresponding to the minimal standard score value obtained. 16 August 8, 2009
- 17. Concept • It is natural to expect that the true number of clusters can be characterized by the empirical distribution of the partition standard score having the shortest left tail. • The proposed methodology is expressed as a sequential creation of the described distribution with its left-asymmetry estimation. 17 August 8, 2009
- 18. Concept One of important problems appearing here is the so-called clusters coordination problem. Actually, the same cluster can be differently tagged within repeated rerunning of the algorithm. This fact results from the inherent symmetry of the partitions according to their clusters labels. 18 August 8, 2009
- 19. Concept We solve this problem by the following way: Let S = S1 ∪ S 2 . Consider three categorizations: Π K := Cl ( S , K ) , Π K ,1 := Cl ( S1, K ) , Π K ,2 := Cl ( S2 , K ) . Thus, we get two partitions for each of the samples Si, i=1,2. The first one is induced by ΠK and the second one is Π K ,i , i = 1, 2 . 19 August 8, 2009
- 20. Concept For each one of the samples i =1,2, our purpose is to find the permutation ψ of the set {1,…,K} which minimizes the quantities of the misclassified items: ( i ) x , i = 1, 2 , ψ i* ψ α = arg min ∑ I ( ) K ,i ( x ) ≠ α K ( ) ψ x∈ X where I(z) is the indicator function of the event z and α K ,i , α Ki ) are assignments defined by ∏ K , ∏ K ,i , ( correspondingly. 20 August 8, 2009
- 21. Concept The well-known Hungarian method for solving this problem has computational complexity of O(K3). After changing the cluster labels of the partitions ∏ K ,i , i = 1, 2 , consistent with ψ i , i = 1, 2 , * we can assume that these partitions are coordinated, i.e., the clusters are consistently designated. 21 August 8, 2009
- 22. Algorithm 1. Choose the parameters: K*, J, m, Cl . 2. For K = 2 to K* 3. For j = 1 to J 4. Sj,1= sample (X, m) , Sj,2= sample (X Sj,1, m) 5. Calculate ΠK , j =Cl( S(j), K) , ΠK , j,1 =Cl( Sj ,1, K) , ΠK , j,2 =Cl( Sj ,2, K) . 6. Solve the coordination problem. 22 August 8, 2009
- 23. Algorithm 7. Calculate Yj(k), k=1,…,K, % (jK ) . Y 8. end if j 9. Calculate an asymmetry index (percentile) IK % (jK ) | j = 1,...,J }. for {Y 10. end if K 11. The “true” K* is selected as the one which yields the maximal value of the index. Here, sample(S,m) is a procedure which selects a random sample of size m from the set S, without replacement. 23 August 8, 2009
- 24. Numerical Experiments We have carried out various numerical experiments on synthetic and real data sets. We choose K*=7 in all tests, and we provide 10 trials for each experiment. The results are presented via the error-bar plots of the sample percentiles’ mean within the trials. The sizes of the error bars equal two standard deviations, found inside the trials of the results. The standard version of the Partitioning Around Medoids (PAM) algorithm has been used for clustering. The empirical percentiles of 25%, 75% and 90% have been used as the asymmetry indexes. 24 August 8, 2009
- 25. Numerical Experiments – Synthetic Data The synthesized data are mixtures of 2-dimensional Gaussian distributions with independent coordinates owning the same standard deviation σ. Mean values of the components are placed on the unit circle on the angular neighboring distance 2π / k . ˆ Each data set contains 4000 items. Here, we took J=100 (J: number of samples) and m=200 (m: size of samples). 25 August 8, 2009
- 26. Synthetic Data - Example 1 The first data set has the parameters k = 4 and σ = 0.3. ˆ As we see, all of the three indexes clearly indicate four clusters. 26 August 8, 2009
- 27. Synthetic Data - Example 2 The second synthetic data set has the parameters k = 5 ˆ and σ = 0.3. The components are obviously overlapping in this case. 27 August 8, 2009
- 28. Synthetic Data - Example 2 As it can be seen, the true number of clusters has been successfully found by all indexes. 28 August 8, 2009
- 29. Numerical Experiments – Real-World Data First Data Sets The first real data set was chosen from the text collection http://ftp.cs.cornell.edu/pub/smart/ . This set consists of the following three sub-collections DC0: Medlars Collection (1033 medical abstracts), DC1: CISI Collection (1460 information science abstracts), DC2: Cranfield Collection (1400 aerodynamics abstracts). 29 August 8, 2009
- 30. Numerical Experiments – Real-World Data First Data Sets We picked the 600 “best” terms, following the common bag of words method. It is known that this collection is well separated by means of its first two leading principal components. Here, we also took J=100 and m=200. 30 August 8, 2009
- 31. Real-World Data - First Data Sets All the indexes receive their maximal values at K=3, i.e., the number of clusters is properly determined. 31 August 8, 2009
- 32. Numerical Experiments – Real-World Data Second Data Set Another considered data set is the famous Iris Flower Data Set, available, for example, at http://archive.ics.uci.edu/ml/datasets/Iris . This dataset is composed from 150 4-dimensional feature vectors of three equally sized sets of iris flowers. We choose J=200 and the sample size equals 70. 32 August 8, 2009
- 33. Real-World Data – Iris Flower Data Set Our method turns out a three clusters structure. 33 August 8, 2009
- 34. Conclusions - The Rationale of Our Approach • In this paper, we propose a novel approach, based on the Minimal Spanning Tree two sample test, for the cluster stability assessment. • The method offers to quantify the partitions’ features through the test statistic computed within the clusters built by means of sample pairs. • The worst cluster, determined by the lowest standardized statistic value, characterizes the partition quality. 34 August 8, 2009
- 35. Conclusions - The Rationale of Our Approach • The departure from the theoretical model, which suggests well-mingled samples within the clusters, is described by the left tail of the score distribution. • The shortest tail corresponds to the “true” number of clusters. • All presented experiments detect the true number of clusters. 35 August 8, 2009
- 36. Conclusions • In the case of the five components Gaussian data set, the true number of clusters was found even though a certain overlapping of the clusters exists. • The four Gaussian components data set contains sufficiently separated components. Therefore, it is of no revelation that the true number of clusters is attained here. 36 August 8, 2009
- 37. Conclusions • The analysis of the abstracts data set is carried out with 600 terms and the true number of clusters was also detected. • The Iris Flower dataset is sufficiently difficult to analyze due to the fact that two clusters are not linearly separable. However, the true number of clusters was found here as well. 37 August 8, 2009
- 38. References Barzily, Z., Volkovich, Z.V., Akteke-Öztürk, B., and Weber, G.-W., Cluster stability using minimal spanning trees, ISI Proceedings of 20th Mini-EURO Conference Continuous Optimization and Knowledge-Based Technologies (Neringa, Lithuania, May 20-23, 2008) 248-252. Barzily, Z., Volkovich, Z.V., Akteke-Öztürk, B., and Weber, G.-W., On a minimal spanning tree approach in the cluster validation problem, to appear in the special issue of INFORMATICA at the occasion of 20th Mini-EURO Conference Continuous Optimization and Knowledge Based Technologies (Neringa, Lithuania, May 20-23, 2008), Dzemyda, G., Miettinen, K., and Sakalauskas, L., guest editors. Volkovich, V., Barzily, Z., Weber, G.-W., and Toledano-Kitai, D., Cluster stability estimation based on a minimal spanning trees approach, Proceedings of the Second Global Conference on Power Control and Optimization, AIP Conference Proceedings 1159, Bali, Indonesia, 1-3 June 2009, Subseries: Mathematical and Statistical Physics; ISBN 978-0-7354-0696-4 (August 2009) 299-305; Hakim, A.H., Vasant, P., and Barsoum, N., guest eds.. 38 August 8, 2009

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment