4th International Summer School
Achievements and Applications of Contemporary
Informatics, Mathematics and Physics
Nationa...
4th International Summer School
Achievements and Applications of Contemporary
Informatics, Mathematics and Physics
Nationa...
Clustering
• An essential tool for “unsupervised” learning is
  cluster analysis which suggests categorizing data
  (objec...
Clustering

For a given set S ⊂ IR d a clustering algorithm CL
constructs a clustered set:
   CL(S, int-part, k) = Π(S) = ...
Clustering

The disjoint subsets   πi (S), i=1,…,k, are named
clusters:
      k

     U π (S )
     i =1
            i   =...
Clustering




  CL(x) = CL(y)       CL(x) ≠ CL(y)


                  6
                              August 8, 2009
Clustering
The iterative clustering process is usually carried out in two phases:
a partitioning phase and a quality asses...
The Problem
Partitions generated by the iterative algorithms are commonly
sensitive to initial partitions fed in as an inp...
The Problem
Partitions generated by the iterative algorithms are commonly
sensitive to initial partitions fed in as an inp...
The Problem
Many approaches to this problem exploit the within-cluster
dispersion matrix (defined according to the pattern...
The Concept
In the current talk, the problem of determining the
true number of clusters is addressed by the cluster
stabil...
The Concept
• We quantify this closeness by the number of edges
  connecting points from different samples in a
  minimal ...
The Concept
Examples of MST produced by samples within a cluster:




                          13
                       ...
The Concept
The left-side picture is an example of “a good cluster”
where the quantity of edges connecting points from
dif...
The Two-Sample MST-Test
Henze and Penrose (1979) considered the asymptotic behavior of
Rmn :
the number of edges of V whic...
Concept
• Resting upon this fact, the standard score
                        2K        m
                 Y j :=       ...
Concept
• It is natural to expect that the true number of
  clusters can be characterized by the empirical
  distribution ...
Concept
One of important problems appearing here is the
so-called clusters coordination problem.
Actually, the same cluste...
Concept
We solve this problem by the following way:
Let S = S1 ∪ S 2 . Consider three categorizations:
                  Π...
Concept
For each one of the samples i =1,2, our purpose is
to find the permutation ψ of the set {1,…,K} which
minimizes th...
Concept

The well-known Hungarian method for solving
this problem has computational complexity of O(K3).
After changing th...
Algorithm
1. Choose the parameters: K*, J, m, Cl .
2. For K = 2 to K*
3.    For j = 1 to J
4.      Sj,1= sample (X, m) , S...
Algorithm
7.       Calculate Yj(k),   k=1,…,K, % (jK ) .
                                     Y
8.    end if j
9.    Calcu...
Numerical Experiments
We have carried out various numerical experiments on synthetic
and real data sets. We choose K*=7 in...
Numerical Experiments – Synthetic Data
The synthesized data are mixtures of 2-dimensional
Gaussian distributions with inde...
Synthetic Data - Example 1
The first data set has the parameters k = 4 and σ = 0.3.
                                      ...
Synthetic Data - Example 2
 The second synthetic data set has the parameters k = 5
                                       ...
Synthetic Data - Example 2




 As it can be seen, the true number of clusters has been
 successfully found by all indexes...
Numerical Experiments – Real-World Data
  First Data Sets
The first real data set was chosen from the text collection
http...
Numerical Experiments – Real-World Data
  First Data Sets
 We picked the 600 “best” terms, following the common
 bag of wo...
Real-World Data - First Data Sets




All the indexes receive their maximal values at K=3,
i.e., the number of clusters is...
Numerical Experiments – Real-World Data
  Second Data Set
 Another considered data set is the famous
 Iris Flower Data Set...
Real-World Data – Iris Flower Data Set




Our method turns out a three clusters structure.
                            33...
Conclusions -
 The Rationale of Our Approach
• In this paper, we propose a novel approach, based on
  the Minimal Spanning...
Conclusions -
 The Rationale of Our Approach
• The departure from the theoretical model, which
  suggests well-mingled sam...
Conclusions

• In the case of the five components Gaussian data set,
  the true number of clusters was found even though
 ...
Conclusions

• The analysis of the abstracts data set is carried out
  with 600 terms and the true number of clusters
  wa...
References
Barzily, Z., Volkovich, Z.V., Akteke-Öztürk, B., and Weber, G.-W., Cluster stability using minimal spanning tre...
Upcoming SlideShare
Loading in …5
×

Methods from Mathematical Data Mining (Supported by Optimization)

842 views

Published on

AACIMP 2009 Summer School lecture by Gerhard Wilhelm Weber. "Modern Operational Research and Its Mathematical Methods" course.

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
842
On SlideShare
0
From Embeds
0
Number of Embeds
84
Actions
Shares
0
Downloads
39
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Methods from Mathematical Data Mining (Supported by Optimization)

  1. 1. 4th International Summer School Achievements and Applications of Contemporary Informatics, Mathematics and Physics National University of Technology of the Ukraine Kiev, Ukraine, August 5-16, 2009 Methods from Mathematical Data Mining (Supported by Optimization) Gerhard-Wilhelm Weber * and Başak Akteke-Öztürk Gerhard- Akteke- Institute of Applied Mathematics Middle East Technical University, Ankara, Turkey * Faculty of Economics, Management and Law, University of Siegen, Germany Center for Research on Optimization and Control, University of Aveiro, Portugal 1 EURO CBBM EURO EURO ORD EURO CE*OC August 8, 2009
  2. 2. 4th International Summer School Achievements and Applications of Contemporary Informatics, Mathematics and Physics National University of Technology of the Ukraine Kiev, Ukraine, August 5-16, 2009 Clustering Theory Cluster Number and Cluster Stability Estimation Z. Volkovich Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel Z. Barzily Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel G.-W. Weber Departments of Scientific Computing, Financial Mathematics and Actuarial Sciences, Institute of Applied Mathematics, Middle East Technical University, 06531, Ankara, Turkey D. Toledano-Kitai Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel 2 August 8, 2009
  3. 3. Clustering • An essential tool for “unsupervised” learning is cluster analysis which suggests categorizing data (objects, instances) into groups such that the likeness within a group is much higher than the one between the groups. • This resemblance is often described by a distance function. 3 August 8, 2009
  4. 4. Clustering For a given set S ⊂ IR d a clustering algorithm CL constructs a clustered set: CL(S, int-part, k) = Π(S) = (π1(S) ,…, πk (S)), such that CL(x) = CL(y) = i, if x and y are similar: x, y ∈ πi(S), for some i=1,…,k; and CL(x) ≠ CL(y), if x and y are dissimilar. 4 August 8, 2009
  5. 5. Clustering The disjoint subsets πi (S), i=1,…,k, are named clusters: k U π (S ) i =1 i = S , and π i ∩ π j = ∅ for i ≠ j. 5 August 8, 2009
  6. 6. Clustering CL(x) = CL(y) CL(x) ≠ CL(y) 6 August 8, 2009
  7. 7. Clustering The iterative clustering process is usually carried out in two phases: a partitioning phase and a quality assessment phase. In the partitioning phase, a label is assigned to each element in view of the assumption that, in addition to the observed features, for each data item, there is a hidden, unobserved feature representing cluster membership. The quality assessment phase measures the grouping quality. The outcome of the clustering process is a partition that acquires the highest quality score. Except for the data itself, two essential input parameters are typically required: an initial partition and a suggested number of clusters. Here, the parameters are denoted as • int-part ; • k. 7 August 8, 2009
  8. 8. The Problem Partitions generated by the iterative algorithms are commonly sensitive to initial partitions fed in as an input parameter. Selection of “good” initial partitions is an essential clustering problem. Another problem arising here is choosing the right number of the clusters. It is well known that this key task of the cluster analysis is ill posed. For instance, the “correct” number of clusters in a data set can depend on the scale in which the data are measured. In this talk, we address to the last problem concerning determination of the number of clusters. 8 August 8, 2009
  9. 9. The Problem Partitions generated by the iterative algorithms are commonly sensitive to initial partitions fed in as an input parameter. Selection of “good” initial partitions is an essential clustering problem. Another problem arising here is choosing the right number of the clusters. It is well known that this key task of the cluster analysis is ill posed. For instance, the “correct” number of clusters in a data set can depend on the scale in which the data are measured. 9 August 8, 2009
  10. 10. The Problem Many approaches to this problem exploit the within-cluster dispersion matrix (defined according to the pattern of a covariance matrix). The span of this matrix (column space) usually decreases as the number of groups rises, and may have a point in which it “falls”. Such an “elbow” on the graph locates, in several known methods, the “true” number of clusters. Stability based approaches, for the cluster validation problem, evaluate the partitions’ variability under repeated applications of a clustering algorithm. Low variability is understood as high consistency in the result obtained, and the number of clusters that maximizes cluster stability is accepted as an estimate for the “true” number of clusters. 10 August 8, 2009
  11. 11. The Concept In the current talk, the problem of determining the true number of clusters is addressed by the cluster stability approach. We propose a method for the study of cluster stability. This method suggests a geometrical stability of a partition. • We draw samples from the source data and estimate the clusters by means of each of the drawn samples. • We compare pairs of the partitions obtained. • A pair is considered to be consistent if the obtained division is close. 11 August 8, 2009
  12. 12. The Concept • We quantify this closeness by the number of edges connecting points from different samples in a minimal spanning tree (MST) constructed for each one of the clusters. • We use the Friedman and Rafsky two sample test statistic which measures these quantities. Under the null hypothesis on the homogeneity of the source data, this statistic is approximately normally distributed. So, the case of well mingled samples within the clusters leads to normal distribution of the considered statistic. 12 August 8, 2009
  13. 13. The Concept Examples of MST produced by samples within a cluster: 13 August 8, 2009
  14. 14. The Concept The left-side picture is an example of “a good cluster” where the quantity of edges connecting points from different samples (marked by solid red lines) is relatively big. The right-side picture images a “poor situation” when only one (and long) edge connects the (sub-) clusters. 14 August 8, 2009
  15. 15. The Two-Sample MST-Test Henze and Penrose (1979) considered the asymptotic behavior of Rmn : the number of edges of V which connect a point of S to a point of T . Suppose that |S|=m → ∞ and |T|=n → ∞ such that m /(m+n) → p∈ (0, 1). ∈ Introducing q = 1 − p and r = 2pq, they obtained: 1   Rmn − 2mn  m+n (  → N 0, σ d 2 ), m+n  2 where the convergence is in distribution and N(0, σ d ) denotes the normal distribution with a 0 expectation and a variance 2 σ d := r (r + Cd (1 − 2r)), for some constant Cd depending only on the space’s dimension d. 15 August 8, 2009
  16. 16. Concept • Resting upon this fact, the standard score 2K  m Y j :=  Rj −  m  K of the mentioned edges quantity is calculated for each cluster j=1,…, K , where m is the sample size and K denotes the number of clusters. % • The partition quality Y is represented by the worst cluster corresponding to the minimal standard score value obtained. 16 August 8, 2009
  17. 17. Concept • It is natural to expect that the true number of clusters can be characterized by the empirical distribution of the partition standard score having the shortest left tail. • The proposed methodology is expressed as a sequential creation of the described distribution with its left-asymmetry estimation. 17 August 8, 2009
  18. 18. Concept One of important problems appearing here is the so-called clusters coordination problem. Actually, the same cluster can be differently tagged within repeated rerunning of the algorithm. This fact results from the inherent symmetry of the partitions according to their clusters labels. 18 August 8, 2009
  19. 19. Concept We solve this problem by the following way: Let S = S1 ∪ S 2 . Consider three categorizations: Π K := Cl ( S , K ) , Π K ,1 := Cl ( S1, K ) , Π K ,2 := Cl ( S2 , K ) . Thus, we get two partitions for each of the samples Si, i=1,2. The first one is induced by ΠK and the second one is Π K ,i , i = 1, 2 . 19 August 8, 2009
  20. 20. Concept For each one of the samples i =1,2, our purpose is to find the permutation ψ of the set {1,…,K} which minimizes the quantities of the misclassified items: ( i ) x , i = 1, 2 , ψ i* ψ α = arg min ∑ I  ( ) K ,i ( x ) ≠ α K ( )  ψ x∈ X   where I(z) is the indicator function of the event z and α K ,i , α Ki ) are assignments defined by ∏ K , ∏ K ,i , ( correspondingly. 20 August 8, 2009
  21. 21. Concept The well-known Hungarian method for solving this problem has computational complexity of O(K3). After changing the cluster labels of the partitions ∏ K ,i , i = 1, 2 , consistent with ψ i , i = 1, 2 , * we can assume that these partitions are coordinated, i.e., the clusters are consistently designated. 21 August 8, 2009
  22. 22. Algorithm 1. Choose the parameters: K*, J, m, Cl . 2. For K = 2 to K* 3. For j = 1 to J 4. Sj,1= sample (X, m) , Sj,2= sample (X Sj,1, m) 5. Calculate ΠK , j =Cl( S(j), K) , ΠK , j,1 =Cl( Sj ,1, K) , ΠK , j,2 =Cl( Sj ,2, K) . 6. Solve the coordination problem. 22 August 8, 2009
  23. 23. Algorithm 7. Calculate Yj(k), k=1,…,K, % (jK ) . Y 8. end if j 9. Calculate an asymmetry index (percentile) IK % (jK ) | j = 1,...,J }. for {Y 10. end if K 11. The “true” K* is selected as the one which yields the maximal value of the index. Here, sample(S,m) is a procedure which selects a random sample of size m from the set S, without replacement. 23 August 8, 2009
  24. 24. Numerical Experiments We have carried out various numerical experiments on synthetic and real data sets. We choose K*=7 in all tests, and we provide 10 trials for each experiment. The results are presented via the error-bar plots of the sample percentiles’ mean within the trials. The sizes of the error bars equal two standard deviations, found inside the trials of the results. The standard version of the Partitioning Around Medoids (PAM) algorithm has been used for clustering. The empirical percentiles of 25%, 75% and 90% have been used as the asymmetry indexes. 24 August 8, 2009
  25. 25. Numerical Experiments – Synthetic Data The synthesized data are mixtures of 2-dimensional Gaussian distributions with independent coordinates owning the same standard deviation σ. Mean values of the components are placed on the unit circle on the angular neighboring distance 2π / k . ˆ Each data set contains 4000 items. Here, we took J=100 (J: number of samples) and m=200 (m: size of samples). 25 August 8, 2009
  26. 26. Synthetic Data - Example 1 The first data set has the parameters k = 4 and σ = 0.3. ˆ As we see, all of the three indexes clearly indicate four clusters. 26 August 8, 2009
  27. 27. Synthetic Data - Example 2 The second synthetic data set has the parameters k = 5 ˆ and σ = 0.3. The components are obviously overlapping in this case. 27 August 8, 2009
  28. 28. Synthetic Data - Example 2 As it can be seen, the true number of clusters has been successfully found by all indexes. 28 August 8, 2009
  29. 29. Numerical Experiments – Real-World Data First Data Sets The first real data set was chosen from the text collection http://ftp.cs.cornell.edu/pub/smart/ . This set consists of the following three sub-collections DC0: Medlars Collection (1033 medical abstracts), DC1: CISI Collection (1460 information science abstracts), DC2: Cranfield Collection (1400 aerodynamics abstracts). 29 August 8, 2009
  30. 30. Numerical Experiments – Real-World Data First Data Sets We picked the 600 “best” terms, following the common bag of words method. It is known that this collection is well separated by means of its first two leading principal components. Here, we also took J=100 and m=200. 30 August 8, 2009
  31. 31. Real-World Data - First Data Sets All the indexes receive their maximal values at K=3, i.e., the number of clusters is properly determined. 31 August 8, 2009
  32. 32. Numerical Experiments – Real-World Data Second Data Set Another considered data set is the famous Iris Flower Data Set, available, for example, at http://archive.ics.uci.edu/ml/datasets/Iris . This dataset is composed from 150 4-dimensional feature vectors of three equally sized sets of iris flowers. We choose J=200 and the sample size equals 70. 32 August 8, 2009
  33. 33. Real-World Data – Iris Flower Data Set Our method turns out a three clusters structure. 33 August 8, 2009
  34. 34. Conclusions - The Rationale of Our Approach • In this paper, we propose a novel approach, based on the Minimal Spanning Tree two sample test, for the cluster stability assessment. • The method offers to quantify the partitions’ features through the test statistic computed within the clusters built by means of sample pairs. • The worst cluster, determined by the lowest standardized statistic value, characterizes the partition quality. 34 August 8, 2009
  35. 35. Conclusions - The Rationale of Our Approach • The departure from the theoretical model, which suggests well-mingled samples within the clusters, is described by the left tail of the score distribution. • The shortest tail corresponds to the “true” number of clusters. • All presented experiments detect the true number of clusters. 35 August 8, 2009
  36. 36. Conclusions • In the case of the five components Gaussian data set, the true number of clusters was found even though a certain overlapping of the clusters exists. • The four Gaussian components data set contains sufficiently separated components. Therefore, it is of no revelation that the true number of clusters is attained here. 36 August 8, 2009
  37. 37. Conclusions • The analysis of the abstracts data set is carried out with 600 terms and the true number of clusters was also detected. • The Iris Flower dataset is sufficiently difficult to analyze due to the fact that two clusters are not linearly separable. However, the true number of clusters was found here as well. 37 August 8, 2009
  38. 38. References Barzily, Z., Volkovich, Z.V., Akteke-Öztürk, B., and Weber, G.-W., Cluster stability using minimal spanning trees, ISI Proceedings of 20th Mini-EURO Conference Continuous Optimization and Knowledge-Based Technologies (Neringa, Lithuania, May 20-23, 2008) 248-252. Barzily, Z., Volkovich, Z.V., Akteke-Öztürk, B., and Weber, G.-W., On a minimal spanning tree approach in the cluster validation problem, to appear in the special issue of INFORMATICA at the occasion of 20th Mini-EURO Conference Continuous Optimization and Knowledge Based Technologies (Neringa, Lithuania, May 20-23, 2008), Dzemyda, G., Miettinen, K., and Sakalauskas, L., guest editors. Volkovich, V., Barzily, Z., Weber, G.-W., and Toledano-Kitai, D., Cluster stability estimation based on a minimal spanning trees approach, Proceedings of the Second Global Conference on Power Control and Optimization, AIP Conference Proceedings 1159, Bali, Indonesia, 1-3 June 2009, Subseries: Mathematical and Statistical Physics; ISBN 978-0-7354-0696-4 (August 2009) 299-305; Hakim, A.H., Vasant, P., and Barsoum, N., guest eds.. 38 August 8, 2009

×