Machine Learning and Data Mining: 06 Clustering: Partitioning

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    4 Favorites

    Machine Learning and Data Mining: 06 Clustering: Partitioning - Presentation Transcript

    1. Clustering: Partitioning Methods Machine Learning and Data Mining
    2. How do partitioning algorithms work? 2 They construct a partition of a data set containing n objects into a set of k clusters, so to minimize the sum of squared distance Σ k =1Σ tmi∈Km (Cm − tmi ) 2 m Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) Each cluster is represented by one of the objects in the cluster Prof. Pier Luca Lanzi
    3. K-means Clustering 3 Works with numeric data only Pick a number (K) of cluster centers (at random) Assign every item to its nearest cluster center (e.g. using Euclidean distance) Move each cluster center to the mean of its assigned items Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold) Prof. Pier Luca Lanzi
    4. K-means Clustering 4 Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple Prof. Pier Luca Lanzi
    5. K-means example (step 1) 5 k1 Y Pick 3 k2 initial cluster centers (randomly) k3 X Prof. Pier Luca Lanzi
    6. K-means example (step 2) 6 k1 Y k2 Assign each point to the closest cluster k3 center X Prof. Pier Luca Lanzi
    7. K-means example (step 3) 7 k1 k1 Y k2 Move k3 each cluster center k2 to the mean k3 of each cluster X Prof. Pier Luca Lanzi
    8. K-means example (step 4) 8 k1 Y Reassign k3 points closest to a k2 different new cluster center X Prof. Pier Luca Lanzi
    9. K-means example (step 5) 9 k1 Y re-compute k3 cluster means k2 X Prof. Pier Luca Lanzi
    10. K-means example (step 6) 10 k1 Y re-compute k2 cluster means k3 X Prof. Pier Luca Lanzi
    11. Two different K-means Clusterings 11 3 2.5 Original Points 2 1.5 y 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x 3 3 2.5 2.5 2 2 1.5 1.5 y y 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x Optimal Clustering Sub-optimal Clustering Prof. Pier Luca Lanzi
    12. Importance of Choosing Initial Centroids 12 Iteration 1 6 5 4 3 2 3 2.5 2 1.5 y 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x Prof. Pier Luca Lanzi
    13. Importance of Choosing Initial Centroids 13 Iteration 1 Iteration 2 Iteration 3 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x Iteration 4 Iteration 5 Iteration 6 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x Prof. Pier Luca Lanzi
    14. Evaluating K-means Clusters 14 Most common measure is the sum of squared error (SSE) For each point, the error is the distance to the nearest cluster To get SSE, we square these errors and sum them. K SSE = ∑ ∑ dist 2 ( mi , x ) i =1 x∈Ci x is a data point in cluster Ci and mi is the representative point for cluster Ci Given two clusters, we can choose the one with the smallest error One easy way to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering with higher K Prof. Pier Luca Lanzi
    15. Importance of Choosing Initial Centroids … 15 Iteration 1 5 4 3 2 3 2.5 2 1.5 y 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x Prof. Pier Luca Lanzi
    16. Importance of Choosing Initial Centroids … 16 Iteration 1 Iteration 2 3 3 2.5 2.5 2 2 1.5 1.5 y y 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x Iteration 3 Iteration 4 Iteration 5 3 3 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y y y 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x x x Prof. Pier Luca Lanzi
    17. Problems with Selecting Initial Points 17 If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small. Chance is relatively small when K is large If clusters are the same size, n, then For example, if K = 10, then probability = 10!/1010 = 0.00036 Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t Consider an example of five pairs of clusters Prof. Pier Luca Lanzi
    18. 10 Clusters Example 18 Iteration 1 4 3 2 8 6 4 2 y 0 -2 -4 -6 0 5 10 15 20 x Starting with two initial centroids in one cluster of each pair of clusters Prof. Pier Luca Lanzi
    19. 10 Clusters Example 19 Iteration 1 Iteration 2 8 8 6 6 4 4 2 2 y y 0 0 -2 -2 -4 -4 -6 -6 0 5 10 15 20 0 5 10 15 20 x x Iteration 3 Iteration 4 8 8 6 6 4 4 2 2 y y 0 0 -2 -2 -4 -4 -6 -6 0 5 10 15 20 0 5 10 15 20 x x Starting with two initial centroids in one cluster of each pair of clusters Prof. Pier Luca Lanzi
    20. 10 Clusters Example 20 Iteration 1 4 3 2 8 6 4 2 y 0 -2 -4 -6 0 5 10 15 20 x Starting with some pairs of clusters having three initial centroids, while other have only one. Prof. Pier Luca Lanzi
    21. 10 Clusters Example 21 Iteration 1 Iteration 2 8 8 6 6 4 4 2 2 y y 0 0 -2 -2 -4 -4 -6 -6 0 5 10 15 20 0 5 10 15 20 x x Iteration 3 Iteration 4 8 8 6 6 4 4 2 2 y y 0 0 -2 -2 -4 -4 -6 -6 0 5 10 15 20 0 5 10 15 20 x x Starting with some pairs of clusters having three initial centroids, while other have only one. Prof. Pier Luca Lanzi
    22. K-means 22 Result can vary significantly depending on initial choice of seeds Can get trapped in local minimum initial cluster centers instances To increase chance of finding global optimum: restart with different random seeds Prof. Pier Luca Lanzi
    23. K-means Clustering 23 Initial centroids are often chosen randomly Clusters produced vary from one run to another The centroid is (typically) the mean of the points in the cluster ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations Often the stopping condition is changed to ‘Until relatively few points change clusters’ Complexity is O(n*K*I*d) n = number of points, K = number of clusters, I = number of iterations, d = number of attributes Prof. Pier Luca Lanzi
    24. Solutions to Initial Centroids Problem 24 Multiple runs Helps, but probability is not on your side Sample and use hierarchical clustering to determine initial centroids Select more than k initial centroids and then select among these initial centroids Select most widely separated Postprocessing Bisecting K-means Not as susceptible to initialization issues Prof. Pier Luca Lanzi
    25. Handling Empty Clusters 25 Basic K-means algorithm can yield empty clusters Several strategies Choose the point that contributes most to SSE Choose a point from the cluster with the highest SSE If there are several empty clusters, the above can be repeated several times. Prof. Pier Luca Lanzi
    26. Updating Centers Incrementally 26 In the basic K-means algorithm, centroids are updated after all points are assigned to a centroid An alternative is to update the centroids after each assignment (incremental approach) Each assignment updates zero or two centroids More expensive Introduces an order dependency Never get an empty cluster Can use “weights” to change the impact Prof. Pier Luca Lanzi
    27. Pre-processing and Post-processing 27 Pre-processing Normalize the data Eliminate outliers Post-processing Eliminate small clusters that may represent outliers Split ‘loose’ clusters, i.e., clusters with relatively high SSE Merge clusters that are ‘close’ and that have relatively low SSE These steps can be used during the clustering process Prof. Pier Luca Lanzi
    28. Bisecting K-means 28 Variant of K-means that can produce a partitional or a hierarchical clustering Prof. Pier Luca Lanzi
    29. Bisecting K-means Example 29 Prof. Pier Luca Lanzi
    30. Limitations of K-means 30 K-means has problems when clusters are of differing Sizes Densities Non-globular shapes K-means has problems when the data contains outliers. Prof. Pier Luca Lanzi
    31. Limitations of K-means: Differing Sizes 31 K-means (3 Clusters) Original Points Prof. Pier Luca Lanzi
    32. Limitations of K-means: Differing Density 32 K-means (3 Clusters) Original Points Prof. Pier Luca Lanzi
    33. Limitations of K-means: Non-globular Shapes 33 Original Points K-means (2 Clusters) Prof. Pier Luca Lanzi
    34. Overcoming K-means Limitations 34 Original Points K-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together. Prof. Pier Luca Lanzi
    35. Overcoming K-means Limitations 35 Original Points K-means Clusters Prof. Pier Luca Lanzi
    36. Overcoming K-means Limitations 36 Original Points K-means Clusters Prof. Pier Luca Lanzi
    37. K-means clustering summary 37 Advantages Simple, understandable Items automatically assigned to clusters Disadvantages Must pick number of clusters before hand All items forced into a cluster Too sensitive to outliers Prof. Pier Luca Lanzi
    38. K-means clustering 38 Strength Relatively efficient Often terminates at a local optimum The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes Prof. Pier Luca Lanzi
    39. Variations of the K-Means Method 39 A few variants of the k-means which differ in Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means Handling categorical data: k-modes Replacing means of clusters with modes Using new dissimilarity measures to deal with categorical objects Using a frequency-based method to update modes of clusters A mixture of categorical and numerical data: k-prototype method Prof. Pier Luca Lanzi
    40. Comments on the K-Means Method 40 Strength: Relatively efficient Complexity is O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k)) Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes Prof. Pier Luca Lanzi
    41. K-Medoids Clustering 41 Instead of mean, use medians of each cluster Mean of 1, 3, 5, 7, 9 is 5 Mean of 1, 3, 5, 7, 1009 is 205 Median of 1, 3, 5, 7, 1009 is 5 Median is not affected by extreme values For large databases, use sampling Prof. Pier Luca Lanzi
    42. K-Medoids Clustering 42 Find representative objects, called medoids, in clusters PAM (Partitioning Around Medoids, 1987) starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering PAM works effectively for small data sets, but does not scale well for large data sets CLARA (Kaufmann & Rousseeuw, 1990) CLARANS (Ng & Han, 1994): Randomized sampling Focusing + spatial data structure (Ester et al., 1995) Prof. Pier Luca Lanzi
    43. A Typical K-Medoids Algorithm (PAM) 43 Total Cost = 20 10 10 10 9 9 9 8 8 8 Arbitrary Assign 7 7 7 6 6 6 choose k each 5 5 5 object as remaining 4 4 4 initial object to 3 3 3 medoids nearest 2 2 2 medoids 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Randomly select a nonmedoid object,Oramdom Total Cost = 26 Do loop 10 10 Compute 9 9 Swapping O 8 8 total cost of Until no 7 7 and Oramdom swapping 6 6 change 5 5 If quality is 4 4 improved. 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Prof. Pier Luca Lanzi
    44. PAM (Partitioning Around Medoids) 44 Use real object to represent the cluster Select k representative objects arbitrarily For each pair of non-selected object h and selected object i, calculate the total swapping cost TCih For each pair of i and h, If TCih < 0, i is replaced by h Then assign each non-selected object to the most similar representative object repeat steps 2-3 until there is no change Prof. Pier Luca Lanzi
    45. PAM Clustering: Total swapping cost 45 TCih=∑jCjih 10 10 j 9 9 t t 8 8 7 7 j 6 6 5 5 h i h 4 4 i 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Cjih = 0 Cjih = d(j, h) - d(j, i) 10 10 9 9 h 8 8 7 j 7 6 6 i 5 5 i h j 4 t 4 3 3 t 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Cjih = d(j, h) - d(j, t) Cjih = d(j, t) - d(j, i) Prof. Pier Luca Lanzi
    46. What Is the Problem with PAM? 46 Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean Pam works efficiently for small data sets but does not scale well for large data sets. O(k(n-k)2 ) for each iteration where n is # of data,k is # of clusters Sampling based method, CLARA(Clustering LARge Applications) Prof. Pier Luca Lanzi
    47. CLARA (Clustering Large Applications) 47 Developed by Kaufmann and Rousseeuw in 1990 It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output Strength: deals with larger data sets than PAM Weakness: Efficiency depends on the sample size A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased Prof. Pier Luca Lanzi
    48. CLARANS (“Randomized” CLARA) 48 CLARANS stands for Clustering Algorithm based on RANdomized Search CLARANS draws sample of neighbors dynamically The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids If the local optimum is found, CLARANS starts with new randomly selected node in search for a new local optimum It is more efficient and scalable than both PAM and CLARA Focusing techniques and spatial access structures may further improve its performance (Ester et al.’95) Prof. Pier Luca Lanzi

    + Pier Luca LanziPier Luca Lanzi, 3 years ago

    custom

    5589 views, 4 favs, 1 embeds more stats

    Course "Machine Learning and Data Mining" for the d more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 5589
      • 5587 on SlideShare
      • 2 from embeds
    • Comments 0
    • Favorites 4
    • Downloads 4
    Most viewed embeds
    • 2 views on http://webspace.elet.polimi.it

    more

    All embeds
    • 2 views on http://webspace.elet.polimi.it

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories