Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Introduction to Clustering

Published in: Technology
  • Be the first to comment


  1. 1. Clustering Algorithms: An Introduction
  2. 2. Classification <ul><li>Method of Supervised learning </li></ul><ul><li>Learns a method for predicting the instance class from pre-labeled (classified) instances </li></ul>
  3. 3. Clustering <ul><li>Method of unsupervised learning </li></ul><ul><li>Finds “natural” grouping of instances given un-labeled data </li></ul>
  4. 4. Clustering Methods <ul><li>Many different method and algorithms: </li></ul><ul><ul><li>For numeric and/or symbolic data </li></ul></ul><ul><ul><li>Deterministic vs. probabilistic </li></ul></ul><ul><ul><li>Exclusive vs. overlapping </li></ul></ul><ul><ul><li>Hierarchical vs. flat </li></ul></ul><ul><ul><li>Top-down vs. bottom-up </li></ul></ul>
  5. 5. Clusters: exclusive vs. overlapping a k j i h g f e d c b
  6. 6. Example of Outlier x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x Outlier
  7. 7. Methods of Clustering <ul><li>Hierarchical (Agglomerative): </li></ul><ul><ul><li>Initially, each point in cluster by itself. </li></ul></ul><ul><ul><li>Repeatedly combine the two “nearest” clusters into one. </li></ul></ul><ul><li>Point Assignment: </li></ul><ul><ul><li>Maintain a set of clusters. </li></ul></ul><ul><ul><li>Place points into their “nearest” cluster. </li></ul></ul>
  8. 8. Hierarchical clustering <ul><li>Bottom up </li></ul><ul><ul><li>Start with single-instance clusters </li></ul></ul><ul><ul><li>At each step, join the two closest clusters </li></ul></ul><ul><ul><li>Design decision: distance between clusters </li></ul></ul><ul><ul><ul><li>E.g. two closest instances in clusters vs. distance between means </li></ul></ul></ul><ul><li>Top down </li></ul><ul><ul><li>Start with one universal cluster </li></ul></ul><ul><ul><li>Find two clusters </li></ul></ul><ul><ul><li>Proceed recursively on each subset </li></ul></ul><ul><ul><li>Can be very fast </li></ul></ul><ul><li>Both methods produce a dendrogram </li></ul>
  9. 9. Incremental clustering <ul><li>Heuristic approach (COBWEB/CLASSIT) </li></ul><ul><li>Form a hierarchy of clusters incrementally </li></ul><ul><li>Start: </li></ul><ul><ul><li>tree consists of empty root node </li></ul></ul><ul><li>Then: </li></ul><ul><ul><li>add instances one by one </li></ul></ul><ul><ul><li>update tree appropriately at each stage </li></ul></ul><ul><ul><li>to update, find the right leaf for an instance </li></ul></ul><ul><ul><li>May involve restructuring the tree </li></ul></ul><ul><li>Base update decisions on category utility </li></ul>
  10. 10. And in the Non-Euclidean Case? <ul><li>The only “locations” we can talk about are the points themselves. </li></ul><ul><ul><li>I.e., there is no “average” of two points. </li></ul></ul><ul><li>Approach 1: clustroid = point “closest” to other points. </li></ul><ul><ul><li>Treat clustroid as if it were centroid, when computing intercluster distances. </li></ul></ul>
  11. 11. “ Closest” Point? <ul><li>Possible meanings: </li></ul><ul><ul><li>Smallest maximum distance to the other points. </li></ul></ul><ul><ul><li>Smallest average distance to other points. </li></ul></ul><ul><ul><li>Smallest sum of squares of distances to other points. </li></ul></ul><ul><ul><li>Etc., etc. </li></ul></ul>
  12. 12. k – Means Algorithm(s) <ul><li>Assumes Euclidean space. </li></ul><ul><li>Start by picking k , the number of clusters. </li></ul><ul><li>Initialize clusters by picking one point per cluster. </li></ul><ul><ul><li>Example: pick one point at random, then k -1 other points, each as far away as possible from the previous points. </li></ul></ul>
  13. 13. Populating Clusters <ul><li>For each point, place it in the cluster whose current centroid it is nearest. </li></ul><ul><li>After all points are assigned, fix the centroids of the k clusters. </li></ul><ul><li>Optional : reassign all points to their closest centroid. </li></ul><ul><ul><li>Sometimes moves points between clusters. </li></ul></ul>
  14. 14. Simple Clustering: K-means <ul><li>Works with numeric data only </li></ul><ul><li>Pick a number (K) of cluster centers (at random) </li></ul><ul><li>Assign every item to its nearest cluster center (e.g. using Euclidean distance) </li></ul><ul><li>Move each cluster center to the mean of its assigned items </li></ul><ul><li>Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold) </li></ul>
  15. 15. K-means clustering summary <ul><li>Advantages </li></ul><ul><li>Simple, understandable </li></ul><ul><li>items automatically assigned to clusters </li></ul><ul><li>Disadvantages </li></ul><ul><li>Must pick number of clusters before hand </li></ul><ul><li>All items forced into a cluster </li></ul><ul><li>Too sensitive to outliers </li></ul>
  16. 16. K-means variations <ul><li>K-medoids – instead of mean, use medians of each cluster </li></ul><ul><ul><li>Mean of 1, 3, 5, 7, 9 is </li></ul></ul><ul><ul><li>Mean of 1, 3, 5, 7, 1009 is </li></ul></ul><ul><ul><li>Median of 1, 3, 5, 7, 1009 is </li></ul></ul><ul><ul><li>Median advantage: not affected by extreme values </li></ul></ul><ul><li>For large databases, use sampling </li></ul>5 205 5
  17. 17. Examples of Clustering Applications <ul><li>Marketing: discover customer groups and use them for targeted marketing and re-organization </li></ul><ul><li>Astronomy: find groups of similar stars and galaxies </li></ul><ul><li>Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults </li></ul><ul><li>Genomics: finding groups of gene with similar expression </li></ul><ul><li>And many more. </li></ul>
  18. 18. Clustering Summary <ul><li>unsupervised </li></ul><ul><li>many approaches </li></ul><ul><ul><li>K-means – simple, sometimes useful </li></ul></ul><ul><ul><ul><li>K-medoids is less sensitive to outliers </li></ul></ul></ul><ul><ul><li>Hierarchical clustering – works for symbolic attributes </li></ul></ul>
  19. 19. References <ul><li>This PPT is complied from: </li></ul><ul><li>Data Mining: Concepts and Techniques, 2nd ed. </li></ul><ul><li>The Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series Editor, Morgan Kaufmann Publishers, March 2006. ISBN 1-55860-901-6 </li></ul>
  20. 20. Visit more self help tutorials <ul><li>Pick a tutorial of your choice and browse through it at your own pace. </li></ul><ul><li>The tutorials section is free, self-guiding and will not involve any additional support. </li></ul><ul><li>Visit us at </li></ul>