Published on

Introduction to Clustering: Part I

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Clustering Algorithms: An Introduction
  2. 2. Classification <ul><li>Method of Supervised learning </li></ul><ul><li>Learns a method for predicting the instance class from pre-labeled (classified) instances </li></ul>
  3. 3. Clustering <ul><li>Method of unsupervised learning </li></ul><ul><li>Finds “natural” grouping of instances given un-labeled data </li></ul>
  4. 4. Clustering Methods <ul><li>Many different method and algorithms: </li></ul><ul><ul><li>For numeric and/or symbolic data </li></ul></ul><ul><ul><li>Deterministic vs. probabilistic </li></ul></ul><ul><ul><li>Exclusive vs. overlapping </li></ul></ul><ul><ul><li>Hierarchical vs. flat </li></ul></ul><ul><ul><li>Top-down vs. bottom-up </li></ul></ul>
  5. 5. Clusters: exclusive vs. overlapping a k j i h g f e d c b
  6. 6. Example of Outlier x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x Outlier
  7. 7. Methods of Clustering <ul><li>Hierarchical (Agglomerative): </li></ul><ul><ul><li>Initially, each point in cluster by itself. </li></ul></ul><ul><ul><li>Repeatedly combine the two “nearest” clusters into one. </li></ul></ul><ul><li>Point Assignment: </li></ul><ul><ul><li>Maintain a set of clusters. </li></ul></ul><ul><ul><li>Place points into their “nearest” cluster. </li></ul></ul>
  8. 8. Hierarchical clustering <ul><li>Bottom up </li></ul><ul><ul><li>Start with single-instance clusters </li></ul></ul><ul><ul><li>At each step, join the two closest clusters </li></ul></ul><ul><ul><li>Design decision: distance between clusters </li></ul></ul><ul><ul><ul><li>E.g. two closest instances in clusters vs. distance between means </li></ul></ul></ul><ul><li>Top down </li></ul><ul><ul><li>Start with one universal cluster </li></ul></ul><ul><ul><li>Find two clusters </li></ul></ul><ul><ul><li>Proceed recursively on each subset </li></ul></ul><ul><ul><li>Can be very fast </li></ul></ul><ul><li>Both methods produce a dendrogram </li></ul>
  9. 9. Incremental clustering <ul><li>Heuristic approach (COBWEB/CLASSIT) </li></ul><ul><li>Form a hierarchy of clusters incrementally </li></ul><ul><li>Start: </li></ul><ul><ul><li>tree consists of empty root node </li></ul></ul><ul><li>Then: </li></ul><ul><ul><li>add instances one by one </li></ul></ul><ul><ul><li>update tree appropriately at each stage </li></ul></ul><ul><ul><li>to update, find the right leaf for an instance </li></ul></ul><ul><ul><li>May involve restructuring the tree </li></ul></ul><ul><li>Base update decisions on category utility </li></ul>
  10. 10. And in the Non-Euclidean Case? <ul><li>The only “locations” we can talk about are the points themselves. </li></ul><ul><ul><li>I.e., there is no “average” of two points. </li></ul></ul><ul><li>Approach 1: clustroid = point “closest” to other points. </li></ul><ul><ul><li>Treat clustroid as if it were centroid, when computing intercluster distances. </li></ul></ul>
  11. 11. “ Closest” Point? <ul><li>Possible meanings: </li></ul><ul><ul><li>Smallest maximum distance to the other points. </li></ul></ul><ul><ul><li>Smallest average distance to other points. </li></ul></ul><ul><ul><li>Smallest sum of squares of distances to other points. </li></ul></ul><ul><ul><li>Etc., etc. </li></ul></ul>
  12. 12. k – Means Algorithm(s) <ul><li>Assumes Euclidean space. </li></ul><ul><li>Start by picking k , the number of clusters. </li></ul><ul><li>Initialize clusters by picking one point per cluster. </li></ul><ul><ul><li>Example: pick one point at random, then k -1 other points, each as far away as possible from the previous points. </li></ul></ul>
  13. 13. Populating Clusters <ul><li>For each point, place it in the cluster whose current centroid it is nearest. </li></ul><ul><li>After all points are assigned, fix the centroids of the k clusters. </li></ul><ul><li>Optional : reassign all points to their closest centroid. </li></ul><ul><ul><li>Sometimes moves points between clusters. </li></ul></ul>
  14. 14. Simple Clustering: K-means <ul><li>Works with numeric data only </li></ul><ul><li>Pick a number (K) of cluster centers (at random) </li></ul><ul><li>Assign every item to its nearest cluster center (e.g. using Euclidean distance) </li></ul><ul><li>Move each cluster center to the mean of its assigned items </li></ul><ul><li>Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold) </li></ul>
  15. 15. K-means clustering summary <ul><li>Advantages </li></ul><ul><li>Simple, understandable </li></ul><ul><li>items automatically assigned to clusters </li></ul><ul><li>Disadvantages </li></ul><ul><li>Must pick number of clusters before hand </li></ul><ul><li>All items forced into a cluster </li></ul><ul><li>Too sensitive to outliers </li></ul>
  16. 16. K-means variations <ul><li>K-medoids – instead of mean, use medians of each cluster </li></ul><ul><ul><li>Mean of 1, 3, 5, 7, 9 is </li></ul></ul><ul><ul><li>Mean of 1, 3, 5, 7, 1009 is </li></ul></ul><ul><ul><li>Median of 1, 3, 5, 7, 1009 is </li></ul></ul><ul><ul><li>Median advantage: not affected by extreme values </li></ul></ul><ul><li>For large databases, use sampling </li></ul>5 205 5
  17. 17. Examples of Clustering Applications <ul><li>Marketing: discover customer groups and use them for targeted marketing and re-organization </li></ul><ul><li>Astronomy: find groups of similar stars and galaxies </li></ul><ul><li>Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults </li></ul><ul><li>Genomics: finding groups of gene with similar expression </li></ul><ul><li>And many more. </li></ul>
  18. 18. Clustering Summary <ul><li>unsupervised </li></ul><ul><li>many approaches </li></ul><ul><ul><li>K-means – simple, sometimes useful </li></ul></ul><ul><ul><ul><li>K-medoids is less sensitive to outliers </li></ul></ul></ul><ul><ul><li>Hierarchical clustering – works for symbolic attributes </li></ul></ul>
  19. 19. References <ul><li>This PPT is complied from: </li></ul><ul><li>Data Mining: Concepts and Techniques, 2nd ed. </li></ul><ul><li>The Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series Editor, Morgan Kaufmann Publishers, March 2006. ISBN 1-55860-901-6 </li></ul>
  20. 20. Visit more self help tutorials <ul><li>Pick a tutorial of your choice and browse through it at your own pace. </li></ul><ul><li>The tutorials section is free, self-guiding and will not involve any additional support. </li></ul><ul><li>Visit us at </li></ul>