Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Introduction to Machine Learning Lecture 18 Clustering Albert Orriols i Puig http://www.albertorriols.net htt // lb t i l t aorriols@salle.url.edu Artificial Intelligence – Machine Learning g g Enginyeria i Arquitectura La Salle Universitat Ramon Llull
  2. 2. Recap of Lecture 17 Clustering g Hierarchical clustering Slide 2 Artificial Intelligence Machine Learning
  3. 3. Today’s Agenda Partitional clustering: K-means Applications of clustering Using Weka Slide 3 Artificial Intelligence Machine Learning
  4. 4. Partitional Clustering Aim Assign a set of objects into K clusters with no hierarchical s uc u e structure How? First approach: enumerate all partitions and get the one that Fi h ll ii d h h minimizes a measure of quality However H To expensive when the number of elements increases 2·104 partitions E.g.: Organize 30 objects into 3 groups Thence, we need heuristic methods Slide 4 Artificial Intelligence Machine Learning
  5. 5. Defining the Problem The problem is p Map N objects into K clusters Each bj t belongs t a separate cluster E h object b l to tlt Key factors Criterion function Algorithm process We’ll see Squared error algorithms Slide 5 Artificial Intelligence Machine Learning
  6. 6. Squared Error Algorithms Definition of squared error q Assume a collection of objects x1, x2, … xN We want to organize them in K clusters c1, c2, … cK The squared error criterion is defined as where Slide 6 Artificial Intelligence Machine Learning
  7. 7. Formulation of the Problem Goal Find the clusterization that minimizes the squared error over all poss b e clusterizations possible c us e a o s Characteristics of k-means It was discovered by several researches across different disciplines Requires the user to specify the number of clusters, which is k In this way, we avoid the problem of determining the number of clusters Uses a heuristic procedure to finish with the best prototypes Slide 7 Artificial Intelligence Machine Learning
  8. 8. K-means The procedure p Initialize a k-partition randomly or based on some prior 1. knowledge. Calculate the c us e p o o ype matrix M o edge Ca cu a e e cluster prototype a Assign each object of the data set to the nearest cluster center 2. (ci) Recalculate the cluster prototype matrix based on the current 3. pa t t o partition Repeat steps 2 and 3 until there is no change for each cluster 4. Will this lead the best solution? I don’t know At least, it will lead to an locally optimal solution least Slide 8 Artificial Intelligence Machine Learning
  9. 9. Example of k-means Slide 9 Artificial Intelligence Machine Learning
  10. 10. Example of k-means Slide 10 Artificial Intelligence Machine Learning
  11. 11. Example of k-means Slide 11 Artificial Intelligence Machine Learning
  12. 12. Example of k-means Slide 12 Artificial Intelligence Machine Learning
  13. 13. Conservative k-means alg. Lloyd algorithm is fast but in each iteration it moves y g many data points, not necessarily causing better convergence. A more conservative method would be to move one p point at a time only if it improves the overall clustering y p g cost The s a e t e c uste g cost o a pa t t o o data po ts is e smaller the clustering of partition of points s the better that clustering is Different methods (e g , the squared e o d sto t o ) ca be e e t et ods (e.g., t e squa ed error distortion) can used to measure this clustering cost Slide 13 Artificial Intelligence Machine Learning
  14. 14. Greedy k-means alg. Select an arbitrary partition P into k clusters 1. while forever 2. bestChange ? 0 1. for every cluster C 2. 2 for every element i not in C 1. if moving i to cluster C reduces its clustering cost g g 1. if (cost(P) – cost(Pi ? C) > bestChange 1. bestChange ? cost(P) – cost(Pi ? C) i* ? I C* ? C if bestChange > 0 3. Change partition P by moving i* to C* 1. else 4. return P 1. Slide 14 Artificial Intelligence Machine Learning
  15. 15. Some Remarks Further comments about k-means No efficient and universal method for identifying the initial pa o s partitions Run the algorithm many times with random initial partitions The iterative approach cannot guarantee convergence to global optimum Incorporation of techniques such GAs or SA to empower the p q p search toward the global optimum It is sensitive to outliers and noise Some approaches such as ISODATA and PAM consider the effect of outliers The definition of “means” restricts the application to continuous variables New dissimilarity measures to deal with categorical variables Slide 15 Artificial Intelligence Machine Learning
  16. 16. APPLICATIONS Slide 16 Artificial Intelligence Machine Learning
  17. 17. Traveling Salesman Problem Up to millions of cities First organize cities in clusters Results of 10k cities 100k cities 1M cities Slide 17 Artificial Intelligence Machine Learning
  18. 18. Bioinformatics – Gene Expression Data Application to pp Genome sequencing projects DNA microarray t h l i i technologies DNA microarray technology Effective and efficient way to measure gene expression levels of thousands of genes simultaneously Investigation of the role of the genes Clustering: Reveal hidden structures of biological data Assumption: Functionally similar genes or proteins usually share similar patterns or primary sequence structures Slide 18 Artificial Intelligence Machine Learning
  19. 19. Bioinformatics – Gene Expression Data Slide 19 Artificial Intelligence Machine Learning
  20. 20. Bioinformatics – Gene Expression Data Slide 20 Artificial Intelligence Machine Learning
  21. 21. Next Class Genetic Fuzzy Systems Slide 21 Artificial Intelligence Machine Learning
  22. 22. Introduction to Machine Learning Lecture 18 Clustering Albert Orriols i Puig http://www.albertorriols.net htt // lb t i l t aorriols@salle.url.edu Artificial Intelligence – Machine Learning g g Enginyeria i Arquitectura La Salle Universitat Ramon Llull