Your SlideShare is downloading. ×
0
Introduction to Machine
       Learning
                  Lecture 18
                   Clustering

                 Alber...
Recap of Lecture 17
        Clustering
                 g




        Hierarchical clustering




                        ...
Today’s Agenda


        Partitional clustering: K-means
        Applications of clustering
        Using Weka




       ...
Partitional Clustering
        Aim
                Assign a set of objects into K clusters with no hierarchical
          ...
Defining the Problem
        The problem is
            p
                Map N objects into K clusters
                Ea...
Squared Error Algorithms
        Definition of squared error
                       q
                Assume a collection ...
Formulation of the Problem
        Goal
                Find the clusterization that minimizes the squared error over all
...
K-means
        The procedure
            p
                 Initialize a k-partition randomly or based on some prior
    ...
Example of k-means




                                                    Slide 9
Artificial Intelligence          Machin...
Example of k-means




                                                    Slide 10
Artificial Intelligence          Machi...
Example of k-means




                                                    Slide 11
Artificial Intelligence          Machi...
Example of k-means




                                                    Slide 12
Artificial Intelligence          Machi...
Conservative k-means alg.
        Lloyd algorithm is fast but in each iteration it moves
           y    g
        many da...
Greedy k-means alg.
        Select an arbitrary partition P into k clusters
1.
        while forever
2.
            bestCh...
Some Remarks
        Further comments about k-means
                No efficient and universal method for identifying the ...
APPLICATIONS




                                                      Slide 16
Artificial Intelligence   Machine Learning
Traveling Salesman Problem
        Up to millions of cities
        First organize cities in clusters
        Results of
 ...
Bioinformatics – Gene Expression Data

        Application to
         pp
                Genome sequencing projects
     ...
Bioinformatics – Gene Expression Data




                                             Slide 19
Artificial Intelligence   ...
Bioinformatics – Gene Expression Data




                                             Slide 20
Artificial Intelligence   ...
Next Class



        Genetic Fuzzy Systems




                                               Slide 21
Artificial Intelli...
Introduction to Machine
       Learning
                  Lecture 18
                   Clustering

                 Alber...
Upcoming SlideShare
Loading in...5
×

Lecture18

839

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
839
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
81
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Lecture18"

  1. 1. Introduction to Machine Learning Lecture 18 Clustering Albert Orriols i Puig http://www.albertorriols.net htt // lb t i l t aorriols@salle.url.edu Artificial Intelligence – Machine Learning g g Enginyeria i Arquitectura La Salle Universitat Ramon Llull
  2. 2. Recap of Lecture 17 Clustering g Hierarchical clustering Slide 2 Artificial Intelligence Machine Learning
  3. 3. Today’s Agenda Partitional clustering: K-means Applications of clustering Using Weka Slide 3 Artificial Intelligence Machine Learning
  4. 4. Partitional Clustering Aim Assign a set of objects into K clusters with no hierarchical s uc u e structure How? First approach: enumerate all partitions and get the one that Fi h ll ii d h h minimizes a measure of quality However H To expensive when the number of elements increases 2·104 partitions E.g.: Organize 30 objects into 3 groups Thence, we need heuristic methods Slide 4 Artificial Intelligence Machine Learning
  5. 5. Defining the Problem The problem is p Map N objects into K clusters Each bj t belongs t a separate cluster E h object b l to tlt Key factors Criterion function Algorithm process We’ll see Squared error algorithms Slide 5 Artificial Intelligence Machine Learning
  6. 6. Squared Error Algorithms Definition of squared error q Assume a collection of objects x1, x2, … xN We want to organize them in K clusters c1, c2, … cK The squared error criterion is defined as where Slide 6 Artificial Intelligence Machine Learning
  7. 7. Formulation of the Problem Goal Find the clusterization that minimizes the squared error over all poss b e clusterizations possible c us e a o s Characteristics of k-means It was discovered by several researches across different disciplines Requires the user to specify the number of clusters, which is k In this way, we avoid the problem of determining the number of clusters Uses a heuristic procedure to finish with the best prototypes Slide 7 Artificial Intelligence Machine Learning
  8. 8. K-means The procedure p Initialize a k-partition randomly or based on some prior 1. knowledge. Calculate the c us e p o o ype matrix M o edge Ca cu a e e cluster prototype a Assign each object of the data set to the nearest cluster center 2. (ci) Recalculate the cluster prototype matrix based on the current 3. pa t t o partition Repeat steps 2 and 3 until there is no change for each cluster 4. Will this lead the best solution? I don’t know At least, it will lead to an locally optimal solution least Slide 8 Artificial Intelligence Machine Learning
  9. 9. Example of k-means Slide 9 Artificial Intelligence Machine Learning
  10. 10. Example of k-means Slide 10 Artificial Intelligence Machine Learning
  11. 11. Example of k-means Slide 11 Artificial Intelligence Machine Learning
  12. 12. Example of k-means Slide 12 Artificial Intelligence Machine Learning
  13. 13. Conservative k-means alg. Lloyd algorithm is fast but in each iteration it moves y g many data points, not necessarily causing better convergence. A more conservative method would be to move one p point at a time only if it improves the overall clustering y p g cost The s a e t e c uste g cost o a pa t t o o data po ts is e smaller the clustering of partition of points s the better that clustering is Different methods (e g , the squared e o d sto t o ) ca be e e t et ods (e.g., t e squa ed error distortion) can used to measure this clustering cost Slide 13 Artificial Intelligence Machine Learning
  14. 14. Greedy k-means alg. Select an arbitrary partition P into k clusters 1. while forever 2. bestChange ? 0 1. for every cluster C 2. 2 for every element i not in C 1. if moving i to cluster C reduces its clustering cost g g 1. if (cost(P) – cost(Pi ? C) > bestChange 1. bestChange ? cost(P) – cost(Pi ? C) i* ? I C* ? C if bestChange > 0 3. Change partition P by moving i* to C* 1. else 4. return P 1. Slide 14 Artificial Intelligence Machine Learning
  15. 15. Some Remarks Further comments about k-means No efficient and universal method for identifying the initial pa o s partitions Run the algorithm many times with random initial partitions The iterative approach cannot guarantee convergence to global optimum Incorporation of techniques such GAs or SA to empower the p q p search toward the global optimum It is sensitive to outliers and noise Some approaches such as ISODATA and PAM consider the effect of outliers The definition of “means” restricts the application to continuous variables New dissimilarity measures to deal with categorical variables Slide 15 Artificial Intelligence Machine Learning
  16. 16. APPLICATIONS Slide 16 Artificial Intelligence Machine Learning
  17. 17. Traveling Salesman Problem Up to millions of cities First organize cities in clusters Results of 10k cities 100k cities 1M cities Slide 17 Artificial Intelligence Machine Learning
  18. 18. Bioinformatics – Gene Expression Data Application to pp Genome sequencing projects DNA microarray t h l i i technologies DNA microarray technology Effective and efficient way to measure gene expression levels of thousands of genes simultaneously Investigation of the role of the genes Clustering: Reveal hidden structures of biological data Assumption: Functionally similar genes or proteins usually share similar patterns or primary sequence structures Slide 18 Artificial Intelligence Machine Learning
  19. 19. Bioinformatics – Gene Expression Data Slide 19 Artificial Intelligence Machine Learning
  20. 20. Bioinformatics – Gene Expression Data Slide 20 Artificial Intelligence Machine Learning
  21. 21. Next Class Genetic Fuzzy Systems Slide 21 Artificial Intelligence Machine Learning
  22. 22. Introduction to Machine Learning Lecture 18 Clustering Albert Orriols i Puig http://www.albertorriols.net htt // lb t i l t aorriols@salle.url.edu Artificial Intelligence – Machine Learning g g Enginyeria i Arquitectura La Salle Universitat Ramon Llull
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×