PART # 01
COURSE INSTRUCTOR:
 DR. FARHEEN QAZI
DEPARTMENT OF SOFTWARE ENGINEERING
SIR SYED UNIVERSITY OF ENGINEERING & TECHNOLOGY
CHAPTER#04
UNSUPERVISED LEARNING & ITS ALGORITHMS
TODAY’S AGENDA
 Unsupervised Learning
 Cluster Analysis
 Clustering Applications
 What is good clustering?
 Types of Clustering
 K-Means Clustering Basic Algorithm
 Advantages
 Disadvantages
 summary
UNSUPERVISED LEARNING
 It do not need to be trained with desired outcome data.
 Suppose it is given an image having both dogs and cats which
have not seen ever.
UNSUPERVISED LEARNING
 Thus machine has no any idea about the features of dogs and
cat so we can’t categorize it in dogs and cats.
 But it can categorize them according to their similarities,
patterns and differences i.e., we can easily categorize the above
picture into two parts.
 First may contain all pics having dogs in it and second part may
contain all pics having cats in it.
 Here you didn’t learn anything before, means no training data
or examples.
UNSUPERVISED LEARNING
CLUSTER ANALYSIS
 Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or unrelated
to) the objects in other groups
CLUSTER ANALYSIS
 Cluster: a collection of data objects
o Similar to one another within the same cluster
o Dissimilar to the objects in other clusters
 Cluster analysis
o Finding similarities between data according to the characteristics found
in the data and grouping similar data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
o As a stand-alone tool to get insight into data distribution
o As a preprocessing step for other algorithms
NOTION OF A CLUSTER CAN BE AMBIGUOUS
QUALITY:WHAT IS GOOD CLUSTERING?
 A good clustering method will produce high quality clusters
with
o high intra-class similarity
o low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its implementation
 The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns
TYPES OF CLUSTERING
 A clustering is a set of clusters
o Important distinction between hierarchical and partitional
sets of clusters
 Partitional Clustering
o A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one
subset
 Hierarchical clustering
o A set of nested clusters organized as a hierarchical tree
PARTITIONALVS HIERARCHICAL
INTRODUCTION TO K-MEANS CLUSTERING
 K-means clustering is a type of unsupervised learning, which is
used when you have unlabeled data (i.e., data without defined
categories or groups).
 The goal of this algorithm is to find groups in the data, with
the number of groups represented by the variable K.
 The algorithm works iteratively to assign each data point to
one of K groups based on the features that are provided. Data
points are clustered based on feature similarity.
K-MEANS CLUSTERING BASIC ALGORITHM
Basic Algorithm:
 Step 1: select K
 Step 2: randomly select initial cluster seeds
Seed 1
650
Seed 2
200
CONTD….
 An initial cluster seed represents the “mean value” of its
cluster.
 In the preceding figure:
o Cluster seed 1 = 650
o Cluster seed 2 = 200
 Step 3: calculate distance from each object to each cluster
seed.
 What type of distance should we use?
o Squared Euclidean distance
 Step 4:Assign each object to the closest cluster
CONTD….
Seed 1
Seed 2
CONTD….
 Step 5: Compute the new centroid for each cluster
Cluster Seed 1
708.9
Cluster Seed 2
214.2
CONTD….
 Step#06: Iterate & Stop
o Calculate distance from objects to cluster centroids.
o Assign objects to closest cluster
o Recalculate new centroids
 Stop based on convergence criteria
o No change in clusters
o Max iterations
FLOWCHART
EXAMPLE#01
Sample No. X Y
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
 Calculate K-Mean Clustering for the following dataset for two
clusters.Tabulate all the assignments
CONTD….
Initial Centroid X Y
C1 185 72
C2 170 56
 Step#01: select K
K = 2
 Step#02: randomly select initial cluster seeds
CONTD….
Sample No. X1 Y2 X2=185 ,Y2=72
C1
X2=170 ,Y2=56
C2
1 185 72 0 21.93
2 170 56 21.93 0
3 168 60 20.81 4.472
4 179 68 7.211 15
5 182 72 3 20
6 188 77 5.83 27.66
 Step3: calculate distance from each object to each cluster
seed using Euclidean distance
 𝒅 𝑿𝒏, 𝒀𝒏 = (𝑿𝟏 − 𝑿𝟐)𝟐+(𝒀𝟏 − 𝒀𝟐)𝟐
CONTD….
Sample
No.
X1 Y2 X2=185 ,Y2=72
C1
X2=170 ,Y2=56
C2
Assignment
1 185 72 0 21.93 C1
2 170 56 21.93 0 C2
3 168 60 20.81 4.472 C2
4 179 68 7.211 15 C1
5 182 72 3 20 C1
6 188 77 5.83 27.66 C1
 Step#04:Assign each object to the closest cluster
CONTD….
 Step#05: Compute the new centroid for each cluster by calculating
mean of both the cluster values (C1 & C2)
 C1 = (185 , 72) , (179 , 68) , (182 , 72) , (188 , 77)
Mean(X,Y) = [(185+179+182+188)/(4) , (72+68+72+77)/(4)]
C1(new) = (183.5 , 72.25)
 C2 = (170 , 56) , (168 , 60)
Mean(X,Y) = [(170+168)/(2) , (56+60)/(2)]
C1(new) = (169 , 58)
CONTD….
 Step#06: Iterate with new centroids & Stop based on convergence
criteria (no change in cluster).
 Calculate Euclidian Distance
Sample No. X1 Y2 X2=183.5 ,Y2=72.25
C1
X2=169 ,Y2=58
C2
1 185 72 1.521 21.26
2 170 56 21.13 2.24
3 168 60 19.76 2.24
4 179 68 6.2 14.142
5 182 72 1.58 19.105
6 188 77 6.54 26.9
CONTD….
 Assign each object to the closest cluster and stop at this point because
same clusters are assign
Sample
No.
X1 Y2 X2=183.5 ,Y2=72.25
C1
X2=169 ,Y2=58
C2
Assignment
1 185 72 1.521 21.26 C1
2 170 56 21.13 2.24 C2
3 168 60 19.76 2.24 C2
4 179 68 6.2 14.142 C1
5 182 72 1.58 19.105 C1
6 188 77 6.54 26.9 C1
CLUSTERING APPLICATIONS
 Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop targeted
marketing programs
 Land use: Identification of areas of similar land use in an earth
observation database
 Insurance: Identifying groups of motor insurance policy holders
with a high average claim cost
 City-planning: Identifying groups of houses according to their
house type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
ADVANTAGES
 It is fast
 Easy to understand
 Comparatively efficient
 If data sets are distinct then gives the best results
 Produce tighter clusters
 When centroids are recomputed the cluster changes.
 Flexible
 Easy to interpret
 Better computational cost
 Enhances Accuracy
DISADVANTAGES
 The algorithm is only applicable if the mean is defined.
o For categorical data, k-mode - the centroid is represented
by most frequent values.
 The user needs to specify k.
 The algorithm is sensitive to outliers
o Outliers are data points that are very far away from other
data points.
o Outliers could be errors in the data recording or some
special data points with very different values
SUMMARY
 Clustering is has along history and still active
o There are a huge number of clustering algorithms
o More are still coming every year.
 We only introduced several main algorithms. There are many
others, e.g.,
o density based algorithm, sub-space clustering, scale-up
methods, neural networks based methods, fuzzy clustering,
co-clustering, etc.
 Clustering is hard to evaluate, but very useful in practice. This
partially explains why there are still a large number of
clustering algorithms being devised every year.
 Clustering is highly application dependent.

Chapter#04[Part#01]K-Means Clusterig.pdf

  • 1.
    PART # 01 COURSEINSTRUCTOR:  DR. FARHEEN QAZI DEPARTMENT OF SOFTWARE ENGINEERING SIR SYED UNIVERSITY OF ENGINEERING & TECHNOLOGY CHAPTER#04 UNSUPERVISED LEARNING & ITS ALGORITHMS
  • 2.
    TODAY’S AGENDA  UnsupervisedLearning  Cluster Analysis  Clustering Applications  What is good clustering?  Types of Clustering  K-Means Clustering Basic Algorithm  Advantages  Disadvantages  summary
  • 3.
    UNSUPERVISED LEARNING  Itdo not need to be trained with desired outcome data.  Suppose it is given an image having both dogs and cats which have not seen ever.
  • 4.
    UNSUPERVISED LEARNING  Thusmachine has no any idea about the features of dogs and cat so we can’t categorize it in dogs and cats.  But it can categorize them according to their similarities, patterns and differences i.e., we can easily categorize the above picture into two parts.  First may contain all pics having dogs in it and second part may contain all pics having cats in it.  Here you didn’t learn anything before, means no training data or examples.
  • 5.
  • 6.
    CLUSTER ANALYSIS  Findinggroups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
  • 7.
    CLUSTER ANALYSIS  Cluster:a collection of data objects o Similar to one another within the same cluster o Dissimilar to the objects in other clusters  Cluster analysis o Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters  Unsupervised learning: no predefined classes  Typical applications o As a stand-alone tool to get insight into data distribution o As a preprocessing step for other algorithms
  • 8.
    NOTION OF ACLUSTER CAN BE AMBIGUOUS
  • 9.
    QUALITY:WHAT IS GOODCLUSTERING?  A good clustering method will produce high quality clusters with o high intra-class similarity o low inter-class similarity  The quality of a clustering result depends on both the similarity measure used by the method and its implementation  The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
  • 10.
    TYPES OF CLUSTERING A clustering is a set of clusters o Important distinction between hierarchical and partitional sets of clusters  Partitional Clustering o A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset  Hierarchical clustering o A set of nested clusters organized as a hierarchical tree
  • 11.
  • 12.
    INTRODUCTION TO K-MEANSCLUSTERING  K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups).  The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K.  The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.
  • 13.
    K-MEANS CLUSTERING BASICALGORITHM Basic Algorithm:  Step 1: select K  Step 2: randomly select initial cluster seeds Seed 1 650 Seed 2 200
  • 14.
    CONTD….  An initialcluster seed represents the “mean value” of its cluster.  In the preceding figure: o Cluster seed 1 = 650 o Cluster seed 2 = 200  Step 3: calculate distance from each object to each cluster seed.  What type of distance should we use? o Squared Euclidean distance  Step 4:Assign each object to the closest cluster
  • 15.
  • 16.
    CONTD….  Step 5:Compute the new centroid for each cluster Cluster Seed 1 708.9 Cluster Seed 2 214.2
  • 17.
    CONTD….  Step#06: Iterate& Stop o Calculate distance from objects to cluster centroids. o Assign objects to closest cluster o Recalculate new centroids  Stop based on convergence criteria o No change in clusters o Max iterations
  • 18.
  • 19.
    EXAMPLE#01 Sample No. XY 1 185 72 2 170 56 3 168 60 4 179 68 5 182 72 6 188 77  Calculate K-Mean Clustering for the following dataset for two clusters.Tabulate all the assignments
  • 20.
    CONTD…. Initial Centroid XY C1 185 72 C2 170 56  Step#01: select K K = 2  Step#02: randomly select initial cluster seeds
  • 21.
    CONTD…. Sample No. X1Y2 X2=185 ,Y2=72 C1 X2=170 ,Y2=56 C2 1 185 72 0 21.93 2 170 56 21.93 0 3 168 60 20.81 4.472 4 179 68 7.211 15 5 182 72 3 20 6 188 77 5.83 27.66  Step3: calculate distance from each object to each cluster seed using Euclidean distance  𝒅 𝑿𝒏, 𝒀𝒏 = (𝑿𝟏 − 𝑿𝟐)𝟐+(𝒀𝟏 − 𝒀𝟐)𝟐
  • 22.
    CONTD…. Sample No. X1 Y2 X2=185,Y2=72 C1 X2=170 ,Y2=56 C2 Assignment 1 185 72 0 21.93 C1 2 170 56 21.93 0 C2 3 168 60 20.81 4.472 C2 4 179 68 7.211 15 C1 5 182 72 3 20 C1 6 188 77 5.83 27.66 C1  Step#04:Assign each object to the closest cluster
  • 23.
    CONTD….  Step#05: Computethe new centroid for each cluster by calculating mean of both the cluster values (C1 & C2)  C1 = (185 , 72) , (179 , 68) , (182 , 72) , (188 , 77) Mean(X,Y) = [(185+179+182+188)/(4) , (72+68+72+77)/(4)] C1(new) = (183.5 , 72.25)  C2 = (170 , 56) , (168 , 60) Mean(X,Y) = [(170+168)/(2) , (56+60)/(2)] C1(new) = (169 , 58)
  • 24.
    CONTD….  Step#06: Iteratewith new centroids & Stop based on convergence criteria (no change in cluster).  Calculate Euclidian Distance Sample No. X1 Y2 X2=183.5 ,Y2=72.25 C1 X2=169 ,Y2=58 C2 1 185 72 1.521 21.26 2 170 56 21.13 2.24 3 168 60 19.76 2.24 4 179 68 6.2 14.142 5 182 72 1.58 19.105 6 188 77 6.54 26.9
  • 25.
    CONTD….  Assign eachobject to the closest cluster and stop at this point because same clusters are assign Sample No. X1 Y2 X2=183.5 ,Y2=72.25 C1 X2=169 ,Y2=58 C2 Assignment 1 185 72 1.521 21.26 C1 2 170 56 21.13 2.24 C2 3 168 60 19.76 2.24 C2 4 179 68 6.2 14.142 C1 5 182 72 1.58 19.105 C1 6 188 77 6.54 26.9 C1
  • 26.
    CLUSTERING APPLICATIONS  Marketing:Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs  Land use: Identification of areas of similar land use in an earth observation database  Insurance: Identifying groups of motor insurance policy holders with a high average claim cost  City-planning: Identifying groups of houses according to their house type, value, and geographical location  Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
  • 27.
    ADVANTAGES  It isfast  Easy to understand  Comparatively efficient  If data sets are distinct then gives the best results  Produce tighter clusters  When centroids are recomputed the cluster changes.  Flexible  Easy to interpret  Better computational cost  Enhances Accuracy
  • 28.
    DISADVANTAGES  The algorithmis only applicable if the mean is defined. o For categorical data, k-mode - the centroid is represented by most frequent values.  The user needs to specify k.  The algorithm is sensitive to outliers o Outliers are data points that are very far away from other data points. o Outliers could be errors in the data recording or some special data points with very different values
  • 29.
    SUMMARY  Clustering ishas along history and still active o There are a huge number of clustering algorithms o More are still coming every year.  We only introduced several main algorithms. There are many others, e.g., o density based algorithm, sub-space clustering, scale-up methods, neural networks based methods, fuzzy clustering, co-clustering, etc.  Clustering is hard to evaluate, but very useful in practice. This partially explains why there are still a large number of clustering algorithms being devised every year.  Clustering is highly application dependent.