Clusters (4).pptx

What is Cluster Analysis?
Cluster Analysis is a technique for combining observations into
groups or clusters such that:
• Each group is homogenous with respect to certain
characteristics (that you specify)
• Each group is different from the other groups with respect to
the same characteristics
• Clustering technique is another example of unsupervised
technique

Cluster Analysis
In general, it is hard to observe response(Y) variable.
Applications:
Segmentation - Group of similar customers
Finance - Clustering of individual stocks
Location Analysis-Deciding the location of warehouses

Historic application of clustering

Example: Beer Data
Suppose I am interested in what influences a consumer’s choice behavior
when she is shopping for beer.
How important she considers each of these qualities when deciding whether
or not to buy the six pack:
low COST of the six pack,
high SIZE of the bottle (volume),
high percentage of ALCOHOL in the beer, the REPUTATION of the brand,
the COLOR of the beer,
nice AROMA of the beer,
and good TASTE of the beer.
Can I find similar group of people based on their answers? If I can, how can I use this
information?
We can use classification technique (discriminant analysis) in order to validate clusters

Cluster centroids
Attribute Cluster n°1 Cluster n°2 Cluster n°3
COST 82.5 21 38.3
SIZE 86.7 11 32.8
ALCOHOL 81.7 21 37.2
REPUTAT 28.3 49 61.1
COLOR 68.3 73 27.2
AROMA 60.8 69 20.6
TASTE 77.5 94 45.6

Example 2: Automobiles
higher price sensitivity
Correspondance Analysis
0.4
0.2
0
-0.2
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
-0.02
-0.04
-0.06
-0.08
-0.1
-0.12
-0.14
Ford Country Squire Wagon
Mercury Grand Marquis
Chry sler LeBaron Wagon
Buick Estate Wagon
Chev y Caprice Classic
Dodge St Regis
Ford LTD
AMC Concord D/L
Chev y Malibu Wagon
Buick Century Special
Dodge Aspen
Mercury Zephy r
Ford Mustang Ghia
Volv o 240 GL
Peugeot 694 SL
Olds Omega
Datsun 810
Chev y Citation
Buick Sky lark
Audi 5000
Ford Mustang 4
Pontiac Phoenix
Toy ota Corona
BMW 320i
Saab 99 GLE
AMC Spirit
Datsun 510
Dodge Omni
Honda Accord LX
Chev ette
Ply mouth Horizon
VW Dasher
Dodge Colt
VW Scirocco
VW Rabbit
Datsun 210
Mazda GLC
Fiat Strada
Displacement
Cy linders
Weight
Horsepower
Driv e_Ratio
MPG

Attribute Cluster n°1 Cluster n°2 Cluster n°3
MPG 17.43 31.02 21.09
Weight 3.91 2.25 3.02
Drive_Ratio 2.42 3.31 3.23
Horsepower 138 77.65 110.92
Displacement 325.5 109.24 175.08
Cylinders 8 4 5.62

Hierarchical vs. Non-Hierarchical Clustering
• Hierarchical clustering does not require a priori knowledge
of the number of clusters
-Agglomerativehierarchical clustering is one of the most
popular hierarchical clustering method.
• Number of clusters are known.
• -K-means is one of the most popular non-hierarchical
clustering method.

2
Distance-based clustering
Main idea:
Clusters are developed based on
distance between objects, as closer
means more related.
Most usedmethod:
AHC – Agglomerative hierarchical
clustering

3
Fea t ure X
Feature
Y
D a t a Points
heigh
t
~
distance

4
Feat ure X
Feature
Y
2
3 0
4
5
6 1
0 1 2 3 4 5 6

5
0 1 2 3 4 5 6
5 6
2 3
1
4
0
6
1st
closest pair
2nd
closest pair
3rd
closest pair
4th
closest pair
Feat ure X
Feature
Y
2
3 0
1
4
5
6

6
0 1 2 3 4 5 6
Feature
Y
height ~ dist ance
2
0
Fea t ure X
1
5
4
6
3

7
Feature
Y
2
3
4
5 6
1
0 1 2 3 4 5 6
height ~ dist ance
0
Fea t ure X

0 1 2 3 4 5 86
Feature
Y
0
1
6
2
3
5
4
height ~ dist ance
Fea t ure X

9
Fea t ure X
Feature
Y
D a t a Points
heigh
t
~
distance

Hierarchical Clustering
Say, we group 0 and 1 together and leave the others as is
How do we compute the distance between a group that
has two (or more) members and the others?

Hierarchical Clustering Algorithms
Centroid Method
Nearest-Neighbor or Single-Linkage
Farthest-Neighbor or Complete-Linkage
Average-Linkage
Ward’s Method

Hierarchical Clustering
Single Linkage Clustering criterion based on the shortest distance
Complete Linkage: Clustering criterion based on the longest distance

Hierarchical Clustering (Contd.)
Average Linkage: Clustering criterion based on the average distance
Ward's Method: Based on the loss of information resulting from grouping of the
objects into clusters (minimize within cluster variation)

Hierarchical Clustering (Contd.)
Centroid Method
Based on the distance between the group centroids (the point whose coordinates are
the means of all the observations in the cluster)

Example 3: Data
Consumer Income ($ 1000s) Education (years)
1 5 5
2 6 6
3 15 14
4 16 15
5 25 19
6 30 20

Geometrical View of Cluster Analysis
Education
Income

Similarity Measures
Why are consumers 1 and 2 similar?
Distance(1,2) = (5-6)2 + (5-6)2
More generally, if there are p variables:
Distance(i,j) =  (xik - xjk)2

Similarity Matrix
C1 C2 C3 C4 C5 C6
C1 0 2 181 221 625 850
C2 2 0 145 181 530 772
C3 181 145 0 2 125 261
C4 221 181 2 0 97 221
C5 625 530 125 97 0 26
C6 850 772 261 221 26 0

Centroid Method
Each group is replaced by an average consumer
Cluster 1 – average income = 5.5 and average education = 5.5

Data for Five Clusters
Cluster Members Income Education
1 C1&C2 5.5 5.5
2 C3 15 14
3 C4 16 15
4 C5 25 20
5 C6 30 19

Similarity Matrix
C1&C2 C3 C4 C5 C6
C1&C2 0
C3 162.5 0
C4 200.5 2 0
C5 590.5 125 97 0
C6 782.5 261 221 26 0

Data for Four Clusters
1 C1&C2 5.5 5.5
2 C3&C4 15.5 14.5
3 C5 25 20
4 C6 30 19

Similarity Matrix
C1&C2 C3&C4 C5 C6
C1&C2 0
C3&C4 181 0
C5 590.5 120.5 0
C6 782.5 230.5 26 0

Data for Three Clusters
1 C1&C2 5.5 5.5
2 C3&C4 15.5 14.5
3 C5&C6 27.5 19.5

Similarity Matrix
C1&C2 C3&C4 C5&C6
C1&C2 0
C3&C4 181 0
C5&C6 680 169 0

Dendogram for the Data
C1 C2 C3 C4 C5 C6

Single Linkage
First Cluster is formed in the same fashion
Distance between Cluster 1 comprising of customers 1 and 2 and customer 3 is the minimum
of Distance(1,3) = 181 and Distance(2,3) = 145

Similarity Matrix
C1&C2 C3 C4 C5 C6
C1&C2 0
C3 145 0
C4 181 2 0
C5 530 125 97 0
C6 772 261 221 26 0

Complete Linkage
Distance between Cluster 1 comprising of customers 1 and 2 and customer 3 is the
maximum of Distance(1,3) = 181 and Distance(2,3) = 145

Similarity Matrix
C1&C2 C3 C4 C5 C6
C1&C2 0
C3 181 0
C4 221 2 0
C5 625 125 97 0
C6 811 261 221 26 0

Average Linkage
Distance between Cluster 1 comprising of customers 1 and 2 and customer 3 is the average
of Distance(1,3) = 181 and Distance(2,3) = 145

Similarity Matrix
C1&C2 C3 C4 C5 C6
C1&C2 0
C3 163 0
C4 201 2 0
C5 578 125 97 0
C6 783 261 221 26 0

Ward’s Method
Does not compute distance between clusters
Forms clusters by maximizing within-cluster homogeneity or minimizing error sum of
squares (ESS)
ESS for cluster with two observations (say, C1 and C2) = (5-5.5)2 + (6-5.5)2 + (5-5.5)2 + (6-5.5)2

Ward’s Method
CL1 CL2 CL3 CL4 CL5 ESS
1 C1,C2 C3 C4 C5 C6 1
2 C1,C3 C2 C4 C5 C6 90.5
3 C1,C4 C2 C3 C5 C6 110.5
4 C1,C5 C2 C3 C4 C6 312.5
5 C1,C6 C2 C3 C4 C5 410.5
6 C2,C3 C1 C4 C5 C6 72.5
7 C2,C4 C1 C3 C5 C6 90.5

Centroid-based clustering
Mainidea:
Minimize the squared distances of
all points in the cluster to cluster
centroids.
Most usedmethod:
k-Means
44

K-means Algorithm
• Determines the best value for K center points or centroids
• Assigns each data point to its closest k-center.
• Compute centroid points based on clusters
• Assigns each data point to new cluster centroids.
• Repeat this process until cluster centroids does not change
or stopping criteria is met.

K-Means: Step 0a
Step 0a.Randomly set
cluster centroids
46

K-Means: Step 0b
Step 0b.Assign all data
points to the closest
centroid.In our example
we’ll use color coding.
47

K-Means: Step1
Step1a.Calculate distances to points Step1b.Relocate centroids to
minimize point distances
48

K-Means: Step1
Step1b.Relocate centroids to
Step1c.Reassign nearest points
49

K-Means: Step2
Step2a.Calculate distances to points Step2b.Relocate centroids to
50

K-Means: Step2
Step2b.Relocate centroids to
Step2c.Reassign nearest points
51

K-Means: iteration logic
Relocate centroids to
Calculate distances
to all points
Reassign nearest
points
52

K-Means: StepN
After a while the shifting of centroids will stop. Now we assume
we found the true location of centroid, and finished clustering
N it erat ions lat er
53

Weaknesses of K-means
• The algorithm is only applicable if the mean is
defined.
– For categorical data, k-mode - the centroid is
represented by most frequent values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
– Outliers are data points that are very far away
from other data points.
– Outliers could be errors in the data recording or
some special data points with very different values.

Sensitivity to initial seeds
Random selection of seeds (centroids)
Iteration 1 Iteration 2
Random selection of seeds (centroids)
Iteration 1 Iteration 2

Dealing with outliers and initial seeds
• For outliers, remove some data points that are much further
away from the centroids than other data points
– To be safe, we may want to monitor these possible outliers over a few
iterations and then decide to remove them.
• If random initialization points is used for the initial
seeds, run the algorithm multiple times and
keeps the seed that minimizes your clustering
error metric.
• Alternatively, carefully choose initial seeds such
that the distance among them are maximum

Special data structures
• The k-means algorithm is not suitable for discovering
clusters that are not hyper-ellipsoids (or hyper-spheres).

K-means Summary
• Despite weaknesses, k-means is still the most
popular algorithm due to its simplicity and
efficiency
• No clear evidence that any other clustering
algorithm performs better in general
• Comparing different clustering algorithms is a
difficult task. No one knows the correct
clusters!

Example 3 Again: Data
Consumer Income ($ 1000s) Education (years)
1 5 5
2 6 6
3 15 14
4 16 15
5 25 19
6 30 20

Choose C1,C3 and C5 as cluster centroids
Initial Assignment
Distance
from CL1(C1)
Distance from
CL2 (C3)
Distance
from CL3(C5)
Assigned
to CL
C1 0 181 625 1
C2 2 145 557 1
C3 181 0 136 2
C4 221 2 106 2
C5 625 136 0 3
C6 821 250 26 3

New Cluster Centroids
Variable CL1 CL2 CL3
Income 5.5 15.5 27.5
Education 5.5 14.5 19.5

Distance Matrix
Distance
from CL1
Distance
from CL2
Distance
from CL3
Previous
Assignment
Current
Assignment
C1 0.5 200.5 716.5 1 1
C2 0.5 162.5 644.5 1 1
C3 162.5 0.5 186.5 2 2
C4 200.5 0.5 152.5 2 2
C5 590.5 120.5 6.5 3 3
C6 600.50 230.5 6.5 3 3

Clusters (4).pptx

Recommended

Recommended

More Related Content

Similar to Clusters (4).pptx

Similar to Clusters (4).pptx (20)

Recently uploaded

Recently uploaded (20)

Clusters (4).pptx