SlideShare a Scribd company logo
What is Cluster Analysis?
Cluster Analysis is a technique for combining observations into
groups or clusters such that:
• Each group is homogenous with respect to certain
characteristics (that you specify)
• Each group is different from the other groups with respect to
the same characteristics
• Clustering technique is another example of unsupervised
technique
Cluster Analysis
In general, it is hard to observe response(Y) variable.
Applications:
Segmentation - Group of similar customers
Finance - Clustering of individual stocks
Location Analysis-Deciding the location of warehouses
Historic application of clustering
Example: Beer Data
Suppose I am interested in what influences a consumer’s choice behavior
when she is shopping for beer.
How important she considers each of these qualities when deciding whether
or not to buy the six pack:
low COST of the six pack,
high SIZE of the bottle (volume),
high percentage of ALCOHOL in the beer, the REPUTATION of the brand,
the COLOR of the beer,
nice AROMA of the beer,
and good TASTE of the beer.
Can I find similar group of people based on their answers? If I can, how can I use this
information?
We can use classification technique (discriminant analysis) in order to validate clusters
Cluster centroids
Attribute Cluster n°1 Cluster n°2 Cluster n°3
COST 82.5 21 38.3
SIZE 86.7 11 32.8
ALCOHOL 81.7 21 37.2
REPUTAT 28.3 49 61.1
COLOR 68.3 73 27.2
AROMA 60.8 69 20.6
TASTE 77.5 94 45.6
Example 2: Automobiles
higher price sensitivity
Correspondance Analysis
0.4
0.2
0
-0.2
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
-0.02
-0.04
-0.06
-0.08
-0.1
-0.12
-0.14
Ford Country Squire Wagon
Mercury Grand Marquis
Chry sler LeBaron Wagon
Buick Estate Wagon
Chev y Caprice Classic
Dodge St Regis
Ford LTD
AMC Concord D/L
Chev y Malibu Wagon
Buick Century Special
Dodge Aspen
Mercury Zephy r
Ford Mustang Ghia
Volv o 240 GL
Peugeot 694 SL
Olds Omega
Datsun 810
Chev y Citation
Buick Sky lark
Audi 5000
Ford Mustang 4
Pontiac Phoenix
Toy ota Corona
BMW 320i
Saab 99 GLE
AMC Spirit
Datsun 510
Dodge Omni
Honda Accord LX
Chev ette
Ply mouth Horizon
VW Dasher
Dodge Colt
VW Scirocco
VW Rabbit
Datsun 210
Mazda GLC
Fiat Strada
Displacement
Cy linders
Weight
Horsepower
Driv e_Ratio
MPG
Attribute Cluster n°1 Cluster n°2 Cluster n°3
MPG 17.43 31.02 21.09
Weight 3.91 2.25 3.02
Drive_Ratio 2.42 3.31 3.23
Horsepower 138 77.65 110.92
Displacement 325.5 109.24 175.08
Cylinders 8 4 5.62
Hierarchical vs. Non-Hierarchical Clustering
• Hierarchical clustering does not require a priori knowledge
of the number of clusters
-Agglomerativehierarchical clustering is one of the most
popular hierarchical clustering method.
• Number of clusters are known.
• -K-means is one of the most popular non-hierarchical
clustering method.
Clustering
Hierarchical
v1.1
2
Distance-based clustering
Main idea:
Clusters are developed based on
distance between objects, as closer
means more related.
Most usedmethod:
AHC – Agglomerative hierarchical
clustering
3
Distance-based clustering
Fea t ure X
Feature
Y
D a t a Points
heigh
t
~
distance
4
Distance-based clustering
Feat ure X
Feature
Y
2
3 0
4
5
6 1
0 1 2 3 4 5 6
5
Distance-based clustering
0 1 2 3 4 5 6
5 6
2 3
1
4
0
6
1st
closest pair
2nd
closest pair
3rd
closest pair
4th
closest pair
Feat ure X
Feature
Y
2
3 0
1
4
5
6
6
Distance-based clustering
0 1 2 3 4 5 6
Feature
Y
height ~ dist ance
2
0
Fea t ure X
1
5
4
6
3
7
Distance-based clustering
Feature
Y
2
3
4
5 6
1
0 1 2 3 4 5 6
height ~ dist ance
0
Fea t ure X
Distance-based clustering
0 1 2 3 4 5 86
Feature
Y
0
1
6
2
3
5
4
height ~ dist ance
Fea t ure X
9
Distance-based clustering
Fea t ure X
Feature
Y
D a t a Points
heigh
t
~
distance
Hierarchical Clustering
Say, we group 0 and 1 together and leave the others as is
How do we compute the distance between a group that
has two (or more) members and the others?
Hierarchical Clustering Algorithms
Centroid Method
Nearest-Neighbor or Single-Linkage
Farthest-Neighbor or Complete-Linkage
Average-Linkage
Ward’s Method
Hierarchical Clustering
Single Linkage Clustering criterion based on the shortest distance
Complete Linkage: Clustering criterion based on the longest distance
Hierarchical Clustering (Contd.)
Average Linkage: Clustering criterion based on the average distance
Ward's Method: Based on the loss of information resulting from grouping of the
objects into clusters (minimize within cluster variation)
Hierarchical Clustering (Contd.)
Centroid Method
Based on the distance between the group centroids (the point whose coordinates are
the means of all the observations in the cluster)
Example 3: Data
Consumer Income ($ 1000s) Education (years)
1 5 5
2 6 6
3 15 14
4 16 15
5 25 19
6 30 20
Geometrical View of Cluster Analysis
Education
Income
Similarity Measures
Why are consumers 1 and 2 similar?
Distance(1,2) = (5-6)2 + (5-6)2
More generally, if there are p variables:
Distance(i,j) =  (xik - xjk)2
Similarity Matrix
C1 C2 C3 C4 C5 C6
C1 0 2 181 221 625 850
C2 2 0 145 181 530 772
C3 181 145 0 2 125 261
C4 221 181 2 0 97 221
C5 625 530 125 97 0 26
C6 850 772 261 221 26 0
Centroid Method
Each group is replaced by an average consumer
Cluster 1 – average income = 5.5 and average education = 5.5
Data for Five Clusters
Cluster Members Income Education
1 C1&C2 5.5 5.5
2 C3 15 14
3 C4 16 15
4 C5 25 20
5 C6 30 19
Similarity Matrix
C1&C2 C3 C4 C5 C6
C1&C2 0
C3 162.5 0
C4 200.5 2 0
C5 590.5 125 97 0
C6 782.5 261 221 26 0
Data for Four Clusters
Cluster Members Income Education
1 C1&C2 5.5 5.5
2 C3&C4 15.5 14.5
3 C5 25 20
4 C6 30 19
Similarity Matrix
C1&C2 C3&C4 C5 C6
C1&C2 0
C3&C4 181 0
C5 590.5 120.5 0
C6 782.5 230.5 26 0
Data for Three Clusters
Cluster Members Income Education
1 C1&C2 5.5 5.5
2 C3&C4 15.5 14.5
3 C5&C6 27.5 19.5
Similarity Matrix
C1&C2 C3&C4 C5&C6
C1&C2 0
C3&C4 181 0
C5&C6 680 169 0
Dendogram for the Data
C1 C2 C3 C4 C5 C6
Single Linkage
First Cluster is formed in the same fashion
Distance between Cluster 1 comprising of customers 1 and 2 and customer 3 is the minimum
of Distance(1,3) = 181 and Distance(2,3) = 145
Similarity Matrix
C1&C2 C3 C4 C5 C6
C1&C2 0
C3 145 0
C4 181 2 0
C5 530 125 97 0
C6 772 261 221 26 0
Complete Linkage
Distance between Cluster 1 comprising of customers 1 and 2 and customer 3 is the
maximum of Distance(1,3) = 181 and Distance(2,3) = 145
Similarity Matrix
C1&C2 C3 C4 C5 C6
C1&C2 0
C3 181 0
C4 221 2 0
C5 625 125 97 0
C6 811 261 221 26 0
Average Linkage
Distance between Cluster 1 comprising of customers 1 and 2 and customer 3 is the average
of Distance(1,3) = 181 and Distance(2,3) = 145
Similarity Matrix
C1&C2 C3 C4 C5 C6
C1&C2 0
C3 163 0
C4 201 2 0
C5 578 125 97 0
C6 783 261 221 26 0
Ward’s Method
Does not compute distance between clusters
Forms clusters by maximizing within-cluster homogeneity or minimizing error sum of
squares (ESS)
ESS for cluster with two observations (say, C1 and C2) = (5-5.5)2 + (6-5.5)2 + (5-5.5)2 + (6-5.5)2
Ward’s Method
CL1 CL2 CL3 CL4 CL5 ESS
1 C1,C2 C3 C4 C5 C6 1
2 C1,C3 C2 C4 C5 C6 90.5
3 C1,C4 C2 C3 C5 C6 110.5
4 C1,C5 C2 C3 C4 C6 312.5
5 C1,C6 C2 C3 C4 C5 410.5
6 C2,C3 C1 C4 C5 C6 72.5
7 C2,C4 C1 C3 C5 C6 90.5
Clustering
K-Means
v1.0
Centroid-based clustering
Mainidea:
Minimize the squared distances of
all points in the cluster to cluster
centroids.
Most usedmethod:
k-Means
44
K-means Algorithm
• Determines the best value for K center points or centroids
• Assigns each data point to its closest k-center.
• Compute centroid points based on clusters
• Assigns each data point to new cluster centroids.
• Repeat this process until cluster centroids does not change
or stopping criteria is met.
K-Means: Step 0a
Step 0a.Randomly set
cluster centroids
46
K-Means: Step 0b
Step 0b.Assign all data
points to the closest
centroid.In our example
we’ll use color coding.
47
K-Means: Step1
Step1a.Calculate distances to points Step1b.Relocate centroids to
minimize point distances
48
K-Means: Step1
Step1b.Relocate centroids to
minimize point distances
Step1c.Reassign nearest points
49
K-Means: Step2
Step2a.Calculate distances to points Step2b.Relocate centroids to
minimize point distances
50
K-Means: Step2
Step2b.Relocate centroids to
minimize point distances
Step2c.Reassign nearest points
51
K-Means: iteration logic
Relocate centroids to
minimize point distances
Calculate distances
to all points
Reassign nearest
points
52
K-Means: StepN
After a while the shifting of centroids will stop. Now we assume
we found the true location of centroid, and finished clustering
N it erat ions lat er
53
Weaknesses of K-means
• The algorithm is only applicable if the mean is
defined.
– For categorical data, k-mode - the centroid is
represented by most frequent values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
– Outliers are data points that are very far away
from other data points.
– Outliers could be errors in the data recording or
some special data points with very different values.
Outliers
Sensitivity to initial seeds
Random selection of seeds (centroids)
Iteration 1 Iteration 2
Random selection of seeds (centroids)
Iteration 1 Iteration 2
Dealing with outliers and initial seeds
• For outliers, remove some data points that are much further
away from the centroids than other data points
– To be safe, we may want to monitor these possible outliers over a few
iterations and then decide to remove them.
• If random initialization points is used for the initial
seeds, run the algorithm multiple times and
keeps the seed that minimizes your clustering
error metric.
• Alternatively, carefully choose initial seeds such
that the distance among them are maximum
Special data structures
• The k-means algorithm is not suitable for discovering
clusters that are not hyper-ellipsoids (or hyper-spheres).
K-means Summary
• Despite weaknesses, k-means is still the most
popular algorithm due to its simplicity and
efficiency
• No clear evidence that any other clustering
algorithm performs better in general
• Comparing different clustering algorithms is a
difficult task. No one knows the correct
clusters!
Example 3 Again: Data
Consumer Income ($ 1000s) Education (years)
1 5 5
2 6 6
3 15 14
4 16 15
5 25 19
6 30 20
Geometrical View of Cluster Analysis
Education
Income
Choose C1,C3 and C5 as cluster centroids
Initial Assignment
Distance
from CL1(C1)
Distance from
CL2 (C3)
Distance
from CL3(C5)
Assigned
to CL
C1 0 181 625 1
C2 2 145 557 1
C3 181 0 136 2
C4 221 2 106 2
C5 625 136 0 3
C6 821 250 26 3
New Cluster Centroids
Variable CL1 CL2 CL3
Income 5.5 15.5 27.5
Education 5.5 14.5 19.5
Distance Matrix
Distance
from CL1
Distance
from CL2
Distance
from CL3
Previous
Assignment
Current
Assignment
C1 0.5 200.5 716.5 1 1
C2 0.5 162.5 644.5 1 1
C3 162.5 0.5 186.5 2 2
C4 200.5 0.5 152.5 2 2
C5 590.5 120.5 6.5 3 3
C6 600.50 230.5 6.5 3 3

More Related Content

Similar to Clusters (4).pptx

Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptx
ImXaib
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
Mukul Kumar Singh Chauhan
 
Kx for wine tasting
Kx for wine tastingKx for wine tasting
Kx for wine tasting
Mark Lefevre, CQF
 
3.Unsupervised Learning.ppt presenting machine learning
3.Unsupervised Learning.ppt presenting machine learning3.Unsupervised Learning.ppt presenting machine learning
3.Unsupervised Learning.ppt presenting machine learning
PriyankaRamavath3
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
Dr Nisha Arora
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering
Yan Xu
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
nikshaikh786
 
TYPES OF CLUSTERING.pptx
TYPES OF CLUSTERING.pptxTYPES OF CLUSTERING.pptx
TYPES OF CLUSTERING.pptx
Incrediblev Vishnu
 
BAS 250 Lecture 8
BAS 250 Lecture 8BAS 250 Lecture 8
BAS 250 Lecture 8
Wake Tech BAS
 
07 learning
07 learning07 learning
07 learning
ankit_ppt
 
PPT s10-machine vision-s2
PPT s10-machine vision-s2PPT s10-machine vision-s2
PPT s10-machine vision-s2
Binus Online Learning
 
Answer key for pattern recognition and machine learning
Answer key for pattern recognition and machine learningAnswer key for pattern recognition and machine learning
Answer key for pattern recognition and machine learning
VijayAECE1
 
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
Smarten Augmented Analytics
 
2012 predictive clusters
2012 predictive clusters2012 predictive clusters
2012 predictive clusters
Alejandro Correa Bahnsen, PhD
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
ShwetapadmaBabu1
 
Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networks
mourya chandra
 
Multivariate Analysis
Multivariate AnalysisMultivariate Analysis
Multivariate Analysis
Stig-Arne Kristoffersen
 
Chapter 10.1,2,3 pdf.pdf
Chapter 10.1,2,3 pdf.pdfChapter 10.1,2,3 pdf.pdf
Chapter 10.1,2,3 pdf.pdf
Amy Aung
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
thamizh arasi
 
Machine Learning - Clustering
Machine Learning - ClusteringMachine Learning - Clustering
Machine Learning - Clustering
Darío Garigliotti
 

Similar to Clusters (4).pptx (20)

Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptx
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
Kx for wine tasting
Kx for wine tastingKx for wine tasting
Kx for wine tasting
 
3.Unsupervised Learning.ppt presenting machine learning
3.Unsupervised Learning.ppt presenting machine learning3.Unsupervised Learning.ppt presenting machine learning
3.Unsupervised Learning.ppt presenting machine learning
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
TYPES OF CLUSTERING.pptx
TYPES OF CLUSTERING.pptxTYPES OF CLUSTERING.pptx
TYPES OF CLUSTERING.pptx
 
BAS 250 Lecture 8
BAS 250 Lecture 8BAS 250 Lecture 8
BAS 250 Lecture 8
 
07 learning
07 learning07 learning
07 learning
 
PPT s10-machine vision-s2
PPT s10-machine vision-s2PPT s10-machine vision-s2
PPT s10-machine vision-s2
 
Answer key for pattern recognition and machine learning
Answer key for pattern recognition and machine learningAnswer key for pattern recognition and machine learning
Answer key for pattern recognition and machine learning
 
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
What is the KMeans Clustering Algorithm and How Does an Enterprise Use it to ...
 
2012 predictive clusters
2012 predictive clusters2012 predictive clusters
2012 predictive clusters
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networks
 
Multivariate Analysis
Multivariate AnalysisMultivariate Analysis
Multivariate Analysis
 
Chapter 10.1,2,3 pdf.pdf
Chapter 10.1,2,3 pdf.pdfChapter 10.1,2,3 pdf.pdf
Chapter 10.1,2,3 pdf.pdf
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
 
Machine Learning - Clustering
Machine Learning - ClusteringMachine Learning - Clustering
Machine Learning - Clustering
 

Recently uploaded

Introducing Gopay Mobile App For Environment.pptx
Introducing Gopay Mobile App For Environment.pptxIntroducing Gopay Mobile App For Environment.pptx
Introducing Gopay Mobile App For Environment.pptx
FauzanHarits1
 
一比一原版(TMU毕业证)多伦多都会大学毕业证如何办理
一比一原版(TMU毕业证)多伦多都会大学毕业证如何办理一比一原版(TMU毕业证)多伦多都会大学毕业证如何办理
一比一原版(TMU毕业证)多伦多都会大学毕业证如何办理
yuhofha
 
Leadership Ambassador club Adventist module
Leadership Ambassador club Adventist moduleLeadership Ambassador club Adventist module
Leadership Ambassador club Adventist module
kakomaeric00
 
在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样
在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样
在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样
2zjra9bn
 
一比一原版(UBC毕业证)不列颠哥伦比亚大学毕业证如何办理
一比一原版(UBC毕业证)不列颠哥伦比亚大学毕业证如何办理一比一原版(UBC毕业证)不列颠哥伦比亚大学毕业证如何办理
一比一原版(UBC毕业证)不列颠哥伦比亚大学毕业证如何办理
pxyhy
 
Job Finding Apps Everything You Need to Know in 2024
Job Finding Apps Everything You Need to Know in 2024Job Finding Apps Everything You Need to Know in 2024
Job Finding Apps Everything You Need to Know in 2024
SnapJob
 
Status of Women in Pakistan.pptxStatus of Women in Pakistan.pptx
Status of Women in Pakistan.pptxStatus of Women in Pakistan.pptxStatus of Women in Pakistan.pptxStatus of Women in Pakistan.pptx
Status of Women in Pakistan.pptxStatus of Women in Pakistan.pptx
MuhammadWaqasBaloch1
 
Lbs last rank 2023 9988kr47h4744j445.pdf
Lbs last rank 2023 9988kr47h4744j445.pdfLbs last rank 2023 9988kr47h4744j445.pdf
Lbs last rank 2023 9988kr47h4744j445.pdf
ashiquepa3
 
一比一原版(SFU毕业证)西蒙弗雷泽大学毕业证如何办理
一比一原版(SFU毕业证)西蒙弗雷泽大学毕业证如何办理一比一原版(SFU毕业证)西蒙弗雷泽大学毕业证如何办理
一比一原版(SFU毕业证)西蒙弗雷泽大学毕业证如何办理
pxyhy
 
0624.speakingengagementsandteaching-01.pdf
0624.speakingengagementsandteaching-01.pdf0624.speakingengagementsandteaching-01.pdf
0624.speakingengagementsandteaching-01.pdf
Thomas GIRARD BDes
 
Resumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying OnlineResumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying Online
Bruce Bennett
 
一比一原版(YU毕业证)约克大学毕业证如何办理
一比一原版(YU毕业证)约克大学毕业证如何办理一比一原版(YU毕业证)约克大学毕业证如何办理
一比一原版(YU毕业证)约克大学毕业证如何办理
yuhofha
 
lab.123456789123456789123456789123456789
lab.123456789123456789123456789123456789lab.123456789123456789123456789123456789
lab.123456789123456789123456789123456789
Ghh
 
一比一原版(QU毕业证)皇后大学毕业证如何办理
一比一原版(QU毕业证)皇后大学毕业证如何办理一比一原版(QU毕业证)皇后大学毕业证如何办理
一比一原版(QU毕业证)皇后大学毕业证如何办理
yuhofha
 
一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理
一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理
一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理
taqyea
 
labb123456789123456789123456789123456789
labb123456789123456789123456789123456789labb123456789123456789123456789123456789
labb123456789123456789123456789123456789
Ghh
 
一比一原版(UVic毕业证)维多利亚大学毕业证如何办理
一比一原版(UVic毕业证)维多利亚大学毕业证如何办理一比一原版(UVic毕业证)维多利亚大学毕业证如何办理
一比一原版(UVic毕业证)维多利亚大学毕业证如何办理
pxyhy
 
RECOGNITION AWARD 13 - TO ALESSANDRO MARTINS.pdf
RECOGNITION AWARD 13 - TO ALESSANDRO MARTINS.pdfRECOGNITION AWARD 13 - TO ALESSANDRO MARTINS.pdf
RECOGNITION AWARD 13 - TO ALESSANDRO MARTINS.pdf
AlessandroMartins454470
 
原版制作(RMIT毕业证书)墨尔本皇家理工大学毕业证在读证明一模一样
原版制作(RMIT毕业证书)墨尔本皇家理工大学毕业证在读证明一模一样原版制作(RMIT毕业证书)墨尔本皇家理工大学毕业证在读证明一模一样
原版制作(RMIT毕业证书)墨尔本皇家理工大学毕业证在读证明一模一样
atwvhyhm
 
A Guide to a Winning Interview June 2024
A Guide to a Winning Interview June 2024A Guide to a Winning Interview June 2024
A Guide to a Winning Interview June 2024
Bruce Bennett
 

Recently uploaded (20)

Introducing Gopay Mobile App For Environment.pptx
Introducing Gopay Mobile App For Environment.pptxIntroducing Gopay Mobile App For Environment.pptx
Introducing Gopay Mobile App For Environment.pptx
 
一比一原版(TMU毕业证)多伦多都会大学毕业证如何办理
一比一原版(TMU毕业证)多伦多都会大学毕业证如何办理一比一原版(TMU毕业证)多伦多都会大学毕业证如何办理
一比一原版(TMU毕业证)多伦多都会大学毕业证如何办理
 
Leadership Ambassador club Adventist module
Leadership Ambassador club Adventist moduleLeadership Ambassador club Adventist module
Leadership Ambassador club Adventist module
 
在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样
在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样
在线制作加拿大萨省大学毕业证文凭证书实拍图原版一模一样
 
一比一原版(UBC毕业证)不列颠哥伦比亚大学毕业证如何办理
一比一原版(UBC毕业证)不列颠哥伦比亚大学毕业证如何办理一比一原版(UBC毕业证)不列颠哥伦比亚大学毕业证如何办理
一比一原版(UBC毕业证)不列颠哥伦比亚大学毕业证如何办理
 
Job Finding Apps Everything You Need to Know in 2024
Job Finding Apps Everything You Need to Know in 2024Job Finding Apps Everything You Need to Know in 2024
Job Finding Apps Everything You Need to Know in 2024
 
Status of Women in Pakistan.pptxStatus of Women in Pakistan.pptx
Status of Women in Pakistan.pptxStatus of Women in Pakistan.pptxStatus of Women in Pakistan.pptxStatus of Women in Pakistan.pptx
Status of Women in Pakistan.pptxStatus of Women in Pakistan.pptx
 
Lbs last rank 2023 9988kr47h4744j445.pdf
Lbs last rank 2023 9988kr47h4744j445.pdfLbs last rank 2023 9988kr47h4744j445.pdf
Lbs last rank 2023 9988kr47h4744j445.pdf
 
一比一原版(SFU毕业证)西蒙弗雷泽大学毕业证如何办理
一比一原版(SFU毕业证)西蒙弗雷泽大学毕业证如何办理一比一原版(SFU毕业证)西蒙弗雷泽大学毕业证如何办理
一比一原版(SFU毕业证)西蒙弗雷泽大学毕业证如何办理
 
0624.speakingengagementsandteaching-01.pdf
0624.speakingengagementsandteaching-01.pdf0624.speakingengagementsandteaching-01.pdf
0624.speakingengagementsandteaching-01.pdf
 
Resumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying OnlineResumes, Cover Letters, and Applying Online
Resumes, Cover Letters, and Applying Online
 
一比一原版(YU毕业证)约克大学毕业证如何办理
一比一原版(YU毕业证)约克大学毕业证如何办理一比一原版(YU毕业证)约克大学毕业证如何办理
一比一原版(YU毕业证)约克大学毕业证如何办理
 
lab.123456789123456789123456789123456789
lab.123456789123456789123456789123456789lab.123456789123456789123456789123456789
lab.123456789123456789123456789123456789
 
一比一原版(QU毕业证)皇后大学毕业证如何办理
一比一原版(QU毕业证)皇后大学毕业证如何办理一比一原版(QU毕业证)皇后大学毕业证如何办理
一比一原版(QU毕业证)皇后大学毕业证如何办理
 
一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理
一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理
一比一原版布拉德福德大学毕业证(bradford毕业证)如何办理
 
labb123456789123456789123456789123456789
labb123456789123456789123456789123456789labb123456789123456789123456789123456789
labb123456789123456789123456789123456789
 
一比一原版(UVic毕业证)维多利亚大学毕业证如何办理
一比一原版(UVic毕业证)维多利亚大学毕业证如何办理一比一原版(UVic毕业证)维多利亚大学毕业证如何办理
一比一原版(UVic毕业证)维多利亚大学毕业证如何办理
 
RECOGNITION AWARD 13 - TO ALESSANDRO MARTINS.pdf
RECOGNITION AWARD 13 - TO ALESSANDRO MARTINS.pdfRECOGNITION AWARD 13 - TO ALESSANDRO MARTINS.pdf
RECOGNITION AWARD 13 - TO ALESSANDRO MARTINS.pdf
 
原版制作(RMIT毕业证书)墨尔本皇家理工大学毕业证在读证明一模一样
原版制作(RMIT毕业证书)墨尔本皇家理工大学毕业证在读证明一模一样原版制作(RMIT毕业证书)墨尔本皇家理工大学毕业证在读证明一模一样
原版制作(RMIT毕业证书)墨尔本皇家理工大学毕业证在读证明一模一样
 
A Guide to a Winning Interview June 2024
A Guide to a Winning Interview June 2024A Guide to a Winning Interview June 2024
A Guide to a Winning Interview June 2024
 

Clusters (4).pptx

  • 1. What is Cluster Analysis? Cluster Analysis is a technique for combining observations into groups or clusters such that: • Each group is homogenous with respect to certain characteristics (that you specify) • Each group is different from the other groups with respect to the same characteristics • Clustering technique is another example of unsupervised technique
  • 2. Cluster Analysis In general, it is hard to observe response(Y) variable. Applications: Segmentation - Group of similar customers Finance - Clustering of individual stocks Location Analysis-Deciding the location of warehouses
  • 4. Example: Beer Data Suppose I am interested in what influences a consumer’s choice behavior when she is shopping for beer. How important she considers each of these qualities when deciding whether or not to buy the six pack: low COST of the six pack, high SIZE of the bottle (volume), high percentage of ALCOHOL in the beer, the REPUTATION of the brand, the COLOR of the beer, nice AROMA of the beer, and good TASTE of the beer. Can I find similar group of people based on their answers? If I can, how can I use this information? We can use classification technique (discriminant analysis) in order to validate clusters
  • 5. Cluster centroids Attribute Cluster n°1 Cluster n°2 Cluster n°3 COST 82.5 21 38.3 SIZE 86.7 11 32.8 ALCOHOL 81.7 21 37.2 REPUTAT 28.3 49 61.1 COLOR 68.3 73 27.2 AROMA 60.8 69 20.6 TASTE 77.5 94 45.6
  • 6. Example 2: Automobiles higher price sensitivity Correspondance Analysis 0.4 0.2 0 -0.2 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 -0.02 -0.04 -0.06 -0.08 -0.1 -0.12 -0.14 Ford Country Squire Wagon Mercury Grand Marquis Chry sler LeBaron Wagon Buick Estate Wagon Chev y Caprice Classic Dodge St Regis Ford LTD AMC Concord D/L Chev y Malibu Wagon Buick Century Special Dodge Aspen Mercury Zephy r Ford Mustang Ghia Volv o 240 GL Peugeot 694 SL Olds Omega Datsun 810 Chev y Citation Buick Sky lark Audi 5000 Ford Mustang 4 Pontiac Phoenix Toy ota Corona BMW 320i Saab 99 GLE AMC Spirit Datsun 510 Dodge Omni Honda Accord LX Chev ette Ply mouth Horizon VW Dasher Dodge Colt VW Scirocco VW Rabbit Datsun 210 Mazda GLC Fiat Strada Displacement Cy linders Weight Horsepower Driv e_Ratio MPG
  • 7. Attribute Cluster n°1 Cluster n°2 Cluster n°3 MPG 17.43 31.02 21.09 Weight 3.91 2.25 3.02 Drive_Ratio 2.42 3.31 3.23 Horsepower 138 77.65 110.92 Displacement 325.5 109.24 175.08 Cylinders 8 4 5.62
  • 8. Hierarchical vs. Non-Hierarchical Clustering • Hierarchical clustering does not require a priori knowledge of the number of clusters -Agglomerativehierarchical clustering is one of the most popular hierarchical clustering method. • Number of clusters are known. • -K-means is one of the most popular non-hierarchical clustering method.
  • 10. 2 Distance-based clustering Main idea: Clusters are developed based on distance between objects, as closer means more related. Most usedmethod: AHC – Agglomerative hierarchical clustering
  • 11. 3 Distance-based clustering Fea t ure X Feature Y D a t a Points heigh t ~ distance
  • 12. 4 Distance-based clustering Feat ure X Feature Y 2 3 0 4 5 6 1 0 1 2 3 4 5 6
  • 13. 5 Distance-based clustering 0 1 2 3 4 5 6 5 6 2 3 1 4 0 6 1st closest pair 2nd closest pair 3rd closest pair 4th closest pair Feat ure X Feature Y 2 3 0 1 4 5 6
  • 14. 6 Distance-based clustering 0 1 2 3 4 5 6 Feature Y height ~ dist ance 2 0 Fea t ure X 1 5 4 6 3
  • 15. 7 Distance-based clustering Feature Y 2 3 4 5 6 1 0 1 2 3 4 5 6 height ~ dist ance 0 Fea t ure X
  • 16. Distance-based clustering 0 1 2 3 4 5 86 Feature Y 0 1 6 2 3 5 4 height ~ dist ance Fea t ure X
  • 17. 9 Distance-based clustering Fea t ure X Feature Y D a t a Points heigh t ~ distance
  • 18. Hierarchical Clustering Say, we group 0 and 1 together and leave the others as is How do we compute the distance between a group that has two (or more) members and the others?
  • 19. Hierarchical Clustering Algorithms Centroid Method Nearest-Neighbor or Single-Linkage Farthest-Neighbor or Complete-Linkage Average-Linkage Ward’s Method
  • 20. Hierarchical Clustering Single Linkage Clustering criterion based on the shortest distance Complete Linkage: Clustering criterion based on the longest distance
  • 21. Hierarchical Clustering (Contd.) Average Linkage: Clustering criterion based on the average distance Ward's Method: Based on the loss of information resulting from grouping of the objects into clusters (minimize within cluster variation)
  • 22. Hierarchical Clustering (Contd.) Centroid Method Based on the distance between the group centroids (the point whose coordinates are the means of all the observations in the cluster)
  • 23. Example 3: Data Consumer Income ($ 1000s) Education (years) 1 5 5 2 6 6 3 15 14 4 16 15 5 25 19 6 30 20
  • 24. Geometrical View of Cluster Analysis Education Income
  • 25. Similarity Measures Why are consumers 1 and 2 similar? Distance(1,2) = (5-6)2 + (5-6)2 More generally, if there are p variables: Distance(i,j) =  (xik - xjk)2
  • 26. Similarity Matrix C1 C2 C3 C4 C5 C6 C1 0 2 181 221 625 850 C2 2 0 145 181 530 772 C3 181 145 0 2 125 261 C4 221 181 2 0 97 221 C5 625 530 125 97 0 26 C6 850 772 261 221 26 0
  • 27. Centroid Method Each group is replaced by an average consumer Cluster 1 – average income = 5.5 and average education = 5.5
  • 28. Data for Five Clusters Cluster Members Income Education 1 C1&C2 5.5 5.5 2 C3 15 14 3 C4 16 15 4 C5 25 20 5 C6 30 19
  • 29. Similarity Matrix C1&C2 C3 C4 C5 C6 C1&C2 0 C3 162.5 0 C4 200.5 2 0 C5 590.5 125 97 0 C6 782.5 261 221 26 0
  • 30. Data for Four Clusters Cluster Members Income Education 1 C1&C2 5.5 5.5 2 C3&C4 15.5 14.5 3 C5 25 20 4 C6 30 19
  • 31. Similarity Matrix C1&C2 C3&C4 C5 C6 C1&C2 0 C3&C4 181 0 C5 590.5 120.5 0 C6 782.5 230.5 26 0
  • 32. Data for Three Clusters Cluster Members Income Education 1 C1&C2 5.5 5.5 2 C3&C4 15.5 14.5 3 C5&C6 27.5 19.5
  • 33. Similarity Matrix C1&C2 C3&C4 C5&C6 C1&C2 0 C3&C4 181 0 C5&C6 680 169 0
  • 34. Dendogram for the Data C1 C2 C3 C4 C5 C6
  • 35. Single Linkage First Cluster is formed in the same fashion Distance between Cluster 1 comprising of customers 1 and 2 and customer 3 is the minimum of Distance(1,3) = 181 and Distance(2,3) = 145
  • 36. Similarity Matrix C1&C2 C3 C4 C5 C6 C1&C2 0 C3 145 0 C4 181 2 0 C5 530 125 97 0 C6 772 261 221 26 0
  • 37. Complete Linkage Distance between Cluster 1 comprising of customers 1 and 2 and customer 3 is the maximum of Distance(1,3) = 181 and Distance(2,3) = 145
  • 38. Similarity Matrix C1&C2 C3 C4 C5 C6 C1&C2 0 C3 181 0 C4 221 2 0 C5 625 125 97 0 C6 811 261 221 26 0
  • 39. Average Linkage Distance between Cluster 1 comprising of customers 1 and 2 and customer 3 is the average of Distance(1,3) = 181 and Distance(2,3) = 145
  • 40. Similarity Matrix C1&C2 C3 C4 C5 C6 C1&C2 0 C3 163 0 C4 201 2 0 C5 578 125 97 0 C6 783 261 221 26 0
  • 41. Ward’s Method Does not compute distance between clusters Forms clusters by maximizing within-cluster homogeneity or minimizing error sum of squares (ESS) ESS for cluster with two observations (say, C1 and C2) = (5-5.5)2 + (6-5.5)2 + (5-5.5)2 + (6-5.5)2
  • 42. Ward’s Method CL1 CL2 CL3 CL4 CL5 ESS 1 C1,C2 C3 C4 C5 C6 1 2 C1,C3 C2 C4 C5 C6 90.5 3 C1,C4 C2 C3 C5 C6 110.5 4 C1,C5 C2 C3 C4 C6 312.5 5 C1,C6 C2 C3 C4 C5 410.5 6 C2,C3 C1 C4 C5 C6 72.5 7 C2,C4 C1 C3 C5 C6 90.5
  • 44. Centroid-based clustering Mainidea: Minimize the squared distances of all points in the cluster to cluster centroids. Most usedmethod: k-Means 44
  • 45. K-means Algorithm • Determines the best value for K center points or centroids • Assigns each data point to its closest k-center. • Compute centroid points based on clusters • Assigns each data point to new cluster centroids. • Repeat this process until cluster centroids does not change or stopping criteria is met.
  • 46. K-Means: Step 0a Step 0a.Randomly set cluster centroids 46
  • 47. K-Means: Step 0b Step 0b.Assign all data points to the closest centroid.In our example we’ll use color coding. 47
  • 48. K-Means: Step1 Step1a.Calculate distances to points Step1b.Relocate centroids to minimize point distances 48
  • 49. K-Means: Step1 Step1b.Relocate centroids to minimize point distances Step1c.Reassign nearest points 49
  • 50. K-Means: Step2 Step2a.Calculate distances to points Step2b.Relocate centroids to minimize point distances 50
  • 51. K-Means: Step2 Step2b.Relocate centroids to minimize point distances Step2c.Reassign nearest points 51
  • 52. K-Means: iteration logic Relocate centroids to minimize point distances Calculate distances to all points Reassign nearest points 52
  • 53. K-Means: StepN After a while the shifting of centroids will stop. Now we assume we found the true location of centroid, and finished clustering N it erat ions lat er 53
  • 54.
  • 55. Weaknesses of K-means • The algorithm is only applicable if the mean is defined. – For categorical data, k-mode - the centroid is represented by most frequent values. • The user needs to specify k. • The algorithm is sensitive to outliers – Outliers are data points that are very far away from other data points. – Outliers could be errors in the data recording or some special data points with very different values.
  • 57. Sensitivity to initial seeds Random selection of seeds (centroids) Iteration 1 Iteration 2 Random selection of seeds (centroids) Iteration 1 Iteration 2
  • 58. Dealing with outliers and initial seeds • For outliers, remove some data points that are much further away from the centroids than other data points – To be safe, we may want to monitor these possible outliers over a few iterations and then decide to remove them. • If random initialization points is used for the initial seeds, run the algorithm multiple times and keeps the seed that minimizes your clustering error metric. • Alternatively, carefully choose initial seeds such that the distance among them are maximum
  • 59. Special data structures • The k-means algorithm is not suitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres).
  • 60. K-means Summary • Despite weaknesses, k-means is still the most popular algorithm due to its simplicity and efficiency • No clear evidence that any other clustering algorithm performs better in general • Comparing different clustering algorithms is a difficult task. No one knows the correct clusters!
  • 61. Example 3 Again: Data Consumer Income ($ 1000s) Education (years) 1 5 5 2 6 6 3 15 14 4 16 15 5 25 19 6 30 20
  • 62. Geometrical View of Cluster Analysis Education Income
  • 63. Choose C1,C3 and C5 as cluster centroids Initial Assignment Distance from CL1(C1) Distance from CL2 (C3) Distance from CL3(C5) Assigned to CL C1 0 181 625 1 C2 2 145 557 1 C3 181 0 136 2 C4 221 2 106 2 C5 625 136 0 3 C6 821 250 26 3
  • 64. New Cluster Centroids Variable CL1 CL2 CL3 Income 5.5 15.5 27.5 Education 5.5 14.5 19.5
  • 65. Distance Matrix Distance from CL1 Distance from CL2 Distance from CL3 Previous Assignment Current Assignment C1 0.5 200.5 716.5 1 1 C2 0.5 162.5 644.5 1 1 C3 162.5 0.5 186.5 2 2 C4 200.5 0.5 152.5 2 2 C5 590.5 120.5 6.5 3 3 C6 600.50 230.5 6.5 3 3