SlideShare a Scribd company logo
1 of 46
Clustering
Unsupervised learning
introduction
Machine Learning
Supervised learning
Training set:
in supervised learning the
output or the label (y) of each
data point is given, and the
classification is perfomed based
on that
Unsupervised learning (Clustering)
Training set:
In clustering the division is done
based on the x’s alone and there
will be no output or labels.
Each X(i) consist of many features like x1, x2, x3, ...
Aspect Clustering Classification
Goal
Group similar data points into clusters
or groups
Assign data points to predefined classes or labels
Supervision Unsupervised learning Supervised learning
Data Labels No predefined labels for clusters Predefined class labels for each data point
Output Clusters or groups of data points Class labels for each data point
Example
Usage
Customer segmentation, anomaly
detection
Spam email detection, image classification
Evaluation Silhouette score, Davies-Bouldin index Accuracy, precision, recall, F1-score
Examples
K-Means, DBSCAN, hierarchical
clustering
Logistic regression, decision trees, support vector
machines
Distance
Metrics
Used to measure similarity between
data points
Not always required, but may be used for similarity
measurement
Training Data Unlabeled data Labeled data
Applications
Grouping similar items, exploratory
data analysis
Predictive modeling, pattern recognition
Clustering vs. Classification
Applications of clustering
Organize computing clusters
Social network
analysis
Image credit: NASA/JPL-Caltech/E. Churchwell (Univ. of Wisconsin, Madison)
Astronomical data analysis
Market segmentation
Types of clustering
Hard Clustering
• Each data point belongs
to exactly one cluster.
• Data points are assigned
exclusively to a single
cluster.
• Commonly used in
algorithms like K-Means.
Soft Clustering (Fuzzy
Clustering)
• Data points can belong to
multiple clusters.
• Membership degrees vary,
indicating the strength of
belonging to each cluster.
• Used in algorithms like
Gaussian Mixture Models
(GMM).
Soft clustering, or fuzzy clustering, is important for handling uncertainty, complex data
structures, and nuanced relationships by allowing data points to belong to multiple
clusters with varying degrees of membership. It provides a more robust and flexible
approach compared to hard clustering when dealing with real-world data.
example on soft clustering
• Customers can belong to multiple segments
simultaneously, indicating varying degrees of affinity to
different product categories or shopping behaviors.
• A customer might be 70% associated with the
"Frequent Shoppers" segment, 40% with the "Discount
Seekers" segment, and 20% with the "Occasional
Shoppers" segment, showing the nuanced nature of
their shopping habits.
• This information enables the business to personalize
marketing strategies, recommending products or
discounts tailored to each customer's multiple
preferences, ultimately improving customer
engagement and sales.
Most popular Clustering Algorithms
 K-Means: a clustering algorithm that partitions data into K distinct, non-overlapping clusters based
on similarity, with the goal of minimizing the within-cluster variance
 Hierarchical Clustering: method of cluster analysis that builds a hierarchy of clusters by
successively merging or splitting them based on data similarity, resulting in a tree-like structure called
a dendrogram
 DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
density-based clustering algorithm that identifies clusters in data by considering regions of high data
point density, effectively capturing clusters of varying shapes and sizes while also identifying noise
points. Gaussian Mixture Models (GMM): represent data as a combination of
multiple Gaussian distributions, enabling flexible modeling of complex data
distributions.
 Agglomerative Clustering: hierarchical clustering method that starts with each data point as its
cluster and iteratively merges the closest clusters until a hierarchy of clusters is formed, often
visualized as a dendrogram.
Distance Metrics
What Are Distance Metrics?
• Distance metrics are measures used to quantify the similarity or dissimilarity
between data points in a clustering algorithm.
• They play a crucial role in determining how clusters are formed.
Common Distance Metrics:
• Euclidean Distance: Measures the straight-line distance between two points in
Euclidean space. Suitable for data with continuous features.
• Manhattan Distance: Calculates the sum of absolute differences between
corresponding elements of two data points. Applicable to data with grid-like
structures.
• Cosine Similarity: Measures the cosine of the angle between two vectors. Useful
for text data, document clustering, and cases where the magnitude of the data is
not as important as the direction.
Clustering
K-means algorithm
K-means clustering
Step 1
Start with data points
randomly distributed.
Data has two features, but
it can have more
Step 2
Decide number of
clusters and randomly
pick cluster centriods
cluster
centriods
Step 3
Compute distance from
each data points to all
centroids.
Perform clustering based on
the minimum distance from
centroids by Selecting the data
points that are closer to a
centroid to belong to the
same cluster
Step 4
Recalculate new centroids based
on the selected clusters by calculating
the mean of each feature of the data
points.
Input:
- (number of clusters)
- Training set
(drop convention)
K-means algorithm
Randomly initialize cluster centroids
K-means algorithm
Repeat {
for = 1 to
:= index (from 1 to ) of cluster centroid
closest to
for = 1 to
:= average (mean) of points assigned to cluster
}
K-means for non-separated clusters
T-shirt sizing
Height
Weight
Clustering
Optimization
objective
K-means optimization objective
= index of cluster (1,2,…, ) to which example is currently
assigned
= cluster centroid ( )
= cluster centroid of cluster to which example has been
assigned
Optimization objective:
Randomly initialize cluster centroids
K-means algorithm
Repeat {
for = 1 to
:= index (from 1 to ) of cluster centroid
closest to
for = 1 to
:= average (mean) of points assigned to cluster
}
Clustering
Random
initialization
Randomly initialize cluster centroids
K-means algorithm
Repeat {
for = 1 to
:= index (from 1 to ) of cluster centroid
closest to
for = 1 to
:= average (mean) of points assigned to cluster
}
Random initialization
Should have
Randomly pick training
examples.
Set equal to these
examples.
Local optima
For i = 1 to 100 {
Randomly initialize K-means.
Run K-means. Get .
Compute cost function (distortion)
}
Pick clustering that gave lowest cost
Random initialization
Clustering
Choosing the
number of clusters
What is the right value of K?
Choosing the value of K
Elbow method:
1 2 3 4 5 6 7 8
Cost
function
(no. of clusters)
1 2 3 4 5 6 7 8
Cost
function
(no. of clusters)
Choosing the value of K
Sometimes, you’re running K-means to get clusters to use for some
later/downstream purpose. Evaluate K-means based on a metric for
how well it performs for that later purpose.
E.g. T-shirt sizing
Height
Weight
T-shirt sizing
Height
Weight
A Simple example k-means (using K=2)
6
4
7
3
5
2
1
6
5
4
3
2
1
0
0 1 2 3 4
Variable 2
5 6 7 8
Variable
1
K= 2
2
Step 1:
Initialization: Randomly we choose two centroids (k=2) for two clusters. In this
case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).
Centroid 1 Centroid 2
1 √( 1 – 1 )² +
( 1 – 1 )² = 0 √( 5 – 1 )² + ( 7 – 1 )² = 7.21
2 √( 1 – 1.5 )² +
( 1 – 2 )² = 1.12 √( 5 – 1.5 )² + ( 7 – 2 )² = 6.10
3 √( 1 – 3 )² +
( 1 – 4 )² = 3.61 √( 5 – 3 )² + ( 7 – 4 )² = 3.61
4 √( 1 – 5 )² +
( 1 – 7 )² = 7.21 √( 5 – 5 )² + ( 7 – 7 )² = 0
5 √( 1 – 3.5 )² +
( 1 – 5 )² = 4.72 √( 5 – 3.5 )² + ( 7 – 5 )² = 2.5
6 √( 1 – 4.5 )² +
( 1 – 5 )² = 5.31 √( 5 – 4.5 )² + ( 7 – 5 )² = 2.06
7 √( 1 – 3.5 )² +
( 1 – 4.5 )² = 4.30 √( 5 – 3.5 )² + ( 7 – 4.5 )² = 2.92
Step 2:
Step 2:
Step 3:
Centroid 1 Centroid 2
1 √( 1.83 – 1 )² + (
2.33
– 1 )
²
= 1.57 √( 4.12 – 1 )² + ( 5.38 – 1 )² = 5.38
2 √(1.83 – 1.5 )² + (2.33 – 2 )² = 0.47 √( 4.12 – 1.5 )² + ( 5.38 – 2 )² = 4.29
3 √(1.83 – 3 )² + (2.33 – 4 )² = 2.04 √( 4.12 – 3 )² + ( 5.38 – 4 )² = 1.78
4 √(1.83 – 5 )² + (2.33 – 7 )² = 5.64 √( 4.12 – 5 )² + ( 5.38 – 7 )² = 1.84
5 √(1.83 – 3.5 )² + (2.33 – 5 )² = 3.15 √( 4.12 – 3.5 )² + ( 5.38 – 5 )² = 0.73
6 √(1.83 – 4.5 )² + (2.33 – 5 )² = 3.78 √( 4.12 – 4.5 )² + ( 5.38 – 5 )² = 0.54
7 √(1.83 – 3.5 )² + (2.33 – 4.5 )² = 2.74 √( 4.12 – 3.5 )² + ( 5.38 – 4.5 )² = 1.08
Therefore, the new clusters are:
{1,2} and {3,4,5,6,7}
=(1.25 , 1.5)
Group 1 =(𝟏+𝟏.𝟓
) , (𝟏+𝟐
)
𝟐 𝟐
Group 2 =(𝟑+𝟓+𝟑.𝟓+𝟒.𝟓+𝟑.𝟓
) , (𝟒+𝟕+𝟓+𝟓+𝟒.𝟓
)
𝟓 𝟓
=(3.9 , 5.1)
Step 4:
Centroid 1 Centroid 2
1 √(
1.25
– 1 )² + ( 1.5 – 1 )² = 0.58 √( 3.9 – 1 )² + ( 5.1 – 1 )² =
5.02
2 √(1.2
5
– 1.5 )² + (1.5 – 2 )² = 0.56 √( 3.9 –
1.5
)² + ( 5.1 – 2 )² =
3.92
3 √(1.2
5
– 3 )² + (1.5 – 4 )² = 3.05 √( 3.9 – 3 )² + ( 5.1 – 4 )² =
1.42
4 √(1.2
5
– 5 )² + (1.5 – 7 )² = 6.66 √( 3.9 – 5 )² + ( 5.1 – 7 )² =
2.20
5 √(1.2
5
– 3.5 )² + (1.5 – 5 )² = 4.16 √( 3.9 – 3.5 )² + ( 5.1 – 5 )² = 0.41
6 √(1.2
5
– 4.5 )² + (1.5 – 5 )² = 4.78 √( 3.9 – 4.5 )² + ( 5.1 – 5 )² = 0.61
7 √(1.2
5
– 3.5 )² + (1.5 – 4.5 )² = 3.75 √( 3.9 – 3.5 )² + ( 5.1 – 4.5 )² = 0.72
▶ Therefore, there is no change in the cluster
.
▶ Thus, the algorithm comes to a halt here and final result consist
of 2 clusters {1,2} and
{3,4,5,6,7}.
Pros and cons
Advantages of k-means
1.Relatively simple to implement.
2.Scales to large data sets.
3.Guarantees convergence.
4.Easily adapts to new examples.
Disadvantages of k-means
1. Choosing (k) manually.
2. Being dependent on initial values.
3. Scaling with number of dimensions.

More Related Content

Similar to Lec13 Clustering.pptx

Lecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.pptLecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.pptSyedNahin1
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningPyingkodi Maran
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmIJERA Editor
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
unsupervised classification.pdf
unsupervised classification.pdfunsupervised classification.pdf
unsupervised classification.pdfsurjeetkoli900
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptionsrefedey275
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
SPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by StatsworkSPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by StatsworkStats Statswork
 
Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysisAnimesh Kumar
 
Predictive Modelling
Predictive ModellingPredictive Modelling
Predictive ModellingRajiv Advani
 
Machine Learning Clustering
Machine Learning ClusteringMachine Learning Clustering
Machine Learning ClusteringRupak Roy
 
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคลMachine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคลBAINIDA
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfVIKASGUPTA127897
 

Similar to Lec13 Clustering.pptx (20)

Lecture4.pptx
Lecture4.pptxLecture4.pptx
Lecture4.pptx
 
Lecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.pptLecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.ppt
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
 
Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
PPT s10-machine vision-s2
PPT s10-machine vision-s2PPT s10-machine vision-s2
PPT s10-machine vision-s2
 
unsupervised classification.pdf
unsupervised classification.pdfunsupervised classification.pdf
unsupervised classification.pdf
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
08 clustering
08 clustering08 clustering
08 clustering
 
SPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by StatsworkSPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by Statswork
 
Clustering
ClusteringClustering
Clustering
 
Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysis
 
Predictive Modelling
Predictive ModellingPredictive Modelling
Predictive Modelling
 
07 learning
07 learning07 learning
07 learning
 
Machine Learning Clustering
Machine Learning ClusteringMachine Learning Clustering
Machine Learning Clustering
 
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคลMachine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
 
Data clustering
Data clustering Data clustering
Data clustering
 
ClusetrigBasic.ppt
ClusetrigBasic.pptClusetrigBasic.ppt
ClusetrigBasic.ppt
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdf
 

Recently uploaded

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 

Lec13 Clustering.pptx

  • 2. Supervised learning Training set: in supervised learning the output or the label (y) of each data point is given, and the classification is perfomed based on that
  • 3. Unsupervised learning (Clustering) Training set: In clustering the division is done based on the x’s alone and there will be no output or labels. Each X(i) consist of many features like x1, x2, x3, ...
  • 4. Aspect Clustering Classification Goal Group similar data points into clusters or groups Assign data points to predefined classes or labels Supervision Unsupervised learning Supervised learning Data Labels No predefined labels for clusters Predefined class labels for each data point Output Clusters or groups of data points Class labels for each data point Example Usage Customer segmentation, anomaly detection Spam email detection, image classification Evaluation Silhouette score, Davies-Bouldin index Accuracy, precision, recall, F1-score Examples K-Means, DBSCAN, hierarchical clustering Logistic regression, decision trees, support vector machines Distance Metrics Used to measure similarity between data points Not always required, but may be used for similarity measurement Training Data Unlabeled data Labeled data Applications Grouping similar items, exploratory data analysis Predictive modeling, pattern recognition Clustering vs. Classification
  • 5. Applications of clustering Organize computing clusters Social network analysis Image credit: NASA/JPL-Caltech/E. Churchwell (Univ. of Wisconsin, Madison) Astronomical data analysis Market segmentation
  • 6. Types of clustering Hard Clustering • Each data point belongs to exactly one cluster. • Data points are assigned exclusively to a single cluster. • Commonly used in algorithms like K-Means. Soft Clustering (Fuzzy Clustering) • Data points can belong to multiple clusters. • Membership degrees vary, indicating the strength of belonging to each cluster. • Used in algorithms like Gaussian Mixture Models (GMM).
  • 7. Soft clustering, or fuzzy clustering, is important for handling uncertainty, complex data structures, and nuanced relationships by allowing data points to belong to multiple clusters with varying degrees of membership. It provides a more robust and flexible approach compared to hard clustering when dealing with real-world data.
  • 8. example on soft clustering • Customers can belong to multiple segments simultaneously, indicating varying degrees of affinity to different product categories or shopping behaviors. • A customer might be 70% associated with the "Frequent Shoppers" segment, 40% with the "Discount Seekers" segment, and 20% with the "Occasional Shoppers" segment, showing the nuanced nature of their shopping habits. • This information enables the business to personalize marketing strategies, recommending products or discounts tailored to each customer's multiple preferences, ultimately improving customer engagement and sales.
  • 9. Most popular Clustering Algorithms  K-Means: a clustering algorithm that partitions data into K distinct, non-overlapping clusters based on similarity, with the goal of minimizing the within-cluster variance  Hierarchical Clustering: method of cluster analysis that builds a hierarchy of clusters by successively merging or splitting them based on data similarity, resulting in a tree-like structure called a dendrogram  DBSCAN (Density-Based Spatial Clustering of Applications with Noise): density-based clustering algorithm that identifies clusters in data by considering regions of high data point density, effectively capturing clusters of varying shapes and sizes while also identifying noise points. Gaussian Mixture Models (GMM): represent data as a combination of multiple Gaussian distributions, enabling flexible modeling of complex data distributions.  Agglomerative Clustering: hierarchical clustering method that starts with each data point as its cluster and iteratively merges the closest clusters until a hierarchy of clusters is formed, often visualized as a dendrogram.
  • 10. Distance Metrics What Are Distance Metrics? • Distance metrics are measures used to quantify the similarity or dissimilarity between data points in a clustering algorithm. • They play a crucial role in determining how clusters are formed. Common Distance Metrics: • Euclidean Distance: Measures the straight-line distance between two points in Euclidean space. Suitable for data with continuous features. • Manhattan Distance: Calculates the sum of absolute differences between corresponding elements of two data points. Applicable to data with grid-like structures. • Cosine Similarity: Measures the cosine of the angle between two vectors. Useful for text data, document clustering, and cases where the magnitude of the data is not as important as the direction.
  • 13. Step 1 Start with data points randomly distributed. Data has two features, but it can have more
  • 14. Step 2 Decide number of clusters and randomly pick cluster centriods cluster centriods
  • 15. Step 3 Compute distance from each data points to all centroids. Perform clustering based on the minimum distance from centroids by Selecting the data points that are closer to a centroid to belong to the same cluster
  • 16. Step 4 Recalculate new centroids based on the selected clusters by calculating the mean of each feature of the data points.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22. Input: - (number of clusters) - Training set (drop convention) K-means algorithm
  • 23. Randomly initialize cluster centroids K-means algorithm Repeat { for = 1 to := index (from 1 to ) of cluster centroid closest to for = 1 to := average (mean) of points assigned to cluster }
  • 24. K-means for non-separated clusters T-shirt sizing Height Weight
  • 26. K-means optimization objective = index of cluster (1,2,…, ) to which example is currently assigned = cluster centroid ( ) = cluster centroid of cluster to which example has been assigned Optimization objective:
  • 27. Randomly initialize cluster centroids K-means algorithm Repeat { for = 1 to := index (from 1 to ) of cluster centroid closest to for = 1 to := average (mean) of points assigned to cluster }
  • 29. Randomly initialize cluster centroids K-means algorithm Repeat { for = 1 to := index (from 1 to ) of cluster centroid closest to for = 1 to := average (mean) of points assigned to cluster }
  • 30. Random initialization Should have Randomly pick training examples. Set equal to these examples.
  • 32. For i = 1 to 100 { Randomly initialize K-means. Run K-means. Get . Compute cost function (distortion) } Pick clustering that gave lowest cost Random initialization
  • 34. What is the right value of K?
  • 35. Choosing the value of K Elbow method: 1 2 3 4 5 6 7 8 Cost function (no. of clusters) 1 2 3 4 5 6 7 8 Cost function (no. of clusters)
  • 36. Choosing the value of K Sometimes, you’re running K-means to get clusters to use for some later/downstream purpose. Evaluate K-means based on a metric for how well it performs for that later purpose. E.g. T-shirt sizing Height Weight T-shirt sizing Height Weight
  • 37. A Simple example k-means (using K=2)
  • 38. 6 4 7 3 5 2 1 6 5 4 3 2 1 0 0 1 2 3 4 Variable 2 5 6 7 8 Variable 1 K= 2 2
  • 39. Step 1: Initialization: Randomly we choose two centroids (k=2) for two clusters. In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).
  • 40. Centroid 1 Centroid 2 1 √( 1 – 1 )² + ( 1 – 1 )² = 0 √( 5 – 1 )² + ( 7 – 1 )² = 7.21 2 √( 1 – 1.5 )² + ( 1 – 2 )² = 1.12 √( 5 – 1.5 )² + ( 7 – 2 )² = 6.10 3 √( 1 – 3 )² + ( 1 – 4 )² = 3.61 √( 5 – 3 )² + ( 7 – 4 )² = 3.61 4 √( 1 – 5 )² + ( 1 – 7 )² = 7.21 √( 5 – 5 )² + ( 7 – 7 )² = 0 5 √( 1 – 3.5 )² + ( 1 – 5 )² = 4.72 √( 5 – 3.5 )² + ( 7 – 5 )² = 2.5 6 √( 1 – 4.5 )² + ( 1 – 5 )² = 5.31 √( 5 – 4.5 )² + ( 7 – 5 )² = 2.06 7 √( 1 – 3.5 )² + ( 1 – 4.5 )² = 4.30 √( 5 – 3.5 )² + ( 7 – 4.5 )² = 2.92 Step 2:
  • 42. Step 3: Centroid 1 Centroid 2 1 √( 1.83 – 1 )² + ( 2.33 – 1 ) ² = 1.57 √( 4.12 – 1 )² + ( 5.38 – 1 )² = 5.38 2 √(1.83 – 1.5 )² + (2.33 – 2 )² = 0.47 √( 4.12 – 1.5 )² + ( 5.38 – 2 )² = 4.29 3 √(1.83 – 3 )² + (2.33 – 4 )² = 2.04 √( 4.12 – 3 )² + ( 5.38 – 4 )² = 1.78 4 √(1.83 – 5 )² + (2.33 – 7 )² = 5.64 √( 4.12 – 5 )² + ( 5.38 – 7 )² = 1.84 5 √(1.83 – 3.5 )² + (2.33 – 5 )² = 3.15 √( 4.12 – 3.5 )² + ( 5.38 – 5 )² = 0.73 6 √(1.83 – 4.5 )² + (2.33 – 5 )² = 3.78 √( 4.12 – 4.5 )² + ( 5.38 – 5 )² = 0.54 7 √(1.83 – 3.5 )² + (2.33 – 4.5 )² = 2.74 √( 4.12 – 3.5 )² + ( 5.38 – 4.5 )² = 1.08
  • 43. Therefore, the new clusters are: {1,2} and {3,4,5,6,7} =(1.25 , 1.5) Group 1 =(𝟏+𝟏.𝟓 ) , (𝟏+𝟐 ) 𝟐 𝟐 Group 2 =(𝟑+𝟓+𝟑.𝟓+𝟒.𝟓+𝟑.𝟓 ) , (𝟒+𝟕+𝟓+𝟓+𝟒.𝟓 ) 𝟓 𝟓 =(3.9 , 5.1)
  • 44. Step 4: Centroid 1 Centroid 2 1 √( 1.25 – 1 )² + ( 1.5 – 1 )² = 0.58 √( 3.9 – 1 )² + ( 5.1 – 1 )² = 5.02 2 √(1.2 5 – 1.5 )² + (1.5 – 2 )² = 0.56 √( 3.9 – 1.5 )² + ( 5.1 – 2 )² = 3.92 3 √(1.2 5 – 3 )² + (1.5 – 4 )² = 3.05 √( 3.9 – 3 )² + ( 5.1 – 4 )² = 1.42 4 √(1.2 5 – 5 )² + (1.5 – 7 )² = 6.66 √( 3.9 – 5 )² + ( 5.1 – 7 )² = 2.20 5 √(1.2 5 – 3.5 )² + (1.5 – 5 )² = 4.16 √( 3.9 – 3.5 )² + ( 5.1 – 5 )² = 0.41 6 √(1.2 5 – 4.5 )² + (1.5 – 5 )² = 4.78 √( 3.9 – 4.5 )² + ( 5.1 – 5 )² = 0.61 7 √(1.2 5 – 3.5 )² + (1.5 – 4.5 )² = 3.75 √( 3.9 – 3.5 )² + ( 5.1 – 4.5 )² = 0.72
  • 45. ▶ Therefore, there is no change in the cluster . ▶ Thus, the algorithm comes to a halt here and final result consist of 2 clusters {1,2} and {3,4,5,6,7}.
  • 46. Pros and cons Advantages of k-means 1.Relatively simple to implement. 2.Scales to large data sets. 3.Guarantees convergence. 4.Easily adapts to new examples. Disadvantages of k-means 1. Choosing (k) manually. 2. Being dependent on initial values. 3. Scaling with number of dimensions.

Editor's Notes

  1. Swap: market seg and organize clusters
  2. Get rid of the legacy points
  3. Get rid of the legacy points
  4. Replace with normal text, size with LATEX fonts
  5. Change numbers to LATEX as well
  6. Replace as previous; change spacing to fill page
  7. Replace as previous; change spacing to fill page
  8. LATEX font