SlideShare a Scribd company logo
1 of 29
Download to read offline
Machine Learning using
Matlab
Lecture 9 Clustering part 1
Outline
● Unsupervised learning
● Agglomerative clustering
● K-means clustering
○ Bag-of-Words (BoW) model
● Other clustering algorithms
Unsupervised learning
● In unsupervised learning, the training data consists of a set of input vectors
without any corresponding target values
● Goals:
○ Clustering: discover groups of similar examples within the data
○ Density estimation: determine the distribution of data within the input space
○ Data visualization: project the data from a high-dimensional space down to two or three
dimensions
● Unsupervised learning was largely ignored by machine learning community
○ It is hard to say what the aim of unsupervised learning is
● Unsupervised learning is the future of machine learning
○ Large scale of unlabeled data
○ Very expensive and infeasible to annotate large data
Clustering - intuition
Unlabeled data After clustering
Clustering
● Suppose we have a data set , the objective of clustering is
to partition the data into different groups (clusters), where the data in each
group are similar to each other other than to those in other groups
● Clustering is one of the most commonly researched unsupervised learning
topic
● The only information clustering uses is the similarity between examples
● Clustering groups examples based of their mutual similarities
Clustering
● Clustering can help us:
○ Automatically organizing data
○ Understanding hidden structure in some data
● A good cluster algorithm is:
○ High within-cluster similarity
○ Low inter-cluster similarity
● Application:
○ Image segmentation
○ Social network analysis
○ Data compression
○ Gene expression data
Clustering
● Hierarchical clustering
○ Agglomerative: bottom-up
○ Divisive: top-down
● Centroid-based clustering
○ K-means clustering
○ Kernel k-means
○ Fuzzy c-means
● Distribution-based clustering
○ Gaussian Mixture Model (GMM)
● Density-based clustering
○ Mean-shift
○ Density-based spatial clustering of applications with noise (DBSCAN)
Distance measure
● Distance is inversely related to similarity, namely, large distance, low
similarity; small distance, high similarity
● Choice of the distance measure is very important for clustering
● Some commonly used distance metrics:
○ Euclidean distance (L2 norm):
○ Manhattan distance (L1 norm):
○ Mahalanobis distance: , where S is the covariance matrix
● Similarity is subjective and hard to define
● Different similarity criteria can lead to different clusters
The property of a distance measure
● Non-negative:
● Symmetry:
● Self-similarity:
● Triangle inequality:
● E.g., self-similarity, we shouldn’t conclude A looks like B, but B doesn't look
like A
Agglomerative clustering
● Initially, each point is treated as a
cluster
● Repeat:
○ Compute distance between all clusters in
terms of the defined distance measure
(store for efficiency)
○ Merge two nearest clusters into a new
cluster
○ Stop when there is only one cluster left
● Save both clustering and sequence of
cluster operations by “Dendrogram”
Agglomerative clustering
Agglomerative clustering
● Summary
○ Choose a cluster distance/dissimilarity method
○ Successively merge closet pair of clusters until only one cluster left
● Pros and cons
○ Easy to understand and implement
○ Don’t need to define the number of clusters
○ Possible to view partitions at different levels of granularities
○ Computational extensive, complexity is O(m2
logm)
Agglomerative clustering
Clustering gene expression data [Eisen et al 1998].
K-means clustering
● Input:
○ Number of clusters k
○ Training set
● Steps:
○ Randomly initialize k cluster centroids, i.e., mean/average
○ Repeat until converge
■ Assign each data to its closest centroid
■ Change the cluster center to the average of its assigned points
● Output:
○ Centroids of each cluster
○ The cluster label which is assigned to each data
K-means clustering
K-means clustering
● Let be the centroids of clusters
● Optimization objective:
● is the indicator to represent if the data i is assigned to cluster j
● Exact optimization of the k-means objective is NP-hard
K-means clustering
● The solution for k-means algorithm is heuristic
○ Fix centroids , optimize - assign step
○ Fix , optimize centroids - mean relocation step
● Each step is guaranteed to decrease the objective, thus guaranteed to
converge to a local optima.
K-means clustering
Algorithm:
● Randomly initialize k cluster centroids
● Repeat:
○ for i = 1 to m
■ Compute its indicator
○ for j = 1 to k
■ Compute the new cluster centroids
Random initialization
● K-means algorithm converges to a local optima, so it is extremely sensitive to
cluster center initialization.
● Bad initialization may lead to:
○ Poor convergence speed
○ Bad overall clustering
Handle bad initialization
1. Make initialized centroids evenly distribute in training set:
○ Choose first center as one of the examples, second which is the farthest from the first, third
which is the farthest from both, and so on
2. Try-and-see, namely, try multiple initializations and choose the best result:
○ For i = 1 to 100
■ Run K-means
■ Compute cost function
○ Pick the result that gave the lowest cost
Choosing the number of cluster k
● Elbow method: try different values of k, plot the k-means cost function J, and
find out the “elbow-point” in the figure
Image segmentation by k-means
K-means clustering
● K-means is the most popular and widely used
clustering algorithm
● Converge to a local optima
● Computational complexity in each iteration:
○ Assign data points to closest cluster center O(mk)
○ Change the cluster center to the average of its assigned
points O(m)
● “Hard assignment”, either 1 or 0 to each cluster
● Limitations:
○ Sensitive to outliers
○ Works well only for round shaped, and of roughly equal
sizes/density clusters
○ Can’t work well on non-linear clusters
Bag-of-Words model
● Origins from Natural Language Process (NLP):
○ A text (such as a sentence or a document) is represented as the bag (multiset) of its words,
disregarding grammar and even word order but keeping multiplicity.
● Orderless document representation: frequencies of words from a dictionary
● A simple example:
○ Document 1: John likes to watch movies. One of his favorite movie is XXX
○ Document 2: John also likes to play football.
○ Document 3: Tom likes to watch movie as well.
○ Dictionary: {movie, football}
○ [2 0] [0 1] [1 0]
Bag-of-Words model - intuition
Bag-of-Words model
Bag-of-Words model in computer vision
Algorithm:
● Extract scale-invariant features (SIFT [Lowe 2004], HOG [Dalal 2005], .etc)
from images
● Sample a subset of those features, construct a visual dictionary by k-means
(How many clusters k?)
● Quantize features using visual vocabulary
● Represent images by frequencies of “visual words”
The histogram representations of images can be used for classification using
some ML classification models.
Bag-of-Words model
● Pros:
○ flexible to geometry / deformations / viewpoint
○ compact summary of image content
○ provides vector representation for images
○ good recognition results in practice
● Cons:
○ basic model ignores geometry – must verify afterwards, or encode via features
○ background and foreground mixed when bag covers whole image
○ interest points or sampling: no guarantee to capture object-level parts
○ optimal vocabulary formation remains unclear
Agglomerative vs. k-means
Agglomerative cluster:
● gives different partitionings depending on
the level-of-resolution we are looking at
● doesn’t need the number of clusters to be
specified
● may be slow as it has to compute distance
many times O(m2
logm)
K-means clustering
● produces a single partitioning
● needs the number of clusters to be
specified
● more efficient O(m(k+1)i), where i is the
number of iteration

More Related Content

Similar to ML using MATLAB

CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
Nandhini S
 

Similar to ML using MATLAB (20)

A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn Algorithm
 
Natural Language Processing to Improve Student Engagement featuring Dr. Rebec...
Natural Language Processing to Improve Student Engagement featuring Dr. Rebec...Natural Language Processing to Improve Student Engagement featuring Dr. Rebec...
Natural Language Processing to Improve Student Engagement featuring Dr. Rebec...
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
Hierarchical Clustering With KSAI
Hierarchical Clustering With KSAIHierarchical Clustering With KSAI
Hierarchical Clustering With KSAI
 
Algorithms__Data_Structures_-_iCSC_2018.pptx
Algorithms__Data_Structures_-_iCSC_2018.pptxAlgorithms__Data_Structures_-_iCSC_2018.pptx
Algorithms__Data_Structures_-_iCSC_2018.pptx
 
Unsupervised Learning
Unsupervised LearningUnsupervised Learning
Unsupervised Learning
 
Md2k 0219 shang
Md2k 0219 shangMd2k 0219 shang
Md2k 0219 shang
 
Machine learning using matlab.pdf
Machine learning using matlab.pdfMachine learning using matlab.pdf
Machine learning using matlab.pdf
 
Neural Network Approximation.pdf
Neural Network Approximation.pdfNeural Network Approximation.pdf
Neural Network Approximation.pdf
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentation
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
Cluster Analysis.pptx
Cluster Analysis.pptxCluster Analysis.pptx
Cluster Analysis.pptx
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
 
Embedded based retrieval in modern search ranking system
Embedded based retrieval in modern search ranking systemEmbedded based retrieval in modern search ranking system
Embedded based retrieval in modern search ranking system
 
hierarchical clustering.pptx
hierarchical clustering.pptxhierarchical clustering.pptx
hierarchical clustering.pptx
 
Evaluation of programs codes using machine learning
Evaluation of programs codes using machine learningEvaluation of programs codes using machine learning
Evaluation of programs codes using machine learning
 
Week 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-OnWeek 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-On
 

Recently uploaded

Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
HyderabadDolls
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 

Recently uploaded (20)

Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 

ML using MATLAB

  • 2. Outline ● Unsupervised learning ● Agglomerative clustering ● K-means clustering ○ Bag-of-Words (BoW) model ● Other clustering algorithms
  • 3. Unsupervised learning ● In unsupervised learning, the training data consists of a set of input vectors without any corresponding target values ● Goals: ○ Clustering: discover groups of similar examples within the data ○ Density estimation: determine the distribution of data within the input space ○ Data visualization: project the data from a high-dimensional space down to two or three dimensions ● Unsupervised learning was largely ignored by machine learning community ○ It is hard to say what the aim of unsupervised learning is ● Unsupervised learning is the future of machine learning ○ Large scale of unlabeled data ○ Very expensive and infeasible to annotate large data
  • 4. Clustering - intuition Unlabeled data After clustering
  • 5. Clustering ● Suppose we have a data set , the objective of clustering is to partition the data into different groups (clusters), where the data in each group are similar to each other other than to those in other groups ● Clustering is one of the most commonly researched unsupervised learning topic ● The only information clustering uses is the similarity between examples ● Clustering groups examples based of their mutual similarities
  • 6. Clustering ● Clustering can help us: ○ Automatically organizing data ○ Understanding hidden structure in some data ● A good cluster algorithm is: ○ High within-cluster similarity ○ Low inter-cluster similarity ● Application: ○ Image segmentation ○ Social network analysis ○ Data compression ○ Gene expression data
  • 7. Clustering ● Hierarchical clustering ○ Agglomerative: bottom-up ○ Divisive: top-down ● Centroid-based clustering ○ K-means clustering ○ Kernel k-means ○ Fuzzy c-means ● Distribution-based clustering ○ Gaussian Mixture Model (GMM) ● Density-based clustering ○ Mean-shift ○ Density-based spatial clustering of applications with noise (DBSCAN)
  • 8. Distance measure ● Distance is inversely related to similarity, namely, large distance, low similarity; small distance, high similarity ● Choice of the distance measure is very important for clustering ● Some commonly used distance metrics: ○ Euclidean distance (L2 norm): ○ Manhattan distance (L1 norm): ○ Mahalanobis distance: , where S is the covariance matrix ● Similarity is subjective and hard to define ● Different similarity criteria can lead to different clusters
  • 9. The property of a distance measure ● Non-negative: ● Symmetry: ● Self-similarity: ● Triangle inequality: ● E.g., self-similarity, we shouldn’t conclude A looks like B, but B doesn't look like A
  • 10. Agglomerative clustering ● Initially, each point is treated as a cluster ● Repeat: ○ Compute distance between all clusters in terms of the defined distance measure (store for efficiency) ○ Merge two nearest clusters into a new cluster ○ Stop when there is only one cluster left ● Save both clustering and sequence of cluster operations by “Dendrogram”
  • 12. Agglomerative clustering ● Summary ○ Choose a cluster distance/dissimilarity method ○ Successively merge closet pair of clusters until only one cluster left ● Pros and cons ○ Easy to understand and implement ○ Don’t need to define the number of clusters ○ Possible to view partitions at different levels of granularities ○ Computational extensive, complexity is O(m2 logm)
  • 13. Agglomerative clustering Clustering gene expression data [Eisen et al 1998].
  • 14. K-means clustering ● Input: ○ Number of clusters k ○ Training set ● Steps: ○ Randomly initialize k cluster centroids, i.e., mean/average ○ Repeat until converge ■ Assign each data to its closest centroid ■ Change the cluster center to the average of its assigned points ● Output: ○ Centroids of each cluster ○ The cluster label which is assigned to each data
  • 16. K-means clustering ● Let be the centroids of clusters ● Optimization objective: ● is the indicator to represent if the data i is assigned to cluster j ● Exact optimization of the k-means objective is NP-hard
  • 17. K-means clustering ● The solution for k-means algorithm is heuristic ○ Fix centroids , optimize - assign step ○ Fix , optimize centroids - mean relocation step ● Each step is guaranteed to decrease the objective, thus guaranteed to converge to a local optima.
  • 18. K-means clustering Algorithm: ● Randomly initialize k cluster centroids ● Repeat: ○ for i = 1 to m ■ Compute its indicator ○ for j = 1 to k ■ Compute the new cluster centroids
  • 19. Random initialization ● K-means algorithm converges to a local optima, so it is extremely sensitive to cluster center initialization. ● Bad initialization may lead to: ○ Poor convergence speed ○ Bad overall clustering
  • 20. Handle bad initialization 1. Make initialized centroids evenly distribute in training set: ○ Choose first center as one of the examples, second which is the farthest from the first, third which is the farthest from both, and so on 2. Try-and-see, namely, try multiple initializations and choose the best result: ○ For i = 1 to 100 ■ Run K-means ■ Compute cost function ○ Pick the result that gave the lowest cost
  • 21. Choosing the number of cluster k ● Elbow method: try different values of k, plot the k-means cost function J, and find out the “elbow-point” in the figure
  • 23. K-means clustering ● K-means is the most popular and widely used clustering algorithm ● Converge to a local optima ● Computational complexity in each iteration: ○ Assign data points to closest cluster center O(mk) ○ Change the cluster center to the average of its assigned points O(m) ● “Hard assignment”, either 1 or 0 to each cluster ● Limitations: ○ Sensitive to outliers ○ Works well only for round shaped, and of roughly equal sizes/density clusters ○ Can’t work well on non-linear clusters
  • 24. Bag-of-Words model ● Origins from Natural Language Process (NLP): ○ A text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. ● Orderless document representation: frequencies of words from a dictionary ● A simple example: ○ Document 1: John likes to watch movies. One of his favorite movie is XXX ○ Document 2: John also likes to play football. ○ Document 3: Tom likes to watch movie as well. ○ Dictionary: {movie, football} ○ [2 0] [0 1] [1 0]
  • 25. Bag-of-Words model - intuition
  • 27. Bag-of-Words model in computer vision Algorithm: ● Extract scale-invariant features (SIFT [Lowe 2004], HOG [Dalal 2005], .etc) from images ● Sample a subset of those features, construct a visual dictionary by k-means (How many clusters k?) ● Quantize features using visual vocabulary ● Represent images by frequencies of “visual words” The histogram representations of images can be used for classification using some ML classification models.
  • 28. Bag-of-Words model ● Pros: ○ flexible to geometry / deformations / viewpoint ○ compact summary of image content ○ provides vector representation for images ○ good recognition results in practice ● Cons: ○ basic model ignores geometry – must verify afterwards, or encode via features ○ background and foreground mixed when bag covers whole image ○ interest points or sampling: no guarantee to capture object-level parts ○ optimal vocabulary formation remains unclear
  • 29. Agglomerative vs. k-means Agglomerative cluster: ● gives different partitionings depending on the level-of-resolution we are looking at ● doesn’t need the number of clusters to be specified ● may be slow as it has to compute distance many times O(m2 logm) K-means clustering ● produces a single partitioning ● needs the number of clusters to be specified ● more efficient O(m(k+1)i), where i is the number of iteration