Unsupervised Learning: Clustering

Machine Learning
Unsupervised Learning: Clustering
Peter Chen
HyerPlanar, Chief Data Scientist

Peter Chen
HyperPlanar, Chief Data
Scientist
•20 Years Industry Experience –
Quantitative Investment Management,
Startups, Retail, Consulting, Software,
Energy, etc.
•Petco, Sempra Energy, Mitchell
International, EMC, Vistage, etc.
• Analytics & Data Science consulting
across a number of industries & clients
•B.S. M.I.T. Graduate Degrees from
Harvard University

•What is Clustering?
•K-means Algorithms
•Gaussian Mixture Models
•Hierarchical Clustering
•Methods for Selecting the number of clusters
•Evaluating the Quality of the Clustering: Silhouette Plots
•Applications in Industry
Topics

Prerequisites
•Python Programming
oUnderstand and read basic Python code
oKnow pandas, numpy, matplotlib libraries
•Basic Mathematics
oBasic Probability
oStatistics

•Provides a basic conceptual understanding of how clustering works
•Provides intuitive understanding of the mathematics behind various
clustering algorithms
•Walk through Python code examples on how to use various cluster
algorithms
•Show how clustering is applied in various industry applications
What am I going to get from this course?

SECTION 1
Course Overview and Introductions

• What is Clustering?
• K- means clustering
• Gaussian Mixture Models (GMM)
• Hierarchical Clustering
• How to select the best number of clusters
• Industry Applications
Course Overview
Section Name here 8

• Grouping/Clustering is such a natural thing that humans do it
all the time!
•Kids separate Halloween candies by type
What is clustering?
Section Name here 9

• Biologists classify and group animals using
the system below
What is clustering?
Section Name here 10

• Watching tons of videos and instantly recognizing they are all different
types of cats. (Note: A nontrivial amount of time on the internet is spent
watching cat videos! )
What is clustering?

•Can we teach machines to do clustering in a somewhat
automated fashion??
•Can machine find groupings of things that are similar without us
explicitly telling it how to do it?
Big Question?

•Yes!
•Using the machine learning technique clustering, we can save
time and money.
•For example, supposed we have a million images in our
database and we want to automatically label them
•Hiring people to manually reviewing the million images would
be costly and time intensive
Big Answer

• We feed the clustering algorithm thousands and millions of
images and then we let it group/cluster/categorize them into
cats
Big Answer

• Create a group with high similarity among the members of the
cluster
•Create a group with significant dissimilarity(differences)
between members of two different clusters
More exact definition of Clustering

r
•Partitions the data set into k clusters
•Each data point belongs to the cluster with the nearest mean
K-means Clustering: High level Idea

r
•Create some sample data with 5 clusters
K-means Algorithm : Code samples

r
•Plot the 5 same clusters
K-means Algorithm: Code samples

r
•K-means automatically identifies the 5 clusters (color coded)
K-means Algorithm: Code samples

r
How does k-means
do that?
K-means Algorithm

r
• Pick some random cluster centers(k of them)
• Repeat until converged
Expectation step: Assign points to the nearest cluster
center
Maximization step: Set the cluster centers to the
mean
k-means Algorithm: Expectation-
Maximization

r
• Step 1: Decide the number of clusters(k)
• Step 2 : Assign randomly the cluster centers(centroid) for each cluster.
• Step 3: Calculate the distance of each observation from each cluster
•Step 4: Assign the observation to the cluster from which its distance is the least
•Step 5: Recalculate the cluster centroid using the mean of ALL the observations in the cluster
•Step 6: Repeat the process starting in Step 3
•Step 7: Stop if none of the observations were reassigned from one cluster to another.
k-means Algorithm: Step by Step

r
• We need to normalize our data points to get the clustering
right, because our input data is usually on different scales.
•Or in Python code:
•df_norm = (df-df.min()) / (df.max()-df.min())
k-means Algorithm: Normalizing

r
• We have to define the distance between two points
• Here are some popular distance measures:
1) Euclidean distance
2) Manhattan distance
3) Minkowski distance
k-means Algorithm: Similarity
measures

r
• Square distance between the two vectors
Similarity measures: Euclidean
distance

r
• Absolute distance between the two vectors
Similarity measures: Manhattan
distance
Absolute distance between the two vectors

r
• Generalized distance metric. p=1 then it becomes
Manhatthan distance, p=2 it’s Euclidean, etc.
Similarity measures: Minkowski
distance

r
• All Distance Metric just satisfy the following mathematical
properties:
• d(a, b) ≥ 0 Distance between 2 points must be non-negative
• d(a ,b) = 0 ↔ a= b Distance between 2 points is zero iff they are the same
point
• d(a, b) = d(b, a) Symmetry.
• d(a, c) ≤ d(a,b) + d(b, c) Triangle inequality. Shortest distance between
two points is a straight line.
Can we just invent any distance
metric??

r
•Global optimal results may not be achieved
•Number of clusters must be selected beforehand
•Limited to linear cluster boundaries (hard spherical boundaries)
•Can be slow for large numbers of samples
k-means Algorithm: Issues

r
• Please see Ipython “Clustering Mini-Project” notebook on the course website
Mini-Project: Complete Code Solution

Unsupervised Learning: Clustering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Unsupervised Learning: Clustering

Similar to Unsupervised Learning: Clustering (20)

More from Experfy

More from Experfy (20)

Recently uploaded

Recently uploaded (20)

Unsupervised Learning: Clustering

Editor's Notes