Machine Learning
Unsupervised Learning: Clustering
Peter Chen
HyerPlanar, Chief Data Scientist
Peter Chen
HyperPlanar, Chief Data
Scientist
•20 Years Industry Experience –
Quantitative Investment Management,
Startups, Retail, Consulting, Software,
Energy, etc.
•Petco, Sempra Energy, Mitchell
International, EMC, Vistage, etc.
• Analytics & Data Science consulting
across a number of industries & clients
•B.S. M.I.T. Graduate Degrees from
Harvard University
•What is Clustering?
•K-means Algorithms
•Gaussian Mixture Models
•Hierarchical Clustering
•Methods for Selecting the number of clusters
•Evaluating the Quality of the Clustering: Silhouette Plots
•Applications in Industry
Topics
Prerequisites
•Python Programming
oUnderstand and read basic Python code
oKnow pandas, numpy, matplotlib libraries
•Basic Mathematics
oBasic Probability
oStatistics
•Provides a basic conceptual understanding of how clustering works
•Provides intuitive understanding of the mathematics behind various
clustering algorithms
•Walk through Python code examples on how to use various cluster
algorithms
•Show how clustering is applied in various industry applications
What am I going to get from this course?
SECTION 1
Course Overview and Introductions
• What is Clustering?
• K- means clustering
• Gaussian Mixture Models (GMM)
• Hierarchical Clustering
• How to select the best number of clusters
• Industry Applications
Course Overview
Section Name here 8
• Grouping/Clustering is such a natural thing that humans do it
all the time!
•Kids separate Halloween candies by type
What is clustering?
Section Name here 9
• Biologists classify and group animals using
the system below
What is clustering?
Section Name here 10
• Watching tons of videos and instantly recognizing they are all different
types of cats. (Note: A nontrivial amount of time on the internet is spent
watching cat videos! )
What is clustering?
Section Name here 11
•Can we teach machines to do clustering in a somewhat
automated fashion??
•Can machine find groupings of things that are similar without us
explicitly telling it how to do it?
Big Question?
Section Name here 12
•Yes!
•Using the machine learning technique clustering, we can save
time and money.
•For example, supposed we have a million images in our
database and we want to automatically label them
•Hiring people to manually reviewing the million images would
be costly and time intensive
Big Answer
Section Name here 13
• We feed the clustering algorithm thousands and millions of
images and then we let it group/cluster/categorize them into
cats
Big Answer
Section Name here 14
• Create a group with high similarity among the members of the
cluster
•Create a group with significant dissimilarity(differences)
between members of two different clusters
More exact definition of Clustering
Section Name here 15
SECTION 2
K-Means Clustering
r
•Partitions the data set into k clusters
•Each data point belongs to the cluster with the nearest mean
K-means Clustering: High level Idea
Section Name here 17
r
•Create some sample data with 5 clusters
K-means Algorithm : Code samples
Section Name here 18
r
•Plot the 5 same clusters
K-means Algorithm: Code samples
Section Name here 19
r
•K-means automatically identifies the 5 clusters (color coded)
K-means Algorithm: Code samples
Section Name here 20
r
How does k-means
do that?
K-means Algorithm
Section Name here 21
r
• Pick some random cluster centers(k of them)
• Repeat until converged
Expectation step: Assign points to the nearest cluster
center
Maximization step: Set the cluster centers to the
mean
k-means Algorithm: Expectation-
Maximization
Section Name here 22
r
• Step 1: Decide the number of clusters(k)
• Step 2 : Assign randomly the cluster centers(centroid) for each cluster.
• Step 3: Calculate the distance of each observation from each cluster
•Step 4: Assign the observation to the cluster from which its distance is the least
•Step 5: Recalculate the cluster centroid using the mean of ALL the observations in the cluster
•Step 6: Repeat the process starting in Step 3
•Step 7: Stop if none of the observations were reassigned from one cluster to another.
k-means Algorithm: Step by Step
Section Name here 23
r
• We need to normalize our data points to get the clustering
right, because our input data is usually on different scales.
•Or in Python code:
•df_norm = (df-df.min()) / (df.max()-df.min())
k-means Algorithm: Normalizing
Section Name here 24
r
Similarity Measures
r
• We have to define the distance between two points
• Here are some popular distance measures:
1) Euclidean distance
2) Manhattan distance
3) Minkowski distance
k-means Algorithm: Similarity
measures
Section Name here 26
r
• Square distance between the two vectors
Similarity measures: Euclidean
distance
Section Name here 27
r
• Absolute distance between the two vectors
Similarity measures: Manhattan
distance
Section Name here 28
Absolute distance between the two vectors
r
• Generalized distance metric. p=1 then it becomes
Manhatthan distance, p=2 it’s Euclidean, etc.
Similarity measures: Minkowski
distance
Section Name here 29
r
• All Distance Metric just satisfy the following mathematical
properties:
• d(a, b) ≥ 0 Distance between 2 points must be non-negative
• d(a ,b) = 0 ↔ a= b Distance between 2 points is zero iff they are the same
point
• d(a, b) = d(b, a) Symmetry.
• d(a, c) ≤ d(a,b) + d(b, c) Triangle inequality. Shortest distance between
two points is a straight line.
Can we just invent any distance
metric??
Section Name here 30
r
Issues
r
•Global optimal results may not be achieved
•Number of clusters must be selected beforehand
•Limited to linear cluster boundaries (hard spherical boundaries)
•Can be slow for large numbers of samples
k-means Algorithm: Issues
Section Name here 32
r
• Please see Ipython “Clustering Mini-Project” notebook on the course website
Mini-Project: Complete Code Solution
Section Name here 33

Unsupervised Learning: Clustering

  • 2.
    Machine Learning Unsupervised Learning:Clustering Peter Chen HyerPlanar, Chief Data Scientist
  • 3.
    Peter Chen HyperPlanar, ChiefData Scientist •20 Years Industry Experience – Quantitative Investment Management, Startups, Retail, Consulting, Software, Energy, etc. •Petco, Sempra Energy, Mitchell International, EMC, Vistage, etc. • Analytics & Data Science consulting across a number of industries & clients •B.S. M.I.T. Graduate Degrees from Harvard University
  • 4.
    •What is Clustering? •K-meansAlgorithms •Gaussian Mixture Models •Hierarchical Clustering •Methods for Selecting the number of clusters •Evaluating the Quality of the Clustering: Silhouette Plots •Applications in Industry Topics
  • 5.
    Prerequisites •Python Programming oUnderstand andread basic Python code oKnow pandas, numpy, matplotlib libraries •Basic Mathematics oBasic Probability oStatistics
  • 6.
    •Provides a basicconceptual understanding of how clustering works •Provides intuitive understanding of the mathematics behind various clustering algorithms •Walk through Python code examples on how to use various cluster algorithms •Show how clustering is applied in various industry applications What am I going to get from this course?
  • 7.
    SECTION 1 Course Overviewand Introductions
  • 8.
    • What isClustering? • K- means clustering • Gaussian Mixture Models (GMM) • Hierarchical Clustering • How to select the best number of clusters • Industry Applications Course Overview Section Name here 8
  • 9.
    • Grouping/Clustering issuch a natural thing that humans do it all the time! •Kids separate Halloween candies by type What is clustering? Section Name here 9
  • 10.
    • Biologists classifyand group animals using the system below What is clustering? Section Name here 10
  • 11.
    • Watching tonsof videos and instantly recognizing they are all different types of cats. (Note: A nontrivial amount of time on the internet is spent watching cat videos! ) What is clustering? Section Name here 11
  • 12.
    •Can we teachmachines to do clustering in a somewhat automated fashion?? •Can machine find groupings of things that are similar without us explicitly telling it how to do it? Big Question? Section Name here 12
  • 13.
    •Yes! •Using the machinelearning technique clustering, we can save time and money. •For example, supposed we have a million images in our database and we want to automatically label them •Hiring people to manually reviewing the million images would be costly and time intensive Big Answer Section Name here 13
  • 14.
    • We feedthe clustering algorithm thousands and millions of images and then we let it group/cluster/categorize them into cats Big Answer Section Name here 14
  • 15.
    • Create agroup with high similarity among the members of the cluster •Create a group with significant dissimilarity(differences) between members of two different clusters More exact definition of Clustering Section Name here 15
  • 16.
  • 17.
    r •Partitions the dataset into k clusters •Each data point belongs to the cluster with the nearest mean K-means Clustering: High level Idea Section Name here 17
  • 18.
    r •Create some sampledata with 5 clusters K-means Algorithm : Code samples Section Name here 18
  • 19.
    r •Plot the 5same clusters K-means Algorithm: Code samples Section Name here 19
  • 20.
    r •K-means automatically identifiesthe 5 clusters (color coded) K-means Algorithm: Code samples Section Name here 20
  • 21.
    r How does k-means dothat? K-means Algorithm Section Name here 21
  • 22.
    r • Pick somerandom cluster centers(k of them) • Repeat until converged Expectation step: Assign points to the nearest cluster center Maximization step: Set the cluster centers to the mean k-means Algorithm: Expectation- Maximization Section Name here 22
  • 23.
    r • Step 1:Decide the number of clusters(k) • Step 2 : Assign randomly the cluster centers(centroid) for each cluster. • Step 3: Calculate the distance of each observation from each cluster •Step 4: Assign the observation to the cluster from which its distance is the least •Step 5: Recalculate the cluster centroid using the mean of ALL the observations in the cluster •Step 6: Repeat the process starting in Step 3 •Step 7: Stop if none of the observations were reassigned from one cluster to another. k-means Algorithm: Step by Step Section Name here 23
  • 24.
    r • We needto normalize our data points to get the clustering right, because our input data is usually on different scales. •Or in Python code: •df_norm = (df-df.min()) / (df.max()-df.min()) k-means Algorithm: Normalizing Section Name here 24
  • 25.
  • 26.
    r • We haveto define the distance between two points • Here are some popular distance measures: 1) Euclidean distance 2) Manhattan distance 3) Minkowski distance k-means Algorithm: Similarity measures Section Name here 26
  • 27.
    r • Square distancebetween the two vectors Similarity measures: Euclidean distance Section Name here 27
  • 28.
    r • Absolute distancebetween the two vectors Similarity measures: Manhattan distance Section Name here 28 Absolute distance between the two vectors
  • 29.
    r • Generalized distancemetric. p=1 then it becomes Manhatthan distance, p=2 it’s Euclidean, etc. Similarity measures: Minkowski distance Section Name here 29
  • 30.
    r • All DistanceMetric just satisfy the following mathematical properties: • d(a, b) ≥ 0 Distance between 2 points must be non-negative • d(a ,b) = 0 ↔ a= b Distance between 2 points is zero iff they are the same point • d(a, b) = d(b, a) Symmetry. • d(a, c) ≤ d(a,b) + d(b, c) Triangle inequality. Shortest distance between two points is a straight line. Can we just invent any distance metric?? Section Name here 30
  • 31.
  • 32.
    r •Global optimal resultsmay not be achieved •Number of clusters must be selected beforehand •Limited to linear cluster boundaries (hard spherical boundaries) •Can be slow for large numbers of samples k-means Algorithm: Issues Section Name here 32
  • 33.
    r • Please seeIpython “Clustering Mini-Project” notebook on the course website Mini-Project: Complete Code Solution Section Name here 33