2. 2
Trees vs Clusters
Trees (Supervised Learning)
Provide: labeled data
Learning Task: be able to predict label
Clusters (Unsupervised Learning)
Provide: unlabeled data
Learning Task: group data by similarity
3. 3
Trees vs Clusters
sepal
length
sepal
width
petal
length
petal
width
species
5.1 3.5 1.4 0.2 setosa
5.7 2.6 3.5 1.0 versicolor
6.7 2.5 5.8 1.8 virginica
… … … … …
sepal
length
sepal
width
petal
length
petal
width
5.1 3.5 1.4 0.2
5.7 2.6 3.5 1.0
6.7 2.5 5.8 1.8
… … … …
Learning Task:
Find function “f” such that:
f(X)≈Y
Learning Task:
Find “k” clusters such that
the data in each cluster is
self similar
Inputs “X”
Label
“Y”
5. 5
Customer Segmentation
GOAL: Cluster the users by usage statistics. Identify clusters
with a higher percentage of high LTV users. Since they have
similar usage patterns, the remaining users in these clusters
may be good candidates for upsell.
• Dataset of mobile game users.
• Data for each user consists of usage statistics
and a LTV based on in-game purchases
• Assumption: Usage correlates to LTV
5
0%
3%
7%
1%
6. 6
Item Discovery
• Dataset of 86 whiskies
• Each whiskey scored on a scale from 0 to 4
for each of 12 possible flavor characteristics.
GOAL: Cluster the whiskies by flavor profile to discover
whiskies that have similar taste.
Smoky
Fruity
Body
7. 7
Association
• Dataset of Lending Club Loans
• Mark any loan that is currently or has even
been late as “trouble”
GOAL: Cluster the loans by application profile to rank loan
quality by percentage of trouble loans in population
0%
3%
7%
1%
8. 8
Active Learning
GOAL: Rather than sample randomly, use clustering to group
patients by similarity and then test a sample from each
cluster to label the data.
• Dataset of diagnostic measurements of 768
patients.
• Want to test each patient for diabetes and
label the dataset to build a model but the
test is expensive*
*For a more realistic example of high cost, imagine a dataset with a billion
transactions, each one needing to be labelled as fraud/not-fraud. Or a million images
which need to be labeled as cat/not-cat.
2323
11. 11
Human Example
• Human used prior knowledge to select possible
features that separated the objects.
• “round”, “skinny”, “edges”, “hard”, etc
• Items were then separated based on the chosen
features
• Separation quality was then tested to ensure:
• met criteria of K=3
• groups were sufficiently “distant”
• no crossover
12. 12
Learning from Humans
• Length/Width
• greater than 1 => “skinny”
• equal to 1 => “round”
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require “edges” which have corners
• easier to count
Create features that capture these object differences
14. 14
Now we can Plot
K=3
Num
Surfaces
Length / Width
box block eraser
knob
penny
dime
bead
key battery screw
K-Means Key Insight:
We can find clusters using distances
in n-dimensional feature space
15. 15
Now we can Plot
Num
Surfaces
Length / Width
box block eraser
knob
penny
dime
bead
key battery screw
K-Means
Find “best” (minimum distance)
circles that include all points
26. 20
Starting Points
• Random points or instances in n-dimensional space
• Chose points “farthest” away from each other
• but this is sensitive to outliers
• k++
• the first center is chosen randomly from instance
• each subsequent center is chosen from the remaining
instances with probability proportional to its squared
distance from the point's closest existing cluster center
28. 22
Other Tricks
• What is the distance to a “missing value”?
• What is the distance between categorical values?
• What is the distance between text features?
• Does it have to be Euclidean distance?
• Unknown “K”?
29. 23
Distance to Missing Value?
• Nonsense! Try replacing missing values with:
• Maximum
• Mean
• Median
• Minimum
• Zero
• Ignore instances with missing values
30. 24
Distance to Categorical?
• Special distance function
• if valA == valB then
distance = 0 (or scaling value)
else
distance = 1
• Assign centroid the most common category of
the member instances
• Compute Euclidean distance as normal
One approach: similar to “k-prototypes”
31. 25
Distance to Categorical?
feature_1 feature_2 feature_3
instance_1 red cat ball
instance_2 red cat ball
instance_3 red cat box
instance_4 blue dog fridge
D = 0
D = 1
D = sqrt(3)