Cluster Analysis Techniques for Data Segmentation

Cluster Analysis
Poul Petersen
BigML

2
Trees vs Clusters
Trees (Supervised Learning)
Provide: labeled data
Learning Task: be able to predict label
Clusters (Unsupervised Learning)
Provide: unlabeled data
Learning Task: group data by similarity

3
Trees vs Clusters
sepal
length
sepal
width
petal
length
petal
width
species
5.1 3.5 1.4 0.2 setosa
5.7 2.6 3.5 1.0 versicolor
6.7 2.5 5.8 1.8 virginica
… … … … …
sepal
length
sepal
width
petal
length
petal
width
5.1 3.5 1.4 0.2
5.7 2.6 3.5 1.0
6.7 2.5 5.8 1.8
… … … …
Learning Task:
Find function “f” such that:
f(X)≈Y
Learning Task:
Find “k” clusters such that
the data in each cluster is
self similar
Inputs “X”
Label

“Y”

4
• Customer segmentation

• Item discovery

• Association

• Recommender

• Active learning
Use Cases

5
Customer Segmentation
GOAL: Cluster the users by usage statistics. Identify clusters
with a higher percentage of high LTV users. Since they have
similar usage patterns, the remaining users in these clusters
may be good candidates for upsell.
• Dataset of mobile game users.
• Data for each user consists of usage statistics
and a LTV based on in-game purchases
• Assumption: Usage correlates to LTV
5
0%
3%
7%
1%

6
Item Discovery
• Dataset of 86 whiskies
• Each whiskey scored on a scale from 0 to 4
for each of 12 possible flavor characteristics.
GOAL: Cluster the whiskies by flavor profile to discover
whiskies that have similar taste.
Smoky
Fruity
Body

7
Association
• Dataset of Lending Club Loans
• Mark any loan that is currently or has even
been late as “trouble”
GOAL: Cluster the loans by application profile to rank loan
quality by percentage of trouble loans in population
0%
3%
7%
1%

8
Active Learning
GOAL: Rather than sample randomly, use clustering to group
patients by similarity and then test a sample from each
cluster to label the data.
• Dataset of diagnostic measurements of 768
patients.
• Want to test each patient for diabetes and
label the dataset to build a model but the
test is expensive*
*For a more realistic example of high cost, imagine a dataset with a billion
transactions, each one needing to be labelled as fraud/not-fraud. Or a million images
which need to be labeled as cat/not-cat.
2323

10
Human Example
“Round”
“Skinny”
“Corners”

11
Human Example
• Human used prior knowledge to select possible
features that separated the objects.
• “round”, “skinny”, “edges”, “hard”, etc
• Items were then separated based on the chosen
features
• Separation quality was then tested to ensure:
• met criteria of K=3
• groups were sufficiently “distant”
• no crossover

12
Learning from Humans
• Length/Width
• greater than 1 => “skinny”
• equal to 1 => “round”
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require “edges” which have corners
• easier to count
Create features that capture these object differences

13
Feature Engineering
Object Length / Width Num Surfaces
penny 1 3
dime 1 3
knob 1 4
eraser 2.75 6
box 1 6
block 1.6 6
screw 8 3
battery 5 3
key 4.25 3
bead 1 2

14
Now we can Plot
K=3
Num

Surfaces
Length / Width
box block eraser
knob
penny

dime
bead
key battery screw
K-Means Key Insight:

We can ﬁnd clusters using distances

in n-dimensional feature space

15
Now we can Plot
Num

Surfaces
Length / Width
box block eraser
knob
penny

dime
bead
key battery screw
K-Means

Find “best” (minimum distance)

circles that include all points

18
Features Matter!
Metal Other
Wood

19
Convergence
Convergence guaranteed

but not necessarily unique

Starting points important (K++)

20
Starting Points
• Random points or instances in n-dimensional space

• Chose points “farthest” away from each other

• but this is sensitive to outliers

• k++

• the ﬁrst center is chosen randomly from instance

• each subsequent center is chosen from the remaining
instances with probability proportional to its squared
distance from the point's closest existing cluster center

21
Scaling
price
number of bedrooms
d = 160,000
d = 1

22
Other Tricks
• What is the distance to a “missing value”?

• What is the distance between categorical values?

• What is the distance between text features?

• Does it have to be Euclidean distance?

• Unknown “K”?

23
Distance to Missing Value?
• Nonsense! Try replacing missing values with:

• Maximum

• Mean

• Median

• Minimum

• Zero

• Ignore instances with missing values

24
Distance to Categorical?
• Special distance function

• if valA == valB then 
distance = 0 (or scaling value)  
else  
distance = 1

• Assign centroid the most common category of
the member instances

• Compute Euclidean distance as normal
One approach: similar to “k-prototypes”

25
Distance to Categorical?
feature_1 feature_2 feature_3
instance_1 red cat ball
instance_2 red cat ball
instance_3 red cat box
instance_4 blue dog fridge
D = 0
D = 1
D = sqrt(3)

26
Only Euclidean?
1
MahalanobisCosine Similarity
0
-1

29
Finding K: G-means
Let K=2
Keep 1, Split 1
New K=3

30
Finding K: G-means
Let K=3
Keep 1, Split 2
New K=5

31
Finding K: G-means
Let K=5
K=5

Cluster Analysis Techniques for Data Segmentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cluster Analysis Techniques for Data Segmentation

Similar to Cluster Analysis Techniques for Data Segmentation (20)

More from Machine Learning Valencia

More from Machine Learning Valencia (9)

Recently uploaded

Recently uploaded (20)

Cluster Analysis Techniques for Data Segmentation