VSSML16 L3. Clusters and Anomaly Detection

BigML, Inc 2
Cluster Analysis
Poul Petersen
CIO, BigML, Inc
Finding Similarities

BigML, Inc 3Unsupervised Learning
Trees vs Clusters
Trees (Supervised Learning)
Provide: labeled data
Learning Task: be able to predict label
Clusters (Unsupervised Learning)
Provide: unlabeled data
Learning Task: group data by similarity

Trees vs Clusters
sepal

length
sepal

width
petal

length
petal

width
species
5.1 3.5 1.4 0.2 setosa
5.7 2.6 3.5 1.0 versicolor
6.7 2.5 5.8 1.8 virginica
… … … … …
sepal

length
sepal

width
petal

length
petal

width
5.1 3.5 1.4 0.2
5.7 2.6 3.5 1.0
6.7 2.5 5.8 1.8
… … … …
Inputs “X” Label “Y”
Learning Task:
Find function “f” such that:
f(X)≈Y
Learning Task:
Find “k” clusters such that
the data in each cluster is self
similar

Use Cases
• Customer segmentation
• Item discovery
• Similarity
• Recommender
• Active learning

Customer Segmentation
GOAL: Cluster the users by usage
statistics. Identify clusters with a
higher percentage of high LTV users.
Since they have similar usage
patterns, the remaining users in
these clusters may be good
candidates for up-sell.
• Dataset of mobile game users.
• Data for each user consists of usage
statistics and a LTV based on in-game
purchases
• Assumption: Usage correlates to LTV
0%
3%
1%

Item Discovery
GOAL: Cluster the whiskies by flavor
profile to discover whiskies that have
similar taste.
• Dataset of 86 whiskies
• Each whiskey scored on a scale from
0 to 4 for each of 12 possible ﬂavor
characteristics.
Smoky
Fruity

Clustering Demo #1

Similarity
GOAL: Cluster the loans by
application profile to rank loan
quality by percentage of trouble
loans in population
• Dataset of Lending Club Loans
• Mark any loan that is currently or has
even been late as “trouble”
0%
3%
7%
1%

Active Learning
GOAL: Rather than sample randomly, use clustering to group
patients by similarity and then test a sample from each cluster
to label the data.
• Dataset of diagnostic measurements
of 768 patients.
• Want to test each patient for diabetes
and label the dataset to build a model
but the test is expensive*.

*For a more realistic example of high cost, imagine a dataset
with a billion transactions, each one needing to be labelled as
fraud/not-fraud. Or a million images which need to be labeled as
cat/not-cat.
2323
Active Learning

Human Example
Cluster into 3 groups…

Human Example

Learning from Humans
• Jesa used prior knowledge to select possible
features that separated the objects.
• “round”, “skinny”, “edges”, “hard”, etc
• Items were then clustered based on the chosen
features
• Separation quality was then tested to ensure:
• met criteria of K=3
• groups were suﬃciently “distant”
• no crossover

• Length/Width
• greater than 1 => “skinny”
• equal to 1 => “round”
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require “edges” which have corners
• easier to count
Create features that capture these object differences

Cluster Features
Object Length / Width Num Surfaces
penny 1 3
dime 1 3
knob 1 4
eraser 2.75 6
box 1 6
block 1.6 6
screw 8 3
battery 5 3
key 4.25 3
bead 1 2

Plot by Features
K=3
Num

Surfaces
Length / Width
box block eraser
knob
penny

dime
bead
key battery screw
K-Means Key Insight:

We can ﬁnd clusters using distances

in n-dimensional feature space

Plot by Features
Num

Surfaces
Length / Width
box block eraser
knob
penny

dime
bead
key battery screw
K-Means

Find “best” (minimum distance)

circles that include all points

K-Means Algorithm
K=3

Features Matter
Metal Other
Wood

Convergence
Convergence guaranteed

but not necessarily unique

Starting points important (K++)

Starting Points
• Random points or instances in n-dimensional space
• Chose points “farthest” away from each other
• but this is sensitive to outliers
• k++
• the ﬁrst center is chosen randomly from instances
• each subsequent center is chosen from the
remaining instances with probability proportional to
its squared distance from the point's closest existing
cluster center

Scaling
price
number of bedrooms
d = 160,000
d = 1

Other Tricks
• What is the distance to a “missing value”?
• What is the distance between categorical values?
• What is the distance between text features?
• Does it have to be Euclidean distance?
• Unknown “K”?

Distance to Missing Value?
• Nonsense! Try replacing missing values with:
• Maximum
• Mean
• Median
• Minimum
• Zero
• Ignore instances with missing values

Distance to Categorical?
• Special distance function
• if valA == valB then 
distance = 0 (or scaling value)  
else  
distance = 1
• Assign centroid the most common category of the
member instances
Approach: similar to “k-prototypes”

Distance to Categorical?
feature_1 feature_2 feature_3
instance_1 red cat ball
instance_2 red cat ball
instance_3 red cat box
instance_4 blue dog fridge
D = 0
D = 1
D = sqrt(3)
Compute Euclidean distance between discrete vectors

Text Vectors
1
Cosine Similarity
0
"hippo" "safari" "zebra" ….
1 0 0 …
1 1 0 …
0 0 0 …
Text Field #1
Text Field #2
Cosine Distance = 1 - Cosine Similarity
Features(thousands)

Finding K: G-Means

Finding K: G-Means
Let K=2
Keep 1, Split 1
New K=3

Finding K: G-Means
Let K=3
Keep 1, Split 2
New K=5

Finding K: G-Means
Let K=5
K=5

Clustering Demo #2

BigML, Inc 2
Anomaly Detection
Poul Petersen
CIO, BigML, Inc
Finding the Unusual

Clusters vs Anomalies
Clusters (Unsupervised Learning)


Learning Task: group data by similarity

Anomalies (Unsupervised Learning)
Learning Task: Rank data by dissimilarity

Clusters vs Anomalies
sepal
length
sepal
width
petal
length
petal
width
5.1 3.5 1.4 0.2
5.7 2.6 3.5 1.0
6.7 2.5 5.8 1.8
… … … …
Learning Task:
Find “k” clusters such that the data
in each cluster is self similar
sepal
length
sepal
width
petal
length
petal
width
5.1 3.5 1.4 0.2
5.7 2.6 3.5 1.0
6.7 2.5 5.8 1.8
… … … …
Learning Task:
Assign value from 0 (similar) to 1
(dissimilar) to each instance.

• Unusual instance discovery
• Intrusion Detection
• Fraud
• Identify Incorrect Data
• Remove Outliers
• Model Competence / Input Data Drift
Use Cases

Removing Outliers
• Models need to generalize
• Outliers negatively impact generalization
GOAL: Use anomaly detector to identify most anomalous
points and then remove them before modeling.
DATASET FILTERED
DATASET
ANOMALY
DETECTOR
CLEAN
MODEL

Anomaly Demo #1

Intrusion Detection
GOAL: Identify unusual command line behavior per user and
across all users that might indicate an intrusion.
• Dataset of command line history for users
• Data for each user consists of commands,
ﬂags, working directories, etc.
• Assumption: Users typically issue the same
ﬂag patterns and work in certain directories
Per User Per Dir All User All Dir

Fraud
• Dataset of credit card transactions
• Additional user proﬁle information
GOAL: Cluster users by profile and use multiple anomaly
scores to detect transactions that are anomalous on multiple
levels.
Card Level User Level Similar User Level

Model Competence
• After putting a model it into production,
data that is being predicted can become
statistically different than the training data.
• Train an anomaly detector at the same time
as the model.
G O A L : F o r e v e r y
prediction, compute
an anomaly score.
If the anomaly score is
high, then the model
may not be competent
and should not be
Training Data
PREDICTION
ANOMALY
SCORE
MODEL
ANOMALY
DETECTOR

Univariate Approach
• Single variable: heights, test scores, etc
• Assume the value is distributed “normally”
• Compute standard deviation
• a measure of how “spread out” the numbers are
• the square root of the variance (The average of the
squared diﬀerences from the Mean.)
• Depending on the number of instances, choose a
“multiple” of standard deviations to indicate an anomaly.
A multiple of 3 for 1000 instances removes ~ 3 outliers.

Univariate Approach
measurement
frequency
outliersoutliers
• Available in BigML API

Benford’s Law
• In real-life numeric sets the small digits occur disproportionately often as
leading signiﬁcant digits.
• Applications include:
• accounting records
• electricity bills
• street addresses
• stock prices
• population numbers
• death rates
• lengths of rivers
• Available in BigML API

Human Example
Most Unusual?

Human Example
“Round”“Skinny” “Corners”
“Skinny”
but not “smooth”
No
“Corners”
Not
“Round”
Key Insight

The “most unusual” object

is diﬀerent in some way from

every partition of the features.
Most unusual

Human Example
• Human used prior knowledge to select possible
features that separated the objects.
• “round”, “skinny”, “smooth”, “corners”
• Items were then separated based on the chosen
features
• Each cluster was then examined to see which
object ﬁt the least well in its cluster and did not ﬁt
any other cluster

• Length/Width
• greater than 1 => “skinny”
• equal to 1 => “round”
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require “edges” which have corners
• easier to count
• Smooth - true or false
Create features that capture these object differences

Anomaly Features
Object Length / Width Num Surfaces Smooth
penny 1 3 TRUE
dime 1 3 TRUE
knob 1 4 TRUE
eraser 2.75 6 TRUE
box 1 6 TRUE
block 1.6 6 TRUE
screw 8 3 FALSE
battery 5 3 TRUE
key 4.25 3 FALSE
bead 1 2 TRUE

smooth = True
length/width > 5
box
blockeraser
knob
penny

dime
bead
key
battery
screw
num surfaces = 6
length/width =1
length/width < 2
Random Splits
Know that “splits” matter - don’t know the order

Isolation Forest
Grow a random decision tree until
each instance is in its own leaf
“easy” to isolate
“hard” to isolate
Depth
Now repeat the process several times and
use average Depth to compute anomaly
score: 0 (similar) -> 1 (dissimilar)

Isolation Forest Scoring
f_1 f_2 f_3
i_1 red cat ball
i_2 red cat ball
i_3 red cat box
i_4 blue dog pen
D = 3
D = 6
D = 2
Score

• A low anomaly score means the loan is similar to the
modeled loans.
• A high anomaly score means you can not trust the
model.
Model Competence
Prediction T T
Conﬁdence
86% 84%
Anomaly
Score
0.5367 0.7124
Competent? Y N
OPEN LOANS
PREDICTION
ANOMALY
SCORE
CLOSED LOAN
MODEL
CLOSED LOAN
ANOMALY DETECTOR

Anomaly Demo #2

VSSML16 L3. Clusters and Anomaly Detection

More Related Content

What's hot

Viewers also liked

Similar to VSSML16 L3. Clusters and Anomaly Detection

More from BigML, Inc

Recently uploaded

VSSML16 L3. Clusters and Anomaly Detection