July 4 - 6, 2022
2 n d E d i t i o n
BigML, Inc #DutchMLSchool
My First Anomaly Detector
Mercè Martín


VP of Applications, BigML
2
BigML, Inc #DutchMLSchool 3
Unusual things can be easy to spot at
first sight if there’s:
• a small number of properties that make
the difference
• a small number of instances to compare
Detecting the unusual
What if there’s lots of instances and properties?
BigML, Inc #DutchMLSchool 4
We decide the action
New data arrives The model scores it
Could we use an anomaly detector?
To decide about unusual things,


we need to know how unusual they are
BigML, Inc #DutchMLSchool 5
Anomaly example
date custom
er
accoun
t
auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
• Amount $2,459 is higher
than all other transactions


• It is the only transaction


• In zip 21350


• For the purchase
class “tech"
BigML, Inc #DutchMLSchool
The challenge
6
BigML, Inc #DutchMLSchool
Detecting fraud
7
https://bit.ly/3a9a9Zm
Sample:
BigML, Inc #DutchMLSchool
First step
8
BigML, Inc #DutchMLSchool
The First Decision
9
https://bigml.com/accounts/register
4Gb / 8 parallel tasks
DUTCHMLSCHOOL
BigML, Inc #DutchMLSchool
The Data Dictionary
10
31 features
Field Description
Time
Number of seconds elapsed between this transaction
and the
fi
rst transaction in the dataset
V1-V28
May be result of a PCA Dimensionality reduction to
protect user identities and sensitive features(v1-v28)
Amount Transaction amount
Class Label: 0 (normal) / 1 (fraud)
BigML, Inc #DutchMLSchool
The Data
11
…
BigML, Inc #DutchMLSchool
The Source
12
How to interpret your data?
• Field types


• Locale (decimals)


• Missing tokens


• Text / Items parsing
…
BigML, Inc #DutchMLSchool
The Dataset
13
How is data distributed?
• Histograms


• Statistics


• Number of missings


• Number of errors
…
BigML, Inc #DutchMLSchool
And now… The Anomaly Detector
14
BigML, Inc #DutchMLSchool
The Anomaly Detector
15
Which are the Anomaly Detector insights?
• Score


• Field Importance
BigML, Inc #DutchMLSchool
The Score
16
BigML, Inc #DutchMLSchool
The Anomaly Score
17
Predicting the Anomaly Score
• 0 = Totally normal


• 1 = Totally anomalous
BigML, Inc #DutchMLSchool
Normal or anomalous?
18
BigML, Inc #DutchMLSchool
The Batch Anomaly Score
19
How is the anomaly score distributed in your dataset?
BigML, Inc #DutchMLSchool
Setting a Threshold
20
BigML, Inc #DutchMLSchool
De
fi
ning the anomalous class
21
Taking advantage of labeled instances, we can decide the anomaly threshold
threshold 1: 0.43
threshold 2: 0.5
threshold 3: 0.56
Setting higher thresholds will
improve precision, but will
reduce recall.
DutchMLSchool 2022 - My First Anomaly Detector

DutchMLSchool 2022 - My First Anomaly Detector