Minority Report:
Using Anomaly Detection
to Identify a Minority Class
David Gerster
Vice President, Data Science
BigML
Traditional “Predictive Modeling”
• The famous Iris data set has measurements for 150 flowers
• Given a flower’s measurements, can we predict its species?
Iris setosa Iris versicolor Iris virginica
3
PetalWidth(cm)
Petal Length (cm)
Iris setosa, red dots
Iris versicolor, green dots
Iris virginica, blue dots
4
PetalWidth(cm)
Petal Length (cm)
Congratulations! You just trained a model.
5
PetalWidth(cm)
Petal Length (cm)
PetalWidth(cm)
Petal Length (cm)
Prediction: Iris setosa
Prediction: Iris versicolor
Prediction: Iris virginica
Prediction:
Iris virginica
6
PetalWidth(cm)
Petal Length (cm)
Prediction: Iris setosa
Prediction: Iris versicolor
Prediction: Iris virginica
Prediction:
Iris virginica
Congratulations! You just scored
four new flowers using your model,
and made a prediction about the
species of each one.
7
PetalWidth(cm)
Petal Length (cm)
8
Width <= 0.8? Width > 0.8?
Width > 1.75? Width <= 1.75?
Length <= 5? Length > 5?
50 red
45 blue
1 blue, 48 green 4 blue, 2 green
“Decision Tree”
“Leaf Nodes”
50 blue, 50 green
5 blue, 50 green
50 red, 50 blue, 50 green
Demo: Predictive Modeling
• Train a predictive model using the 699 biopsies
• The “label” of benign or malignant is known for each one
• Since we have labels, this is supervised learning
10
What if we don’t have labels?
• Can we still get insight into our data if we don’t know the
colors of the dots?
• Enter anomaly detection
• Since we don’t have labels, this is unsupervised learning
11
10 lines are needed
to isolate this data point
(not anomalous)
Only 4 lines are needed
to isolate this data point
(highly anomalous)
Demo: Anomaly Detection
• Remove the labels of benign or malignant
• Train an anomaly detector on this unlabeled data
• Create a new dataset with the anomaly scores as “labels”
• Use these “labels” to train a predictive model!
16
Who Needs Labels?
Who Needs Labels?
18
What if we remove the malignant biopsies?
• If we remove the malignant biopsies from the dataset and do
the whole process again …
• We find a similar result!
19
Minority Report
• This approach is well-suited for large unlabeled datasets,
especially if you expect to find an (adversarial) minority class
• Millions of credit card transactions, billions of network events …
• Doesn’t require you to know what you’re looking for!
20
Free BigML subscription
• Use code “CERN” for a free 3-mo. BigML Pro subscription
• Handles datasets up to 4GB
The original “Isolation Forest” paper
23
Q and A
24
David Gerster
VP Data Science, BigML
gerster@bigml.com

Anomaly Detection Using Isolation Forests

  • 1.
    Minority Report: Using AnomalyDetection to Identify a Minority Class David Gerster Vice President, Data Science BigML
  • 2.
    Traditional “Predictive Modeling” •The famous Iris data set has measurements for 150 flowers • Given a flower’s measurements, can we predict its species? Iris setosa Iris versicolor Iris virginica 3
  • 3.
    PetalWidth(cm) Petal Length (cm) Irissetosa, red dots Iris versicolor, green dots Iris virginica, blue dots 4
  • 4.
  • 5.
    PetalWidth(cm) Petal Length (cm) PetalWidth(cm) PetalLength (cm) Prediction: Iris setosa Prediction: Iris versicolor Prediction: Iris virginica Prediction: Iris virginica 6
  • 6.
    PetalWidth(cm) Petal Length (cm) Prediction:Iris setosa Prediction: Iris versicolor Prediction: Iris virginica Prediction: Iris virginica Congratulations! You just scored four new flowers using your model, and made a prediction about the species of each one. 7
  • 7.
    PetalWidth(cm) Petal Length (cm) 8 Width<= 0.8? Width > 0.8? Width > 1.75? Width <= 1.75? Length <= 5? Length > 5? 50 red 45 blue 1 blue, 48 green 4 blue, 2 green “Decision Tree” “Leaf Nodes” 50 blue, 50 green 5 blue, 50 green 50 red, 50 blue, 50 green
  • 8.
    Demo: Predictive Modeling •Train a predictive model using the 699 biopsies • The “label” of benign or malignant is known for each one • Since we have labels, this is supervised learning 10
  • 9.
    What if wedon’t have labels? • Can we still get insight into our data if we don’t know the colors of the dots? • Enter anomaly detection • Since we don’t have labels, this is unsupervised learning 11
  • 10.
    10 lines areneeded to isolate this data point (not anomalous)
  • 11.
    Only 4 linesare needed to isolate this data point (highly anomalous)
  • 12.
    Demo: Anomaly Detection •Remove the labels of benign or malignant • Train an anomaly detector on this unlabeled data • Create a new dataset with the anomaly scores as “labels” • Use these “labels” to train a predictive model! 16
  • 13.
  • 14.
  • 15.
    What if weremove the malignant biopsies? • If we remove the malignant biopsies from the dataset and do the whole process again … • We find a similar result! 19
  • 16.
    Minority Report • Thisapproach is well-suited for large unlabeled datasets, especially if you expect to find an (adversarial) minority class • Millions of credit card transactions, billions of network events … • Doesn’t require you to know what you’re looking for! 20
  • 17.
    Free BigML subscription •Use code “CERN” for a free 3-mo. BigML Pro subscription • Handles datasets up to 4GB
  • 18.
    The original “IsolationForest” paper 23
  • 19.
    Q and A 24 DavidGerster VP Data Science, BigML gerster@bigml.com