BigML Inc
Today’s Webinar 
• Speaker: 
• Poul Petersen, CIO 
• Moderator: 
• Andrew Shikiar, VP Business Development 
• Enter questions into chat box – we’ll answer some 
via text; others at the end of the session 
• For direct follow-up, email us at info@bigml.com 
BigML Inc 2
Agenda 
1 
What’s New 
2 Anomaly Detection 
2 Coming Soon 
3 Questions 
BigML Inc 3
Model Clusters 
Use models to discover rules that describe clusters 
5 
6 
7 
3 1 
2 
4 
Spicy Body Nutty 
5.1 3.5 1.4 
2.6 3.5 
6.7 2.5 5.8 
… … … 
Spicy Body Nutty In 5? 
5.1 3.5 1.4 TRUE 
5.7 2.6 3.5 FALSE 
6.7 2.5 5.8 TRUE 
… … … … 
In Cluster 5? 
BigML Inc 4
Model Clusters 
• Dataset of 86 whiskies 
• Each whiskey scored on a scale from 0 to 4 
for each of 12 possible flavor characteristics. 
GOAL: Cluster the whiskies by flavor profile, then 
discover rules that distinguish the clusters from each 
other. 
BigML Inc 5
Missing Splits 
Missing: 
101010 
Real World Data 
… is messy 
x? 
• Define missing tokens: N/A, Null, etc 
• Filter out missing values 
• Add a new feature to replace missing values 
• Default numeric values in cluster 
• Proportional prediction for missing input data 
• Allow splits on missing values 
BigML Inc 6
Online Predictions 
• Single predictions 
• Computed in real-time using browser JS 
• JS will be open sourced 
• Available for models, ensembles, and clusters 
BigML Inc 7
Fast(er) Ensembles 
Fetch 
Dataset 
“F” secs 
Transform 
Dataset 
“T” secs 
Model 
Dataset 
“M” secs 
Store 
Model 
“S” secs 
Insight: if the dataset fits in memory, we can perform the 
fetch and transform steps once and model quickly in memory 
Old New Savings 
Number of 
Models “n” 
Time 
n * [ F + T + M + S ] F + T + n * [ M + S ] ( n - 1 ) * [ F + T ] 
BigML Inc 8
Anomaly Detection 
An unsupervised 
algorithm to find 
unusual data 
quickly and easily 
BigML Inc 9
Learning Tasks 
Trees (Supervised Learning) 
! 
Provide: labeled data 
Learning Task: be able to predict label 
Cluster (Unsupervised Learning) 
! 
Provide: unlabeled data 
Learning Task: group data by similarity 
Anomalies (Unsupervised Learning) 
! 
Provide: unlabeled data 
Learning Task: Rank data by dissimilarity 
BigML Inc 10
Learning Tasks 
sepal 
length 
sepal 
width 
petal 
length 
petal 
width 
species 
5.1 3.5 1.4 0.2 setosa 
5.7 2.6 3.5 1.0 versicolor 
6.7 2.5 5.8 1.8 virginica 
… … … … … 
Inputs “X” “Y” 
Learning Task: 
Find function “f” such that: 
f(X)≈Y 
sepal 
length 
sepal 
width 
petal 
length 
petal 
width 
5.1 3.5 1.4 0.2 
5.7 2.6 3.5 1.0 
6.7 2.5 5.8 1.8 
… … … … 
Learning Task: 
Find “k” clusters such that 
the data in each cluster is 
self similar 
sepal 
length 
sepal 
width 
petal 
length 
petal 
width 
5.1 3.5 1.4 0.2 
5.7 2.6 3.5 1.0 
6.7 2.5 5.8 1.8 
… … … … 
Learning Task: 
Assign value from 0 (similar) 
to 1 (dissimilar) to each 
instance. 
BigML Inc 11
Anomalies 
Isolation Forest: 
Grow a random decision tree until 
each instance is in its own leaf 
“easy” to isolate 
Depth 
“hard” to isolate 
Now repeat the process several times and 
use average Depth to compute anomaly 
score: 0 (similar) -> 1 (dissimilar) 
BigML Inc 12
cluster anomaly 
centroid anomalyscore 
+ 
+ 
batchcentroid batchanomalyscore 
BigML Inc 
13 
Workflow 
Clusters Anomalies 
ANOMALYSCORE 
DATASET 
+ 
CSV 
DATASET CLUSTER DATASET 
INSTANCE 
INSTANCE CENTROID 
DATASET 
+ 
CSV 
ANOMALY 
CLUSTER ANOMALY 
CLUSTER ANOMALY
Use Cases 
• Unusual instance discovery 
• Intrusion Detection 
• Fraud 
• Identify Incorrect Data 
• Remove Outliers 
• Model Competence / Input Data Drift 
BigML Inc 14
Anomalies 
• High dimensions - 10,000 fields 
• Mixed data: 
• numerical: 3.4 
• categorical: red, green, blue 
• date time: 2014-05-14T12:34:56 
Coming 
• unstructured text: “The quick brown fox…” 
• Computing anomaly score for new data 
• Using anomaly detectors programmatically 
BigML Inc 15
Coming Soon 
• Config panel for anomaly detection 
• Project Management 
• In-memory sample server 
• Dynamic scatterplots 
BigML Inc 16
Coming Soon 
BigML Inc 17
Get Started Today! 
RESOURCES Join us for future 
FEEDBACK 
webinars & hangouts 
info@bigml.com 
TWITTER @bigmlcom 
BigML Inc 18

BigML Late Summer 2014 Release Webinar - Anomaly Detection!

  • 1.
  • 2.
    Today’s Webinar •Speaker: • Poul Petersen, CIO • Moderator: • Andrew Shikiar, VP Business Development • Enter questions into chat box – we’ll answer some via text; others at the end of the session • For direct follow-up, email us at info@bigml.com BigML Inc 2
  • 3.
    Agenda 1 What’sNew 2 Anomaly Detection 2 Coming Soon 3 Questions BigML Inc 3
  • 4.
    Model Clusters Usemodels to discover rules that describe clusters 5 6 7 3 1 2 4 Spicy Body Nutty 5.1 3.5 1.4 2.6 3.5 6.7 2.5 5.8 … … … Spicy Body Nutty In 5? 5.1 3.5 1.4 TRUE 5.7 2.6 3.5 FALSE 6.7 2.5 5.8 TRUE … … … … In Cluster 5? BigML Inc 4
  • 5.
    Model Clusters •Dataset of 86 whiskies • Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics. GOAL: Cluster the whiskies by flavor profile, then discover rules that distinguish the clusters from each other. BigML Inc 5
  • 6.
    Missing Splits Missing: 101010 Real World Data … is messy x? • Define missing tokens: N/A, Null, etc • Filter out missing values • Add a new feature to replace missing values • Default numeric values in cluster • Proportional prediction for missing input data • Allow splits on missing values BigML Inc 6
  • 7.
    Online Predictions •Single predictions • Computed in real-time using browser JS • JS will be open sourced • Available for models, ensembles, and clusters BigML Inc 7
  • 8.
    Fast(er) Ensembles Fetch Dataset “F” secs Transform Dataset “T” secs Model Dataset “M” secs Store Model “S” secs Insight: if the dataset fits in memory, we can perform the fetch and transform steps once and model quickly in memory Old New Savings Number of Models “n” Time n * [ F + T + M + S ] F + T + n * [ M + S ] ( n - 1 ) * [ F + T ] BigML Inc 8
  • 9.
    Anomaly Detection Anunsupervised algorithm to find unusual data quickly and easily BigML Inc 9
  • 10.
    Learning Tasks Trees(Supervised Learning) ! Provide: labeled data Learning Task: be able to predict label Cluster (Unsupervised Learning) ! Provide: unlabeled data Learning Task: group data by similarity Anomalies (Unsupervised Learning) ! Provide: unlabeled data Learning Task: Rank data by dissimilarity BigML Inc 10
  • 11.
    Learning Tasks sepal length sepal width petal length petal width species 5.1 3.5 1.4 0.2 setosa 5.7 2.6 3.5 1.0 versicolor 6.7 2.5 5.8 1.8 virginica … … … … … Inputs “X” “Y” Learning Task: Find function “f” such that: f(X)≈Y sepal length sepal width petal length petal width 5.1 3.5 1.4 0.2 5.7 2.6 3.5 1.0 6.7 2.5 5.8 1.8 … … … … Learning Task: Find “k” clusters such that the data in each cluster is self similar sepal length sepal width petal length petal width 5.1 3.5 1.4 0.2 5.7 2.6 3.5 1.0 6.7 2.5 5.8 1.8 … … … … Learning Task: Assign value from 0 (similar) to 1 (dissimilar) to each instance. BigML Inc 11
  • 12.
    Anomalies Isolation Forest: Grow a random decision tree until each instance is in its own leaf “easy” to isolate Depth “hard” to isolate Now repeat the process several times and use average Depth to compute anomaly score: 0 (similar) -> 1 (dissimilar) BigML Inc 12
  • 13.
    cluster anomaly centroidanomalyscore + + batchcentroid batchanomalyscore BigML Inc 13 Workflow Clusters Anomalies ANOMALYSCORE DATASET + CSV DATASET CLUSTER DATASET INSTANCE INSTANCE CENTROID DATASET + CSV ANOMALY CLUSTER ANOMALY CLUSTER ANOMALY
  • 14.
    Use Cases •Unusual instance discovery • Intrusion Detection • Fraud • Identify Incorrect Data • Remove Outliers • Model Competence / Input Data Drift BigML Inc 14
  • 15.
    Anomalies • Highdimensions - 10,000 fields • Mixed data: • numerical: 3.4 • categorical: red, green, blue • date time: 2014-05-14T12:34:56 Coming • unstructured text: “The quick brown fox…” • Computing anomaly score for new data • Using anomaly detectors programmatically BigML Inc 15
  • 16.
    Coming Soon •Config panel for anomaly detection • Project Management • In-memory sample server • Dynamic scatterplots BigML Inc 16
  • 17.
  • 18.
    Get Started Today! RESOURCES Join us for future FEEDBACK webinars & hangouts info@bigml.com TWITTER @bigmlcom BigML Inc 18