SlideShare a Scribd company logo
1 of 57
Download to read offline
September 8-9, 2016
BigML, Inc 2
Cluster Analysis
Poul Petersen
CIO, BigML, Inc
Finding Similarities
BigML, Inc 3Unsupervised Learning
Trees vs Clusters
Trees (Supervised Learning)
Provide: labeled data
Learning Task: be able to predict label
Clusters (Unsupervised Learning)
Provide: unlabeled data
Learning Task: group data by similarity
BigML, Inc 4Unsupervised Learning
Trees vs Clusters
sepal

length
sepal

width
petal

length
petal

width
species
5.1 3.5 1.4 0.2 setosa
5.7 2.6 3.5 1.0 versicolor
6.7 2.5 5.8 1.8 virginica
… … … … …
sepal

length
sepal

width
petal

length
petal

width
5.1 3.5 1.4 0.2
5.7 2.6 3.5 1.0
6.7 2.5 5.8 1.8
… … … …
Inputs “X” Label “Y”
Learning Task:
Find function “f” such that:
f(X)≈Y
Learning Task:
Find “k” clusters such that
the data in each cluster is self
similar
BigML, Inc 5Unsupervised Learning
Use Cases
• Customer segmentation
• Item discovery
• Similarity
• Recommender
• Active learning
BigML, Inc 6Unsupervised Learning
Customer Segmentation
GOAL: Cluster the users by usage
statistics. Identify clusters with a
higher percentage of high LTV users.
Since they have similar usage
patterns, the remaining users in
these clusters may be good
candidates for up-sell.
• Dataset of mobile game users.
• Data for each user consists of usage
statistics and a LTV based on in-game
purchases
• Assumption: Usage correlates to LTV
0%
3%
1%
BigML, Inc 7Unsupervised Learning
Item Discovery
GOAL: Cluster the whiskies by flavor
profile to discover whiskies that have
similar taste.
• Dataset of 86 whiskies
• Each whiskey scored on a scale from
0 to 4 for each of 12 possible flavor
characteristics.
Smoky
Fruity
BigML, Inc 8Unsupervised Learning
Clustering Demo #1
BigML, Inc 9Unsupervised Learning
Similarity
GOAL: Cluster the loans by
application profile to rank loan
quality by percentage of trouble
loans in population
• Dataset of Lending Club Loans
• Mark any loan that is currently or has
even been late as “trouble”
0%
3%
7%
1%
BigML, Inc 10Unsupervised Learning
Active Learning
GOAL: Rather than sample randomly, use clustering to group
patients by similarity and then test a sample from each cluster
to label the data.
• Dataset of diagnostic measurements
of 768 patients.
• Want to test each patient for diabetes
and label the dataset to build a model
but the test is expensive*.
BigML, Inc 11Unsupervised Learning
*For a more realistic example of high cost, imagine a dataset
with a billion transactions, each one needing to be labelled as
fraud/not-fraud. Or a million images which need to be labeled as
cat/not-cat.
2323
Active Learning
BigML, Inc 12Unsupervised Learning
Human Example
Cluster into 3 groups…
BigML, Inc 13Unsupervised Learning
Human Example
BigML, Inc 14Unsupervised Learning
Learning from Humans
• Jesa used prior knowledge to select possible
features that separated the objects.
• “round”, “skinny”, “edges”, “hard”, etc
• Items were then clustered based on the chosen
features
• Separation quality was then tested to ensure:
• met criteria of K=3
• groups were sufficiently “distant”
• no crossover
BigML, Inc 15Unsupervised Learning
Learning from Humans
• Length/Width
• greater than 1 => “skinny”
• equal to 1 => “round”
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require “edges” which have corners
• easier to count
Create features that capture these object differences
BigML, Inc 16Unsupervised Learning
Cluster Features
Object Length / Width Num Surfaces
penny 1 3
dime 1 3
knob 1 4
eraser 2.75 6
box 1 6
block 1.6 6
screw 8 3
battery 5 3
key 4.25 3
bead 1 2
BigML, Inc 17Unsupervised Learning
Plot by Features
K=3
Num

Surfaces
Length / Width
box block eraser
knob
penny

dime
bead
key battery screw
K-Means Key Insight:

We can find clusters using distances

in n-dimensional feature space
BigML, Inc 18Unsupervised Learning
Plot by Features
Num

Surfaces
Length / Width
box block eraser
knob
penny

dime
bead
key battery screw
K-Means

Find “best” (minimum distance)

circles that include all points
BigML, Inc 19Unsupervised Learning
K-Means Algorithm
K=3
BigML, Inc 20Unsupervised Learning
K-Means Algorithm
K=3
BigML, Inc 21Unsupervised Learning
Features Matter
Metal Other
Wood
BigML, Inc 22Unsupervised Learning
Convergence
Convergence guaranteed

but not necessarily unique

Starting points important (K++)
BigML, Inc 23Unsupervised Learning
Starting Points
• Random points or instances in n-dimensional space
• Chose points “farthest” away from each other
• but this is sensitive to outliers
• k++
• the first center is chosen randomly from instances
• each subsequent center is chosen from the
remaining instances with probability proportional to
its squared distance from the point's closest existing
cluster center
BigML, Inc 24Unsupervised Learning
Scaling
price
number of bedrooms
d = 160,000
d = 1
BigML, Inc 25Unsupervised Learning
Other Tricks
• What is the distance to a “missing value”?
• What is the distance between categorical values?
• What is the distance between text features?
• Does it have to be Euclidean distance?
• Unknown “K”?
BigML, Inc 26Unsupervised Learning
Distance to Missing Value?
• Nonsense! Try replacing missing values with:
• Maximum
• Mean
• Median
• Minimum
• Zero
• Ignore instances with missing values
BigML, Inc 27Unsupervised Learning
Distance to Categorical?
• Special distance function
• if valA == valB then

distance = 0 (or scaling value) 

else 

distance = 1
• Assign centroid the most common category of the
member instances
Approach: similar to “k-prototypes”
BigML, Inc 28Unsupervised Learning
Distance to Categorical?
feature_1 feature_2 feature_3
instance_1 red cat ball
instance_2 red cat ball
instance_3 red cat box
instance_4 blue dog fridge
D = 0
D = 1
D = sqrt(3)
Compute Euclidean distance between discrete vectors
BigML, Inc 29Unsupervised Learning
Text Vectors
1
Cosine Similarity
0
"hippo" "safari" "zebra" ….
1 0 0 …
1 1 0 …
0 0 0 …
Text Field #1
Text Field #2
Cosine Distance = 1 - Cosine Similarity
Features(thousands)
BigML, Inc 30Unsupervised Learning
Finding K: G-Means
BigML, Inc 31Unsupervised Learning
Finding K: G-Means
BigML, Inc 32Unsupervised Learning
Finding K: G-Means
Let K=2
Keep 1, Split 1
New K=3
BigML, Inc 33Unsupervised Learning
Finding K: G-Means
Let K=3
Keep 1, Split 2
New K=5
BigML, Inc 34Unsupervised Learning
Finding K: G-Means
Let K=5
K=5
BigML, Inc 35Unsupervised Learning
Clustering Demo #2
BigML, Inc 2
Anomaly Detection
Poul Petersen
CIO, BigML, Inc
Finding the Unusual
BigML, Inc 3Unsupervised Learning
Clusters vs Anomalies
Clusters (Unsupervised Learning)

Provide: unlabeled data

Learning Task: group data by similarity

Anomalies (Unsupervised Learning)
Provide: unlabeled data
Learning Task: Rank data by dissimilarity
BigML, Inc 4Unsupervised Learning
Clusters vs Anomalies
sepal
length
sepal
width
petal
length
petal
width
5.1 3.5 1.4 0.2
5.7 2.6 3.5 1.0
6.7 2.5 5.8 1.8
… … … …
Learning Task:
Find “k” clusters such that the data
in each cluster is self similar
sepal
length
sepal
width
petal
length
petal
width
5.1 3.5 1.4 0.2
5.7 2.6 3.5 1.0
6.7 2.5 5.8 1.8
… … … …
Learning Task:
Assign value from 0 (similar) to 1
(dissimilar) to each instance.
BigML, Inc 5Unsupervised Learning
• Unusual instance discovery
• Intrusion Detection
• Fraud
• Identify Incorrect Data
• Remove Outliers
• Model Competence / Input Data Drift
Use Cases
BigML, Inc 6Unsupervised Learning
Removing Outliers
• Models need to generalize
• Outliers negatively impact generalization
GOAL: Use anomaly detector to identify most anomalous
points and then remove them before modeling.
DATASET FILTERED
DATASET
ANOMALY
DETECTOR
CLEAN
MODEL
BigML, Inc 7Unsupervised Learning
Anomaly Demo #1
BigML, Inc 8Unsupervised Learning
Intrusion Detection
GOAL: Identify unusual command line behavior per user and
across all users that might indicate an intrusion.
• Dataset of command line history for users
• Data for each user consists of commands,
flags, working directories, etc.
• Assumption: Users typically issue the same
flag patterns and work in certain directories
Per User Per Dir All User All Dir
BigML, Inc 9Unsupervised Learning
Fraud
• Dataset of credit card transactions
• Additional user profile information
GOAL: Cluster users by profile and use multiple anomaly
scores to detect transactions that are anomalous on multiple
levels.
Card Level User Level Similar User Level
BigML, Inc 10Unsupervised Learning
Model Competence
• After putting a model it into production,
data that is being predicted can become
statistically different than the training data.
• Train an anomaly detector at the same time
as the model.
G O A L : F o r e v e r y
prediction, compute
an anomaly score.
If the anomaly score is
high, then the model
may not be competent
and should not be
Training Data
PREDICTION
ANOMALY
SCORE
MODEL
ANOMALY
DETECTOR
BigML, Inc 11Unsupervised Learning
Univariate Approach
• Single variable: heights, test scores, etc
• Assume the value is distributed “normally”
• Compute standard deviation
• a measure of how “spread out” the numbers are
• the square root of the variance (The average of the
squared differences from the Mean.)
• Depending on the number of instances, choose a
“multiple” of standard deviations to indicate an anomaly.
A multiple of 3 for 1000 instances removes ~ 3 outliers.
BigML, Inc 12Unsupervised Learning
Univariate Approach
measurement
frequency
outliersoutliers
• Available in BigML API
BigML, Inc 13Unsupervised Learning
Benford’s Law
• In real-life numeric sets the small digits occur disproportionately often as
leading significant digits.
• Applications include:
• accounting records
• electricity bills
• street addresses
• stock prices
• population numbers
• death rates
• lengths of rivers
• Available in BigML API
BigML, Inc 14Unsupervised Learning
Human Example
Most Unusual?
BigML, Inc 15Unsupervised Learning
Human Example
“Round”“Skinny” “Corners”
“Skinny”
but not “smooth”
No
“Corners”
Not
“Round”
Key Insight

The “most unusual” object

is different in some way from

every partition of the features.
Most unusual
BigML, Inc 16Unsupervised Learning
Human Example
• Human used prior knowledge to select possible
features that separated the objects.
• “round”, “skinny”, “smooth”, “corners”
• Items were then separated based on the chosen
features
• Each cluster was then examined to see which
object fit the least well in its cluster and did not fit
any other cluster
BigML, Inc 17Unsupervised Learning
Learning from Humans
• Length/Width
• greater than 1 => “skinny”
• equal to 1 => “round”
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require “edges” which have corners
• easier to count
• Smooth - true or false
Create features that capture these object differences
BigML, Inc 18Unsupervised Learning
Anomaly Features
Object Length / Width Num Surfaces Smooth
penny 1 3 TRUE
dime 1 3 TRUE
knob 1 4 TRUE
eraser 2.75 6 TRUE
box 1 6 TRUE
block 1.6 6 TRUE
screw 8 3 FALSE
battery 5 3 TRUE
key 4.25 3 FALSE
bead 1 2 TRUE
BigML, Inc 19Unsupervised Learning
smooth = True
length/width > 5
box
blockeraser
knob
penny

dime
bead
key
battery
screw
num surfaces = 6
length/width =1
length/width < 2
Random Splits
Know that “splits” matter - don’t know the order
BigML, Inc 20Unsupervised Learning
Isolation Forest
Grow a random decision tree until
each instance is in its own leaf
“easy” to isolate
“hard” to isolate
Depth
Now repeat the process several times and
use average Depth to compute anomaly
score: 0 (similar) -> 1 (dissimilar)
BigML, Inc 21Unsupervised Learning
Isolation Forest Scoring
f_1 f_2 f_3
i_1 red cat ball
i_2 red cat ball
i_3 red cat box
i_4 blue dog pen
D = 3
D = 6
D = 2
Score
BigML, Inc 22Unsupervised Learning
• A low anomaly score means the loan is similar to the
modeled loans.
• A high anomaly score means you can not trust the
model.
Model Competence
Prediction T T
Confidence
86% 84%
Anomaly
Score
0.5367 0.7124
Competent? Y N
OPEN LOANS
PREDICTION
ANOMALY
SCORE
CLOSED LOAN
MODEL
CLOSED LOAN
ANOMALY DETECTOR
BigML, Inc 23Unsupervised Learning
Anomaly Demo #2

More Related Content

What's hot

What's hot (20)

L5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringL5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature Engineering
 
VSSML17 L6. Time Series and Deepnets
VSSML17 L6. Time Series and DeepnetsVSSML17 L6. Time Series and Deepnets
VSSML17 L6. Time Series and Deepnets
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
VSSML17 L2. Ensembles and Logistic Regressions
VSSML17 L2. Ensembles and Logistic RegressionsVSSML17 L2. Ensembles and Logistic Regressions
VSSML17 L2. Ensembles and Logistic Regressions
 
L14. Anomaly Detection
L14. Anomaly DetectionL14. Anomaly Detection
L14. Anomaly Detection
 
BSSML17 - Clusters
BSSML17 - ClustersBSSML17 - Clusters
BSSML17 - Clusters
 
L15. Machine Learning - Black Art
L15. Machine Learning - Black ArtL15. Machine Learning - Black Art
L15. Machine Learning - Black Art
 
VSSML17 L5. Basic Data Transformations and Feature Engineering
VSSML17 L5. Basic Data Transformations and Feature EngineeringVSSML17 L5. Basic Data Transformations and Feature Engineering
VSSML17 L5. Basic Data Transformations and Feature Engineering
 
Feature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax audit
 
L13. Cluster Analysis
L13. Cluster AnalysisL13. Cluster Analysis
L13. Cluster Analysis
 
VSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 SessionsVSSML17 Review. Summary Day 1 Sessions
VSSML17 Review. Summary Day 1 Sessions
 
BSSML17 - Anomaly Detection
BSSML17 - Anomaly DetectionBSSML17 - Anomaly Detection
BSSML17 - Anomaly Detection
 
LR1. Summary Day 1
LR1. Summary Day 1LR1. Summary Day 1
LR1. Summary Day 1
 
BSSML17 - Deepnets
BSSML17 - DeepnetsBSSML17 - Deepnets
BSSML17 - Deepnets
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
BSSML17 - Feature Engineering
BSSML17 - Feature EngineeringBSSML17 - Feature Engineering
BSSML17 - Feature Engineering
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
L3. Decision Trees
L3. Decision TreesL3. Decision Trees
L3. Decision Trees
 
BigML Education - Feature Engineering with Flatline
BigML Education - Feature Engineering with FlatlineBigML Education - Feature Engineering with Flatline
BigML Education - Feature Engineering with Flatline
 

Viewers also liked

Чи готова дитина до школи
Чи готова дитина до школиЧи готова дитина до школи
Чи готова дитина до школи
labinskiir-33
 

Viewers also liked (14)

Pv ca cos saint quentin 5 05 2009 signe
Pv ca cos saint quentin 5 05 2009 signePv ca cos saint quentin 5 05 2009 signe
Pv ca cos saint quentin 5 05 2009 signe
 
2pages-1
2pages-12pages-1
2pages-1
 
Culto jovens tem a força
Culto jovens tem a forçaCulto jovens tem a força
Culto jovens tem a força
 
Nouvelles prestations cos SAINT QUENTIN vote ca du 04 03 2013
Nouvelles prestations cos SAINT QUENTIN vote ca du 04 03 2013Nouvelles prestations cos SAINT QUENTIN vote ca du 04 03 2013
Nouvelles prestations cos SAINT QUENTIN vote ca du 04 03 2013
 
Satish_Resume
Satish_ResumeSatish_Resume
Satish_Resume
 
El teléfono móvil
El teléfono móvilEl teléfono móvil
El teléfono móvil
 
Чи готова дитина до школи
Чи готова дитина до школиЧи готова дитина до школи
Чи готова дитина до школи
 
Conservación de alimentos
Conservación de alimentosConservación de alimentos
Conservación de alimentos
 
Duvan Guerrero
Duvan GuerreroDuvan Guerrero
Duvan Guerrero
 
Trazabilidad.
Trazabilidad.Trazabilidad.
Trazabilidad.
 
Formulario de denuncia CONAHOBA Maltrato Animal Uruguay
Formulario de denuncia CONAHOBA Maltrato Animal UruguayFormulario de denuncia CONAHOBA Maltrato Animal Uruguay
Formulario de denuncia CONAHOBA Maltrato Animal Uruguay
 
Аналіз уроку
Аналіз урокуАналіз уроку
Аналіз уроку
 
Conservación de los alimentos
Conservación de los alimentosConservación de los alimentos
Conservación de los alimentos
 
Звіт 22-28.02
Звіт 22-28.02Звіт 22-28.02
Звіт 22-28.02
 

Similar to VSSML16 L3. Clusters and Anomaly Detection

How to do Predictive Analytics with Limited Data
How to do Predictive Analytics with Limited DataHow to do Predictive Analytics with Limited Data
How to do Predictive Analytics with Limited Data
Datameer
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
 

Similar to VSSML16 L3. Clusters and Anomaly Detection (20)

VSSML18. Clustering and Latent Dirichlet Allocation
VSSML18. Clustering and Latent Dirichlet AllocationVSSML18. Clustering and Latent Dirichlet Allocation
VSSML18. Clustering and Latent Dirichlet Allocation
 
MLSEV. Cluster Analysis and Anomaly Detection
MLSEV. Cluster Analysis and Anomaly DetectionMLSEV. Cluster Analysis and Anomaly Detection
MLSEV. Cluster Analysis and Anomaly Detection
 
VSSML17 L3. Clusters and Anomaly Detection
VSSML17 L3. Clusters and Anomaly DetectionVSSML17 L3. Clusters and Anomaly Detection
VSSML17 L3. Clusters and Anomaly Detection
 
DutchMLSchool. Clusters and Anomalies
DutchMLSchool. Clusters and AnomaliesDutchMLSchool. Clusters and Anomalies
DutchMLSchool. Clusters and Anomalies
 
DutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesDutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time Series
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
07 learning
07 learning07 learning
07 learning
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
How to do Predictive Analytics with Limited Data
How to do Predictive Analytics with Limited DataHow to do Predictive Analytics with Limited Data
How to do Predictive Analytics with Limited Data
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
DutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision MakingDutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision Making
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Genetic Algorithms
Genetic AlgorithmsGenetic Algorithms
Genetic Algorithms
 
DutchMLSchool. ML: A Technical Perspective
DutchMLSchool. ML: A Technical PerspectiveDutchMLSchool. ML: A Technical Perspective
DutchMLSchool. ML: A Technical Perspective
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
CS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptxCS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptx
 
Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdf
 
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 11_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
Data science for advanced dummies
Data science for advanced dummiesData science for advanced dummies
Data science for advanced dummies
 

More from BigML, Inc

More from BigML, Inc (20)

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in Manufacturing
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - Automation
 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML Compliance
 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective Anomalies
 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector
 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly Detection
 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End ML
 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven Company
 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal Sector
 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe Stadiums
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at Scale
 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AI
 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object Detection
 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image Processing
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail Sector
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
 
ML in GRC: Cybersecurity versus Governance, Risk Management, and Compliance
ML in GRC: Cybersecurity versus Governance, Risk Management, and ComplianceML in GRC: Cybersecurity versus Governance, Risk Management, and Compliance
ML in GRC: Cybersecurity versus Governance, Risk Management, and Compliance
 

Recently uploaded

怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 

Recently uploaded (20)

怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 

VSSML16 L3. Clusters and Anomaly Detection

  • 2. BigML, Inc 2 Cluster Analysis Poul Petersen CIO, BigML, Inc Finding Similarities
  • 3. BigML, Inc 3Unsupervised Learning Trees vs Clusters Trees (Supervised Learning) Provide: labeled data Learning Task: be able to predict label Clusters (Unsupervised Learning) Provide: unlabeled data Learning Task: group data by similarity
  • 4. BigML, Inc 4Unsupervised Learning Trees vs Clusters sepal length sepal width petal length petal width species 5.1 3.5 1.4 0.2 setosa 5.7 2.6 3.5 1.0 versicolor 6.7 2.5 5.8 1.8 virginica … … … … … sepal length sepal width petal length petal width 5.1 3.5 1.4 0.2 5.7 2.6 3.5 1.0 6.7 2.5 5.8 1.8 … … … … Inputs “X” Label “Y” Learning Task: Find function “f” such that: f(X)≈Y Learning Task: Find “k” clusters such that the data in each cluster is self similar
  • 5. BigML, Inc 5Unsupervised Learning Use Cases • Customer segmentation • Item discovery • Similarity • Recommender • Active learning
  • 6. BigML, Inc 6Unsupervised Learning Customer Segmentation GOAL: Cluster the users by usage statistics. Identify clusters with a higher percentage of high LTV users. Since they have similar usage patterns, the remaining users in these clusters may be good candidates for up-sell. • Dataset of mobile game users. • Data for each user consists of usage statistics and a LTV based on in-game purchases • Assumption: Usage correlates to LTV 0% 3% 1%
  • 7. BigML, Inc 7Unsupervised Learning Item Discovery GOAL: Cluster the whiskies by flavor profile to discover whiskies that have similar taste. • Dataset of 86 whiskies • Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics. Smoky Fruity
  • 8. BigML, Inc 8Unsupervised Learning Clustering Demo #1
  • 9. BigML, Inc 9Unsupervised Learning Similarity GOAL: Cluster the loans by application profile to rank loan quality by percentage of trouble loans in population • Dataset of Lending Club Loans • Mark any loan that is currently or has even been late as “trouble” 0% 3% 7% 1%
  • 10. BigML, Inc 10Unsupervised Learning Active Learning GOAL: Rather than sample randomly, use clustering to group patients by similarity and then test a sample from each cluster to label the data. • Dataset of diagnostic measurements of 768 patients. • Want to test each patient for diabetes and label the dataset to build a model but the test is expensive*.
  • 11. BigML, Inc 11Unsupervised Learning *For a more realistic example of high cost, imagine a dataset with a billion transactions, each one needing to be labelled as fraud/not-fraud. Or a million images which need to be labeled as cat/not-cat. 2323 Active Learning
  • 12. BigML, Inc 12Unsupervised Learning Human Example Cluster into 3 groups…
  • 13. BigML, Inc 13Unsupervised Learning Human Example
  • 14. BigML, Inc 14Unsupervised Learning Learning from Humans • Jesa used prior knowledge to select possible features that separated the objects. • “round”, “skinny”, “edges”, “hard”, etc • Items were then clustered based on the chosen features • Separation quality was then tested to ensure: • met criteria of K=3 • groups were sufficiently “distant” • no crossover
  • 15. BigML, Inc 15Unsupervised Learning Learning from Humans • Length/Width • greater than 1 => “skinny” • equal to 1 => “round” • less than 1 => invert • Number of Surfaces • distinct surfaces require “edges” which have corners • easier to count Create features that capture these object differences
  • 16. BigML, Inc 16Unsupervised Learning Cluster Features Object Length / Width Num Surfaces penny 1 3 dime 1 3 knob 1 4 eraser 2.75 6 box 1 6 block 1.6 6 screw 8 3 battery 5 3 key 4.25 3 bead 1 2
  • 17. BigML, Inc 17Unsupervised Learning Plot by Features K=3 Num Surfaces Length / Width box block eraser knob penny dime bead key battery screw K-Means Key Insight: We can find clusters using distances in n-dimensional feature space
  • 18. BigML, Inc 18Unsupervised Learning Plot by Features Num Surfaces Length / Width box block eraser knob penny dime bead key battery screw K-Means Find “best” (minimum distance) circles that include all points
  • 19. BigML, Inc 19Unsupervised Learning K-Means Algorithm K=3
  • 20. BigML, Inc 20Unsupervised Learning K-Means Algorithm K=3
  • 21. BigML, Inc 21Unsupervised Learning Features Matter Metal Other Wood
  • 22. BigML, Inc 22Unsupervised Learning Convergence Convergence guaranteed but not necessarily unique Starting points important (K++)
  • 23. BigML, Inc 23Unsupervised Learning Starting Points • Random points or instances in n-dimensional space • Chose points “farthest” away from each other • but this is sensitive to outliers • k++ • the first center is chosen randomly from instances • each subsequent center is chosen from the remaining instances with probability proportional to its squared distance from the point's closest existing cluster center
  • 24. BigML, Inc 24Unsupervised Learning Scaling price number of bedrooms d = 160,000 d = 1
  • 25. BigML, Inc 25Unsupervised Learning Other Tricks • What is the distance to a “missing value”? • What is the distance between categorical values? • What is the distance between text features? • Does it have to be Euclidean distance? • Unknown “K”?
  • 26. BigML, Inc 26Unsupervised Learning Distance to Missing Value? • Nonsense! Try replacing missing values with: • Maximum • Mean • Median • Minimum • Zero • Ignore instances with missing values
  • 27. BigML, Inc 27Unsupervised Learning Distance to Categorical? • Special distance function • if valA == valB then
 distance = 0 (or scaling value) 
 else 
 distance = 1 • Assign centroid the most common category of the member instances Approach: similar to “k-prototypes”
  • 28. BigML, Inc 28Unsupervised Learning Distance to Categorical? feature_1 feature_2 feature_3 instance_1 red cat ball instance_2 red cat ball instance_3 red cat box instance_4 blue dog fridge D = 0 D = 1 D = sqrt(3) Compute Euclidean distance between discrete vectors
  • 29. BigML, Inc 29Unsupervised Learning Text Vectors 1 Cosine Similarity 0 "hippo" "safari" "zebra" …. 1 0 0 … 1 1 0 … 0 0 0 … Text Field #1 Text Field #2 Cosine Distance = 1 - Cosine Similarity Features(thousands)
  • 30. BigML, Inc 30Unsupervised Learning Finding K: G-Means
  • 31. BigML, Inc 31Unsupervised Learning Finding K: G-Means
  • 32. BigML, Inc 32Unsupervised Learning Finding K: G-Means Let K=2 Keep 1, Split 1 New K=3
  • 33. BigML, Inc 33Unsupervised Learning Finding K: G-Means Let K=3 Keep 1, Split 2 New K=5
  • 34. BigML, Inc 34Unsupervised Learning Finding K: G-Means Let K=5 K=5
  • 35. BigML, Inc 35Unsupervised Learning Clustering Demo #2
  • 36. BigML, Inc 2 Anomaly Detection Poul Petersen CIO, BigML, Inc Finding the Unusual
  • 37. BigML, Inc 3Unsupervised Learning Clusters vs Anomalies Clusters (Unsupervised Learning) Provide: unlabeled data Learning Task: group data by similarity Anomalies (Unsupervised Learning) Provide: unlabeled data Learning Task: Rank data by dissimilarity
  • 38. BigML, Inc 4Unsupervised Learning Clusters vs Anomalies sepal length sepal width petal length petal width 5.1 3.5 1.4 0.2 5.7 2.6 3.5 1.0 6.7 2.5 5.8 1.8 … … … … Learning Task: Find “k” clusters such that the data in each cluster is self similar sepal length sepal width petal length petal width 5.1 3.5 1.4 0.2 5.7 2.6 3.5 1.0 6.7 2.5 5.8 1.8 … … … … Learning Task: Assign value from 0 (similar) to 1 (dissimilar) to each instance.
  • 39. BigML, Inc 5Unsupervised Learning • Unusual instance discovery • Intrusion Detection • Fraud • Identify Incorrect Data • Remove Outliers • Model Competence / Input Data Drift Use Cases
  • 40. BigML, Inc 6Unsupervised Learning Removing Outliers • Models need to generalize • Outliers negatively impact generalization GOAL: Use anomaly detector to identify most anomalous points and then remove them before modeling. DATASET FILTERED DATASET ANOMALY DETECTOR CLEAN MODEL
  • 41. BigML, Inc 7Unsupervised Learning Anomaly Demo #1
  • 42. BigML, Inc 8Unsupervised Learning Intrusion Detection GOAL: Identify unusual command line behavior per user and across all users that might indicate an intrusion. • Dataset of command line history for users • Data for each user consists of commands, flags, working directories, etc. • Assumption: Users typically issue the same flag patterns and work in certain directories Per User Per Dir All User All Dir
  • 43. BigML, Inc 9Unsupervised Learning Fraud • Dataset of credit card transactions • Additional user profile information GOAL: Cluster users by profile and use multiple anomaly scores to detect transactions that are anomalous on multiple levels. Card Level User Level Similar User Level
  • 44. BigML, Inc 10Unsupervised Learning Model Competence • After putting a model it into production, data that is being predicted can become statistically different than the training data. • Train an anomaly detector at the same time as the model. G O A L : F o r e v e r y prediction, compute an anomaly score. If the anomaly score is high, then the model may not be competent and should not be Training Data PREDICTION ANOMALY SCORE MODEL ANOMALY DETECTOR
  • 45. BigML, Inc 11Unsupervised Learning Univariate Approach • Single variable: heights, test scores, etc • Assume the value is distributed “normally” • Compute standard deviation • a measure of how “spread out” the numbers are • the square root of the variance (The average of the squared differences from the Mean.) • Depending on the number of instances, choose a “multiple” of standard deviations to indicate an anomaly. A multiple of 3 for 1000 instances removes ~ 3 outliers.
  • 46. BigML, Inc 12Unsupervised Learning Univariate Approach measurement frequency outliersoutliers • Available in BigML API
  • 47. BigML, Inc 13Unsupervised Learning Benford’s Law • In real-life numeric sets the small digits occur disproportionately often as leading significant digits. • Applications include: • accounting records • electricity bills • street addresses • stock prices • population numbers • death rates • lengths of rivers • Available in BigML API
  • 48. BigML, Inc 14Unsupervised Learning Human Example Most Unusual?
  • 49. BigML, Inc 15Unsupervised Learning Human Example “Round”“Skinny” “Corners” “Skinny” but not “smooth” No “Corners” Not “Round” Key Insight The “most unusual” object is different in some way from every partition of the features. Most unusual
  • 50. BigML, Inc 16Unsupervised Learning Human Example • Human used prior knowledge to select possible features that separated the objects. • “round”, “skinny”, “smooth”, “corners” • Items were then separated based on the chosen features • Each cluster was then examined to see which object fit the least well in its cluster and did not fit any other cluster
  • 51. BigML, Inc 17Unsupervised Learning Learning from Humans • Length/Width • greater than 1 => “skinny” • equal to 1 => “round” • less than 1 => invert • Number of Surfaces • distinct surfaces require “edges” which have corners • easier to count • Smooth - true or false Create features that capture these object differences
  • 52. BigML, Inc 18Unsupervised Learning Anomaly Features Object Length / Width Num Surfaces Smooth penny 1 3 TRUE dime 1 3 TRUE knob 1 4 TRUE eraser 2.75 6 TRUE box 1 6 TRUE block 1.6 6 TRUE screw 8 3 FALSE battery 5 3 TRUE key 4.25 3 FALSE bead 1 2 TRUE
  • 53. BigML, Inc 19Unsupervised Learning smooth = True length/width > 5 box blockeraser knob penny dime bead key battery screw num surfaces = 6 length/width =1 length/width < 2 Random Splits Know that “splits” matter - don’t know the order
  • 54. BigML, Inc 20Unsupervised Learning Isolation Forest Grow a random decision tree until each instance is in its own leaf “easy” to isolate “hard” to isolate Depth Now repeat the process several times and use average Depth to compute anomaly score: 0 (similar) -> 1 (dissimilar)
  • 55. BigML, Inc 21Unsupervised Learning Isolation Forest Scoring f_1 f_2 f_3 i_1 red cat ball i_2 red cat ball i_3 red cat box i_4 blue dog pen D = 3 D = 6 D = 2 Score
  • 56. BigML, Inc 22Unsupervised Learning • A low anomaly score means the loan is similar to the modeled loans. • A high anomaly score means you can not trust the model. Model Competence Prediction T T Confidence 86% 84% Anomaly Score 0.5367 0.7124 Competent? Y N OPEN LOANS PREDICTION ANOMALY SCORE CLOSED LOAN MODEL CLOSED LOAN ANOMALY DETECTOR
  • 57. BigML, Inc 23Unsupervised Learning Anomaly Demo #2