SlideShare a Scribd company logo
Anomaly/Novelty Detection
with scikit-learn
Alexandre Gramfort
Telecom ParisTech - CNRS LTCI
alexandre.gramfort@telecom-paristech.fr
GitHub : @agramfort Twitter : @agramfort
Alexandre Gramfort Anomaly detection with scikit-learn
What’s the problem?
2
Objective: Spot the red apple
Alexandre Gramfort Anomaly detection with scikit-learn
What’s the problem?
3
“An outlier is an observation in a data set which appears to
be inconsistent with the remainder of that set of data.”
Johnson 1992
“An outlier is an observation which deviates so much from
the other observations as to arouse suspicions that it was
generated by a different mechanism.”
Hawkins 1980
Outlier/Anomaly
Alexandre Gramfort Anomaly detection with scikit-learn
Types of AD
4
• Supervised AD
• Labels available for both normal data and anomalies
• Similar to rare class mining / imbalanced classification
• Semi-supervised AD (Novelty Detection)
• Only normal data available to train
• The algorithm learns on normal data only
• Unsupervised AD (Outlier Detection)
• no labels, training set = normal + abnormal data
• Assumption: anomalies are very rare
Alexandre Gramfort Anomaly detection with scikit-learn
Types of AD
5
• Supervised AD
• Labels available for both normal data and anomalies
• Similar to rare class mining / imbalanced classification
• Semi-supervised AD (Novelty Detection)
• Only normal data available to train
• The algorithm learns on normal data only
• Unsupervised AD (Outlier Detection)
• no labels, training set = normal + abnormal data
• Assumption: anomalies are very rare
Alexandre Gramfort Anomaly detection with scikit-learn
ML Taxonomy
6
Machine Learning
SupervisedUnsupervised
Regression Classif.
“Prediction”
Clustering Dim. Red. Anomaly
Novelty
Detection
Alexandre Gramfort Anomaly detection with scikit-learn
Applications
7
• Fraud detection
• Network intrusion
• Finance
• Insurance
• Maintenance
• Medicine (unusual symptoms)
• Measurement errors (from sensors)
Any application where
looking at unusual
observations is
relevant
Alexandre Gramfort Anomaly detection with scikit-learn
Big picture
9
Look for samples that are in
low density regions, isolated
Look for a region of the space
that is small in volume but
contains most of the samples
Density based approach (KDE,
Gaussian Ellipse, GMM)
Kernel methods
Nearest neighbors
Trees / Partitioning
Novelty detection via density estimates
>>> from sklearn.mixture import GaussianMixture
>>> gmm = GaussianMixture(n_components=1).fit(X)
>>> log_dens = gmm.score_samples(X_plot)
>>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)
Low density regions
Novelty detection via density estimates
>>> from sklearn.mixture import GaussianMixture
>>> gmm = GaussianMixture(n_components=2).fit(X)
>>> log_dens = gmm.score_samples(X_plot)
>>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)
Novelty detection via density estimates
>>> from sklearn.neighbors import KernelDensity
>>> kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(X)
>>> log_dens = kde.score_samples(X_plot)
>>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)
Kernel methods
Nearest neighbors
Trees / Partitioning
Our toy dataset
Kernel Approach
http://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html
>>> est = OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)
Nearest neighbors
Trees / Partitioning
Nearest Neighbors (NN) Approach
https://github.com/scikit-learn/scikit-learn/pull/5279
>>> est = LocalOutlierFactor(n_neighbors=5)
Local Outlier Factor (LOF)
https://en.wikipedia.org/wiki/Local_outlier_factor
Trees / Partitioning
Partitioning / Tree Approach
http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.IsolationForest.html
>>> est = IsolationForest(n_estimators=100)
Isolation Forest
An anomaly can be isolated with a
very shallow random tree
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Network intrusions
https://github.com/scikit-learn/scikit-learn/blob/master/benchmarks/
bench_isolation_forest.py
Alexandre Gramfort Anomaly detection with scikit-learn
Some caveats
25
• How to set model hyperparameters?
• How to evaluate performance in the unsupervised setup?
• In any AD method there is a notion of metric/similarity
between samples, e.g. Euclidian distance. Unclear how to define
it (think continuous, categorical features etc.)
http://scikit-learn.org/stable/modules/outlier_detection.html
Alexandre Gramfort
alexandre.gramfort@telecom-paristech.frContact:
GitHub : @agramfort Twitter : @agramfort
Questions?
1 position to work on Scikit-Learn and Scipy stack available !
Thanks @ngoix & @albertthomas88 for the work

More Related Content

What's hot

What's hot (20)

Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaUnsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
 
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningAnomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationAnomaly Detection - Real World Scenarios, Approaches and Live Implementation
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
 
22 Machine Learning Feature Selection
22 Machine Learning Feature Selection22 Machine Learning Feature Selection
22 Machine Learning Feature Selection
 
Object Detection with Transformers
Object Detection with TransformersObject Detection with Transformers
Object Detection with Transformers
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Outlier detection method introduction
Outlier detection method introductionOutlier detection method introduction
Outlier detection method introduction
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Anomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleAnomaly detection with machine learning at scale
Anomaly detection with machine learning at scale
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural network
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 
Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)
 
Isolation Forest
Isolation ForestIsolation Forest
Isolation Forest
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic Regression
 

More from agramfort

More from agramfort (7)

MNE sapien labs 2019
MNE sapien labs 2019MNE sapien labs 2019
MNE sapien labs 2019
 
MAIN Conf Talk: Learning representations from neural signals
MAIN Conf Talk: Learning representations from neural signalsMAIN Conf Talk: Learning representations from neural signals
MAIN Conf Talk: Learning representations from neural signals
 
SfN 2018: Machine learning and signal processing for neural oscillations
SfN 2018: Machine learning and signal processing for neural oscillationsSfN 2018: Machine learning and signal processing for neural oscillations
SfN 2018: Machine learning and signal processing for neural oscillations
 
ICML 2018 Reproducible Machine Learning - A. Gramfort
ICML 2018 Reproducible Machine Learning - A. GramfortICML 2018 Reproducible Machine Learning - A. Gramfort
ICML 2018 Reproducible Machine Learning - A. Gramfort
 
MNE group analysis presentation @ Biomag 2016 conf.
MNE group analysis presentation @ Biomag 2016 conf.MNE group analysis presentation @ Biomag 2016 conf.
MNE group analysis presentation @ Biomag 2016 conf.
 
Teaching ML with scikit-learn at Telecom ParisTech
Teaching ML with scikit-learn at Telecom ParisTechTeaching ML with scikit-learn at Telecom ParisTech
Teaching ML with scikit-learn at Telecom ParisTech
 
Paris machine learning meetup 17 Sept. 2013
Paris machine learning meetup 17 Sept. 2013Paris machine learning meetup 17 Sept. 2013
Paris machine learning meetup 17 Sept. 2013
 

Recently uploaded

Recently uploaded (20)

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 

Anomaly/Novelty detection with scikit-learn

  • 1. Anomaly/Novelty Detection with scikit-learn Alexandre Gramfort Telecom ParisTech - CNRS LTCI alexandre.gramfort@telecom-paristech.fr GitHub : @agramfort Twitter : @agramfort
  • 2. Alexandre Gramfort Anomaly detection with scikit-learn What’s the problem? 2 Objective: Spot the red apple
  • 3. Alexandre Gramfort Anomaly detection with scikit-learn What’s the problem? 3 “An outlier is an observation in a data set which appears to be inconsistent with the remainder of that set of data.” Johnson 1992 “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.” Hawkins 1980 Outlier/Anomaly
  • 4. Alexandre Gramfort Anomaly detection with scikit-learn Types of AD 4 • Supervised AD • Labels available for both normal data and anomalies • Similar to rare class mining / imbalanced classification • Semi-supervised AD (Novelty Detection) • Only normal data available to train • The algorithm learns on normal data only • Unsupervised AD (Outlier Detection) • no labels, training set = normal + abnormal data • Assumption: anomalies are very rare
  • 5. Alexandre Gramfort Anomaly detection with scikit-learn Types of AD 5 • Supervised AD • Labels available for both normal data and anomalies • Similar to rare class mining / imbalanced classification • Semi-supervised AD (Novelty Detection) • Only normal data available to train • The algorithm learns on normal data only • Unsupervised AD (Outlier Detection) • no labels, training set = normal + abnormal data • Assumption: anomalies are very rare
  • 6. Alexandre Gramfort Anomaly detection with scikit-learn ML Taxonomy 6 Machine Learning SupervisedUnsupervised Regression Classif. “Prediction” Clustering Dim. Red. Anomaly Novelty Detection
  • 7. Alexandre Gramfort Anomaly detection with scikit-learn Applications 7 • Fraud detection • Network intrusion • Finance • Insurance • Maintenance • Medicine (unusual symptoms) • Measurement errors (from sensors) Any application where looking at unusual observations is relevant
  • 8.
  • 9. Alexandre Gramfort Anomaly detection with scikit-learn Big picture 9 Look for samples that are in low density regions, isolated Look for a region of the space that is small in volume but contains most of the samples
  • 10. Density based approach (KDE, Gaussian Ellipse, GMM) Kernel methods Nearest neighbors Trees / Partitioning
  • 11. Novelty detection via density estimates >>> from sklearn.mixture import GaussianMixture >>> gmm = GaussianMixture(n_components=1).fit(X) >>> log_dens = gmm.score_samples(X_plot) >>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7) Low density regions
  • 12. Novelty detection via density estimates >>> from sklearn.mixture import GaussianMixture >>> gmm = GaussianMixture(n_components=2).fit(X) >>> log_dens = gmm.score_samples(X_plot) >>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)
  • 13. Novelty detection via density estimates >>> from sklearn.neighbors import KernelDensity >>> kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(X) >>> log_dens = kde.score_samples(X_plot) >>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)
  • 18. Nearest Neighbors (NN) Approach https://github.com/scikit-learn/scikit-learn/pull/5279 >>> est = LocalOutlierFactor(n_neighbors=5)
  • 19. Local Outlier Factor (LOF) https://en.wikipedia.org/wiki/Local_outlier_factor
  • 21. Partitioning / Tree Approach http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.IsolationForest.html >>> est = IsolationForest(n_estimators=100)
  • 22. Isolation Forest An anomaly can be isolated with a very shallow random tree
  • 23.
  • 25. Alexandre Gramfort Anomaly detection with scikit-learn Some caveats 25 • How to set model hyperparameters? • How to evaluate performance in the unsupervised setup? • In any AD method there is a notion of metric/similarity between samples, e.g. Euclidian distance. Unclear how to define it (think continuous, categorical features etc.)
  • 27.
  • 28. Alexandre Gramfort alexandre.gramfort@telecom-paristech.frContact: GitHub : @agramfort Twitter : @agramfort Questions? 1 position to work on Scikit-Learn and Scipy stack available ! Thanks @ngoix & @albertthomas88 for the work