Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Anomaly/Novelty Detection
with scikit-learn
Alexandre Gramfort
Telecom ParisTech - CNRS LTCI
alexandre.gramfort@telecom-pa...
Alexandre Gramfort Anomaly detection with scikit-learn
What’s the problem?
2
Objective: Spot the red apple
Alexandre Gramfort Anomaly detection with scikit-learn
What’s the problem?
3
“An outlier is an observation in a data set w...
Alexandre Gramfort Anomaly detection with scikit-learn
Types of AD
4
• Supervised AD
• Labels available for both normal da...
Alexandre Gramfort Anomaly detection with scikit-learn
Types of AD
5
• Supervised AD
• Labels available for both normal da...
Alexandre Gramfort Anomaly detection with scikit-learn
ML Taxonomy
6
Machine Learning
SupervisedUnsupervised
Regression Cl...
Alexandre Gramfort Anomaly detection with scikit-learn
Applications
7
• Fraud detection
• Network intrusion
• Finance
• In...
Alexandre Gramfort Anomaly detection with scikit-learn
Big picture
9
Look for samples that are in
low density regions, iso...
Density based approach (KDE,
Gaussian Ellipse, GMM)
Kernel methods
Nearest neighbors
Trees / Partitioning
Novelty detection via density estimates
>>> from sklearn.mixture import GaussianMixture
>>> gmm = GaussianMixture(n_compon...
Novelty detection via density estimates
>>> from sklearn.mixture import GaussianMixture
>>> gmm = GaussianMixture(n_compon...
Novelty detection via density estimates
>>> from sklearn.neighbors import KernelDensity
>>> kde = KernelDensity(kernel='ga...
Kernel methods
Nearest neighbors
Trees / Partitioning
Our toy dataset
Kernel Approach
http://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html
>>> est = OneClassSVM(nu=0.1, kernel="...
Nearest neighbors
Trees / Partitioning
Nearest Neighbors (NN) Approach
https://github.com/scikit-learn/scikit-learn/pull/5279
>>> est = LocalOutlierFactor(n_neig...
Local Outlier Factor (LOF)
https://en.wikipedia.org/wiki/Local_outlier_factor
Trees / Partitioning
Partitioning / Tree Approach
http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.IsolationForest.html
>>> est =...
Isolation Forest
An anomaly can be isolated with a
very shallow random tree
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Network intrusions
https://github.com/scikit-learn/scikit-learn/bl...
Alexandre Gramfort Anomaly detection with scikit-learn
Some caveats
25
• How to set model hyperparameters?
• How to evalua...
http://scikit-learn.org/stable/modules/outlier_detection.html
Alexandre Gramfort
alexandre.gramfort@telecom-paristech.frContact:
GitHub : @agramfort Twitter : @agramfort
Questions?
1 p...
Anomaly/Novelty detection with scikit-learn
Anomaly/Novelty detection with scikit-learn
Anomaly/Novelty detection with scikit-learn
Upcoming SlideShare
Loading in …5
×

Anomaly/Novelty detection with scikit-learn

12,474 views

Published on

Talk given at PyData Paris / Scikit-Learn data 2016

Published in: Technology
  • Be the first to comment

Anomaly/Novelty detection with scikit-learn

  1. 1. Anomaly/Novelty Detection with scikit-learn Alexandre Gramfort Telecom ParisTech - CNRS LTCI alexandre.gramfort@telecom-paristech.fr GitHub : @agramfort Twitter : @agramfort
  2. 2. Alexandre Gramfort Anomaly detection with scikit-learn What’s the problem? 2 Objective: Spot the red apple
  3. 3. Alexandre Gramfort Anomaly detection with scikit-learn What’s the problem? 3 “An outlier is an observation in a data set which appears to be inconsistent with the remainder of that set of data.” Johnson 1992 “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.” Hawkins 1980 Outlier/Anomaly
  4. 4. Alexandre Gramfort Anomaly detection with scikit-learn Types of AD 4 • Supervised AD • Labels available for both normal data and anomalies • Similar to rare class mining / imbalanced classification • Semi-supervised AD (Novelty Detection) • Only normal data available to train • The algorithm learns on normal data only • Unsupervised AD (Outlier Detection) • no labels, training set = normal + abnormal data • Assumption: anomalies are very rare
  5. 5. Alexandre Gramfort Anomaly detection with scikit-learn Types of AD 5 • Supervised AD • Labels available for both normal data and anomalies • Similar to rare class mining / imbalanced classification • Semi-supervised AD (Novelty Detection) • Only normal data available to train • The algorithm learns on normal data only • Unsupervised AD (Outlier Detection) • no labels, training set = normal + abnormal data • Assumption: anomalies are very rare
  6. 6. Alexandre Gramfort Anomaly detection with scikit-learn ML Taxonomy 6 Machine Learning SupervisedUnsupervised Regression Classif. “Prediction” Clustering Dim. Red. Anomaly Novelty Detection
  7. 7. Alexandre Gramfort Anomaly detection with scikit-learn Applications 7 • Fraud detection • Network intrusion • Finance • Insurance • Maintenance • Medicine (unusual symptoms) • Measurement errors (from sensors) Any application where looking at unusual observations is relevant
  8. 8. Alexandre Gramfort Anomaly detection with scikit-learn Big picture 9 Look for samples that are in low density regions, isolated Look for a region of the space that is small in volume but contains most of the samples
  9. 9. Density based approach (KDE, Gaussian Ellipse, GMM) Kernel methods Nearest neighbors Trees / Partitioning
  10. 10. Novelty detection via density estimates >>> from sklearn.mixture import GaussianMixture >>> gmm = GaussianMixture(n_components=1).fit(X) >>> log_dens = gmm.score_samples(X_plot) >>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7) Low density regions
  11. 11. Novelty detection via density estimates >>> from sklearn.mixture import GaussianMixture >>> gmm = GaussianMixture(n_components=2).fit(X) >>> log_dens = gmm.score_samples(X_plot) >>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)
  12. 12. Novelty detection via density estimates >>> from sklearn.neighbors import KernelDensity >>> kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(X) >>> log_dens = kde.score_samples(X_plot) >>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)
  13. 13. Kernel methods Nearest neighbors Trees / Partitioning
  14. 14. Our toy dataset
  15. 15. Kernel Approach http://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html >>> est = OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)
  16. 16. Nearest neighbors Trees / Partitioning
  17. 17. Nearest Neighbors (NN) Approach https://github.com/scikit-learn/scikit-learn/pull/5279 >>> est = LocalOutlierFactor(n_neighbors=5)
  18. 18. Local Outlier Factor (LOF) https://en.wikipedia.org/wiki/Local_outlier_factor
  19. 19. Trees / Partitioning
  20. 20. Partitioning / Tree Approach http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.IsolationForest.html >>> est = IsolationForest(n_estimators=100)
  21. 21. Isolation Forest An anomaly can be isolated with a very shallow random tree
  22. 22. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html Network intrusions https://github.com/scikit-learn/scikit-learn/blob/master/benchmarks/ bench_isolation_forest.py
  23. 23. Alexandre Gramfort Anomaly detection with scikit-learn Some caveats 25 • How to set model hyperparameters? • How to evaluate performance in the unsupervised setup? • In any AD method there is a notion of metric/similarity between samples, e.g. Euclidian distance. Unclear how to define it (think continuous, categorical features etc.)
  24. 24. http://scikit-learn.org/stable/modules/outlier_detection.html
  25. 25. Alexandre Gramfort alexandre.gramfort@telecom-paristech.frContact: GitHub : @agramfort Twitter : @agramfort Questions? 1 position to work on Scikit-Learn and Scipy stack available ! Thanks @ngoix & @albertthomas88 for the work

×