Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0

Share

Download to read offline

Here is the anomalow-down!

Download to read offline

Why should we care about anomalies? They demand our attention because they are telling a different story from the norm. An anomaly might signify a failing heart rate of a patient, a fraudulent credit card activity, or an early indication of a tsunami. As such, it is extremely important to detect anomalies.
What are the challenges in anomaly detection? As with many machine/statistical learning tasks high dimensional data poses a problem. Another challenge is selecting appropriate parameters. Yet another challenge is high false positive rates.
In this talk we introduce two R packages – dobin and lookout - that address different challenges in anomaly detection. Dobin is a dimension reduction technique especially catered to anomaly detection. So, dobin is somewhat similar PCA; but dobin puts anomalies in the forefront. We can use dobin as a pre-processing step and find anomalies using fewer dimensions.
On the other hand, lookout is an anomaly detection method that uses kernel density estimates and extreme value theory. But there is a difference. Generally, anomaly detection methods that use kernel density estimates require a user-defined bandwidth parameter. But does the user know how to specify this elusive bandwidth parameter? Lookout addresses this challenge by constructing an appropriate bandwidth for anomaly detection using topological data analysis, so the user doesn’t need to specify a bandwidth parameter. Furthermore, lookout has a low false positive rate because it uses extreme value theory.
We also introduce the concept of anomaly persistence, which explores the birth and death of anomalies as the bandwidth changes. If a data point is identified as an anomaly for a large range of bandwidth values, then its significance as an anomaly is high.

  • Be the first to like this

Here is the anomalow-down!

  1. 1. Here is the anomalow-down! Sevvandi Kandanaarachchi RMIT University Joint work with Rob Hyndman 1
  2. 2. Why anomalies? • They tell a different story • Fraudulent credit card transactions amongst billions of legitimate transactions • Computer network intrusions • Astronomical anomalies – solar flares • Weather anomalies – tsunamis • Stock market anomalies – heralding a crash? 2
  3. 3. Anomaly detection – why? • Take fraud and network intrusions for example • Training a model on certain fraud/intrusions/cyber attacks is not optimal, because there are new types of fraud/attacks, always! • You want to be alerted when weird things happen. • Anomaly detection is used in these applications. 3
  4. 4. Is everything rosy? 4
  5. 5. Some Current Challenges High dimensionality of data • Finding anomalies in high dimensional data is hard • Anomalies and normal points look similar High false positives • Do not want an “alarm factory” – confidence in the system goes down Parameters need to be defined by the user • But expert knowledge is needed 5
  6. 6. Overview lookout – an anomaly detection method Low false positives User does not need to specify parameters lookout – on CRAN dobin – a dimension reduction method for anomaly detection Addresses the high dimensionality challenge dobin – on CRAN 6
  7. 7. dobin – dimension reduction for outlier detection Sevvandi Kandanaarachchi, Rob Hyndman JCGS, (2021) 30:1, 204-219 7
  8. 8. What is it? Original anomalies are still anomalies in the reduced dimensional space It is a preprocessing technique Not an anomaly detection method 8
  9. 9. What does it do? Find a set of new axes (basis vectors), which preserves anomalies First basis vector in the direction of most anomalousness (largest knn distances), second basis vector in the direction of second largest knn distances 9
  10. 10. Example • Uniform distribution in 20 dimensions, • one point at (0.9, 0.9, 0.9, . . .) • This is the outlier • In R • > dobin(X) 10
  11. 11. Sevvandi Kandanaarachchi, Rob Hyndman Preprint - https://bit.ly/lookoutliers lookout – leave one out kde for outlier detection 11
  12. 12. lookout Outlier detection method • Because of Extreme Value Theory (EVT) • EVT is used to model 100-year floods • Use a Generalized Pareto Distribution Low false positives Not an “alarm factory” 12
  13. 13. lookout User does not need to specify parameters • Use Kernel Density Estimates – need a bandwidth parameter • But general bandwidth is not appropriate for anomaly detection • Select bandwidth using topological data analysis • bw(TDA) → KDE → EVT → outliers Anomaly persistence • Which anomalies are consistently identified, with changing bandwidth? • Visual representation of anomaly persistence 13
  14. 14. Example 1 2D normal distribution, with outliers at the far end. The outlying indices are 501 - 505 The persistence diagram. The outliers get identified for a large range of bandwidth values. 14
  15. 15. Example 2 2D bimodal distribution, with outliers in the trough. The outliers have indices 1001 - 1005 The persistence diagram. Again, the outliers get identified for a large range of bandwidth values. 15
  16. 16. Example 3 Points in 3 normally distributed clusters, with anomalies away from them. Anomalies have indices 701 - 703. The persistence diagram. Anomalies get identified for a broad range of bandwidth values. 16
  17. 17. Example 4 Points in an annulus with anomalies in the middle. Anomalies have indices 1001 - 1010 The persistence diagram. 17
  18. 18. Summary • dobin - a dimension reduction method for anomaly detection • lookout - a EVT based method to find anomalies • Both paper/preprint available • https://doi.org/10.1080/10618600.2020.1807353 • https://bit.ly/lookoutliers • Both packages on CRAN 18
  19. 19. Thank you! 19

Why should we care about anomalies? They demand our attention because they are telling a different story from the norm. An anomaly might signify a failing heart rate of a patient, a fraudulent credit card activity, or an early indication of a tsunami. As such, it is extremely important to detect anomalies. What are the challenges in anomaly detection? As with many machine/statistical learning tasks high dimensional data poses a problem. Another challenge is selecting appropriate parameters. Yet another challenge is high false positive rates. In this talk we introduce two R packages – dobin and lookout - that address different challenges in anomaly detection. Dobin is a dimension reduction technique especially catered to anomaly detection. So, dobin is somewhat similar PCA; but dobin puts anomalies in the forefront. We can use dobin as a pre-processing step and find anomalies using fewer dimensions. On the other hand, lookout is an anomaly detection method that uses kernel density estimates and extreme value theory. But there is a difference. Generally, anomaly detection methods that use kernel density estimates require a user-defined bandwidth parameter. But does the user know how to specify this elusive bandwidth parameter? Lookout addresses this challenge by constructing an appropriate bandwidth for anomaly detection using topological data analysis, so the user doesn’t need to specify a bandwidth parameter. Furthermore, lookout has a low false positive rate because it uses extreme value theory. We also introduce the concept of anomaly persistence, which explores the birth and death of anomalies as the bandwidth changes. If a data point is identified as an anomaly for a large range of bandwidth values, then its significance as an anomaly is high.

Views

Total views

71

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

0

Shares

0

Comments

0

Likes

0

×