Why should we care about anomalies? They demand our attention because they are telling a different story from the norm. An anomaly might signify a failing heart rate of a patient, a fraudulent credit card activity, or an early indication of a tsunami. As such, it is extremely important to detect anomalies.
What are the challenges in anomaly detection? As with many machine/statistical learning tasks high dimensional data poses a problem. Another challenge is selecting appropriate parameters. Yet another challenge is high false positive rates.
In this talk we introduce two R packages – dobin and lookout - that address different challenges in anomaly detection. Dobin is a dimension reduction technique especially catered to anomaly detection. So, dobin is somewhat similar PCA; but dobin puts anomalies in the forefront. We can use dobin as a pre-processing step and find anomalies using fewer dimensions.
On the other hand, lookout is an anomaly detection method that uses kernel density estimates and extreme value theory. But there is a difference. Generally, anomaly detection methods that use kernel density estimates require a user-defined bandwidth parameter. But does the user know how to specify this elusive bandwidth parameter? Lookout addresses this challenge by constructing an appropriate bandwidth for anomaly detection using topological data analysis, so the user doesn’t need to specify a bandwidth parameter. Furthermore, lookout has a low false positive rate because it uses extreme value theory.
We also introduce the concept of anomaly persistence, which explores the birth and death of anomalies as the bandwidth changes. If a data point is identified as an anomaly for a large range of bandwidth values, then its significance as an anomaly is high.
2. Why anomalies?
• They tell a different story
• Fraudulent credit card transactions amongst billions of
legitimate transactions
• Computer network intrusions
• Astronomical anomalies – solar flares
• Weather anomalies – tsunamis
• Stock market anomalies – heralding a crash?
2
3. Anomaly detection – why?
• Take fraud and network intrusions for example
• Training a model on certain fraud/intrusions/cyber attacks is
not optimal, because there are new types of fraud/attacks,
always!
• You want to be alerted when weird things happen.
• Anomaly detection is used in these applications.
3
5. Some
Current
Challenges
High dimensionality of data
• Finding anomalies in high dimensional data is hard
• Anomalies and normal points look similar
High false positives
• Do not want an “alarm factory” – confidence in the
system goes down
Parameters need to be defined by the user
• But expert knowledge is needed
5
6. Overview
lookout – an
anomaly
detection
method
Low false positives
User does not need to specify parameters
lookout – on CRAN
dobin – a
dimension
reduction
method for
anomaly
detection
Addresses the high dimensionality challenge
dobin – on CRAN
6
8. What is it?
Original anomalies are still
anomalies in the reduced
dimensional space
It is a preprocessing technique
Not an anomaly detection method
8
9. What does
it do?
Find a set of new axes (basis
vectors), which preserves
anomalies
First basis vector in the direction of
most anomalousness (largest knn
distances), second basis vector in
the direction of second largest knn
distances
9
10. Example
• Uniform distribution in 20
dimensions,
• one point at (0.9, 0.9, 0.9, . . .)
• This is the outlier
• In R
• > dobin(X)
10
11. Sevvandi Kandanaarachchi, Rob Hyndman
Preprint - https://bit.ly/lookoutliers
lookout – leave one
out kde for outlier
detection
11
12. lookout
Outlier detection method
• Because of Extreme Value Theory
(EVT)
• EVT is used to model 100-year floods
• Use a Generalized Pareto Distribution
Low false positives
Not an “alarm factory”
12
13. lookout
User does not need to specify
parameters
• Use Kernel Density Estimates –
need a bandwidth parameter
• But general bandwidth is not
appropriate for anomaly detection
• Select bandwidth using topological
data analysis
• bw(TDA) → KDE → EVT → outliers
Anomaly persistence
• Which anomalies are consistently
identified, with changing
bandwidth?
• Visual representation of anomaly
persistence
13
14. Example 1
2D normal distribution, with outliers at the far end.
The outlying indices are 501 - 505
The persistence diagram. The outliers get identified
for a large range of bandwidth values.
14
15. Example 2
2D bimodal distribution, with outliers in the trough.
The outliers have indices 1001 - 1005
The persistence diagram. Again, the outliers
get identified for a large range of bandwidth values.
15
16. Example 3
Points in 3 normally distributed clusters, with anomalies
away from them. Anomalies have indices 701 - 703.
The persistence diagram. Anomalies get
identified for a broad range of bandwidth
values.
16
17. Example 4
Points in an annulus with anomalies in the middle.
Anomalies have indices 1001 - 1010
The persistence diagram.
17
18. Summary
• dobin - a dimension reduction method for anomaly detection
• lookout - a EVT based method to find anomalies
• Both paper/preprint available
• https://doi.org/10.1080/10618600.2020.1807353
• https://bit.ly/lookoutliers
• Both packages on CRAN
18