Anomaly detection is a topic with many different applications. From social media tracking, to cybersecurity, anomaly detection (or outlier detection) algorithms can have a huge impact in your organisation.
For the video please visit: https://www.youtube.com/watch?v=XEM2bYYxkTU
This slideshare has been produced by the Tesseract Academy (http://tesseract.academy), a company that educates decision makers in deep technical topics such as data science, analytics, machine learning and blockchain.
If you are interested in data science and related topics, make sure to also visit The Data Scientist: http://thedatascientist.com.
2. Anomaly Detection
Anomaly detection (also known as outlier detection) is the search for items or events which do
not conform to an expected pattern.
◦ This is domain specific
◦ E.g. intrusion detection, spikes
2
3. Anomaly detection
•Anomaly detection is applicable in a variety of domains,
• intrusion detection, fraud detection, fault detection, system health monitoring, event detection in
sensor networks, and detecting Eco-system disturbances.
It is often used in preprocessing to remove anomalous data from the dataset.
In supervised learning, removing the anomalous data from the dataset often results in a
statistically significant increase in accuracy.
3
4. Types of anomalies
Anomalies can be classified into following three categories:
1. Point anomalies
2. Contextual anomalies
3. Collective anomalies
4
5. Point anomalies
•If an individual data instance can be considered as anomalous with respect to the rest of data,
then the instance is termed as a point anomaly.
•This is the simplest type of anomaly and is the focus of majority of research on anomaly
detection.
Credit card fraud detection.
◦ Data set: an individual’s credit card transactions.
◦ A transaction for which the amount spent is very high compared to the normal range of expenditure for
that person will be a point anomaly.
5
7. Contextual anomalies
•The contextual attributes are used to determine the context (or neighborhood) for that instance.
•For example, in spatial data sets, the longitude and latitude of a location are the contextual
attributes. In time series data, time is a contextual attribute which determines the position of an
instance on the entire sequence.
Network intrusion detection and social media volume
◦ the interesting objects are often not rare objects, but unexpected bursts in activity.
7
9. Collective anomalies
If a collection of related data instances is anomalous with respect to the entire data set, it is
termed as a collective anomaly. The individual data instances in a collective anomaly may not be
anomalies by themselves, but their occurrence together as a collection is anomalous.
They have two variations.
◦ Events in unexpected order ( ordered. e.g. breaking rhythm in ECG)
◦ Unexpected value combinations ( unordered. e.g. buying large number of expensive items)
9
10. Anomaly detection techniques
Many techniques have been proposed. Some indicative are:
◦ Distance based techniques (k-nearest neighbour, local outlier factor)
◦ One class support vector machines.
◦ Replicator neural networks.
◦ Cluster analysis based outlier detection.
◦ Pointing at records that deviate from learned association rules.
10
11. Anomaly detection in time series
Twitter Anomaly Detection package
◦ https://github.com/twitter/AnomalyDetection
11
12. Seasonal Hybrid ESD
Builds upon the Generalized ESD test for detecting anomalies
Generalized extreme Studentized Deviate test (Rosner 1983)
Given the upper bound, r, the generalized ESD test essentially performs r separate tests: a test
for one outlier, a test for two outliers, and so on up to r outliers.
Hypothesis test
◦ Null: There are no outliers in the data set
◦ Alternative: There are up to r outliers in the data set
Seasonal ESD applies time series decomposition to remove seasonal component
12
13. Twitter anomaly detection algorithm
Extends original by using robust statistics (median, median absolute deviation)
Parameters
◦ Max number of anomalies: expressed as a percentage
◦ Direction: positive – negative – both
◦ Alpha: significance level
◦ Period: Main period of observations (e.g. 24 hours, or 7 days)
13
14. Applications of anomaly detection
Cybersecurity
◦ Intrusion detection
Fraud detection
Social media monitoring
Medical monitoring
14
15. Learn more
Tesseract Academy
◦ http://tesseract.academy
◦ https://www.youtube.com/watch?v=XEM2bYYxkTU
◦ Data science, big data and blockchain for executives and managers.
The Data scientist
◦ Personal blog
◦ Covers data science, analytics, blockchain, tokenomics and many more subjects
◦ http://thedatascientist.com