The document summarizes an anomaly detection survey paper. It discusses different aspects of anomaly detection problems including the nature of input data, type of anomalies, availability of data labels, and output types. It also describes several anomaly detection techniques such as classification-based, nearest neighbor-based, clustering-based, statistical-based, and spectral-based methods. For each technique, it provides the basic idea, categories, examples, advantages, and disadvantages.
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance.
In this workshop, we will discuss the core techniques in anomaly detection and discuss advances in Deep Learning in this field.
Through case studies, we will discuss how anomaly detection techniques could be applied to various business problems. We will also demonstrate examples using R, Python, Keras and Tensorflow applications to help reinforce concepts in anomaly detection and best practices in analyzing and reviewing results.
What you will learn:
Anomaly Detection: An introduction
Graphical and Exploratory analysis techniques
Statistical techniques in Anomaly Detection
Machine learning methods for Outlier analysis
Evaluating performance in Anomaly detection techniques
Detecting anomalies in time series data
Case study 1: Anomalies in Freddie Mac mortgage data
Case study 2: Auto-encoder based Anomaly Detection for Credit risk with Keras and Tensorflow
Part 1
- Introduction
- Application for Anomaly Detection
- AIOps
- GraphDB
Part 2
- Type Of Anomaly Detection
- How to Identify Outliers in your Data
Part 3
- Anomaly Detection for Timeseries Technique
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance.
In this workshop, we will discuss the core techniques in anomaly detection and discuss advances in Deep Learning in this field.
Through case studies, we will discuss how anomaly detection techniques could be applied to various business problems. We will also demonstrate examples using R, Python, Keras and Tensorflow applications to help reinforce concepts in anomaly detection and best practices in analyzing and reviewing results.
What you will learn:
Anomaly Detection: An introduction
Graphical and Exploratory analysis techniques
Statistical techniques in Anomaly Detection
Machine learning methods for Outlier analysis
Evaluating performance in Anomaly detection techniques
Detecting anomalies in time series data
Case study 1: Anomalies in Freddie Mac mortgage data
Case study 2: Auto-encoder based Anomaly Detection for Credit risk with Keras and Tensorflow
Part 1
- Introduction
- Application for Anomaly Detection
- AIOps
- GraphDB
Part 2
- Type Of Anomaly Detection
- How to Identify Outliers in your Data
Part 3
- Anomaly Detection for Timeseries Technique
Anomaly detection is a topic with many different applications. From social media tracking, to cybersecurity, anomaly detection (or outlier detection) algorithms can have a huge impact in your organisation.
For the video please visit: https://www.youtube.com/watch?v=XEM2bYYxkTU
This slideshare has been produced by the Tesseract Academy (http://tesseract.academy), a company that educates decision makers in deep technical topics such as data science, analytics, machine learning and blockchain.
If you are interested in data science and related topics, make sure to also visit The Data Scientist: http://thedatascientist.com.
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningQuantUniversity
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance.
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationImpetus Technologies
Detecting anomalous patterns in data can lead to significant actionable insights in a wide variety of application domains, such as fraud detection, network traffic management, predictive healthcare, energy monitoring and many more.
However, detecting anomalies accurately can be difficult. What qualifies as an anomaly is continuously changing and anomalous patterns are unexpected. An effective anomaly detection system needs to continuously self-learn without relying on pre-programmed thresholds.
Join our speakers Ravishankar Rao Vallabhajosyula, Senior Data Scientist, Impetus Technologies and Saurabh Dutta, Technical Product Manager - StreamAnalytix, in a discussion on:
Importance of anomaly detection in enterprise data, types of anomalies, and challenges
Prominent real-time application areas
Approaches, techniques and algorithms for anomaly detection
Sample use-case implementation on the StreamAnalytix platform
Organizations are collecting massive amounts of data from disparate sources. However, they continuously face the challenge of identifying patterns, detecting anomalies, and projecting future trends based on large data sets. Machine learning for anomaly detection provides a promising alternative for the detection and classification of anomalies.
Find out how you can implement machine learning to increase speed and effectiveness in identifying and reporting anomalies.
In this webinar, we will discuss :
How machine learning can help in identifying anomalies
Steps to approach an anomaly detection problem
Various techniques available for anomaly detection
Best algorithms that fit in different situations
Implementing an anomaly detection use case on the StreamAnalytix platform
To view the webinar - https://bit.ly/2IV2ahC
This presentation will present topics such as "What is Anomaly Detection? What are the different types of Data that may be used? What are the popular techniques may be used to identify anomalies. What are the best practices in anomaly detection? What is the Value of Anomaly Detection?
One of the determinants for a good anomaly detector is finding smart data representations that can easily evince deviations from the normal distribution. Traditional supervised approaches would require a strong assumption about what is normal and what not plus a non negligible effort in labeling the training dataset. Deep auto-encoders work very well in learning high-level abstractions and non-linear relationships of the data without requiring data labels. In this talk we will review a few popular techniques used in shallow machine learning and propose two semi-supervised approaches for novelty detection: one based on reconstruction error and another based on lower-dimensional feature compression.
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaPyData
PyData London 2018
This talk will focus on the importance of correctly defining an anomaly when conducting anomaly detection using unsupervised machine learning. It will include a review of Isolation Forest algorithm (Liu et al. 2008), and a demonstration of how this algorithm can be applied to transaction monitoring, specifically to detect money laundering.
---
www.pydata.org
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.
PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
This presentation deals with the formal presentation of anomaly detection and outlier analysis and types of anomalies and outliers. Different approaches to tackel anomaly detection problems.
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance. In this talk, we will introduce anomaly detection and discuss the various analytical and machine learning techniques used in in this field. Through a case study, we will discuss how anomaly detection techniques could be applied to energy data sets. We will also demonstrate, using R and Apache Spark, an application to help reinforce concepts in anomaly detection and best practices in analyzing and reviewing results.
Random Forest Algorithm widespread popularity stems from its user-friendly nature and adaptability, enabling it to tackle both classification and regression problems effectively. The algorithm’s strength lies in its ability to handle complex datasets and mitigate overfitting, making it a valuable tool for various predictive tasks in machine learning.
One of the most important features of the Random Forest Algorithm is that it can handle the data set containing continuous variables, as in the case of regression, and categorical variables, as in the case of classification. It performs better for classification and regression tasks. In this tutorial, we will understand the working of random forest and implement random forest on a classification task.
With the growth of computer networking, electronic commerce and web services, security networking systems have become very important to protect infomation and networks againts malicious usage or attacks. In this report, it is designed an Intrusion Detection System using two artificial neural networks: one for Intrusion Detection and the another for Attack Classification.
This Naive Bayes Tutorial from Edureka will help you understand all the concepts of Naive Bayes classifier, use cases and how it can be used in the industry. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their concepts in Data Science and Machine Learning through Naive Bayes. Below are the topics covered in this tutorial:
1. What is Machine Learning?
2. Introduction to Classification
3. Classification Algorithms
4. What is Naive Bayes?
5. Use Cases of Naive Bayes
6. Demo – Employee Salary Prediction in R
Outlier analysis,Chapter-12, Data Mining: Concepts and TechniquesAshikur Rahman
This slide is prepared for a course of Dept. of CSE, Islamic Univresity of Technology (IUT).
Course: CSE 4739- Data Mining
This topic is based on:
Data Mining: Concepts and Techniques
Book by Jiawei Han
Chapter 12
Anomaly detection is a topic with many different applications. From social media tracking, to cybersecurity, anomaly detection (or outlier detection) algorithms can have a huge impact in your organisation.
For the video please visit: https://www.youtube.com/watch?v=XEM2bYYxkTU
This slideshare has been produced by the Tesseract Academy (http://tesseract.academy), a company that educates decision makers in deep technical topics such as data science, analytics, machine learning and blockchain.
If you are interested in data science and related topics, make sure to also visit The Data Scientist: http://thedatascientist.com.
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningQuantUniversity
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance.
Anomaly Detection - Real World Scenarios, Approaches and Live ImplementationImpetus Technologies
Detecting anomalous patterns in data can lead to significant actionable insights in a wide variety of application domains, such as fraud detection, network traffic management, predictive healthcare, energy monitoring and many more.
However, detecting anomalies accurately can be difficult. What qualifies as an anomaly is continuously changing and anomalous patterns are unexpected. An effective anomaly detection system needs to continuously self-learn without relying on pre-programmed thresholds.
Join our speakers Ravishankar Rao Vallabhajosyula, Senior Data Scientist, Impetus Technologies and Saurabh Dutta, Technical Product Manager - StreamAnalytix, in a discussion on:
Importance of anomaly detection in enterprise data, types of anomalies, and challenges
Prominent real-time application areas
Approaches, techniques and algorithms for anomaly detection
Sample use-case implementation on the StreamAnalytix platform
Organizations are collecting massive amounts of data from disparate sources. However, they continuously face the challenge of identifying patterns, detecting anomalies, and projecting future trends based on large data sets. Machine learning for anomaly detection provides a promising alternative for the detection and classification of anomalies.
Find out how you can implement machine learning to increase speed and effectiveness in identifying and reporting anomalies.
In this webinar, we will discuss :
How machine learning can help in identifying anomalies
Steps to approach an anomaly detection problem
Various techniques available for anomaly detection
Best algorithms that fit in different situations
Implementing an anomaly detection use case on the StreamAnalytix platform
To view the webinar - https://bit.ly/2IV2ahC
This presentation will present topics such as "What is Anomaly Detection? What are the different types of Data that may be used? What are the popular techniques may be used to identify anomalies. What are the best practices in anomaly detection? What is the Value of Anomaly Detection?
One of the determinants for a good anomaly detector is finding smart data representations that can easily evince deviations from the normal distribution. Traditional supervised approaches would require a strong assumption about what is normal and what not plus a non negligible effort in labeling the training dataset. Deep auto-encoders work very well in learning high-level abstractions and non-linear relationships of the data without requiring data labels. In this talk we will review a few popular techniques used in shallow machine learning and propose two semi-supervised approaches for novelty detection: one based on reconstruction error and another based on lower-dimensional feature compression.
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaPyData
PyData London 2018
This talk will focus on the importance of correctly defining an anomaly when conducting anomaly detection using unsupervised machine learning. It will include a review of Isolation Forest algorithm (Liu et al. 2008), and a demonstration of how this algorithm can be applied to transaction monitoring, specifically to detect money laundering.
---
www.pydata.org
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.
PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
This presentation deals with the formal presentation of anomaly detection and outlier analysis and types of anomalies and outliers. Different approaches to tackel anomaly detection problems.
Anomaly detection (or Outlier analysis) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. It is used is applications such as intrusion detection, fraud detection, fault detection and monitoring processes in various domains including energy, healthcare and finance. In this talk, we will introduce anomaly detection and discuss the various analytical and machine learning techniques used in in this field. Through a case study, we will discuss how anomaly detection techniques could be applied to energy data sets. We will also demonstrate, using R and Apache Spark, an application to help reinforce concepts in anomaly detection and best practices in analyzing and reviewing results.
Random Forest Algorithm widespread popularity stems from its user-friendly nature and adaptability, enabling it to tackle both classification and regression problems effectively. The algorithm’s strength lies in its ability to handle complex datasets and mitigate overfitting, making it a valuable tool for various predictive tasks in machine learning.
One of the most important features of the Random Forest Algorithm is that it can handle the data set containing continuous variables, as in the case of regression, and categorical variables, as in the case of classification. It performs better for classification and regression tasks. In this tutorial, we will understand the working of random forest and implement random forest on a classification task.
With the growth of computer networking, electronic commerce and web services, security networking systems have become very important to protect infomation and networks againts malicious usage or attacks. In this report, it is designed an Intrusion Detection System using two artificial neural networks: one for Intrusion Detection and the another for Attack Classification.
This Naive Bayes Tutorial from Edureka will help you understand all the concepts of Naive Bayes classifier, use cases and how it can be used in the industry. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their concepts in Data Science and Machine Learning through Naive Bayes. Below are the topics covered in this tutorial:
1. What is Machine Learning?
2. Introduction to Classification
3. Classification Algorithms
4. What is Naive Bayes?
5. Use Cases of Naive Bayes
6. Demo – Employee Salary Prediction in R
Outlier analysis,Chapter-12, Data Mining: Concepts and TechniquesAshikur Rahman
This slide is prepared for a course of Dept. of CSE, Islamic Univresity of Technology (IUT).
Course: CSE 4739- Data Mining
This topic is based on:
Data Mining: Concepts and Techniques
Book by Jiawei Han
Chapter 12
Presentation at Advanced Intelligent Systems for Sustainable Development (AISSD 2021) 20-22 August 2021 organized by the scientific research group in Egypt with Collaboration with Faculty of Computers and AI, Cairo University and the Chinese University in Egypt
Searching for Anomalies, by Thomas Dietterich, Distinguished Professor Emeritus in the School of EECS at Oregon State University and Chief Scientist of BigML.
*MLSEV 2020: Virtual Conference.
A Comparison of Non-Dictionary Based Approaches to Automate Cancer Detection Using Plaintext Medical Data with Dr. Shaun Grannis, Dr. Brian Dixon et. al. presented at the Regenstrief WIP (7th Jan 2015)
Classification Using Decision Trees and RulesChapter 5.docxmonicafrancis71118
Classification Using Decision
Trees and Rules
Chapter 5
Introduction
• Decision tree learners use a tree structure to model the relationships
among the features and the potential outcomes.
• a structure of branching decisions into a final predicted class value
• Decision begins at the root node, then passed through decision nodes
that require choices.
• Choices split the data across branches that indicate potential
outcomes of a decision
• Tree is terminated by leaf nodes that denote the action to be taken as
the result of the series of decisions.
Decision Tree Example
Benefits
• Flowchart-like tree structure is not necessarily exclusively for the
learner's internal use.
• Resulting structure in a human-readable format.
• Provides insight into how and why the model works or doesn't work well for a
particular task.
• Useful where classification mechanism needs to be transparent for legal reasons, or in
case the results need to be shared with others in order to inform future business
practices
• Credit scoring models where criteria that causes an applicant to be rejected need to be clearly
documented and free from bias
• Marketing studies of customer behavior such as satisfaction or churn, which will be shared
with management or advertising agencies
• Diagnosis of medical conditions based on laboratory measurements, symptoms, or the rate of
disease progression
Applicability
• Widely used machine learning technique
• Can be applied to model almost any type of data with excellent
results
• Does not fit task where the data has a large number of nominal
features with many levels or it has a large number of numeric
features.
• Result in large number of decisions and an overly complex tree.
• Tendency of decision trees to overfit data, though this can be overcome by
adjusting some simple parameters
Divide and Conquer
• Decision trees are built using a heuristic called recursive partitioning.
• Divide and conquer because it splits the data into subsets, which are then
split repeatedly into even smaller subsets,
• Stops when the data within the subsets are sufficiently homogenous, or
another stopping criterion has been met.
• Root node represents the entire dataset
• Algorithm must choose a feature to split upon
• Choose the feature most predictive of the target class.
• Algorithm continues to divide and conquer the data, choosing the best
candidate feature each time to create another decision node, until a stopping
criterion is reached.
Divide and Conquer
• Stopping Conditions
• All (or nearly all) of the examples at the node have the same class
• There are no remaining features to distinguish among the examples
• The tree has grown to a predefined size limit
Example
• Finding potential for a movie- Box Office Bust, Mainstream Hit, Critical Success
• Diagonal lines might have split the data even more cleanly.
• Limitation of the decision tree's knowledge representation, whi.
Tester contribution to Testing Effectiveness. An Empirical ResearchNatalia Juristo
Software verification and validation might be improved by gaining knowledge of the variables influencing the effectiveness of testing techniques. Unless automated, testing is a human-intensive activity. It is not unreasonable to suppose then that the person applying a technique is a variable that affects effectiveness. We have conducted an empirical study on the extent to which tester and technique contribute to the effectiveness of two testing techniques (equivalence partitioning and branch testing). We seed three programs with several faults. Next we examine the theoretical effectiveness of the technique (measured as how likely a subject applying the technique strictly as prescribed is to generate a test case that exercises the fault).
To measure the observed effectiveness, we then conduct an observational study where the techniques are applied by master’s students. The difference between theoretical and observed effectiveness suggests the tester contribution. To determine the nature of this contribution, we conduct a qualitative empirical study in which we ask the students to explain why they do or do not detect each seeded fault. We have found that the human component can reduce or increase testing technique effectiveness. We have also found that testers do not contribute equally to the different techniques.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
1. Anomaly Detectoin: A Survey
by Varun Chandola, Arindam Banerjee, Vipin Kumar
Presenter: Jaehyeong Ahn
Department of Applied Statistics, Konkuk University
jayahn0104@gmail.com
1 / 48
2. Table of Contents
Introduction
• Rapid Introduction to Anomaly Detection Problem
• Different Aspects of Anomaly Detection Problem
Anomaly Detection Techniques
• Classification Based Anomaly Detection
• Nearest Neighbor Based Anomaly Detection
• Clustering Based Anomaly Detection
• Statistical Anomaly Detection
• Spectral Anomaly Detection
Conclusion
2 / 48
4. Rapid Introduction to Anomaly Detection Problem
What are Anomalies?
• Anomalies are patterns in data that do not conform to a well
defined notion of normal behavior
Figure: Simple illustration of anomalies in a two-dimensional data set
• N, N: Normal regions, o, o, O: Anomalies
4 / 48
5. Rapid Introduction to Anomaly Detection Problem
How are they called?
• Anomalies, Outliers, Discordant observations, Exceptions,
Aberrations, Surprises, Peculiarities, or Contaminants in different
application domains
What is an Anomaly Detection?
• Anomaly Detection refers to the problem of finding patterns in data
that do not conform to expected behavior
5 / 48
6. Rapid Introduction to Anomaly Detection Problem
Various applications of Anomaly Detection
• fraud detection for credit card, insurance, or health care
• intrusion detection for cyber-security
• fault detection in safety critical systems
• military surveillance for enemy activities
Why Detecting Anomaly is important?
• The importance of anomaly detection is due to the fact that in spite
of the small numbers, anomalies cause significant and critical issues
• For example, credit card fraud, cyber-intrusion, terrorist activity or
break down a system, etc
6 / 48
7. Different Aspects of An Anomaly Detection
Problem
Anomaly Detection problem can be characterized as a specific
formulation by categorizing several different factors:
• Nature of Input Data
• Type of Anomaly
• Availability of Data Labels
• Output Type of Anomaly Detection Technique
7 / 48
8. Different Aspects of An Anomaly Detection
Problem
Nature of Input Data
Input data can be categorized based on the nature of attribute type and
the relationship among the data instances
• Attribute type
• Univariate vs Multivariate
• Categorical vs Continuous
• Same type vs Mixture of different types
• Relationship among data instances
• No relationship vs Related (e.g. sequence data, spatial data,
graph data)
8 / 48
9. Different Aspects of An Anomaly Detection
Problem
Type of Anomaly
An important aspect of an anomaly detection technique is the nature of
the desired anomaly.
Anomalies can be classified into three categories
• Point Anomalies
• Contextual Anomalies
• Collective Anomalies
9 / 48
10. Different Aspects of An Anomaly Detection
Problem
Point Anomalies
• If an individual data instance can be considered as anomalous with
respect to the rest of the data, then the instance is termed a point
anomaly
• This is the simplest type of anomaly and is the focus of majority of
research on anomaly detection
10 / 48
11. Different Aspects of An Anomaly Detection
Problem
Contextual Anomalies
• If a data instance is anomalous in a specific context, but not
otherwise, then it is termed a contextual anomaly
• Thus when a data instance is a contextual anomaly in a given
context, but an identical data instance could be considered normal
in a different context
• Most commonly explored in time-series data and spatial data
11 / 48
12. Different Aspects of An Anomaly Detection
Problem
Collective Anomalies
• If a collection of related data instances is anomalous with
respect to the entire data set, it is termed a collective anomaly
• Thus the individual instances in a collective anomaly may not be
anomalies by themselves, but their occurrence together as a
collection is anomalous
12 / 48
13. Different Aspects of An Anomaly Detection
Problem
Data Labels
• Obtaining labeled data that is accurate is often prohibitively
expensive
• Furthermore, getting a labeled set of anomalous data instances that
covers all possible type of anomalous behavior is more difficult than
getting labels for normal behavior (Because Anomalous behaviors
are not in standardized manner)
• Based on the extent to which the labels are available, anomaly
detection techniques can operate in one of the three modes:
• Supervised Anomaly Detection
• Semi-Supervised Anomaly Detection
• Unsupervised Anomaly Detection
13 / 48
14. Different Aspects of An Anomaly Detection
Problem
Supervised Anomaly Detection
• Require full accurate class labels
• Extremely imbalanced data set
• Similar to solve an imbalanced classification problem
Semi-Supervised Anomaly Detection
• Require labels only for the normal instances
• More widely applicable than supervised techniques
• Typical approach: to build a model for the normal behavior, and use the
model to identify anomalies in the test data
Unsupervised Anomaly Detection
• Require no label information. Thus it is most widely applicable
• Requires implicit assumption which is ”Anomalies are few and different
from the normal points” (If this assumption is not true then such
techniques suffer from high false alarm rate) 14 / 48
15. Different Aspects of An Anomaly Detection
Problem
Output of Anomaly Detection
The outputs produced by anomaly detection techniques are one of the
following two types:
Scores
• Scoring techniques report anomaly score to each instance in the test
data depending of the degree to which that instance is considered
an anomaly
• Thus the outputs can be represented in a ranked form
Labels
• These techniques assign a label (normal or anomalous) to each test
instances
15 / 48
18. Classification Based
General Assumption
A classifier that can distinguish between normal and anomalous classes
can be learned in the given feature space
Basic Idea
It works similar to classification problem in supervised learning except for
that it only obtains normal class instances for training
• Training Phase
• Learns a classifier using the available labeled training data
• Testing Phase
• Classifies a test instance as normal or anomalous, using the
classifier
18 / 48
19. Classification Based
Two broad categories based on the availability of labels
• One-Class Classification based anomaly detection
• Assumption: all training instances have only one class label
• Learns a discriminative boundary around the normal instances
using a one-class classification algorithm
• Multi-Class Classification based anomaly detection
• Assumption: the training data contains labeled instances
belonging to multiple normal classes
• Learns discriminative boundaries around the normal class
instances
Both techniques identify test instance as anomalous if it is not
classified as normal by classifier
19 / 48
20. Classification Based
Two broad categories based on the availability of labels
(a) One-Class Classification (b) Multi-Class Classification
20 / 48
21. Classification Based
Various Classification-Based Anomaly Detection Techniques
• Neural Networks-Based
• Bayesian Networks-Based
• Support Vector Machines-Based
• · · ·
Computational Complexity
• It mainly depends on the classification algorithm being used
• The testing phase is usually very fast since the testing phase
uses a learned model for classification
21 / 48
22. Classification Based
Advantages and Disadvantages
(+) It can make use of powerful algorithms that can distinguish between
instances belonging to different classes
(+) The testing phase is fast, since each test instance needs to be
compared against the precomputed model
(–) Require the availability of accurate labels which is often not possible
(–) Return binary output, normal or anomalous, which can be a
disadvantage when a meaningful anomaly score (or ranking) is
desired
(some techniques obtain a probabilistic prediction score to overcome
this problem)
22 / 48
23. Nearest Neighbor Based
General Assumption
Normal data instances occur in dense neighborhoods, while anomalies
occur far from their closest neighbors
Basic Idea
• Lazy learning algorithm (Training stage does just storing the
training data)
• Training Phase
• Store training data
• Testing Phase
• Compute specified distance between test instance and all
training instances
• For a test instance, if computed distance is high, then it is identified
as anomalous
23 / 48
24. Nearest Neighbor Based
Two Broad Categories of Nearest Neighbor Techniques
• Using Distance to kth Nearest Neighbor
• Use the distance of a data instance to its kth
nearest neighbor
as the anomaly score
• Using Local Density of test instance
• use the inverse of local density of each data instance to
compute its anomaly score
24 / 48
25. Nearest Neighbor Based
Using Distance to kth Nearest Neighbor
• The anomaly score of a data instance is defined as its distance to its
kth
nearest neighbor in a given data set
(a) Normal point (b) Anomaly point
• Basic Extensions
• Modifies the definition to obtain anomaly score (e.g.
average,max, · · · )
• Use different distance measure to handle complex data type
• Improve the computational efficiency
25 / 48
26. Nearest Neighbor Based
Density-Based
• An instance that lies in a neighborhood with low density is declared
to be anomalous while an instance that lies in a dense neighborhood
is declared to be normal
(a) Normal point (b) Anomaly point
• Density of the instance is usually estimated as k divided by the
radius of hypershpere, centered at the given data instance, which
contains k other instances
26 / 48
27. Nearest Neighbor Based
Problem of Basic Nearest Neighbor Method
• If the data has regions of varying densities, the basic nearest
neighbor method fails to detect anomalies
Figure: Problem of Basic Nearest Neighbor method, retrieved from [2]
To handle this issue, Relative Density method has been proposed
27 / 48
28. Nearest Neighbor Based
Using Relative Density
Figure: Local Outlier Factor
• Basic Extensions
• Estimate the local density in a different way
• Compute more complex data types
• Improve the computational efficiency
28 / 48
29. Nearest Neighbor Based
Advantages and Disadvantages
(+) Unsupervised in nature
(+) Adapting nearest neighbor-based techniques to a different data type
is straightforward, and primarily requires defining a appropriate
distance measure for the given data
(–) If the general assumption does not conform to the given data set,
the detector fails to catch anomalies
(–) The computational complexity of the testing phase is a challenge
O(N2
). Since it has to compute all distances between each test
instance to all training data points
(–) Performance greatly relies on a distance measure
29 / 48
30. Clustering Based
• Clustering is used to group similar data instances into clusters
• Clustering-Based anomaly detection techniques can be
grouped into three categories based on its assumption
• Normal data instances belong to a cluster in the data, while
anomalies do not belong to any cluster
• Normal data instances lie close to their closest cluster centroid,
while anomalies are far away from their closest cluster centroid
• Normal data instances belong to large and dense clusters,
while anomalies either belong to small or sparse clusters
30 / 48
31. Clustering Based
First Category
• Assumption
• Normal data instances belong to a cluster in the data, while
anomalies do not belong to any cluster
• Several clustering algorithms that do not force every data instance
to belong to a cluster can be used (e.g. DBSCAN, ROCK, SNN,
· · · )
• Disadvantage
• They are not optimized to find anomalies, since the main aim
of the underlying clustering algorithm is to find clusters
31 / 48
33. Clustering Based
Second Category
• Assumption
• Normal data instances lie close to their closest cluster centroid,
while anomalies are far away from their closest cluster centroid
• These kind of techniques are operated in two-steps
1 The data is clustered using a clustering algorithm
2 For each data instance, its distance to its closest cluster
centroid is calculated as its anomaly score
• Various clustering algorithms can be obtained (e.g. Self- Organizing
Maps (SOM), K-means Clustering, · · · )
• Disadvantage
• If the anomalies in the data form clusters by themselves, these
techniques will not be able to detect such anomalies
33 / 48
35. Clustering Based
Third Category
• Assumption
• Normal data instances belong to large and dense clusters,
while anomalies either belong to small or sparse clusters
• Techniques based on this assumption declare instances belonging to
clusters whose size and/or density is below a certain threshold, as
anomalous
• Several variations have been proposed for this categorry
• For example, Cluster-Bassed Local Outlier Factor (CBLOF)
captures the size of the cluser to which the data instance
belongs, as well as the distance of the data instance to its
cluster centroid
35 / 48
36. Clustering Based
Advantages and Disadvantages
(+) Can operate in an unsupervised mode
(+) Can be adapted to other complex data types by simply plugging in a
clustering algorithm that can handle the particular data type
(+) The testing phase is fast since the number of clusters against which
every test instance needs to be compared is a small constant
(–) Performance is highly dependent on the effectiveness of clustering
algorithms
(–) They are not optimized for anomaly detection
(–) The computational complexity for clustering the data is often a
bottleneck
36 / 48
37. Statistical Anomaly Detection
General Assumption
Normal data instances occur in high probability regions of a stochastic
model, while anomalies occur in the low probability regions of the
stochastic model
Basic Idea
• Fit a statistical model to the given data (usually for normal
behavior) and then apply a statistical inference test to determine if
an unseen instance belongs to this model or not
Two broad categories of statistical anomaly detection
• Parametric Techniques
• which assume the knowledge of the underlying distribution and
estimate the parameters from the given data
• Nonparametric Techniques
37 / 48
38. Statistical Anomaly Detection
Parametric Techniques
• Assume that the normal data is generated by a parametric
distribution with parameters Θ and probability density function
f(x, Θ)
• The anomaly score of a test instance x is the inverse of the
probability density function, f(x, Θ). Θ is estimated from the given
data
• Alternatively, a statistical hypothesis test may be used
• The null hypothesis H: the data instance x has been
generated using the estimated distribution (with parameters Θ)
• If the statistical test rejects H, x is declared to be anomaly
• Since a statistical hypothesis test is associated with a test
statistic, the test statistic value can be obtained as an anomaly
score for x
38 / 48
39. Statistical Anomaly Detection
Parametric Techniques
• Basic Example: Gaussian Model-Based
• Assume that the data is generated from a Gaussian distribution
• The parameters are estimated using
Maximum Likelihood Estimates(MLE)
• The distance of a data instance to the estimated mean is the
anomaly score for that instance
• A certain threshold is applied to the anomaly scores to
determine the anomalies (e.g. µ±3σ, since this region contains
99.7% of the data)
39 / 48
40. Statistical Anomaly Detection
Nonparametric Techniques
• The model structure is not defined a priori, but is instead
determined from given data
• Basic Example: Histogram-Based
1 Build a histogram based on the different values in the training
data
2 Check if a test instance falls in any one of the bins of the
histogram
• The size of the bin used when building the histogram is a key
for anomaly detection
• Alternatively the height of the bin that contains the instance
can be obtained as an anomaly score
40 / 48
41. Statistical Anomaly Detection
Advantages and Disadvantages
(+) If the underlying data distribution assumption holds true, they
provide a statistically justifiable solution for anomaly detection
(+) The anomaly score is associated with a confidence interval, which
can be used as additional information
(–) They rely on the assumption that the data is generated from a
particular distribution.
(–) Choosing the best statistic is often not a straightforward task
41 / 48
42. Spectral Anomaly Detection
General Assumption
Data can be embedded into a lower dimensional subspace in which
normal instances and anomalies appear significantly different
Basic Idea
• Spectral techniques try to find an approximation of the data using a
combination of attributes that capture the bulk of the variability in
the data
• Thus these techniques seek to determine the subspaces in which the
anomalous instances can be easily identified
• Basic Example: Principal Component Analysis (PCA)
• Projecting the data into lower dimensional space while
maximizing the variability
• Compute the distance between a data instance and the
subspace, generated by PCA as an anomaly score
42 / 48
44. Spectral Anomaly Detection
Advantages and Disadvantages
(+) These techniques automatically perform dimensionality reduction
and hence are suitable for handling high dimensional data
(+) Thus they can also be used as a preprocessing step before applying
any existing anomaly detection technique
(+) Can be used in unsupervised setting
(–) Useful only if the normal and anomalous instances are separable in
the lower dimensional embedding of the data
(–) High computational complexity
44 / 48
46. Conclusion
Relative Strengths and Weaknesses of Anomaly Detection
Techniques
• Each of the anomaly detection techniques discussed in the previous
sections have their unique strengths and weaknesses
• Thus it is important to know which anomaly detection technique is
best suited for a given problem
Concluding Remarks
• We have discussed the way that anomaly detection problem can be
characterized and have provided a overview of the various techniques
• For each category, we have identified a unique assumption regarding
the notion of normal and anomalous data
• When applying a given technique to a particular domain, these
assumptions can be used as a guidelines for application
46 / 48
47. Conclusion
The Contribution of this paper
• This survey is an attempt to provide a structured and broad
overview of extensive research on anomaly detection techniques
spanning multiple research areas and application domains
47 / 48
48. References
1 Chandola, Varun, Arindam Banerjee, and Vipin Kumar. ”Anomaly
detection: A survey.” ACM computing surveys (CSUR) 41, no. 3 (2009):
1-58.
2 Breunig, Markus M., Hans-Peter Kriegel, Raymond T. Ng, and J¨org
Sander. ”LOF: identifying density-based local outliers.” In Proceedings of
the 2000 ACM SIGMOD international conference on Management of
data, pp. 93-104. 2000.
3 Eik Loo, Chong, Mun Yong Ng, Christopher Leckie, and Marimuthu
Palaniswami. ”Intrusion detection for routing attacks in sensor networks.”
International Journal of Distributed Sensor Networks 2, no. 4 (2006):
313-332.
4 Anomaly Detection Using K means Clustering, Ashen Weerathunga,
https://ashenweerathunga.wordpress.com/2016/01/05/anomaly-
detection-using-k-means-clustering/
5 Anomaly detection using PCA reconstruction error,
https://stats.stackexchange.com/questions/259806/anomaly-detection-
using-pca-reconstruction-error
48 / 48