This document provides an overview of anomaly detection techniques. It defines anomalies as data points that are considerably different from most other points. Several causes of anomalies are discussed, including errors in data collection. Both model-based and model-free approaches to anomaly detection are described. Specific techniques covered include statistical approaches that assume a probability distribution for the data, distance-based approaches that measure distances to nearest neighbors, density-based approaches that measure local densities, and clustering-based approaches that identify outliers as points far from cluster centers. Strengths and weaknesses of each type of technique are also summarized.
AnomalyOutlier DetectionWhat are anomaliesoutliersThe set.docxamrit47
Anomaly/Outlier Detection
What are anomalies/outliers?
The set of data points that are
considerably different than the
remainder of the data
Natural implication is that anomalies are relatively rare
One in a thousand occurs often if you have lots of data
Context is important, e.g., freezing temps in July
Can be important or a nuisance
10 foot tall 2 year old
Unusually high blood pressure
9/29/2019
Introduction to Data Mining, 2nd Edition
1
Importance of Anomaly Detection
Ozone Depletion History
In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels
Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations?
The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded!
Sources:
http://exploringdata.cqu.edu.au/ozone.html
http://www.epa.gov/ozone/science/hole/size.html
9/29/2019
Introduction to Data Mining, 2nd Edition
2
Causes of Anomalies
Data from different classes
Measuring the weights of oranges, but a few grapefruit are mixed in
Natural variation
Unusually tall people
Data errors
200 pound 2 year old
9/29/2019
Introduction to Data Mining, 2nd Edition
3
Distinction Between Noise and Anomalies
Noise is erroneous, perhaps random, values or contaminating objects
Weight recorded incorrectly
Grapefruit mixed in with the oranges
Noise doesn’t necessarily produce unusual values or objects
Noise is not interesting
Anomalies may be interesting if they are not a result of noise
Noise and anomalies are related but distinct concepts
9/29/2019
Introduction to Data Mining, 2nd Edition
4
General Issues: Number of Attributes
Many anomalies are defined in terms of a single attribute
Height
Shape
Color
Can be hard to find an anomaly using all attributes
Noisy or irrelevant attributes
Object is only anomalous with respect to some attributes
However, an object may not be anomalous in any one attribute
9/29/2019
Introduction to Data Mining, 2nd Edition
5
General Issues: Anomaly Scoring
Many anomaly detection techniques provide only a binary categorization
An object is an anomaly or it isn’t
This is especially true of classification-based approaches
Other approaches assign a score to all points
This score measures the degree to which an object is an anomaly
This allows objects to be ranked
In the end, you often need a binary decision
Should this credit card transaction be flagged?
Still useful to have a score
How many anomalies are there?
9/29/2019
Introduction to Data Mining, 2nd Edition
6
Other Issues for Anomaly Detection
Find all anomalies at once or one at a time
Swamping
Masking
Evaluation
How do you measure performance?
Supervised vs. unsupervised situations
Efficiency
Context
Professional basketball team
9/.
AnomalyOutlier DetectionWhat are anomaliesoutliersThe set.docxamrit47
Anomaly/Outlier Detection
What are anomalies/outliers?
The set of data points that are
considerably different than the
remainder of the data
Natural implication is that anomalies are relatively rare
One in a thousand occurs often if you have lots of data
Context is important, e.g., freezing temps in July
Can be important or a nuisance
10 foot tall 2 year old
Unusually high blood pressure
9/29/2019
Introduction to Data Mining, 2nd Edition
1
Importance of Anomaly Detection
Ozone Depletion History
In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels
Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations?
The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded!
Sources:
http://exploringdata.cqu.edu.au/ozone.html
http://www.epa.gov/ozone/science/hole/size.html
9/29/2019
Introduction to Data Mining, 2nd Edition
2
Causes of Anomalies
Data from different classes
Measuring the weights of oranges, but a few grapefruit are mixed in
Natural variation
Unusually tall people
Data errors
200 pound 2 year old
9/29/2019
Introduction to Data Mining, 2nd Edition
3
Distinction Between Noise and Anomalies
Noise is erroneous, perhaps random, values or contaminating objects
Weight recorded incorrectly
Grapefruit mixed in with the oranges
Noise doesn’t necessarily produce unusual values or objects
Noise is not interesting
Anomalies may be interesting if they are not a result of noise
Noise and anomalies are related but distinct concepts
9/29/2019
Introduction to Data Mining, 2nd Edition
4
General Issues: Number of Attributes
Many anomalies are defined in terms of a single attribute
Height
Shape
Color
Can be hard to find an anomaly using all attributes
Noisy or irrelevant attributes
Object is only anomalous with respect to some attributes
However, an object may not be anomalous in any one attribute
9/29/2019
Introduction to Data Mining, 2nd Edition
5
General Issues: Anomaly Scoring
Many anomaly detection techniques provide only a binary categorization
An object is an anomaly or it isn’t
This is especially true of classification-based approaches
Other approaches assign a score to all points
This score measures the degree to which an object is an anomaly
This allows objects to be ranked
In the end, you often need a binary decision
Should this credit card transaction be flagged?
Still useful to have a score
How many anomalies are there?
9/29/2019
Introduction to Data Mining, 2nd Edition
6
Other Issues for Anomaly Detection
Find all anomalies at once or one at a time
Swamping
Masking
Evaluation
How do you measure performance?
Supervised vs. unsupervised situations
Efficiency
Context
Professional basketball team
9/.
Outlier Detection Using Unsupervised Learning on High Dimensional DataIJERA Editor
The outliers in data mining can be detected using semi-supervised and unsupervised methods. Outlier
detection in high dimensional data faces various challenges from curse of dimensionality. It means due
to the distance concentration the data becomes unobvious in high dimensional data. Using outlier
detection techniques, the distance base methods are used to detect outliers and label all the points as
good outliers. In high dimensional data to detect outliers effectively, we use unsupervised learning
methods like IQR, KNN with Anti hub.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Decision Support for Environmental Management of a Chromium Plume at Los Alam...Velimir (monty) Vesselinov
Vesselinov, V.V., Katzman, D., Broxton, D., Birdsell, K., Reneau, S., Vaniman, D., Longmire, P., Fabryka-Martin, J., Heikoop, J., Ding, M., Hickmott, D., Jacobs, E., Goering, T., Harp, D., Mishra, P., Data and Model-Driven Decision Support for Environmental Management of a Chromium Plume at Los Alamos National Laboratory (LANL), Waste Management Symposium 2013, Session 109: ER Challenges: Alternative Approaches for Achieving End State, Phoenix, AZ, February 28, 2013.
Data and Model-Driven Decision Support for Environmental Management of a Chro...Velimir (monty) Vesselinov
Vesselinov, V.V., Katzman, D., Broxton, D., Birdsell, K., Reneau, S., Vaniman, D., Longmire, P., Fabryka-Martin, J., Heikoop, J., Ding, M., Hickmott, D., Jacobs, E., Goering, T., Harp, D., Mishra, P., Data and Model-Driven Decision Support for Environmental Management of a Chromium Plume at Los Alamos National Laboratory (LANL), Waste Management Symposium 2013, Session 109: ER Challenges: Alternative Approaches for Achieving End State, Phoenix, AZ, February 28, 2013.
Model-driven decision support for monitoring network design based on analysis...Velimir (monty) Vesselinov
Vesselinov, V.V., Harp, D., Katzman, D., Model-driven decision support for monitoring network design based on analysis of data and model uncertainties: methods and applications, H32F: Uncertainty Quantification and Parameter Estimation: Impacts on Risk and Decision Making, AGU Fall meeting, San Francisco, December 3-7, 2012, LA-UR-13-20189, (invited).
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Outlier Detection Using Unsupervised Learning on High Dimensional DataIJERA Editor
The outliers in data mining can be detected using semi-supervised and unsupervised methods. Outlier
detection in high dimensional data faces various challenges from curse of dimensionality. It means due
to the distance concentration the data becomes unobvious in high dimensional data. Using outlier
detection techniques, the distance base methods are used to detect outliers and label all the points as
good outliers. In high dimensional data to detect outliers effectively, we use unsupervised learning
methods like IQR, KNN with Anti hub.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Decision Support for Environmental Management of a Chromium Plume at Los Alam...Velimir (monty) Vesselinov
Vesselinov, V.V., Katzman, D., Broxton, D., Birdsell, K., Reneau, S., Vaniman, D., Longmire, P., Fabryka-Martin, J., Heikoop, J., Ding, M., Hickmott, D., Jacobs, E., Goering, T., Harp, D., Mishra, P., Data and Model-Driven Decision Support for Environmental Management of a Chromium Plume at Los Alamos National Laboratory (LANL), Waste Management Symposium 2013, Session 109: ER Challenges: Alternative Approaches for Achieving End State, Phoenix, AZ, February 28, 2013.
Data and Model-Driven Decision Support for Environmental Management of a Chro...Velimir (monty) Vesselinov
Vesselinov, V.V., Katzman, D., Broxton, D., Birdsell, K., Reneau, S., Vaniman, D., Longmire, P., Fabryka-Martin, J., Heikoop, J., Ding, M., Hickmott, D., Jacobs, E., Goering, T., Harp, D., Mishra, P., Data and Model-Driven Decision Support for Environmental Management of a Chromium Plume at Los Alamos National Laboratory (LANL), Waste Management Symposium 2013, Session 109: ER Challenges: Alternative Approaches for Achieving End State, Phoenix, AZ, February 28, 2013.
Model-driven decision support for monitoring network design based on analysis...Velimir (monty) Vesselinov
Vesselinov, V.V., Harp, D., Katzman, D., Model-driven decision support for monitoring network design based on analysis of data and model uncertainties: methods and applications, H32F: Uncertainty Quantification and Parameter Estimation: Impacts on Risk and Decision Making, AGU Fall meeting, San Francisco, December 3-7, 2012, LA-UR-13-20189, (invited).
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
chap9_anomaly_detection.pptx
1. Anomaly Detection
Lecture Notes for Chapter 9
Introduction to Data Mining, 2nd Edition
by
Tan, Steinbach, Karpatne, Kumar
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
1
2. Anomaly/Outlier Detection
What are anomalies/outliers?
– The set of data points that are
considerably different than the
remainder of the data
Natural implication is that
anomalies are relatively rare
– One in a thousand occurs often if you have lots of data
– Context is important, e.g., freezing temps in July
Can be important or a nuisance
– Unusually high blood pressure
– 200 pound, 2 year old
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
2
3. Importance of Anomaly Detection
Ozone Depletion History
In 1985 three researchers (Farman,
Gardinar and Shanklin) were
puzzled by data gathered by the
British Antarctic Survey showing that
ozone levels for Antarctica had
dropped 10% below normal levels
Why did the Nimbus 7 satellite,
which had instruments aboard for
recording ozone levels, not record
similarly low ozone concentrations?
The ozone concentrations recorded
by the satellite were so low they
were being treated as outliers by a
computer program and discarded! Source:
http://www.epa.gov/ozone/science/hole/size.html
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
3
4. Causes of Anomalies
Data from different classes
– Measuring the weights of oranges, but a few grapefruit
are mixed in
Natural variation
– Unusually tall people
Data errors
– 200 pound 2 year old
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
4
https://umn.zoom.us/my/kumar001
5. Distinction Between Noise and Anomalies
Noise doesn’t necessarily produce unusual values or
objects
Noise is not interesting
Noise and anomalies are related but distinct concepts
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
5
6. Model-based vs Model-free
Model-based Approaches
Model can be parametric or non-parametric
Anomalies are those points that don’t fit well
Anomalies are those points that distort the model
Model-free Approaches
Anomalies are identified directly from the data without
building a model
Often the underlying assumption is that the
most of the points in the data are normal
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
6
7. General Issues: Label vs Score
Some anomaly detection techniques provide only a
binary categorization
Other approaches measure the degree to which an
object is an anomaly
– This allows objects to be ranked
– Scores can also have associated meaning (e.g., statistical
significance)
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
7
8. Anomaly Detection Techniques
Statistical Approaches
Proximity-based
– Anomalies are points far away from other points
Clustering-based
– Points far away from cluster centers are outliers
– Small clusters are outliers
Reconstruction Based
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
8
9. Statistical Approaches
Probabilistic definition of an outlier: An outlier is an object that
has a low probability with respect to a probability distribution
model of the data.
Usually assume a parametric model describing the distribution
of the data (e.g., normal distribution)
Apply a statistical test that depends on
– Data distribution
– Parameters of distribution (e.g., mean, variance)
– Number of expected outliers (confidence limit)
Issues
– Identifying the distribution of a data set
Heavy tailed distribution
– Number of attributes
– Is the data a mixture of distributions?
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
9
11. Grubbs’ Test
Detect outliers in univariate data
Assume data comes from normal distribution
Detects one outlier at a time, remove the outlier,
and repeat
– H0: There is no outlier in data
– HA: There is at least one outlier
Grubbs’ test statistic:
Reject H0 if:
s
X
X
G
max
2
2
)
2
,
/
(
)
2
,
/
(
2
)
1
(
N
N
N
N
t
N
t
N
N
G
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
11
12. Statistically-based – Likelihood Approach
Assume the data set D contains samples from a
mixture of two probability distributions:
– M (majority distribution)
– A (anomalous distribution)
General Approach:
– Initially, assume all the data points belong to M
– Let Lt(D) be the log likelihood of D at time t
– For each point xt that belongs to M, move it to A
Let Lt+1 (D) be the new log likelihood.
Compute the difference, = Lt(D) – Lt+1 (D)
If > c (some threshold), then xt is declared as an anomaly
and moved permanently from M to A
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
12
13. Statistically-based – Likelihood Approach
Data distribution, D = (1 – ) M + A
M is a probability distribution estimated from data
– Can be based on any modeling method (naïve Bayes,
maximum entropy, etc.)
A is initially assumed to be uniform distribution
Likelihood at time t:
t
i
t
t
i
t
t
i
t
t
t
i
t
t
A
x
i
A
t
M
x
i
M
t
t
A
x
i
A
A
M
x
i
M
M
N
i
i
D
t
x
P
A
x
P
M
D
LL
x
P
x
P
x
P
D
L
)
(
log
log
)
(
log
)
1
log(
)
(
)
(
)
(
)
1
(
)
(
)
( |
|
|
|
1
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
13
14. Strengths/Weaknesses of Statistical Approaches
Firm mathematical foundation
Can be very efficient
Good results if distribution is known
In many cases, data distribution may not be known
For high dimensional data, it may be difficult to estimate
the true distribution
Anomalies can distort the parameters of the distribution
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
14
15. Distance-Based Approaches
The outlier score of an object is the distance to
its kth nearest neighbor
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
15
16. One Nearest Neighbor - One Outlier
D
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Outlier Score
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
16
17. One Nearest Neighbor - Two Outliers
D
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Outlier Score
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
17
18. Five Nearest Neighbors - Small Cluster
D
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Outlier Score
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
18
19. Five Nearest Neighbors - Differing Density
D
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Outlier Score
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
19
20. Strengths/Weaknesses of Distance-Based Approaches
Simple
Expensive – O(n2)
Sensitive to parameters
Sensitive to variations in density
Distance becomes less meaningful in high-
dimensional space
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
20
21. Density-Based Approaches
Density-based Outlier: The outlier score of an
object is the inverse of the density around the
object.
– Can be defined in terms of the k nearest neighbors
– One definition: Inverse of distance to kth neighbor
– Another definition: Inverse of the average distance to k
neighbors
– DBSCAN definition
If there are regions of different density, this
approach can have problems
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
21
22. Relative Density
Consider the density of a point relative to that of
its k nearest neighbors
Let 𝑦1, … , 𝑦𝑘 be the 𝑘 nearest neighbors of 𝒙
𝑑𝑒𝑛𝑠𝑖𝑡𝑦 𝒙, 𝑘 =
1
𝑑𝑖𝑠𝑡 𝒙, 𝑘
=
1
𝑑𝑖𝑠𝑡(𝒙, 𝒚𝑘)
𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 𝒙, 𝑘 = 𝑖=1
𝑘
𝑑𝑒𝑛𝑠𝑖𝑡𝑦(𝒚𝑖,𝑘)/𝑘
𝑑𝑒𝑛𝑠𝑖𝑡𝑦(𝒙,𝑘)
=
𝑑𝑖𝑠𝑡(𝒙,𝑘)
𝑖=1
𝑘
𝑑𝑖𝑠𝑡(𝒚𝑖,𝑘)/𝑘
=
𝑑𝑖𝑠𝑡(𝒙,𝒚)
𝑖=1
𝑘
𝑑𝑖𝑠𝑡(𝒚𝑖,𝑘)/𝑘
Can use average distance instead
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
22
23. Relative Density Outlier Scores
Outlier Score
1
2
3
4
5
6
6.85
1.33
1.40
A
C
D
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
23
24. Relative Density-based: LOF approach
For each point, compute the density of its local neighborhood
Compute local outlier factor (LOF) of a sample p as the average of
the ratios of the density of sample p and the density of its nearest
neighbors
Outliers are points with largest LOF value
p2
p1
In the NN approach, p2 is
not considered as outlier,
while LOF approach find
both p1 and p2 as outliers
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
24
25. Strengths/Weaknesses of Density-Based Approaches
Simple
Expensive – O(n2)
Sensitive to parameters
Density becomes less meaningful in high-
dimensional space
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
25
26. Clustering-Based Approaches
An object is a cluster-based
outlier if it does not strongly
belong to any cluster
– For prototype-based clusters, an
object is an outlier if it is not close
enough to a cluster center
Outliers can impact the clustering produced
– For density-based clusters, an object
is an outlier if its density is too low
Can’t distinguish between noise and outliers
– For graph-based clusters, an object
is an outlier if it is not well connected
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
26
27. Distance of Points from Closest Centroids
Outlier Score
0.5
1
1.5
2
2.5
3
3.5
4
4.5
D
C
A
1.2
0.17
4.6
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
27
28. Relative Distance of Points from Closest Centroid
Outlier Score
0.5
1
1.5
2
2.5
3
3.5
4
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
28
29. Strengths/Weaknesses of Clustering-Based Approaches
Simple
Many clustering techniques can be used
Can be difficult to decide on a clustering
technique
Can be difficult to decide on number of clusters
Outliers can distort the clusters
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
29
30. Reconstruction-Based Approaches
Based on assumptions there are patterns in the
distribution of the normal class that can be
captured using lower-dimensional
representations
Reduce data to lower dimensional data
– E.g. Use Principal Components Analysis (PCA) or
Auto-encoders
Measure the reconstruction error for each object
– The difference between original and reduced
dimensionality version
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
30
31. Reconstruction Error
Let 𝐱 be the original data object
Find the representation of the object in a lower
dimensional space
Project the object back to the original space
Call this object 𝐱
Reconstruction Error(x)= x − x
Objects with large reconstruction errors are
anomalies
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
31
33. Basic Architecture of an Autoencoder
An autoencoder is a multi-layer neural network
The number of input and output neurons is equal
to the number of original attributes.
4/12/2021
Introduction to Data Mining, 2nd Edition
Tan, Steinbach, Karpatne, Kumar
33
34. Strengths and Weaknesses
Does not require assumptions about distribution
of normal class
Can use many dimensionality reduction
approaches
The reconstruction error is computed in the
original space
– This can be a problem if dimensionality is high
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
34
35. One Class SVM
Uses an SVM approach to classify normal objects
Uses the given data to construct such a model
This data may contain outliers
But the data does not contain class labels
How to build a classifier given one class?
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
35
36. How Does One-Class SVM Work?
Uses the “origin” trick
Use a Gaussian kernel
– Every point mapped to a unit hypersphere
– Every point in the same orthant (quadrant)
Aim to maximize the distance of the separating
plane from the origin
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
36
37. Two-dimensional One Class SVM
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
37
38. Equations for One-Class SVM
Equation of hyperplane
𝜙 is the mapping to high dimensional space
Weight vector is
ν is fraction of outliers
Optimization condition is the following
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
38
39. Finding Outliers with a One-Class SVM
Decision boundary with 𝜈 = 0.1
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
39
40. Finding Outliers with a One-Class SVM
Decision boundary with 𝜈 = 0.05 and 𝜈 = 0.2
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
40
41. Strengths and Weaknesses
Strong theoretical foundation
Choice of ν is difficult
Computationally expensive
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
41
42. Information Theoretic Approaches
Key idea is to measure how much information
decreases when you delete an observation
Anomalies should show higher gain
Normal points should have less gain
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
42
43. Information Theoretic Example
Survey of height and weight for 100 participants
Eliminating last group give a gain of
2.08 − 1.89 = 0.19
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
43
44. Strengths and Weaknesses
Solid theoretical foundation
Theoretically applicable to all kinds of data
Difficult and computationally expensive to
implement in practice
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
44
45. Evaluation of Anomaly Detection
If class labels are present, then use standard
evaluation approaches for rare class such as
precision, recall, or false positive rate
– FPR is also know as false alarm rate
For unsupervised anomaly detection use
measures provided by the anomaly method
– E.g. reconstruction error or gain
Can also look at histograms of anomaly scores.
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
45
46. Distribution of Anomaly Scores
Anomaly scores should show a tail
4/12/2021
Introduction to Data Mining, 2nd Edition Tan,
Steinbach, Karpatne, Kumar
46