SlideShare a Scribd company logo
1 of 24
Anomaly/Outlier Detection
What are anomalies/outliers?
The set of data points that are
considerably different than the
remainder of the data
Natural implication is that anomalies are relatively rare
One in a thousand occurs often if you have lots of data
Context is important, e.g., freezing temps in July
Can be important or a nuisance
10 foot tall 2 year old
Unusually high blood pressure
9/29/2019
Introduction to Data Mining, 2nd Edition
1
Importance of Anomaly Detection
Ozone Depletion History
In 1985 three researchers (Farman, Gardinar and Shanklin) were
puzzled by data gathered by the British Antarctic Survey
showing that ozone levels for Antarctica had dropped 10%
below normal levels
Why did the Nimbus 7 satellite, which had instruments aboard
for recording ozone levels, not record similarly low ozone
concentrations?
The ozone concentrations recorded by the satellite were so low
they were being treated as outliers by a computer program and
discarded!
Sources:
http://exploringdata.cqu.edu.au/ozone.html
http://www.epa.gov/ozone/science/hole/size.html
9/29/2019
Introduction to Data Mining, 2nd Edition
2
Causes of Anomalies
Data from different classes
Measuring the weights of oranges, but a few grapefruit are
mixed in
Natural variation
Unusually tall people
Data errors
200 pound 2 year old
9/29/2019
Introduction to Data Mining, 2nd Edition
3
Distinction Between Noise and Anomalies
Noise is erroneous, perhaps random, values or contaminating
objects
Weight recorded incorrectly
Grapefruit mixed in with the oranges
Noise doesn’t necessarily produce unusual values or objects
Noise is not interesting
Anomalies may be interesting if they are not a result of noise
Noise and anomalies are related but distinct concepts
9/29/2019
Introduction to Data Mining, 2nd Edition
4
General Issues: Number of Attributes
Many anomalies are defined in terms of a single attribute
Height
Shape
Color
Can be hard to find an anomaly using all attributes
Noisy or irrelevant attributes
Object is only anomalous with respect to some attributes
However, an object may not be anomalous in any one attribute
9/29/2019
Introduction to Data Mining, 2nd Edition
5
General Issues: Anomaly Scoring
Many anomaly detection techniques provide only a binary
categorization
An object is an anomaly or it isn’t
This is especially true of classification-based approaches
Other approaches assign a score to all points
This score measures the degree to which an object is an
anomaly
This allows objects to be ranked
In the end, you often need a binary decision
Should this credit card transaction be flagged?
Still useful to have a score
How many anomalies are there?
9/29/2019
Introduction to Data Mining, 2nd Edition
6
Other Issues for Anomaly Detection
Find all anomalies at once or one at a time
Swamping
Masking
Evaluation
How do you measure performance?
Supervised vs. unsupervised situations
Efficiency
Context
Professional basketball team
9/29/2019
Introduction to Data Mining, 2nd Edition
7
Variants of Anomaly Detection Problems
scores greater than some threshold t
-n
largest anomaly scores
Given a data set D, containing mostly normal (but unlabeled)
data points, and a test point x, compute the anomaly score of x
with respect to D
9/29/2019
Introduction to Data Mining, 2nd Edition
8
Model-Based Anomaly Detection
Build a model for the data and see
Unsupervised
Anomalies are those points that don’t fit well
Anomalies are those points that distort the model
Examples:
Statistical distribution
Clusters
Regression
Geometric
Graph
Supervised
Anomalies are regarded as a rare class
Need to have training data
9/29/2019
Introduction to Data Mining, 2nd Edition
9
Additional Anomaly Detection Techniques
Proximity-based
Anomalies are points far away from other points
Can detect this graphically in some cases
Density-based
Low density points are outliers
Pattern matching
Create profiles or templates of atypical but important events or
objects
Algorithms to detect these patterns are usually simple and
efficient
9/29/2019
Introduction to Data Mining, 2nd Edition
10
Visual Approaches
Boxplots or scatter plots
Limitations
Not automatic
Subjective
9/29/2019
Introduction to Data Mining, 2nd Edition
11
Statistical Approaches
Probabilistic definition of an outlier: An outlier is an object that
has a low probability with respect to a probability distribution
model of the data.
Usually assume a parametric model describing the distribution
of the data (e.g., normal distribution)
Apply a statistical test that depends on
Data distribution
Parameters of distribution (e.g., mean, variance)
Number of expected outliers (confidence limit)
Issues
Identifying the distribution of a data set
Heavy tailed distribution
Number of attributes
Is the data a mixture of distributions?
9/29/2019
Introduction to Data Mining, 2nd Edition
12
Normal Distributions
One-dimensional Gaussian
Two-dimensional Gaussian
9/29/2019
Introduction to Data Mining, 2nd Edition
13
Grubbs’ Test
Detect outliers in univariate data
Assume data comes from normal distribution
Detects one outlier at a time, remove the outlier, and repeat
H0: There is no outlier in data
HA: There is at least one outlier
Grubbs’ test statistic:
Reject H0 if:
9/29/2019
Introduction to Data Mining, 2nd Edition
14
Statistical-based – Likelihood Approach
Assume the data set D contains samples from a mixture of two
probability distributions:
M (majority distribution)
A (anomalous distribution)
General Approach:
Initially, assume all the data points belong to M
Let Lt(D) be the log likelihood of D at time t
For each point xt that belongs to M, move it to A
Let Lt+1 (D) be the new log likelihood.
– Lt+1 (D)
and moved permanently from M to A
9/29/2019
Introduction to Data Mining, 2nd Edition
15
Statistical-based – Likelihood Approach
Data distribution, D = (1 –
M is a probability distribution estimated from data
Can be based on any modeling method (naïve Bayes, maximum
entropy, etc)
A is initially assumed to be uniform distribution
Likelihood at time t:
9/29/2019
Introduction to Data Mining, 2nd Edition
16
Strengths/Weaknesses of Statistical Approaches
Firm mathematical foundation
Can be very efficient
Good results if distribution is known
In many cases, data distribution may not be known
For high dimensional data, it may be difficult to estimate the
true distribution
Anomalies can distort the parameters of the distribution
9/29/2019
Introduction to Data Mining, 2nd Edition
17
Distance-Based Approaches
Several different techniques
An object is an outlier if a specified fraction of the objects is
more than a specified distance away (Knorr, Ng 1998)
Some statistical definitions are special cases of this
The outlier score of an object is the distance to its kth nearest
neighbor
9/29/2019
Introduction to Data Mining, 2nd Edition
18
One Nearest Neighbor - One Outlier
Outlier Score
9/29/2019
Introduction to Data Mining, 2nd Edition
19
One Nearest Neighbor - Two Outliers
Outlier Score
9/29/2019
Introduction to Data Mining, 2nd Edition
20
Five Nearest Neighbors - Small Cluster
Outlier Score
9/29/2019
Introduction to Data Mining, 2nd Edition
21
Five Nearest Neighbors - Differing Density
Outlier Score
9/29/2019
Introduction to Data Mining, 2nd Edition
22
Strengths/Weaknesses of Distance-Based Approaches
Simple
Expensive – O(n2)
Sensitive to parameters
Sensitive to variations in density
Distance becomes less meaningful in high-dimensional space
9/29/2019
Introduction to Data Mining, 2nd Edition
23
Density-Based Approaches
Density-based Outlier: The outlier score of an object is the
inverse of the density around the object.
Can be defined in terms of the k nearest neighbors
One definition: Inverse of distance to kth neighbor
Another definition: Inverse of the average distance to k
neighbors
DBSCAN definition
If there are regions of different density, this approach can have
problems
9/29/2019
Introduction to Data Mining, 2nd Edition
24
Relative Density
Consider the density of a point relative to that of its k nearest
neighbors
9/29/2019
Introduction to Data Mining, 2nd Edition
25
Relative Density Outlier Scores
Outlier Score
9/29/2019
Introduction to Data Mining, 2nd Edition
26
Density-based: LOF approach
For each point, compute the density of its local neighborhood
Compute local outlier factor (LOF) of a sample p as the average
of the ratios of the density of sample p and the density of its
nearest neighbors
Outliers are points with largest LOF value
p2
p1
In the NN approach, p2 is not considered as outlier, while LOF
approach find both p1 and p2 as outliers
9/29/2019
Introduction to Data Mining, 2nd Edition
27
Strengths/Weaknesses of Density-Based Approaches
Simple
Expensive – O(n2)
Sensitive to parameters
Density becomes less meaningful in high-dimensional space
9/29/2019
Introduction to Data Mining, 2nd Edition
28
Clustering-Based Approaches
Clustering-based Outlier: An object is a cluster-based outlier if
it does not strongly belong to any cluster
For prototype-based clusters, an object is an outlier if it is not
close enough to a cluster center
For density-based clusters, an object is an outlier if its density
is too low
For graph-based clusters, an object is an outlier if it is not well
connected
Other issues include the impact of outliers on the clusters and
the number of clusters
9/29/2019
Introduction to Data Mining, 2nd Edition
29
Distance of Points from Closest Centroids
Outlier Score
9/29/2019
Introduction to Data Mining, 2nd Edition
30
Relative Distance of Points from Closest Centroid
Outlier Score
9/29/2019
Introduction to Data Mining, 2nd Edition
31
Strengths/Weaknesses of Distance-Based Approaches
Simple
Many clustering techniques can be used
Can be difficult to decide on a clustering technique
Can be difficult to decide on number of clusters
Outliers can distort the clusters
9/29/2019
Introduction to Data Mining, 2nd Edition
32
x
y
-4
-3
-2
-1
0
1
2
3
4
5
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
7
8
probability
density
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
s
X
X
G
-
=
max
2
2
)
2
,
/
(
)
2
,
/
(
2
)
1
(
-
-
+
-
-
>
N
N
N
N
t
N
t
N
N
G
a
a
å
å
Õ
Õ
Õ
Î
Î
Î
Î
=
+
+
+
-
=
÷
÷
ø
ö
ç
ç
è
æ
÷
÷
ø
ö
ç
ç
è
æ
-
=
=
t
i
t
t
i
t
t
i
t
t
t
i
t
t
A
x
i
A
t
M
x
i
M
t
t
A
x
i
A
A
M
x
i
M
M
N
i
i
D
t
x
P
A
x
P
M
D
LL
x
P
x
P
x
P
D
L
)
(
log
log
)
(
log
)
1
log(
)
(
)
(
)
(
)
1
(
)
(
)
(
|
|
|
|
1
l
l
l
l
D
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
D
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
D
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
D
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
1
2
3
4
5
6
6.85
1.33
1.40
A
C
D
0.5
1
1.5
2
2.5
3
3.5
4
4.5
D
C
A
1.2
0.17
4.6
0.5
1
1.5
2
2.5
3
3.5
4

More Related Content

Similar to AnomalyOutlier DetectionWhat are anomaliesoutliersThe set.docx

Chapter 12. Outlier Detection.ppt
Chapter 12. Outlier Detection.pptChapter 12. Outlier Detection.ppt
Chapter 12. Outlier Detection.pptSubrata Kumer Paul
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Salah Amean
 
Data Mining Primitives, Languages & Systems
Data Mining Primitives, Languages & SystemsData Mining Primitives, Languages & Systems
Data Mining Primitives, Languages & SystemsNiloy Sikder
 
Chapter 10 Anomaly Detection
Chapter 10 Anomaly DetectionChapter 10 Anomaly Detection
Chapter 10 Anomaly DetectionKhalid Elshafie
 
Outlier Detection Using Unsupervised Learning on High Dimensional Data
Outlier Detection Using Unsupervised Learning on High Dimensional DataOutlier Detection Using Unsupervised Learning on High Dimensional Data
Outlier Detection Using Unsupervised Learning on High Dimensional DataIJERA Editor
 
Datascience Introduction WebSci Summer School 2014
Datascience Introduction WebSci Summer School 2014Datascience Introduction WebSci Summer School 2014
Datascience Introduction WebSci Summer School 2014Claudia Wagner
 
Chapter 4 Clustering I02142018 Intro
Chapter 4 Clustering I02142018      IntroChapter 4 Clustering I02142018      Intro
Chapter 4 Clustering I02142018 IntroTawnaDelatorrejs
 
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersZac Darcy
 
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSA MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSZac Darcy
 
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersZac Darcy
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysisKrish_ver2
 
Term_Paper_Shengzhe_Wang
Term_Paper_Shengzhe_WangTerm_Paper_Shengzhe_Wang
Term_Paper_Shengzhe_WangShengzhe Wang
 
Chapter 4 Clustering I02142018 Intro.docx
Chapter 4 Clustering I02142018      Intro.docxChapter 4 Clustering I02142018      Intro.docx
Chapter 4 Clustering I02142018 Intro.docxketurahhazelhurst
 
Introduction to Classification
Introduction to ClassificationIntroduction to Classification
Introduction to Classificationamitpraseed
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10mqasimsheikh5
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 

Similar to AnomalyOutlier DetectionWhat are anomaliesoutliersThe set.docx (20)

12 outlier
12 outlier12 outlier
12 outlier
 
Chapter 12. Outlier Detection.ppt
Chapter 12. Outlier Detection.pptChapter 12. Outlier Detection.ppt
Chapter 12. Outlier Detection.ppt
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
 
Data cleaning-outlier-detection
Data cleaning-outlier-detectionData cleaning-outlier-detection
Data cleaning-outlier-detection
 
Chapter 12 outlier
Chapter 12 outlierChapter 12 outlier
Chapter 12 outlier
 
Data Mining Primitives, Languages & Systems
Data Mining Primitives, Languages & SystemsData Mining Primitives, Languages & Systems
Data Mining Primitives, Languages & Systems
 
Chapter 10 Anomaly Detection
Chapter 10 Anomaly DetectionChapter 10 Anomaly Detection
Chapter 10 Anomaly Detection
 
Outlier Detection Using Unsupervised Learning on High Dimensional Data
Outlier Detection Using Unsupervised Learning on High Dimensional DataOutlier Detection Using Unsupervised Learning on High Dimensional Data
Outlier Detection Using Unsupervised Learning on High Dimensional Data
 
Datascience Introduction WebSci Summer School 2014
Datascience Introduction WebSci Summer School 2014Datascience Introduction WebSci Summer School 2014
Datascience Introduction WebSci Summer School 2014
 
Chapter 4 Clustering I02142018 Intro
Chapter 4 Clustering I02142018      IntroChapter 4 Clustering I02142018      Intro
Chapter 4 Clustering I02142018 Intro
 
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
 
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSA MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
 
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
 
Term_Paper_Shengzhe_Wang
Term_Paper_Shengzhe_WangTerm_Paper_Shengzhe_Wang
Term_Paper_Shengzhe_Wang
 
Data analysis00 commonprobabilitymodels
Data analysis00 commonprobabilitymodelsData analysis00 commonprobabilitymodels
Data analysis00 commonprobabilitymodels
 
Chapter 4 Clustering I02142018 Intro.docx
Chapter 4 Clustering I02142018      Intro.docxChapter 4 Clustering I02142018      Intro.docx
Chapter 4 Clustering I02142018 Intro.docx
 
Introduction to Classification
Introduction to ClassificationIntroduction to Classification
Introduction to Classification
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 

More from amrit47

APA, The assignment require a contemporary approach addressing Race,.docx
APA, The assignment require a contemporary approach addressing Race,.docxAPA, The assignment require a contemporary approach addressing Race,.docx
APA, The assignment require a contemporary approach addressing Race,.docxamrit47
 
APA style and all questions answered ( no min page requirements) .docx
APA style and all questions answered ( no min page requirements) .docxAPA style and all questions answered ( no min page requirements) .docx
APA style and all questions answered ( no min page requirements) .docxamrit47
 
Apa format1-2 paragraphsreferences It is often said th.docx
Apa format1-2 paragraphsreferences It is often said th.docxApa format1-2 paragraphsreferences It is often said th.docx
Apa format1-2 paragraphsreferences It is often said th.docxamrit47
 
APA format2-3 pages, double-spaced1. Choose a speech to review. It.docx
APA format2-3 pages, double-spaced1. Choose a speech to review. It.docxAPA format2-3 pages, double-spaced1. Choose a speech to review. It.docx
APA format2-3 pages, double-spaced1. Choose a speech to review. It.docxamrit47
 
APA format  httpsapastyle.apa.orghttpsowl.purd.docx
APA format     httpsapastyle.apa.orghttpsowl.purd.docxAPA format     httpsapastyle.apa.orghttpsowl.purd.docx
APA format  httpsapastyle.apa.orghttpsowl.purd.docxamrit47
 
APA format2-3 pages, double-spaced1. Choose a speech to review. .docx
APA format2-3 pages, double-spaced1. Choose a speech to review. .docxAPA format2-3 pages, double-spaced1. Choose a speech to review. .docx
APA format2-3 pages, double-spaced1. Choose a speech to review. .docxamrit47
 
APA Formatting AssignmentUse the information below to create.docx
APA Formatting AssignmentUse the information below to create.docxAPA Formatting AssignmentUse the information below to create.docx
APA Formatting AssignmentUse the information below to create.docxamrit47
 
APA style300 words10 maximum plagiarism  Mrs. Smith was.docx
APA style300 words10 maximum plagiarism  Mrs. Smith was.docxAPA style300 words10 maximum plagiarism  Mrs. Smith was.docx
APA style300 words10 maximum plagiarism  Mrs. Smith was.docxamrit47
 
APA format1. What are the three most important takeawayslessons.docx
APA format1. What are the three most important takeawayslessons.docxAPA format1. What are the three most important takeawayslessons.docx
APA format1. What are the three most important takeawayslessons.docxamrit47
 
APA General Format Summary APA (American Psychological.docx
APA General Format Summary APA (American Psychological.docxAPA General Format Summary APA (American Psychological.docx
APA General Format Summary APA (American Psychological.docxamrit47
 
Appearance When I watched the video of myself, I felt that my b.docx
Appearance When I watched the video of myself, I felt that my b.docxAppearance When I watched the video of myself, I felt that my b.docx
Appearance When I watched the video of myself, I felt that my b.docxamrit47
 
apa format1-2 paragraphsreferencesFor this week’s .docx
apa format1-2 paragraphsreferencesFor this week’s .docxapa format1-2 paragraphsreferencesFor this week’s .docx
apa format1-2 paragraphsreferencesFor this week’s .docxamrit47
 
APA Format, with 2 references for each question and an assignment..docx
APA Format, with 2 references for each question and an assignment..docxAPA Format, with 2 references for each question and an assignment..docx
APA Format, with 2 references for each question and an assignment..docxamrit47
 
APA-formatted 8-10 page research paper which examines the potential .docx
APA-formatted 8-10 page research paper which examines the potential .docxAPA-formatted 8-10 page research paper which examines the potential .docx
APA-formatted 8-10 page research paper which examines the potential .docxamrit47
 
APA    STYLE 1.Define the terms multiple disabilities and .docx
APA    STYLE 1.Define the terms multiple disabilities and .docxAPA    STYLE 1.Define the terms multiple disabilities and .docx
APA    STYLE 1.Define the terms multiple disabilities and .docxamrit47
 
APA STYLE  follow this textbook answer should be summarize for t.docx
APA STYLE  follow this textbook answer should be summarize for t.docxAPA STYLE  follow this textbook answer should be summarize for t.docx
APA STYLE  follow this textbook answer should be summarize for t.docxamrit47
 
APA7Page length 3-4, including Title Page and Reference Pag.docx
APA7Page length 3-4, including Title Page and Reference Pag.docxAPA7Page length 3-4, including Title Page and Reference Pag.docx
APA7Page length 3-4, including Title Page and Reference Pag.docxamrit47
 
APA format, 2 pagesThree general sections 1. an article s.docx
APA format, 2 pagesThree general sections 1. an article s.docxAPA format, 2 pagesThree general sections 1. an article s.docx
APA format, 2 pagesThree general sections 1. an article s.docxamrit47
 
APA Style with minimum of 450 words, with annotations, quotation.docx
APA Style with minimum of 450 words, with annotations, quotation.docxAPA Style with minimum of 450 words, with annotations, quotation.docx
APA Style with minimum of 450 words, with annotations, quotation.docxamrit47
 
APA FORMAT1.  What are the three most important takeawayslesson.docx
APA FORMAT1.  What are the three most important takeawayslesson.docxAPA FORMAT1.  What are the three most important takeawayslesson.docx
APA FORMAT1.  What are the three most important takeawayslesson.docxamrit47
 

More from amrit47 (20)

APA, The assignment require a contemporary approach addressing Race,.docx
APA, The assignment require a contemporary approach addressing Race,.docxAPA, The assignment require a contemporary approach addressing Race,.docx
APA, The assignment require a contemporary approach addressing Race,.docx
 
APA style and all questions answered ( no min page requirements) .docx
APA style and all questions answered ( no min page requirements) .docxAPA style and all questions answered ( no min page requirements) .docx
APA style and all questions answered ( no min page requirements) .docx
 
Apa format1-2 paragraphsreferences It is often said th.docx
Apa format1-2 paragraphsreferences It is often said th.docxApa format1-2 paragraphsreferences It is often said th.docx
Apa format1-2 paragraphsreferences It is often said th.docx
 
APA format2-3 pages, double-spaced1. Choose a speech to review. It.docx
APA format2-3 pages, double-spaced1. Choose a speech to review. It.docxAPA format2-3 pages, double-spaced1. Choose a speech to review. It.docx
APA format2-3 pages, double-spaced1. Choose a speech to review. It.docx
 
APA format  httpsapastyle.apa.orghttpsowl.purd.docx
APA format     httpsapastyle.apa.orghttpsowl.purd.docxAPA format     httpsapastyle.apa.orghttpsowl.purd.docx
APA format  httpsapastyle.apa.orghttpsowl.purd.docx
 
APA format2-3 pages, double-spaced1. Choose a speech to review. .docx
APA format2-3 pages, double-spaced1. Choose a speech to review. .docxAPA format2-3 pages, double-spaced1. Choose a speech to review. .docx
APA format2-3 pages, double-spaced1. Choose a speech to review. .docx
 
APA Formatting AssignmentUse the information below to create.docx
APA Formatting AssignmentUse the information below to create.docxAPA Formatting AssignmentUse the information below to create.docx
APA Formatting AssignmentUse the information below to create.docx
 
APA style300 words10 maximum plagiarism  Mrs. Smith was.docx
APA style300 words10 maximum plagiarism  Mrs. Smith was.docxAPA style300 words10 maximum plagiarism  Mrs. Smith was.docx
APA style300 words10 maximum plagiarism  Mrs. Smith was.docx
 
APA format1. What are the three most important takeawayslessons.docx
APA format1. What are the three most important takeawayslessons.docxAPA format1. What are the three most important takeawayslessons.docx
APA format1. What are the three most important takeawayslessons.docx
 
APA General Format Summary APA (American Psychological.docx
APA General Format Summary APA (American Psychological.docxAPA General Format Summary APA (American Psychological.docx
APA General Format Summary APA (American Psychological.docx
 
Appearance When I watched the video of myself, I felt that my b.docx
Appearance When I watched the video of myself, I felt that my b.docxAppearance When I watched the video of myself, I felt that my b.docx
Appearance When I watched the video of myself, I felt that my b.docx
 
apa format1-2 paragraphsreferencesFor this week’s .docx
apa format1-2 paragraphsreferencesFor this week’s .docxapa format1-2 paragraphsreferencesFor this week’s .docx
apa format1-2 paragraphsreferencesFor this week’s .docx
 
APA Format, with 2 references for each question and an assignment..docx
APA Format, with 2 references for each question and an assignment..docxAPA Format, with 2 references for each question and an assignment..docx
APA Format, with 2 references for each question and an assignment..docx
 
APA-formatted 8-10 page research paper which examines the potential .docx
APA-formatted 8-10 page research paper which examines the potential .docxAPA-formatted 8-10 page research paper which examines the potential .docx
APA-formatted 8-10 page research paper which examines the potential .docx
 
APA    STYLE 1.Define the terms multiple disabilities and .docx
APA    STYLE 1.Define the terms multiple disabilities and .docxAPA    STYLE 1.Define the terms multiple disabilities and .docx
APA    STYLE 1.Define the terms multiple disabilities and .docx
 
APA STYLE  follow this textbook answer should be summarize for t.docx
APA STYLE  follow this textbook answer should be summarize for t.docxAPA STYLE  follow this textbook answer should be summarize for t.docx
APA STYLE  follow this textbook answer should be summarize for t.docx
 
APA7Page length 3-4, including Title Page and Reference Pag.docx
APA7Page length 3-4, including Title Page and Reference Pag.docxAPA7Page length 3-4, including Title Page and Reference Pag.docx
APA7Page length 3-4, including Title Page and Reference Pag.docx
 
APA format, 2 pagesThree general sections 1. an article s.docx
APA format, 2 pagesThree general sections 1. an article s.docxAPA format, 2 pagesThree general sections 1. an article s.docx
APA format, 2 pagesThree general sections 1. an article s.docx
 
APA Style with minimum of 450 words, with annotations, quotation.docx
APA Style with minimum of 450 words, with annotations, quotation.docxAPA Style with minimum of 450 words, with annotations, quotation.docx
APA Style with minimum of 450 words, with annotations, quotation.docx
 
APA FORMAT1.  What are the three most important takeawayslesson.docx
APA FORMAT1.  What are the three most important takeawayslesson.docxAPA FORMAT1.  What are the three most important takeawayslesson.docx
APA FORMAT1.  What are the three most important takeawayslesson.docx
 

Recently uploaded

Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptxAnalyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptxLimon Prince
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsSandeep D Chaudhary
 
How to Manage Website in Odoo 17 Studio App.pptx
How to Manage Website in Odoo 17 Studio App.pptxHow to Manage Website in Odoo 17 Studio App.pptx
How to Manage Website in Odoo 17 Studio App.pptxCeline George
 
Basic Civil Engineering notes on Transportation Engineering & Modes of Transport
Basic Civil Engineering notes on Transportation Engineering & Modes of TransportBasic Civil Engineering notes on Transportation Engineering & Modes of Transport
Basic Civil Engineering notes on Transportation Engineering & Modes of TransportDenish Jangid
 
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading RoomSternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading RoomSean M. Fox
 
When Quality Assurance Meets Innovation in Higher Education - Report launch w...
When Quality Assurance Meets Innovation in Higher Education - Report launch w...When Quality Assurance Meets Innovation in Higher Education - Report launch w...
When Quality Assurance Meets Innovation in Higher Education - Report launch w...Gary Wood
 
male presentation...pdf.................
male presentation...pdf.................male presentation...pdf.................
male presentation...pdf.................MirzaAbrarBaig5
 
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjjStl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjjMohammed Sikander
 
Improved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio AppImproved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio AppCeline George
 
Contoh Aksi Nyata Refleksi Diri ( NUR ).pdf
Contoh Aksi Nyata Refleksi Diri ( NUR ).pdfContoh Aksi Nyata Refleksi Diri ( NUR ).pdf
Contoh Aksi Nyata Refleksi Diri ( NUR ).pdfcupulin
 
8 Tips for Effective Working Capital Management
8 Tips for Effective Working Capital Management8 Tips for Effective Working Capital Management
8 Tips for Effective Working Capital ManagementMBA Assignment Experts
 
Andreas Schleicher presents at the launch of What does child empowerment mean...
Andreas Schleicher presents at the launch of What does child empowerment mean...Andreas Schleicher presents at the launch of What does child empowerment mean...
Andreas Schleicher presents at the launch of What does child empowerment mean...EduSkills OECD
 
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...EADTU
 
How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17Celine George
 
How to Send Pro Forma Invoice to Your Customers in Odoo 17
How to Send Pro Forma Invoice to Your Customers in Odoo 17How to Send Pro Forma Invoice to Your Customers in Odoo 17
How to Send Pro Forma Invoice to Your Customers in Odoo 17Celine George
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSAnaAcapella
 

Recently uploaded (20)

Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptxAnalyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 
OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...
 
How to Manage Website in Odoo 17 Studio App.pptx
How to Manage Website in Odoo 17 Studio App.pptxHow to Manage Website in Odoo 17 Studio App.pptx
How to Manage Website in Odoo 17 Studio App.pptx
 
Basic Civil Engineering notes on Transportation Engineering & Modes of Transport
Basic Civil Engineering notes on Transportation Engineering & Modes of TransportBasic Civil Engineering notes on Transportation Engineering & Modes of Transport
Basic Civil Engineering notes on Transportation Engineering & Modes of Transport
 
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading RoomSternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
 
When Quality Assurance Meets Innovation in Higher Education - Report launch w...
When Quality Assurance Meets Innovation in Higher Education - Report launch w...When Quality Assurance Meets Innovation in Higher Education - Report launch w...
When Quality Assurance Meets Innovation in Higher Education - Report launch w...
 
male presentation...pdf.................
male presentation...pdf.................male presentation...pdf.................
male presentation...pdf.................
 
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjjStl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
 
Improved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio AppImproved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio App
 
Including Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdfIncluding Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdf
 
Contoh Aksi Nyata Refleksi Diri ( NUR ).pdf
Contoh Aksi Nyata Refleksi Diri ( NUR ).pdfContoh Aksi Nyata Refleksi Diri ( NUR ).pdf
Contoh Aksi Nyata Refleksi Diri ( NUR ).pdf
 
Mattingly "AI & Prompt Design: Named Entity Recognition"
Mattingly "AI & Prompt Design: Named Entity Recognition"Mattingly "AI & Prompt Design: Named Entity Recognition"
Mattingly "AI & Prompt Design: Named Entity Recognition"
 
8 Tips for Effective Working Capital Management
8 Tips for Effective Working Capital Management8 Tips for Effective Working Capital Management
8 Tips for Effective Working Capital Management
 
Andreas Schleicher presents at the launch of What does child empowerment mean...
Andreas Schleicher presents at the launch of What does child empowerment mean...Andreas Schleicher presents at the launch of What does child empowerment mean...
Andreas Schleicher presents at the launch of What does child empowerment mean...
 
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
 
How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17
 
How to Send Pro Forma Invoice to Your Customers in Odoo 17
How to Send Pro Forma Invoice to Your Customers in Odoo 17How to Send Pro Forma Invoice to Your Customers in Odoo 17
How to Send Pro Forma Invoice to Your Customers in Odoo 17
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
 
Supporting Newcomer Multilingual Learners
Supporting Newcomer  Multilingual LearnersSupporting Newcomer  Multilingual Learners
Supporting Newcomer Multilingual Learners
 

AnomalyOutlier DetectionWhat are anomaliesoutliersThe set.docx

  • 1. Anomaly/Outlier Detection What are anomalies/outliers? The set of data points that are considerably different than the remainder of the data Natural implication is that anomalies are relatively rare One in a thousand occurs often if you have lots of data Context is important, e.g., freezing temps in July Can be important or a nuisance 10 foot tall 2 year old Unusually high blood pressure 9/29/2019 Introduction to Data Mining, 2nd Edition 1 Importance of Anomaly Detection Ozone Depletion History In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations?
  • 2. The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded! Sources: http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html 9/29/2019 Introduction to Data Mining, 2nd Edition 2 Causes of Anomalies Data from different classes Measuring the weights of oranges, but a few grapefruit are mixed in Natural variation Unusually tall people Data errors 200 pound 2 year old 9/29/2019 Introduction to Data Mining, 2nd Edition 3 Distinction Between Noise and Anomalies Noise is erroneous, perhaps random, values or contaminating
  • 3. objects Weight recorded incorrectly Grapefruit mixed in with the oranges Noise doesn’t necessarily produce unusual values or objects Noise is not interesting Anomalies may be interesting if they are not a result of noise Noise and anomalies are related but distinct concepts 9/29/2019 Introduction to Data Mining, 2nd Edition 4 General Issues: Number of Attributes Many anomalies are defined in terms of a single attribute Height Shape Color Can be hard to find an anomaly using all attributes Noisy or irrelevant attributes Object is only anomalous with respect to some attributes However, an object may not be anomalous in any one attribute 9/29/2019 Introduction to Data Mining, 2nd Edition 5 General Issues: Anomaly Scoring Many anomaly detection techniques provide only a binary
  • 4. categorization An object is an anomaly or it isn’t This is especially true of classification-based approaches Other approaches assign a score to all points This score measures the degree to which an object is an anomaly This allows objects to be ranked In the end, you often need a binary decision Should this credit card transaction be flagged? Still useful to have a score How many anomalies are there? 9/29/2019 Introduction to Data Mining, 2nd Edition 6 Other Issues for Anomaly Detection Find all anomalies at once or one at a time Swamping Masking Evaluation How do you measure performance? Supervised vs. unsupervised situations Efficiency Context Professional basketball team 9/29/2019 Introduction to Data Mining, 2nd Edition
  • 5. 7 Variants of Anomaly Detection Problems scores greater than some threshold t -n largest anomaly scores Given a data set D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D 9/29/2019 Introduction to Data Mining, 2nd Edition 8 Model-Based Anomaly Detection Build a model for the data and see Unsupervised Anomalies are those points that don’t fit well Anomalies are those points that distort the model Examples: Statistical distribution Clusters Regression Geometric Graph Supervised Anomalies are regarded as a rare class Need to have training data
  • 6. 9/29/2019 Introduction to Data Mining, 2nd Edition 9 Additional Anomaly Detection Techniques Proximity-based Anomalies are points far away from other points Can detect this graphically in some cases Density-based Low density points are outliers Pattern matching Create profiles or templates of atypical but important events or objects Algorithms to detect these patterns are usually simple and efficient 9/29/2019 Introduction to Data Mining, 2nd Edition 10 Visual Approaches Boxplots or scatter plots Limitations Not automatic Subjective 9/29/2019 Introduction to Data Mining, 2nd Edition
  • 7. 11 Statistical Approaches Probabilistic definition of an outlier: An outlier is an object that has a low probability with respect to a probability distribution model of the data. Usually assume a parametric model describing the distribution of the data (e.g., normal distribution) Apply a statistical test that depends on Data distribution Parameters of distribution (e.g., mean, variance) Number of expected outliers (confidence limit) Issues Identifying the distribution of a data set Heavy tailed distribution Number of attributes Is the data a mixture of distributions? 9/29/2019 Introduction to Data Mining, 2nd Edition 12 Normal Distributions One-dimensional Gaussian Two-dimensional Gaussian 9/29/2019 Introduction to Data Mining, 2nd Edition 13
  • 8. Grubbs’ Test Detect outliers in univariate data Assume data comes from normal distribution Detects one outlier at a time, remove the outlier, and repeat H0: There is no outlier in data HA: There is at least one outlier Grubbs’ test statistic: Reject H0 if: 9/29/2019 Introduction to Data Mining, 2nd Edition 14 Statistical-based – Likelihood Approach Assume the data set D contains samples from a mixture of two probability distributions: M (majority distribution) A (anomalous distribution) General Approach: Initially, assume all the data points belong to M Let Lt(D) be the log likelihood of D at time t For each point xt that belongs to M, move it to A Let Lt+1 (D) be the new log likelihood. – Lt+1 (D) and moved permanently from M to A
  • 9. 9/29/2019 Introduction to Data Mining, 2nd Edition 15 Statistical-based – Likelihood Approach Data distribution, D = (1 – M is a probability distribution estimated from data Can be based on any modeling method (naïve Bayes, maximum entropy, etc) A is initially assumed to be uniform distribution Likelihood at time t: 9/29/2019 Introduction to Data Mining, 2nd Edition 16 Strengths/Weaknesses of Statistical Approaches Firm mathematical foundation Can be very efficient Good results if distribution is known In many cases, data distribution may not be known For high dimensional data, it may be difficult to estimate the true distribution Anomalies can distort the parameters of the distribution
  • 10. 9/29/2019 Introduction to Data Mining, 2nd Edition 17 Distance-Based Approaches Several different techniques An object is an outlier if a specified fraction of the objects is more than a specified distance away (Knorr, Ng 1998) Some statistical definitions are special cases of this The outlier score of an object is the distance to its kth nearest neighbor 9/29/2019 Introduction to Data Mining, 2nd Edition 18 One Nearest Neighbor - One Outlier Outlier Score 9/29/2019 Introduction to Data Mining, 2nd Edition 19
  • 11. One Nearest Neighbor - Two Outliers Outlier Score 9/29/2019 Introduction to Data Mining, 2nd Edition 20 Five Nearest Neighbors - Small Cluster Outlier Score 9/29/2019 Introduction to Data Mining, 2nd Edition 21 Five Nearest Neighbors - Differing Density Outlier Score 9/29/2019 Introduction to Data Mining, 2nd Edition 22 Strengths/Weaknesses of Distance-Based Approaches Simple Expensive – O(n2) Sensitive to parameters
  • 12. Sensitive to variations in density Distance becomes less meaningful in high-dimensional space 9/29/2019 Introduction to Data Mining, 2nd Edition 23 Density-Based Approaches Density-based Outlier: The outlier score of an object is the inverse of the density around the object. Can be defined in terms of the k nearest neighbors One definition: Inverse of distance to kth neighbor Another definition: Inverse of the average distance to k neighbors DBSCAN definition If there are regions of different density, this approach can have problems 9/29/2019 Introduction to Data Mining, 2nd Edition 24 Relative Density Consider the density of a point relative to that of its k nearest neighbors
  • 13. 9/29/2019 Introduction to Data Mining, 2nd Edition 25 Relative Density Outlier Scores Outlier Score 9/29/2019 Introduction to Data Mining, 2nd Edition 26 Density-based: LOF approach For each point, compute the density of its local neighborhood Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors Outliers are points with largest LOF value p2 p1 In the NN approach, p2 is not considered as outlier, while LOF approach find both p1 and p2 as outliers 9/29/2019 Introduction to Data Mining, 2nd Edition
  • 14. 27 Strengths/Weaknesses of Density-Based Approaches Simple Expensive – O(n2) Sensitive to parameters Density becomes less meaningful in high-dimensional space 9/29/2019 Introduction to Data Mining, 2nd Edition 28 Clustering-Based Approaches Clustering-based Outlier: An object is a cluster-based outlier if it does not strongly belong to any cluster For prototype-based clusters, an object is an outlier if it is not close enough to a cluster center For density-based clusters, an object is an outlier if its density is too low For graph-based clusters, an object is an outlier if it is not well connected Other issues include the impact of outliers on the clusters and the number of clusters
  • 15. 9/29/2019 Introduction to Data Mining, 2nd Edition 29 Distance of Points from Closest Centroids Outlier Score 9/29/2019 Introduction to Data Mining, 2nd Edition 30 Relative Distance of Points from Closest Centroid Outlier Score 9/29/2019 Introduction to Data Mining, 2nd Edition 31
  • 16. Strengths/Weaknesses of Distance-Based Approaches Simple Many clustering techniques can be used Can be difficult to decide on a clustering technique Can be difficult to decide on number of clusters Outliers can distort the clusters 9/29/2019 Introduction to Data Mining, 2nd Edition 32 x y -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1