SlideShare a Scribd company logo
1 of 10
Download to read offline
Chapter 6
Outlier detection
Syllabus:
What are outliers? Types, Challenges; Outlier Detection Methods: Supervised, Semi- Supervised,
Unsupervised, Proximity based, Clustering Based.
Outliers detection ?
Outlier detection (also known as anomaly detection) is the process of finding data objects with
behaviors that are very different from expectation. Such objects are called outliers or anomalies.
Outlier detection is important in many applications in addition to fraud detection such as medical
care, public safety and security, industry damage detection, image processing, sensor/video network
surveillance, and intrusion detection.
What is Outliers ?
An outlier is a data object that deviates significantly from the rest of the objects, as if it were
generated by a different mechanism. For ease of presentation within this chapter, we may refer to data
objects that are not outliers as “normal” or expected data. Similarly, we may refer to outliers as
“abnormal” data.
Outliers are different from noisy data. Noise is a random error or variance in a measured variable.
In general, noise is not interesting in data analysis, including outlier detection.
For example, in credit card fraud detection, a customer’s purchase behavior can be modeled as a
random variable. A customer may generate some “noise transactions” that may seem like “random
errors” or “variance,” such as by buying a bigger lunch one day, or having one more cup of coffee
than usual. Such transactions should not be treated as outliers;
Otherwise, the credit card company would incur heavy costs from verifying that many
transactions.
The company may also lose customers by bothering them with multiple false alarms. As in many
other data analysis and data mining tasks, noise should be removed before outlier detection.
Types of Outliers :-
In general, outliers can be classified into three categories,
 Global Outliers
 Contextual Outliers
 Collective Outliers
 Global Outliers :-
In a given data set, a data object is a global outlier if it deviates significantly from the rest of
the data set.
Global outliers are sometimes called point anomalies, and are the simplest type of outliers.
Most outlier detection methods are aimed at finding global outliers.
Examples :-
To detect global outliers, a critical issue is to find an appropriate measurement of deviation
with respect to the application in question. Various measurements are proposed,
and, based on these, outlier detection methods are partitioned into different categories.We will
come to this issue in detail later.
Global outlier detection is important in many applications. Consider intrusion detection in
computer networks, for example. If the communication behavior of a computer is very different
from the normal patterns (e.g., a large number of packages is broadcast in a short time), this
behavior may be considered as a global outlier and the corresponding computer is a suspected
victim of hacking. As another example, in trading transaction auditing systems, transactions that
do not follow the regulations are considered as global outliers and should be held for further
examination.
 Contextual Outliers :-
In a given data set, a data object is a contextual outlier if it deviates significantly with
respect to a specific context of the object.
Contextual outliers are also known as conditional outliers because they are conditional on the
selected context.
Therefore, in contextual outlier detection, the context has to be specified as part of the
problem definition.
Generally, in contextual outlier detection, the attributes of the data objects in question are
divided into two groups:
i. Contextual attributes
ii. Behavioral attributes
Examples :-
“The temperature today is 28C. Is it exceptional (i.e., an outlier)?” It depends, for example,
on the time and location! If it is in winter in Toronto, yes, it is an outlier. If it is a summer day in
Toronto, then it is normal. Unlike global outlier detection, in this case, whether or not today’s
temperature value is an outlier depends on the context—the date, the location, and possibly some
other factors.
i. Contextual attributes :-
The contextual attributes of a data object define the object’s context. In the temperature
example, the contextual attributes may be date and location.
ii. Behavioral attributes :-
These define the object’s characteristics, and are used to evaluate whether the object is
an outlier in the context to which it belongs. In the temperature example, the behavioral
attributes may be the temperature, humidity, and pressure.
 Collective Outliers :-
Suppose you are a supply-chain manager of AllElectronics. You handle thousands of orders
and shipments every day. If the shipment of an order is delayed, it may not be considered an
outlier because, statistically, delays occur from time to time. However, you have to pay attention
if 100 orders are delayed on a single day. Those 100 orders as a whole form an outlier, although
each of them may not be regarded as an outlier if considered individually. You may have to take a
close look at those orders collectively to understand the shipment problem.
Given a data set, a subset of data objects forms a collective outlier if the objects as a whole
deviate significantly from the entire data set. Importantly, the individual data objects may not be
outliers.
The black objects as a whole form a collective outlier because the density of those objects is
much higher than the rest in the data set.However, every black object individually is not an outlier
with respect to the whole data set.
Collective outlier detection has many important applications. For example, in intrusion
detection, a denial-of-service package from one computer to another is considered normal, and
not an outlier at all.
However, if several computers keep sending denial-of-service packages to each other, they as
a whole should be considered as a collective outlier. The computers involved may be suspected of
being compromised by an attack.
As another example, a stock transaction between two parties is considered normal. However,
a large set of transactions of the same stock among a small party in a short period are collective
outliers because they may be evidence of some people manipulating the market.
Unlike global or contextual outlier detection, in collective outlier detection we have to
consider not only the behavior of individual objects, but also that of groups of objects.
Therefore, to detect collective outliers, we need background knowledge of the relationship
among data objects such as distance or similarity measurements between objects.
Challenges of Outlier Detection :-
Outlier detection is useful in many applications yet faces many challenges such as the
following:
 Modeling normal objects and outliers effectively.
 Application-specific outlier detection.
 Handling noise in outlier detection.
 Understandability.
 Modeling normal objects and outliers effectively :-
Outlier detection quality highly depends on the modeling of normal (nonoutlier) objects and
outliers.
Often, building a comprehensive model for data normality is very challenging, if not
impossible.
This is partly because it is hard to enumerate all possible normal behaviors in an application.
The border between data normality and abnormality (outliers) is often not clear cut.
Instead, there can be a wide range of gray area. Consequently, while some outlier detection
methods assign to each object in the input data set a label of either “normal” or “outlier,” other
methods assign to each object a score measuring the “outlier-ness” of the object.
 Application-specific outlier detection :-
Technically, choosing the similarity/distance measure and the relationship model to describe
data objects is critical in outlier detection.
Unfortunately, such choices are often application-dependent. Different applications may have
very different requirements.
For example, in clinic data analysis, a small deviation may be important enough to justify an
outlier.
In contrast, in marketing analysis, objects are often subject to larger fluctuations, and
consequently a substantially larger deviation is needed to justify an outlier.
Outlier detection’s high dependency on the application type makes it impossible to develop a
universally applicable outlier detection method.
Instead, individual outlier detection methods that are dedicated to specific applications must
be developed.
 Handling noise in outlier detection :-
As mentioned earlier, outliers are different from noise. It is also well known that the quality
of real data sets tends to be poor.
Noise often unavoidably exists in data collected in many applications. Noise may be present
as deviations in attribute values or even as missing values.
Low data quality and the presence of noise bring a huge challenge to outlier detection. They
can distort the data, blurring the distinction between normal objects and outliers.
Moreover, noise and missing data may “hide” outliers and reduce the effectiveness of outlier
detection—an outlier may appear “disguised” as a noise point, and an outlier detection method
may mistakenly identify a noise point as an outlier.
 Understandability
In some application scenarios, a user may want to not only detect outliers, but also
understand why the detected objects are outliers.
To meet the understandability requirement, an outlier detection method has to provide some
justification of the detection. For example, a statistical method can be used to justify the degree to
which an object may be an outlier based on the likelihood that the object was generated by the
same mechanism that generated the majority of the data.
The smaller the likelihood, the more unlikely the object was generated by the same
mechanism, and the more likely the object is an outlier.
Outlier Detection Methods :-
There are many outlier detection methods in the literature and in practice. Here, we present
two orthogonal ways to categorize outlier detection methods.
First, we categorize outlier detection methods according to whether the sample of data for
analysis is given with domain expert–provided labels that can be used to build an outlier detection
model.
Second, we divide methods into groups according to their assumptions regarding normal
objects versus outliers.
Some of the Methods in Outlier Detection Methods are;
 Supervised Methods
 Unsupervised Methods
 Semi-Supervised Methods
 Proximity-Based Methods
 Clustering-Based Methods
 Supervised Methods
Supervised methods model data normality and abnormality. Domain experts examine and
label a sample of the underlying data.
Outlier detection can then be modeled as a classification problem T he sample is used for
training and testing.
In some applications, the experts may label just the normal objects, and any other objects not
matching the model of normal objects are reported as outliers.
Other methods model the outliers and treat objects not matching the model of outliers as
normal.
The two classes (i.e., normal objects versus outliers) are imbalanced. That is, the
population of outliers is typically much smaller than that of normal objects.
Therefore, methods for handling imbalanced classes may be used, such as oversampling
(i.e., replicating) outliers to increase their distribution in the training set used to construct the
classifier.
Due to the small population of outliers in data, the sample data examined by domain
experts and used in training may not even sufficiently represent the outlier distribution. The
lack of outlier samples can limit the capability of classifiers built as such. To tackle these
problems, some methods “make up” artificial outliers.
In many outlier detection applications, catching as many outliers as possible (i.e., the
sensitivity or recall of outlier detection) is far more important than not mislabeling normal
objects as outliers.
Consequently, when a classification method is used for supervised outlier detection, it
has to be interpreted appropriately so as to consider the application interest on recall.
Supervised methods of outlier detection must be careful in how they train and how they
interpret classification rates due to the fact that outliers are rare in comparison to the other
data samples.
 Unsupervised Methods
Unsupervised outlier detection methods make an implicit assumption: The normal objects are
somewhat “clustered.” In other words, an unsupervised outlier detection method expects that
normal objects follow a pattern far more frequently than outliers.
Normal objects do not have to fall into one group sharing high similarity. Instead, they can
form multiple groups, where each group has distinct features.
However, an outlier is expected to occur far away in feature space from any of those groups
of normal objects. This assumption may not be true all the time. For example, the normal objects
do not share any strong patterns.
Instead, they are uniformly distributed. The collective outliers, however, share high similarity
in a small area.
Unsupervised methods cannot detect such outliers effectively. In some applications, normal
objects are diversely distributed, and many such objects do not follow strong patterns. For
instance, in some intrusion detection and computer virus detection problems, normal activities are
very diverse and many do not fall into high-quality clusters.
In such scenarios, unsupervised methods may have a high false positive rate—they may
mislabel many normal objects as outliers (intrusions or viruses in these applications), and let
many actual outliers go undetected.
Due to the high similarity between intrusions and viruses (i.e., they have to attack key
resources in the target systems), modeling outliers using supervised methods may be far more
effective.
Many clustering methods can be adapted to act as unsupervised outlier detection methods.
The central idea is to find clusters first, and then the data objects not belonging to any cluster are
detected as outliers.
However, such methods suffer from two issues.
First, a data object not belonging to any cluster may be noise instead of an outlier.
Second, it is often costly to find clusters first and then find outliers. It is usually assumed that
there are far fewer outliers than normal objects.
The latest unsupervised outlier detection methods develop various smart ideas to tackle
outliers directly without explicitly and completely finding clusters.
 Semi-Supervised Methods
In many applications, although obtaining some labeled examples is feasible, the number of
such labeled examples is often small.
We may encounter cases where only a small set of the normal and/or outlier objects are
labeled, but most of the data are unlabeled. Semi-supervised outlier detection methods were
developed to tackle such scenarios.
Semi-supervised outlier detection methods can be regarded as applications of semisupervised
learning methods.
For example, when some labeled normal objects are available, we can use them, together
with unlabeled objects that are close by, to train a model for normal objects.
The model of normal objects then can be used to detect outliers—those objects not fitting the
model of normal objects are classified as Outliers.
If only some labeled outliers are available, semi-supervised outlier detection is trickier. A
small number of labeled outliers are unlikely to represent all the possible outliers.
Therefore, building a model for outliers based on only a few labeled outliers is unlikely to be
effective.
To improve the quality of outlier detection, we can get help from models for normal objects
learned from unsupervised methods.
 Proximity-Based Methods
Proximity-based methods assume that an object is an outlier if the nearest neighbors of the
object are far away in feature space, that is, the proximity of the object to its neighbors
significantly deviates from the proximity of most of the other objects to their neighbors in the
same data set.
There are two major types of proximity-based outlier detection, namely distancebased and
density-based outlier detection.
 Clustering-Based Methods
Clustering-based methods assume that the normal data objects belong to large and dense
clusters, whereas outliers belong to small or sparse clusters, or do not belong to any clusters.

More Related Content

Similar to Chapter 6.pdf

Term_Paper_Shengzhe_Wang
Term_Paper_Shengzhe_WangTerm_Paper_Shengzhe_Wang
Term_Paper_Shengzhe_WangShengzhe Wang
 
Dn31538540
Dn31538540Dn31538540
Dn31538540IJMER
 
164788616_Data_Leakage_Detection_Complete_Project_Report__1_.docx.pdf
164788616_Data_Leakage_Detection_Complete_Project_Report__1_.docx.pdf164788616_Data_Leakage_Detection_Complete_Project_Report__1_.docx.pdf
164788616_Data_Leakage_Detection_Complete_Project_Report__1_.docx.pdfDrog3
 
Data Allocation Strategies for Leakage Detection
Data Allocation Strategies for Leakage DetectionData Allocation Strategies for Leakage Detection
Data Allocation Strategies for Leakage DetectionIOSR Journals
 
Outlier Detection in Data Mining An Essential Component of Semiconductor Manu...
Outlier Detection in Data Mining An Essential Component of Semiconductor Manu...Outlier Detection in Data Mining An Essential Component of Semiconductor Manu...
Outlier Detection in Data Mining An Essential Component of Semiconductor Manu...yieldWerx Semiconductor
 
Modeling and Detection of Data Leakage Fraud
Modeling and Detection of Data Leakage FraudModeling and Detection of Data Leakage Fraud
Modeling and Detection of Data Leakage FraudIOSR Journals
 
IRJET- Data Leakage Detection System
IRJET- Data Leakage Detection SystemIRJET- Data Leakage Detection System
IRJET- Data Leakage Detection SystemIRJET Journal
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.docbutest
 
Jpdcs1(data lekage detection)
Jpdcs1(data lekage detection)Jpdcs1(data lekage detection)
Jpdcs1(data lekage detection)Chaitanya Kn
 
Data leakage detection
Data leakage detectionData leakage detection
Data leakage detectionrejii
 
Analysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTAnalysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTIJERA Editor
 
AN IMPROVED FRAMEWORK FOR OUTLIER PERIODIC PATTERN DETECTION IN TIME SERIES U...
AN IMPROVED FRAMEWORK FOR OUTLIER PERIODIC PATTERN DETECTION IN TIME SERIES U...AN IMPROVED FRAMEWORK FOR OUTLIER PERIODIC PATTERN DETECTION IN TIME SERIES U...
AN IMPROVED FRAMEWORK FOR OUTLIER PERIODIC PATTERN DETECTION IN TIME SERIES U...ijiert bestjournal
 
Outlier analysis,Chapter-12, Data Mining: Concepts and Techniques
Outlier analysis,Chapter-12, Data Mining: Concepts and TechniquesOutlier analysis,Chapter-12, Data Mining: Concepts and Techniques
Outlier analysis,Chapter-12, Data Mining: Concepts and TechniquesAshikur Rahman
 
IRJET- Credit Card Fraud Detection using Isolation Forest
IRJET- Credit Card Fraud Detection using Isolation ForestIRJET- Credit Card Fraud Detection using Isolation Forest
IRJET- Credit Card Fraud Detection using Isolation ForestIRJET Journal
 
Digital Forensics for Artificial Intelligence (AI ) Systems.pdf
Digital Forensics for Artificial Intelligence (AI ) Systems.pdfDigital Forensics for Artificial Intelligence (AI ) Systems.pdf
Digital Forensics for Artificial Intelligence (AI ) Systems.pdfMahdi_Fahmideh
 
An Efficient Approach for Outlier Detection in Wireless Sensor Network
An Efficient Approach for Outlier Detection in Wireless Sensor NetworkAn Efficient Approach for Outlier Detection in Wireless Sensor Network
An Efficient Approach for Outlier Detection in Wireless Sensor NetworkIOSR Journals
 

Similar to Chapter 6.pdf (20)

Term_Paper_Shengzhe_Wang
Term_Paper_Shengzhe_WangTerm_Paper_Shengzhe_Wang
Term_Paper_Shengzhe_Wang
 
Dn31538540
Dn31538540Dn31538540
Dn31538540
 
164788616_Data_Leakage_Detection_Complete_Project_Report__1_.docx.pdf
164788616_Data_Leakage_Detection_Complete_Project_Report__1_.docx.pdf164788616_Data_Leakage_Detection_Complete_Project_Report__1_.docx.pdf
164788616_Data_Leakage_Detection_Complete_Project_Report__1_.docx.pdf
 
2007.02500.pdf
2007.02500.pdf2007.02500.pdf
2007.02500.pdf
 
Data Allocation Strategies for Leakage Detection
Data Allocation Strategies for Leakage DetectionData Allocation Strategies for Leakage Detection
Data Allocation Strategies for Leakage Detection
 
Outlier Detection in Data Mining An Essential Component of Semiconductor Manu...
Outlier Detection in Data Mining An Essential Component of Semiconductor Manu...Outlier Detection in Data Mining An Essential Component of Semiconductor Manu...
Outlier Detection in Data Mining An Essential Component of Semiconductor Manu...
 
Modeling and Detection of Data Leakage Fraud
Modeling and Detection of Data Leakage FraudModeling and Detection of Data Leakage Fraud
Modeling and Detection of Data Leakage Fraud
 
IRJET- Data Leakage Detection System
IRJET- Data Leakage Detection SystemIRJET- Data Leakage Detection System
IRJET- Data Leakage Detection System
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.doc
 
Jpdcs1(data lekage detection)
Jpdcs1(data lekage detection)Jpdcs1(data lekage detection)
Jpdcs1(data lekage detection)
 
Data leakage detection
Data leakage detectionData leakage detection
Data leakage detection
 
Analysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTAnalysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOT
 
AN IMPROVED FRAMEWORK FOR OUTLIER PERIODIC PATTERN DETECTION IN TIME SERIES U...
AN IMPROVED FRAMEWORK FOR OUTLIER PERIODIC PATTERN DETECTION IN TIME SERIES U...AN IMPROVED FRAMEWORK FOR OUTLIER PERIODIC PATTERN DETECTION IN TIME SERIES U...
AN IMPROVED FRAMEWORK FOR OUTLIER PERIODIC PATTERN DETECTION IN TIME SERIES U...
 
Outlier analysis,Chapter-12, Data Mining: Concepts and Techniques
Outlier analysis,Chapter-12, Data Mining: Concepts and TechniquesOutlier analysis,Chapter-12, Data Mining: Concepts and Techniques
Outlier analysis,Chapter-12, Data Mining: Concepts and Techniques
 
Idea
IdeaIdea
Idea
 
IRJET- Credit Card Fraud Detection using Isolation Forest
IRJET- Credit Card Fraud Detection using Isolation ForestIRJET- Credit Card Fraud Detection using Isolation Forest
IRJET- Credit Card Fraud Detection using Isolation Forest
 
J017446568
J017446568J017446568
J017446568
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Digital Forensics for Artificial Intelligence (AI ) Systems.pdf
Digital Forensics for Artificial Intelligence (AI ) Systems.pdfDigital Forensics for Artificial Intelligence (AI ) Systems.pdf
Digital Forensics for Artificial Intelligence (AI ) Systems.pdf
 
An Efficient Approach for Outlier Detection in Wireless Sensor Network
An Efficient Approach for Outlier Detection in Wireless Sensor NetworkAn Efficient Approach for Outlier Detection in Wireless Sensor Network
An Efficient Approach for Outlier Detection in Wireless Sensor Network
 

More from DrGnaneswariG

The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Ka...
The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Ka...The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Ka...
The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Ka...DrGnaneswariG
 

More from DrGnaneswariG (7)

Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
Chapter 4.pdf
Chapter 4.pdfChapter 4.pdf
Chapter 4.pdf
 
Chapter 2.pdf
Chapter 2.pdfChapter 2.pdf
Chapter 2.pdf
 
Chapter 7.pdf
Chapter 7.pdfChapter 7.pdf
Chapter 7.pdf
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Chapter 1.pdf
Chapter 1.pdfChapter 1.pdf
Chapter 1.pdf
 
The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Ka...
The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Ka...The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Ka...
The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Ka...
 

Recently uploaded

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 

Recently uploaded (20)

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 

Chapter 6.pdf

  • 1. Chapter 6 Outlier detection Syllabus: What are outliers? Types, Challenges; Outlier Detection Methods: Supervised, Semi- Supervised, Unsupervised, Proximity based, Clustering Based. Outliers detection ? Outlier detection (also known as anomaly detection) is the process of finding data objects with behaviors that are very different from expectation. Such objects are called outliers or anomalies. Outlier detection is important in many applications in addition to fraud detection such as medical care, public safety and security, industry damage detection, image processing, sensor/video network surveillance, and intrusion detection. What is Outliers ? An outlier is a data object that deviates significantly from the rest of the objects, as if it were generated by a different mechanism. For ease of presentation within this chapter, we may refer to data objects that are not outliers as “normal” or expected data. Similarly, we may refer to outliers as “abnormal” data. Outliers are different from noisy data. Noise is a random error or variance in a measured variable. In general, noise is not interesting in data analysis, including outlier detection. For example, in credit card fraud detection, a customer’s purchase behavior can be modeled as a random variable. A customer may generate some “noise transactions” that may seem like “random errors” or “variance,” such as by buying a bigger lunch one day, or having one more cup of coffee than usual. Such transactions should not be treated as outliers; Otherwise, the credit card company would incur heavy costs from verifying that many transactions. The company may also lose customers by bothering them with multiple false alarms. As in many other data analysis and data mining tasks, noise should be removed before outlier detection.
  • 2. Types of Outliers :- In general, outliers can be classified into three categories,  Global Outliers  Contextual Outliers  Collective Outliers  Global Outliers :- In a given data set, a data object is a global outlier if it deviates significantly from the rest of the data set. Global outliers are sometimes called point anomalies, and are the simplest type of outliers. Most outlier detection methods are aimed at finding global outliers. Examples :- To detect global outliers, a critical issue is to find an appropriate measurement of deviation with respect to the application in question. Various measurements are proposed, and, based on these, outlier detection methods are partitioned into different categories.We will come to this issue in detail later. Global outlier detection is important in many applications. Consider intrusion detection in computer networks, for example. If the communication behavior of a computer is very different from the normal patterns (e.g., a large number of packages is broadcast in a short time), this
  • 3. behavior may be considered as a global outlier and the corresponding computer is a suspected victim of hacking. As another example, in trading transaction auditing systems, transactions that do not follow the regulations are considered as global outliers and should be held for further examination.  Contextual Outliers :- In a given data set, a data object is a contextual outlier if it deviates significantly with respect to a specific context of the object. Contextual outliers are also known as conditional outliers because they are conditional on the selected context. Therefore, in contextual outlier detection, the context has to be specified as part of the problem definition. Generally, in contextual outlier detection, the attributes of the data objects in question are divided into two groups: i. Contextual attributes ii. Behavioral attributes Examples :- “The temperature today is 28C. Is it exceptional (i.e., an outlier)?” It depends, for example, on the time and location! If it is in winter in Toronto, yes, it is an outlier. If it is a summer day in Toronto, then it is normal. Unlike global outlier detection, in this case, whether or not today’s temperature value is an outlier depends on the context—the date, the location, and possibly some other factors. i. Contextual attributes :- The contextual attributes of a data object define the object’s context. In the temperature example, the contextual attributes may be date and location. ii. Behavioral attributes :- These define the object’s characteristics, and are used to evaluate whether the object is an outlier in the context to which it belongs. In the temperature example, the behavioral attributes may be the temperature, humidity, and pressure.
  • 4.  Collective Outliers :- Suppose you are a supply-chain manager of AllElectronics. You handle thousands of orders and shipments every day. If the shipment of an order is delayed, it may not be considered an outlier because, statistically, delays occur from time to time. However, you have to pay attention if 100 orders are delayed on a single day. Those 100 orders as a whole form an outlier, although each of them may not be regarded as an outlier if considered individually. You may have to take a close look at those orders collectively to understand the shipment problem. Given a data set, a subset of data objects forms a collective outlier if the objects as a whole deviate significantly from the entire data set. Importantly, the individual data objects may not be outliers. The black objects as a whole form a collective outlier because the density of those objects is much higher than the rest in the data set.However, every black object individually is not an outlier with respect to the whole data set. Collective outlier detection has many important applications. For example, in intrusion detection, a denial-of-service package from one computer to another is considered normal, and not an outlier at all. However, if several computers keep sending denial-of-service packages to each other, they as a whole should be considered as a collective outlier. The computers involved may be suspected of being compromised by an attack.
  • 5. As another example, a stock transaction between two parties is considered normal. However, a large set of transactions of the same stock among a small party in a short period are collective outliers because they may be evidence of some people manipulating the market. Unlike global or contextual outlier detection, in collective outlier detection we have to consider not only the behavior of individual objects, but also that of groups of objects. Therefore, to detect collective outliers, we need background knowledge of the relationship among data objects such as distance or similarity measurements between objects. Challenges of Outlier Detection :- Outlier detection is useful in many applications yet faces many challenges such as the following:  Modeling normal objects and outliers effectively.  Application-specific outlier detection.  Handling noise in outlier detection.  Understandability.  Modeling normal objects and outliers effectively :- Outlier detection quality highly depends on the modeling of normal (nonoutlier) objects and outliers. Often, building a comprehensive model for data normality is very challenging, if not impossible. This is partly because it is hard to enumerate all possible normal behaviors in an application. The border between data normality and abnormality (outliers) is often not clear cut. Instead, there can be a wide range of gray area. Consequently, while some outlier detection methods assign to each object in the input data set a label of either “normal” or “outlier,” other methods assign to each object a score measuring the “outlier-ness” of the object.
  • 6.  Application-specific outlier detection :- Technically, choosing the similarity/distance measure and the relationship model to describe data objects is critical in outlier detection. Unfortunately, such choices are often application-dependent. Different applications may have very different requirements. For example, in clinic data analysis, a small deviation may be important enough to justify an outlier. In contrast, in marketing analysis, objects are often subject to larger fluctuations, and consequently a substantially larger deviation is needed to justify an outlier. Outlier detection’s high dependency on the application type makes it impossible to develop a universally applicable outlier detection method. Instead, individual outlier detection methods that are dedicated to specific applications must be developed.  Handling noise in outlier detection :- As mentioned earlier, outliers are different from noise. It is also well known that the quality of real data sets tends to be poor. Noise often unavoidably exists in data collected in many applications. Noise may be present as deviations in attribute values or even as missing values. Low data quality and the presence of noise bring a huge challenge to outlier detection. They can distort the data, blurring the distinction between normal objects and outliers. Moreover, noise and missing data may “hide” outliers and reduce the effectiveness of outlier detection—an outlier may appear “disguised” as a noise point, and an outlier detection method may mistakenly identify a noise point as an outlier.  Understandability In some application scenarios, a user may want to not only detect outliers, but also understand why the detected objects are outliers. To meet the understandability requirement, an outlier detection method has to provide some justification of the detection. For example, a statistical method can be used to justify the degree to
  • 7. which an object may be an outlier based on the likelihood that the object was generated by the same mechanism that generated the majority of the data. The smaller the likelihood, the more unlikely the object was generated by the same mechanism, and the more likely the object is an outlier. Outlier Detection Methods :- There are many outlier detection methods in the literature and in practice. Here, we present two orthogonal ways to categorize outlier detection methods. First, we categorize outlier detection methods according to whether the sample of data for analysis is given with domain expert–provided labels that can be used to build an outlier detection model. Second, we divide methods into groups according to their assumptions regarding normal objects versus outliers. Some of the Methods in Outlier Detection Methods are;  Supervised Methods  Unsupervised Methods  Semi-Supervised Methods  Proximity-Based Methods  Clustering-Based Methods  Supervised Methods Supervised methods model data normality and abnormality. Domain experts examine and label a sample of the underlying data. Outlier detection can then be modeled as a classification problem T he sample is used for training and testing. In some applications, the experts may label just the normal objects, and any other objects not matching the model of normal objects are reported as outliers. Other methods model the outliers and treat objects not matching the model of outliers as normal.
  • 8. The two classes (i.e., normal objects versus outliers) are imbalanced. That is, the population of outliers is typically much smaller than that of normal objects. Therefore, methods for handling imbalanced classes may be used, such as oversampling (i.e., replicating) outliers to increase their distribution in the training set used to construct the classifier. Due to the small population of outliers in data, the sample data examined by domain experts and used in training may not even sufficiently represent the outlier distribution. The lack of outlier samples can limit the capability of classifiers built as such. To tackle these problems, some methods “make up” artificial outliers. In many outlier detection applications, catching as many outliers as possible (i.e., the sensitivity or recall of outlier detection) is far more important than not mislabeling normal objects as outliers. Consequently, when a classification method is used for supervised outlier detection, it has to be interpreted appropriately so as to consider the application interest on recall. Supervised methods of outlier detection must be careful in how they train and how they interpret classification rates due to the fact that outliers are rare in comparison to the other data samples.  Unsupervised Methods Unsupervised outlier detection methods make an implicit assumption: The normal objects are somewhat “clustered.” In other words, an unsupervised outlier detection method expects that normal objects follow a pattern far more frequently than outliers. Normal objects do not have to fall into one group sharing high similarity. Instead, they can form multiple groups, where each group has distinct features. However, an outlier is expected to occur far away in feature space from any of those groups of normal objects. This assumption may not be true all the time. For example, the normal objects do not share any strong patterns. Instead, they are uniformly distributed. The collective outliers, however, share high similarity in a small area. Unsupervised methods cannot detect such outliers effectively. In some applications, normal objects are diversely distributed, and many such objects do not follow strong patterns. For
  • 9. instance, in some intrusion detection and computer virus detection problems, normal activities are very diverse and many do not fall into high-quality clusters. In such scenarios, unsupervised methods may have a high false positive rate—they may mislabel many normal objects as outliers (intrusions or viruses in these applications), and let many actual outliers go undetected. Due to the high similarity between intrusions and viruses (i.e., they have to attack key resources in the target systems), modeling outliers using supervised methods may be far more effective. Many clustering methods can be adapted to act as unsupervised outlier detection methods. The central idea is to find clusters first, and then the data objects not belonging to any cluster are detected as outliers. However, such methods suffer from two issues. First, a data object not belonging to any cluster may be noise instead of an outlier. Second, it is often costly to find clusters first and then find outliers. It is usually assumed that there are far fewer outliers than normal objects. The latest unsupervised outlier detection methods develop various smart ideas to tackle outliers directly without explicitly and completely finding clusters.  Semi-Supervised Methods In many applications, although obtaining some labeled examples is feasible, the number of such labeled examples is often small. We may encounter cases where only a small set of the normal and/or outlier objects are labeled, but most of the data are unlabeled. Semi-supervised outlier detection methods were developed to tackle such scenarios. Semi-supervised outlier detection methods can be regarded as applications of semisupervised learning methods. For example, when some labeled normal objects are available, we can use them, together with unlabeled objects that are close by, to train a model for normal objects. The model of normal objects then can be used to detect outliers—those objects not fitting the model of normal objects are classified as Outliers. If only some labeled outliers are available, semi-supervised outlier detection is trickier. A
  • 10. small number of labeled outliers are unlikely to represent all the possible outliers. Therefore, building a model for outliers based on only a few labeled outliers is unlikely to be effective. To improve the quality of outlier detection, we can get help from models for normal objects learned from unsupervised methods.  Proximity-Based Methods Proximity-based methods assume that an object is an outlier if the nearest neighbors of the object are far away in feature space, that is, the proximity of the object to its neighbors significantly deviates from the proximity of most of the other objects to their neighbors in the same data set. There are two major types of proximity-based outlier detection, namely distancebased and density-based outlier detection.  Clustering-Based Methods Clustering-based methods assume that the normal data objects belong to large and dense clusters, whereas outliers belong to small or sparse clusters, or do not belong to any clusters.