1. Chapter 6
Outlier detection
Syllabus:
What are outliers? Types, Challenges; Outlier Detection Methods: Supervised, Semi- Supervised,
Unsupervised, Proximity based, Clustering Based.
Outliers detection ?
Outlier detection (also known as anomaly detection) is the process of finding data objects with
behaviors that are very different from expectation. Such objects are called outliers or anomalies.
Outlier detection is important in many applications in addition to fraud detection such as medical
care, public safety and security, industry damage detection, image processing, sensor/video network
surveillance, and intrusion detection.
What is Outliers ?
An outlier is a data object that deviates significantly from the rest of the objects, as if it were
generated by a different mechanism. For ease of presentation within this chapter, we may refer to data
objects that are not outliers as “normal” or expected data. Similarly, we may refer to outliers as
“abnormal” data.
Outliers are different from noisy data. Noise is a random error or variance in a measured variable.
In general, noise is not interesting in data analysis, including outlier detection.
For example, in credit card fraud detection, a customer’s purchase behavior can be modeled as a
random variable. A customer may generate some “noise transactions” that may seem like “random
errors” or “variance,” such as by buying a bigger lunch one day, or having one more cup of coffee
than usual. Such transactions should not be treated as outliers;
Otherwise, the credit card company would incur heavy costs from verifying that many
transactions.
The company may also lose customers by bothering them with multiple false alarms. As in many
other data analysis and data mining tasks, noise should be removed before outlier detection.
2. Types of Outliers :-
In general, outliers can be classified into three categories,
Global Outliers
Contextual Outliers
Collective Outliers
Global Outliers :-
In a given data set, a data object is a global outlier if it deviates significantly from the rest of
the data set.
Global outliers are sometimes called point anomalies, and are the simplest type of outliers.
Most outlier detection methods are aimed at finding global outliers.
Examples :-
To detect global outliers, a critical issue is to find an appropriate measurement of deviation
with respect to the application in question. Various measurements are proposed,
and, based on these, outlier detection methods are partitioned into different categories.We will
come to this issue in detail later.
Global outlier detection is important in many applications. Consider intrusion detection in
computer networks, for example. If the communication behavior of a computer is very different
from the normal patterns (e.g., a large number of packages is broadcast in a short time), this
3. behavior may be considered as a global outlier and the corresponding computer is a suspected
victim of hacking. As another example, in trading transaction auditing systems, transactions that
do not follow the regulations are considered as global outliers and should be held for further
examination.
Contextual Outliers :-
In a given data set, a data object is a contextual outlier if it deviates significantly with
respect to a specific context of the object.
Contextual outliers are also known as conditional outliers because they are conditional on the
selected context.
Therefore, in contextual outlier detection, the context has to be specified as part of the
problem definition.
Generally, in contextual outlier detection, the attributes of the data objects in question are
divided into two groups:
i. Contextual attributes
ii. Behavioral attributes
Examples :-
“The temperature today is 28C. Is it exceptional (i.e., an outlier)?” It depends, for example,
on the time and location! If it is in winter in Toronto, yes, it is an outlier. If it is a summer day in
Toronto, then it is normal. Unlike global outlier detection, in this case, whether or not today’s
temperature value is an outlier depends on the context—the date, the location, and possibly some
other factors.
i. Contextual attributes :-
The contextual attributes of a data object define the object’s context. In the temperature
example, the contextual attributes may be date and location.
ii. Behavioral attributes :-
These define the object’s characteristics, and are used to evaluate whether the object is
an outlier in the context to which it belongs. In the temperature example, the behavioral
attributes may be the temperature, humidity, and pressure.
4. Collective Outliers :-
Suppose you are a supply-chain manager of AllElectronics. You handle thousands of orders
and shipments every day. If the shipment of an order is delayed, it may not be considered an
outlier because, statistically, delays occur from time to time. However, you have to pay attention
if 100 orders are delayed on a single day. Those 100 orders as a whole form an outlier, although
each of them may not be regarded as an outlier if considered individually. You may have to take a
close look at those orders collectively to understand the shipment problem.
Given a data set, a subset of data objects forms a collective outlier if the objects as a whole
deviate significantly from the entire data set. Importantly, the individual data objects may not be
outliers.
The black objects as a whole form a collective outlier because the density of those objects is
much higher than the rest in the data set.However, every black object individually is not an outlier
with respect to the whole data set.
Collective outlier detection has many important applications. For example, in intrusion
detection, a denial-of-service package from one computer to another is considered normal, and
not an outlier at all.
However, if several computers keep sending denial-of-service packages to each other, they as
a whole should be considered as a collective outlier. The computers involved may be suspected of
being compromised by an attack.
5. As another example, a stock transaction between two parties is considered normal. However,
a large set of transactions of the same stock among a small party in a short period are collective
outliers because they may be evidence of some people manipulating the market.
Unlike global or contextual outlier detection, in collective outlier detection we have to
consider not only the behavior of individual objects, but also that of groups of objects.
Therefore, to detect collective outliers, we need background knowledge of the relationship
among data objects such as distance or similarity measurements between objects.
Challenges of Outlier Detection :-
Outlier detection is useful in many applications yet faces many challenges such as the
following:
Modeling normal objects and outliers effectively.
Application-specific outlier detection.
Handling noise in outlier detection.
Understandability.
Modeling normal objects and outliers effectively :-
Outlier detection quality highly depends on the modeling of normal (nonoutlier) objects and
outliers.
Often, building a comprehensive model for data normality is very challenging, if not
impossible.
This is partly because it is hard to enumerate all possible normal behaviors in an application.
The border between data normality and abnormality (outliers) is often not clear cut.
Instead, there can be a wide range of gray area. Consequently, while some outlier detection
methods assign to each object in the input data set a label of either “normal” or “outlier,” other
methods assign to each object a score measuring the “outlier-ness” of the object.
6. Application-specific outlier detection :-
Technically, choosing the similarity/distance measure and the relationship model to describe
data objects is critical in outlier detection.
Unfortunately, such choices are often application-dependent. Different applications may have
very different requirements.
For example, in clinic data analysis, a small deviation may be important enough to justify an
outlier.
In contrast, in marketing analysis, objects are often subject to larger fluctuations, and
consequently a substantially larger deviation is needed to justify an outlier.
Outlier detection’s high dependency on the application type makes it impossible to develop a
universally applicable outlier detection method.
Instead, individual outlier detection methods that are dedicated to specific applications must
be developed.
Handling noise in outlier detection :-
As mentioned earlier, outliers are different from noise. It is also well known that the quality
of real data sets tends to be poor.
Noise often unavoidably exists in data collected in many applications. Noise may be present
as deviations in attribute values or even as missing values.
Low data quality and the presence of noise bring a huge challenge to outlier detection. They
can distort the data, blurring the distinction between normal objects and outliers.
Moreover, noise and missing data may “hide” outliers and reduce the effectiveness of outlier
detection—an outlier may appear “disguised” as a noise point, and an outlier detection method
may mistakenly identify a noise point as an outlier.
Understandability
In some application scenarios, a user may want to not only detect outliers, but also
understand why the detected objects are outliers.
To meet the understandability requirement, an outlier detection method has to provide some
justification of the detection. For example, a statistical method can be used to justify the degree to
7. which an object may be an outlier based on the likelihood that the object was generated by the
same mechanism that generated the majority of the data.
The smaller the likelihood, the more unlikely the object was generated by the same
mechanism, and the more likely the object is an outlier.
Outlier Detection Methods :-
There are many outlier detection methods in the literature and in practice. Here, we present
two orthogonal ways to categorize outlier detection methods.
First, we categorize outlier detection methods according to whether the sample of data for
analysis is given with domain expert–provided labels that can be used to build an outlier detection
model.
Second, we divide methods into groups according to their assumptions regarding normal
objects versus outliers.
Some of the Methods in Outlier Detection Methods are;
Supervised Methods
Unsupervised Methods
Semi-Supervised Methods
Proximity-Based Methods
Clustering-Based Methods
Supervised Methods
Supervised methods model data normality and abnormality. Domain experts examine and
label a sample of the underlying data.
Outlier detection can then be modeled as a classification problem T he sample is used for
training and testing.
In some applications, the experts may label just the normal objects, and any other objects not
matching the model of normal objects are reported as outliers.
Other methods model the outliers and treat objects not matching the model of outliers as
normal.
8. The two classes (i.e., normal objects versus outliers) are imbalanced. That is, the
population of outliers is typically much smaller than that of normal objects.
Therefore, methods for handling imbalanced classes may be used, such as oversampling
(i.e., replicating) outliers to increase their distribution in the training set used to construct the
classifier.
Due to the small population of outliers in data, the sample data examined by domain
experts and used in training may not even sufficiently represent the outlier distribution. The
lack of outlier samples can limit the capability of classifiers built as such. To tackle these
problems, some methods “make up” artificial outliers.
In many outlier detection applications, catching as many outliers as possible (i.e., the
sensitivity or recall of outlier detection) is far more important than not mislabeling normal
objects as outliers.
Consequently, when a classification method is used for supervised outlier detection, it
has to be interpreted appropriately so as to consider the application interest on recall.
Supervised methods of outlier detection must be careful in how they train and how they
interpret classification rates due to the fact that outliers are rare in comparison to the other
data samples.
Unsupervised Methods
Unsupervised outlier detection methods make an implicit assumption: The normal objects are
somewhat “clustered.” In other words, an unsupervised outlier detection method expects that
normal objects follow a pattern far more frequently than outliers.
Normal objects do not have to fall into one group sharing high similarity. Instead, they can
form multiple groups, where each group has distinct features.
However, an outlier is expected to occur far away in feature space from any of those groups
of normal objects. This assumption may not be true all the time. For example, the normal objects
do not share any strong patterns.
Instead, they are uniformly distributed. The collective outliers, however, share high similarity
in a small area.
Unsupervised methods cannot detect such outliers effectively. In some applications, normal
objects are diversely distributed, and many such objects do not follow strong patterns. For
9. instance, in some intrusion detection and computer virus detection problems, normal activities are
very diverse and many do not fall into high-quality clusters.
In such scenarios, unsupervised methods may have a high false positive rate—they may
mislabel many normal objects as outliers (intrusions or viruses in these applications), and let
many actual outliers go undetected.
Due to the high similarity between intrusions and viruses (i.e., they have to attack key
resources in the target systems), modeling outliers using supervised methods may be far more
effective.
Many clustering methods can be adapted to act as unsupervised outlier detection methods.
The central idea is to find clusters first, and then the data objects not belonging to any cluster are
detected as outliers.
However, such methods suffer from two issues.
First, a data object not belonging to any cluster may be noise instead of an outlier.
Second, it is often costly to find clusters first and then find outliers. It is usually assumed that
there are far fewer outliers than normal objects.
The latest unsupervised outlier detection methods develop various smart ideas to tackle
outliers directly without explicitly and completely finding clusters.
Semi-Supervised Methods
In many applications, although obtaining some labeled examples is feasible, the number of
such labeled examples is often small.
We may encounter cases where only a small set of the normal and/or outlier objects are
labeled, but most of the data are unlabeled. Semi-supervised outlier detection methods were
developed to tackle such scenarios.
Semi-supervised outlier detection methods can be regarded as applications of semisupervised
learning methods.
For example, when some labeled normal objects are available, we can use them, together
with unlabeled objects that are close by, to train a model for normal objects.
The model of normal objects then can be used to detect outliers—those objects not fitting the
model of normal objects are classified as Outliers.
If only some labeled outliers are available, semi-supervised outlier detection is trickier. A
10. small number of labeled outliers are unlikely to represent all the possible outliers.
Therefore, building a model for outliers based on only a few labeled outliers is unlikely to be
effective.
To improve the quality of outlier detection, we can get help from models for normal objects
learned from unsupervised methods.
Proximity-Based Methods
Proximity-based methods assume that an object is an outlier if the nearest neighbors of the
object are far away in feature space, that is, the proximity of the object to its neighbors
significantly deviates from the proximity of most of the other objects to their neighbors in the
same data set.
There are two major types of proximity-based outlier detection, namely distancebased and
density-based outlier detection.
Clustering-Based Methods
Clustering-based methods assume that the normal data objects belong to large and dense
clusters, whereas outliers belong to small or sparse clusters, or do not belong to any clusters.