outiar.pdf

MASTER
THESIS Master's Programme in Embedded and Intelligent Systems, 120 credits
Estimating p-values for outlier detection
Henrik Norrman
Computer Science and Engineering, 30 credits
Halmstad 2014-06-16

Master thesis in Computer Science and Engineering, 30 credits
2014
Author: Henrik Norrman
Supervisors: Thorsteinn Rögnvaldsson, Stefan Byttner, Eric Järpe
Opponent: Tuve Löfström (University of Borås)
Examiner: Antanas Verikas
School of Information Science, Computer and Electrical Engineering
Halmstad University
PO Box 823, SE-301 18 HALMSTAD
Sweden

Henrik Norrman
© Copyright Henrik Norrman, 2014. All rights reserved.
Master thesis report IDE1408
School of Information Science, Computer and Electrical Engineering
Halmstad University
Typset in 11pt Palatino (L
A
TEX)

Preface
This work is a thesis for the Master’s Programme in Embedded and Intelligent Systems
(120 credits) at Halmstad University. It is the original work of the author and was con-
ducted between December 2013 to June 2014.
The author would like express his gratitude to the supervisors Thorsteinn Rögnvaldsson,
Stefan Byttner, and Eric Järpe, from Halmstad University, for their support and time.
The author would also like to thank his opponent, Tuve Löfström from the University of
Borås, for his thoughtful feedback and constructive criticism.
Additionally the author would like to thank his lecturers, examiners and classmates for
making the last two years a truly enjoyable experience.
Finally, the author would like to thank his family and friends for their support and pa-
tience.
i

Abstract
Outlier detection is useful in a vast numbers of different domains, wherever there is data
and a need for analysis. The research area related to outlier detection is large and the
number of available approaches is constantly growing. Most of the approaches produce a
binary result: either outlier or not. In this work approaches that are able to detect outliers
by producing a p-value estimate are investigated. Approaches that estimate p-values are
interesting since it allows their results to easily be compared against each other, followed
over time, or be used with a variable threshold.
Four approaches are subjected to a variety of tests to attempt to measure their suitability
when the data is distributed in a number of ways. The first approach, the R2S, is devel-
oped at Halmstad University. Based on finding the mid-point of the data. The second
approach is based on one-class support vector machines (OCSVM). The third and fourth
approaches are both based on conformal anomaly detection (CAD), but using different
nonconformity measures (NCM). The Mahalanobis distance to the mean and a variation
of k-NN are used as NCMs.
The R2S and the CAD Mahalanobis are both good at estimating p-values from data gen-
erated by unimodal and symmetrical distributions. The CAD k-NN is good at estimating
p-values when the data is generated by a bimodal or extremely asymmetric distribution.
The OCSVM does not excel in any scenario, but produces good average results in most
of the tests. The approaches are also subjected to real data, where they all produce com-
parable results.
iii

Contents
Preface i
Abstract iii
1 Introduction 1
1.1 The nature of an outlier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 What is a p-value? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Benefits of using a p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Research question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Research scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6 Structure of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 The nature of the data . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Prior knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Approaches to outlier detection . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Statistical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Proximity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.4 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.5 High-dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Outlier detection by p-value estimation . . . . . . . . . . . . . . . . . . . . 12
2.4 Conformal anomaly detection (CAD) . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Non-conformity measure (NCM) . . . . . . . . . . . . . . . . . . . . 14
2.5 R2S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 The evolution of the support vector machine . . . . . . . . . . . . . . . . . 15
2.6.1 The linearly separable case . . . . . . . . . . . . . . . . . . . . . . . 15
2.6.2 Nonlinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.3 Soft margins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6.4 One class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Methods 23
3.1 Approach selection criteria and discussion . . . . . . . . . . . . . . . . . . 23
3.2 Adaptation and implementations . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Implementation of the CAD . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 First NCM: Mahalanobis distance . . . . . . . . . . . . . . . . . . . 24
3.3.2 Second NCM: k-NN . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.3 Selecting a k-value for k-NN . . . . . . . . . . . . . . . . . . . . . . . 25
v

3.4 Implementation of the R2S . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Implementation of the OCSVM . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Selecting the kernel and the kernel parameters . . . . . . . . . . . . . . . . 27
3.7 Test plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.7.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.7.2 Real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.8 Calculating true p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.8.1 The normal Gaussian distribution . . . . . . . . . . . . . . . . . . . 33
3.8.2 The Student’s t-distribution . . . . . . . . . . . . . . . . . . . . . . . 34
3.8.3 The symmetric beta distribution . . . . . . . . . . . . . . . . . . . . 35
3.8.4 The non-symmetric beta distribution . . . . . . . . . . . . . . . . . . 35
3.8.5 The log-normal distribution . . . . . . . . . . . . . . . . . . . . . . . 36
3.8.6 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.8.7 Bimodal normal distribution . . . . . . . . . . . . . . . . . . . . . . 37
4 Results 39
4.1 Generated univariate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.2 t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.3 Symmetric beta distribution . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.4 Non-symmetric beta distribution . . . . . . . . . . . . . . . . . . . . 49
4.1.5 Log-normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1.6 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.8 Normal distribution with uniform noise . . . . . . . . . . . . . . . . 61
4.1.9 Normal distribution with normal noise . . . . . . . . . . . . . . . . 64
4.2 Generated bivariate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.2 Elliptic normal distribution . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.4 Normal distribution with uniform noise . . . . . . . . . . . . . . . . 77
4.2.5 Normal distribution with normal noise . . . . . . . . . . . . . . . . 80
4.2.6 Log-normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2.7 Normal/log-normal distribution . . . . . . . . . . . . . . . . . . . . 86
4.2.8 Collected results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3 Real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5 Discussion 93
5.1 Interpreting the estimations . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 About the p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3 Interpreting comparisons and RMSE values . . . . . . . . . . . . . . . . . . 97
5.4 R2S versus CAD Mahalanobis . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.5 Continuation of the research . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6 Conclusion 99
Bibliography 101

Chapter 1
Introduction
The gathering and accumulation of information used to be hard. At first it was limited to
travel between minds through speech, a game of telephone spanning over generations.
Then humans invented numerals and the concept of mathematics and writing. Informa-
tion could be stored for generations and still convey the author’s original ideas several
decennia down the line. The invention of numerals and mathematics gave us a more
abstract way of presenting quantities, bookkeeping and census collection. These skills
might have been available only to the lucky and selected few and controlled by institu-
tions, but change was coming. Gutenbergs printing press did not ease the gathering of
information much, but revolutionized the spread and distribution.
The industrialization woke new needs for information gathering as managers needed
control processes and tools for decision making. The amount and rate of information
were for a while manageable by pens, papers and brains. Then computers arrived, and
both information gathering and storage grows at an exponential rate, far beyond the
limits of the human brain. One can ask whether it is the information gathering that drive
the storage density to new heights, or if it is the other way around, the need or the ability.
In order for us to make sense of the information, make it useful, it needs to be distilled,
by removing the noise and highlighting the parts that fill a purpose.
One way of understanding the data is to find the parts that do not conform to the normal,
those that give us clues about if and when something is wrong. There are a lot of methods
to do this and this work will attempt to summarize the current state-of-the-art. Some of
them will be selected for an in depth review and performance evaluation.
1.1 The nature of an outlier
Throughout literature there are a lot of different words used to describe the concept of a
measurement that does not seem to belong to other measurements from the same source.
Novelty, anomaly, noise, deviation, exception, discordant observation, aberration, sur-
prise, peculiarity, contaminant, all these words, in the concept of outlier detection, mean
almost the same thing. In this work, the word outlier will be used to describe this concept
in order to ease the confusion. The outliers can be classified based on the source and the
situation, such as:
Point outliers is the simplest form. It means that the points in the data set do not have
any mutual relationship besides being created by the same process. In this case the points
1

2 Chapter 1. Introduction
can be compared to each other or to a model without any loss of information. This case
has a clear connection to detecting outliers in a global sense.
Contextual or conditional outliers means there are some underlying relationship between
the points. That one point might be an outlier depends on where or when it appears e.g.
in relation to time (temporal or behavioral) or to its neighbors (contextual). Imagine mea-
surements of temperature. A peculiarly high temperature during summer is expected
and a peculiarly low temperature during winter is also expected. But a peculiarly high
temperature during winter might be considered an outlier. This has some relation to
detecting outliers in a local sense.
There can be a lot of causes of the appearance of outliers. It might be because of hu-
man error, an error in the measurement equipment, natural deviations, faults in systems,
among others.
Outlier detection may sound rather abstract from a layman’s point of view, but in the
abstraction lies the flexibility. None of the approaches in this work depend on what the
numbers actually represent, they are in that sense information neutral. As long as there
is data that is properly represented in a numerical or categorical sense, outlier detection
can be applied and be useful. Large amounts of data might be hard to get a grasp on,
and then the outliers, the exceptions, might contain more information. Although some
approaches mentioned in this work actually construct models of the data distribution,
outlier detection can also be useful to clean the data before using a secondary modeling
approach.
1.2 What is a p-value?
Central to this work is the concept of the p-value, which comes from hypothesis testing.
To specify what this is let us first begin with the level of significance. In hypothesis testing
of a null hypothesis versus an alternative the level of significance is the risk of rejecting
the null hypothesis when it is actually true. Then observing the realization of the test
statistic, the p-value is the smallest level of significance at which the test would have
rejected the null hypothesis.
Our null hypothesis in this case is that a data point is generated according to the same
distribution as the previously seen data points, and the alternative hypothesis is that it
is not. Hypothesis testing is a trivial task when the distribution of the null hypothesis
is known, but that is not always something that can be assumed in real applications.
The problem investigated in this work is how well a handful of approaches can estimate
the p-value of data points without knowledge of the null distribution, but with access
to a set of training data generated according to the same null distribution. Different
kinds and variations of distributions are tested to get an idea about what assumptions
the approaches do about the null distribution and in some of the tests the training data
is contaminated with noise to view the effect on the approaches. The data points are
artificially generated so that the true null distribution can be used as a control. Based
on knowledge of the null distributions the true p-values of the test data points can be
calculated and the error of the approaches can be measured. The calculation of the true
p-values is explored in Section 3.8.

1.3. Benefits of using a p-value 3
1.3 Benefits of using a p-value
There are a number of benefits of choosing to represent a data points outlier score as a p-
value. The p-value, as defined in Section 1.2, is an understood concept. It is a convenient
score to use since it is bounded between 0 and 1 and it can also be treated as a probability
score of a point being an outlier. By using a standardized score as an output of different
approaches it is possible to compare the results of them, something that is hard to do
with legacy scores. Often it is the case that, especially in time dependent systems, that
the change has to be tracked. If that is the case then it might not be possible to use a
fixed training set, in the spirit of learning systems, the training sets has to be able to
change to adapt to the system. Changing the training set might have, depending on
the approach used, an effect on the legacy score. It is therefore not necessarily sound to
compare scores produced by the same approaches but given different training sets, so
there is a clear need for a standardized score. Lastly, the one more interesting and useful
characteristic of the p-value is that, given that the null hypothesis is true, the p-value is
uniformly distributed. So even if the approach changes, or the training set, given same
null hypothesis, it is possible to catch deviations of the process that is being observed
by comparing the distribution of the p-values to a uniform distribution between 0 and 1.
This is being utilized when examining the real data in Section 3.7.2 and Section 4.3.
1.4 Research question
The research question is split into several points;
• What are the state-of-the-art approaches in the research area of outlier detection?
• Identify those approaches that estimate p-values or those that can be adapted to
estimate p-values. Select a set of them.
• How well do these selected approaches estimate the p-value given sets of synthetic
data generated from a selection of distributions?
• The distributions should be selected to test different aspects such as symmetry,
modality and robustness to noise.
• Subject the approaches to tests based on non-synthetic data and attempt to interpret
the results.
1.5 Research scope
The research is subject to the following constraints;
• A requirement of the selected approaches is that they should be able to accept input
in the form of a matrix of distances between the data points, and not the input
points themselves.
• One of the approaches should be the R2S approach developed for a project at Halm-
stad University.
• One of the approaches should be based on support vector machines.

4 Chapter 1. Introduction
• The search area is huge and the approaches in the overview should be limited to
those found from the surveys mentioned in 2.2 and those found using the key-
words; “outlier”, “anomaly”, and “novelty”, or variations thereof.
1.6 Structure of this work
Chapter 2 starts by introducing the format and the expectations of the kind of data on
which this work applies. It continues with an overview of the current state-of-the-art
approaches that are suitable for outlier detection. Then follows a continuation of Section
1.2 and 1.3 in the exploration of the reasons and usefulness of the p-values and how it
applies to the approaches mentioned in this Chapter. The approaches selected for further
comparison are then described.
Chapter 3 begins by explaining the reasoning behind the selection of the approaches and
their implementations are then described. It is followed by an explanation of the test plan
and the true p-value calculations.
Chapter 4 presents the results of the test where the data comes from a univariate distri-
bution, bivariate distribution, or a real source.
Chapter 5 conducts a discussion where the results are analyzed and the author’s inter-
pretation of the results.
Chapter 6 summarizes the results and the findings in this work.

Chapter 2
Background
2.1 The data
2.1.1 The nature of the data
The data is generally assumed to be a set or a list of samples. In this work the input data
can also be a distance matrix between the samples or point. These samples might also
be referred to as objects, records, points, vectors, patterns, events, cases, observations,
entity throughout the literature. In this work, the words samples and points will be
used. Each sample consists of several attributes, also called variables, characteristics,
features, fields, dimensions. In most of this work, attributes will be used, and dimensions
will be used in a more general sense as describing the amount of variables expected per
sample. The distances between the samples or point can be e.g. the Euclidean distance,
the Mahalanobis distance, or any other distance measure suitable for the task.
A set that only contains samples with one attribute is thus called 1-dimensional or uni-
variate. More than one and it is referred to as N-dimensional, where N is the amount of
attributes per sample, or simply multivariate, when N is more than one. The attributes
themselves can be numeric, continuous or discrete, or categorical, with or without order.
A sample where all the values are of the same type is called monotyped and when the
attributes are of different types, called multityped. Some of the approaches presented
require the data to be entirely numeric. Converting categorical values to numerical is un-
problematic and commonly done in preprocessing, but it should be noted that, especially
for unordered categorical data, it might have an effect on the outlier detection method
used and the final result. The data is usually represented in this work as matrices, where
each row is a sample and the columns represent the different attributes.
The data that is given a priori and used in the creation of a solution is in this work re-
ferred to as the training set. How this process is conducted is explained in more detail in
Chapter 3.
The relations between the samples depend on the context of the problem. They might be
independent, but also linearly ordered e.g. as a time-series. They can also be spatially
related as in neighborhoods, and both, spatio-temporal, e.g. as climate data. The relations
have a big significance when the selection of outlier detection is made, although this work
will primarily focus on independent data.
5

6 Chapter 2. Background
2.1.2 Prior knowledge
The more that is known about the data beforehand the easier it will be to select a suitable
outlier detection approach, e.g. whether it comes from an identifiable distribution. If
preprocessing is necessary, e.g. normalization or whitening. The data might also be
labeled, and that means that the outlier status, or class membership of all or some of
the samples is already known. The state of membership knowledge can be classified
according to three cases.
Case 1, none of the samples’ membership are known. This is known as the unsupervised
case. Given that some data is available, but with normal data and outliers mixed in
unknown proportions, it is possible, using a robust approach to estimate or model the
distribution. The word robust in this case means that the approach selected is insensitive
to individual outliers, given reasonable proportions. The fewer outliers there are in the
data, the less robust the approach needs to be in order to produce a satisfactory solution.
Case 2, all the correct labels are known for the given samples. This is known as the
supervised case.
Case 3, when the correct label is known for only normal or outlier data, or when only
normal or outlier data is available as training data. This case benefits those approaches
that are constructed to find an optimal boundary around the data. Since there are only
instances of one class there is no notion of robustness to consider, the problem is more in
the line of model estimation. In this work the focus will be on data of case 1, where the
proportion of outliers in the data is small, and of case 3, when only data that is known to
belong to the normal class is known.
2.2 Approaches to outlier detection
Most of the approaches make use of some kind of distance measure. The distance is an
analogy for how much the tested data sample deviates from what is considered normal.
The approaches differ in what distance measure that is used and how normality is deter-
mined.
The approaches also differ in what kind of training data they need. Some approaches
work best when only being trained with normal data and perform badly when there are
outliers in the training data. Other approaches might not be as susceptible to this, which
is referred to as robust in this work. Robust is the concept of insensitivity to outliers in
the training data, it does not mean that an approach is objectively better, it depends on
the training data available and the application. As an example, a very robust approach
might be badly suited for when normal data is from more than one distribution, since
it would only generalize to whichever distribution it deemed more important. On the
other hand, a more sensitive approach would be able to adapt to two distributions but
also to noise in the training data.
The following summary of the area of outlier detection and related approaches is based
on a collection of reviews. These are Chandola et al. [2009], Hodge and Austin [2004],
Markou and Singh [2003a,b], Kriegel et al. [2009b], Patcha and Park [2007], Gogoi et al.
[2011], Frisén [2003], Khan and Madden [2010], Zhang et al. [2010], Gupta et al. [2014].

2.2. Approaches to outlier detection 7
2.2.1 Statistical
This section covers some approaches that one might call statistical. What sets these ap-
proaches apart from the others are that they assume that normal data belongs to a certain
kind of statistical model and then tests how well any new data fits that model. If the
new data on the other hand has a low probability of belonging to the model, the data is
deemed to be an outlier. This category is split into two subcategories depending on what
they assume known about the data and the process generating the data.
Parametric
Parametric approaches assume quite a lot about the data, but if the assumptions prove
true these approaches are shown to be some of the most effective. Usually it is assumed
that the data belongs to a Gaussian distribution, and a model is created according to that.
The parameters of the model can be estimated using e.g. maximum likelihood estimates.
One easy way to use a model to find outlier is simply to calculate how far a data point is
from the mean and use a threshold to determine how likely the point is to be an outlier.
Shewhart [1931] who was an early pioneer in this subject used a threshold of 3 standard
deviations. Further away than that, and the point was considered an outlier. Another
very simple way is to calculate the Z value and compare it to a significance level, called
the Grubb’s test [Grubbs, 1969]. This can be extended to multivariate models using the
Mahalanobis distance from the mean
One approach is to visualize the data as a box plot [Tukey, 1977]. A box plot is a very sim-
ple and well-known way to get an idea of the characteristics of a distribution, although
assuming the data belongs to a univariate distribution. They show five significant val-
ues, the lower extreme, the lower quartile, the median, the upper quartile and the upper
extreme. It is then visually easy to guess if the data point is an outlier.
However, assuming a Gaussian model and calculating the means and covariance ma-
trix for multivariate data is a simple solution. The Mahalanobis distance follows a χ2-
distribution where the degree of freedom correlates to the dimensionality of the data.
One significant problem is that these calculations are very sensitive to outliers in the
training data, which can significantly affect the performance of the approach.
From this follows another approach, one where the model is not used explicitly but
rather, points are removed from the training set based on how large impact they have
on the calculated covariance of the total set. This approach is called the minimum covari-
ance determinant and was introduced by Rousseeuw and Leroy [1987].
Another significant problem is that as the dimensionality increases, the worse will these
approaches be, known as the curse of dimensionality, a concept which most approaches
that use some kind of distance measure suffer from.
Non-parametric
In the category of statistical approaches that are not parametric in the same sense as those
described in the previous section, there are two major sub categories.
The first one, and the simplest, is to use histograms to create a profile of the data, counting
the data points that are within certain ranges and assigning them to bins. Using the

histograms for outlier detection is done in two steps. The first step is to build a histogram
of the data. The second step is to test a data point, if it falls into a bin with a high number
of training samples, it is assumed to be an instance of normal data.
If it falls into a bin with a low amount of training samples, or none at all, the test point
is assumed to be an outlier. The number of training samples in the bin of the test point
will in that sense correspond to an outlier score. The biggest challenge when using his-
tograms for anomaly detection is selecting a suitable size for the bins. If the bins are too
large, outliers will fall into bins with a lot of training data, that will lead to a lot of false
negatives. On the other hand, too small bins will risk normal data falling into bins that
are empty, with the result of a high number of false positives.
The other major category of statistical non-parametric outlier detection are approaches
that are based on kernel functions. One of them is parzen window estimation [Parzen,
1962]. This approach resembles the density estimation approaches mentioned in previous
section, except that this approach uses kernel functions. Statistical tests can then be used
to determine if a test data point is an outlier or not.
2.2.2 Proximity
A lot of approaches to outlier detection are based on the idea that normal data is found
in dense areas and outliers are found in sparse areas. Whether or not a point is within
a dense area can be determined by calculating the distance or similarity between the
point and its nearest neighbors and comparing the result to that of other points. How
to calculate the distance is not fixed, any similarity measure that is positive-definite and
symmetric may work, but effectiveness will vary. It is common to use the Euclidean
or Mahalanobis distance if the values are continuous, and there are a lot of other more
complex measures that are suitable for e.g. categorical data. Approaches that are based
on proximity can be divided into two main groups: those that use the distance to a certain
number of nearest neighbors as the outlier score, and those that use the distance measure
to calculate the relative density around a data point and use that value as an outlier score.
In this section clustering and depth-based approaches will also be covered.
k-Nearest neighbor
k-Nearest neighbor is a very commonly used classifier, where an instance of data is as-
sumed to belong to the same class as its nearest neighbors, often using voting if k > 1.
This concept is easily adapted to be used as an outlier detector by using the distance to
the nearest neighbor and some threshold. Varying the k parameter causes different be-
haviors. Another way, as done by Ramaswamy et al. [2000], is to select a proportion of
the total set of data points with the largest distance to their neighbors and declare them
outliers, but this approach requires some idea of the proportion of outliers in the data set.
There are three common ways to extend the k-nearest neighbor approach. The first one
is to change how an outlier score is determined based on the similarities between points,
be it for example changing the k-value, to an absolute amount or an amount proportional
to the total data set, or calculating an outlier score based on the sum of the nearest neigh-
bors, or combining them in a different way. The second way is to use a different kind of
distance measure. The third is based on how to increase the efficiency of the algorithm

since, based on the definition, the computing time grows by the square of the amount of
points in the data set.
Relative density
This group of approaches are very closely related to the k-nearest neighbor, but rather
than selecting a fixed number of nearest neighbors and basing decisions on that, these
approaches select those points that are within a certain range of the test data point, i.e.
those points that are within an imagined hypersphere around the test points.
One approach that uses this idea is presented in Knorr and Ng [1997]. It select those
points that are within a certain radius of the test point. If the number of those points are
less then or equal to a selected percentage of the total number of points, the test point
is considered an outlier. Approaches that need user set parameters are generally not
preferred in an unsupervised scenario, but this one is interesting since its percentage can
easily be compared to a p-value.
One problem with these approaches is that they have a difficulty with data sets that
contain areas of varying density. Breunig et al. [1999, 2000] solves this, by giving each
data point an outlier score based on the ratio of the average local density where the data
point resides and the data point itself, which is called the local outlier factor (LOF).
This approach has been extended both to improve the efficiency and to relate to different
structures in the data. One example is the connectivity based outlier factor (COF) by Tang
et al. [2002], where the subsequent neighbors are also tested, thus it is able to handle data
sets where the data is positioned following a line-like structure.
Another extension is outlier detection using in-degree number, which also follows the
neighbors of the neighbor of the data point, returning the number of points that have the
test data point in their k-nearest neighbor list. This group of approaches suffer the same
computational complexity as the k-nearest neighbor algorithms, and a lot of research has
gone into improving the efficiency.
Clustering
Clustering is a group of approaches that are extensively used in an unsupervised setting.
These approaches focus on finding groups in the data based on similarities without actu-
ally knowing what the similarity is. This is connected to the problem of outlier detection
based on what the clustering approach decides to do with those instances of data that do
not seem to belong to any of the discovered clusters. This is the basis of clustering-based
outlier detection. There are three main subcategories in the clustering approach. The first
are those approaches that find points that do not belong to any cluster. The approach
does not force a point to belong to a cluster, but instead keeps track of them as outliers.
One clustering algorithm that does this is e.g. DBSCAN [Ester et al., 1996]. The draw-
back is that these algorithms are not optimized for this purpose, their main task is still
clustering.
The second group are those approaches that produce a centroid point for each cluster. The
distance of each data point to their associated cluster center can act as a measurement of
how likely they are to be outliers.

The third approach is that whole clusters can represent outliers, and that the difference
is how sparse the data points are withing these clusters. By combining a clustering algo-
rithm by the means mentioned in the previous section, the sparseness can be calculated
and outliers can be determined based on a threshold or by proportion. The biggest dif-
ference between the clustering approach, the nearest neighbor and the relative density
approaches are that the clustering approaches compare distances or similarities to the
cluster centers, rather than other data points.
Depth
One group of approaches base the outlier detection on depth. They create a shell around
the data based on the outermost points and assign them the depth score of 1. The algo-
rithm then continues to create shells around the data, continuously increasing the depth
score. The shells are often referred to as convex hull layers. It is then up to the user to
determine at what depth the data points are to be considered outliers. There are two com-
monly used algorithms for this approach, ISODEPTH [Ruts and Rousseeuw, 1996] and
FDC [Johnson et al., 1998]. According to Kriegel et al. [2009b] they are only efficient on
low dimensional data sets. One difference with most of the other approaches mentioned
is that they will return a discrete value, the number of the layer, instead of a continuous
score, that can have some impact in the accuracy.
2.2.3 Neural networks
There are a lot of approaches that fall under the neural network category. The survey
Markou and Singh [2003b] is divided into two parts, where the second states to focus
explicitly on neural network based approaches.
The probably most common technique that is associated with neural networks is the
multi-layer perceptrons (MLP). They are used primarily for classification, but can be
adapted to work as outlier detectors. It is stated by Bishop [1994] that the when MLPs fail,
it is often because of data that the trained MLP has not seen before, the kind of data that
can seen as an outlier. So when detecting errors the MLPs show a very low confidence in
their decision. This can be used as a measure of outlierness.
Another approach to this is presented by Le Cun et al. [1990] where the activation level for
the winning node has to be higher than a certain threshold and that the second winning
node has to be lower than another certain threshold, and additionally that the absolute
difference between these are also in turn higher than a certain threshold.
The above mentioned approaches are not strictly unsupervised. One approach that is
wholly unsupervised is the self-organizing map (SOM) introduced by Kohonen [1988].
The SOM approach could also be categorized as a clustering approach since it produces
a similar output, but the underlying technique is built upon artificial neurons. Ypma and
Duin [1997] use this technique for outlier detection, they mean that the best solution is
to train the system based on normal operation to get a representation of what normal is,
and use that as a comparison.
In [Marsland et al., 1999] an approach that is based on habituation is introduced, that
learns to ignore repeated stimuli in the same way as the human brain works. The ap-
proach is further developed in [Marsland et al., 2000a] where the SOM technique is used

together with a filter in order to do outlier detection. In Marsland et al. [2000b] tempo-
ral Kohonen maps are used, they are based on Kohonens SOM but use what is called a
“leaky integrator” which decay over time, i.e. it allows the system to forget, which makes
this approach suitable for adaptive and real-time systems.
Neural trees is an approach that combines the unsupervised learning of the competitive
neural networks with a binary tree structure and is introduced by Martinez [1998]. When
used for outlier detection it will split the data space into cells where the dense areas are
split into smaller cells and the sparse areas are split into larger cells. This approach is also
suitable for online learning.
2.2.4 Machine learning
This category encompasses a lot of different kinds of approaches that are based on dif-
ferent theories and techniques, but most are based on sets of rules that are generated by
an algorithm or by systematically removing or adding points. One of these techniques is
decision trees used specifically if the data is categorical and can be useful for detecting
errors in databases. Skalak and Rissland [1990] trains C4.5 decision trees with preselected
normal classes and John [1995] builds on this approach by using repeated pruning and
retraining of the decision tree. The pruned nodes get to represent the outliers.
Fawcett and Provost [1999] use a rule based system for detecting outliers based on pro-
filing cases of normal data, and applies this idea to intrusion detection. Arning et al.
[1996] applies another approach that is based on pruning categorical data, by identifying
subsets and removing those that causes the greatest reduction in complexity.
There are those approaches that uses a Bayesian approach to outlier detection. A naive
Bayesian model technique is described by Elnahrawy and Nath [2004] to detect local
outliers, that also factor in the spatio-temporal correlations. In the structure of a network,
the nodes calculate the probability of its inputs being in certain subintervals divided by
the whole interval. If the probability of a sensed reading in its subinterval is less than
that of another subinterval it is considered an outlier, but this particular approach is only
applicable on one-dimensional data. Janakiram et al. [2006] describes an approach based
on a Bayesian belief network model that is applied to streaming data. The nodes in this
network not only care about their own readings, but also their neighbors. A data point
is considered an outlier if it falls beyond the range of its expected class. Hill et al. [2007]
introduces two approaches that are based on dynamic Bayesian belief models, one where
the current state is based on historical data and the posterior probabilities are calculated
based on a sliding window. The other approach allows for several data streams at once.
Yet another way to find outliers, which is very different from those presented so far, is
the isolation forest approach as described by Liu et al. [2008]. The data is partitioned
randomly and recursively according to a tree structure, until all data points are isolated,
and based on the idea that outliers are “few and different” they will be found closer to
the root. The path length to the data point can then act as an outlier score, smaller score
means more likely to be an outlier. The trees can be trained on parts of the training set
and then combined into an ensemble of trees, a forest.

2.2.5 High-dimensionality
A lot of the approaches mentioned so far have a problem with high-dimensional data.
This is a common issue and is known as the curse of dimensionality, and is based on that
most distance measures become less meaningful the more dimensions that are involved.
It is intuitively easy to grasp by the idea that a distance measure is a way to reduce the
dimensionality to 1, distilling the most important differences, varying of course with the
distance measure chosen, and when doing so a lot of information is lost.
One alternative is to use principal component analysis (PCA), which is a way of reducing
dimensionality but still attempting to keep most of the information. The new dimen-
sions are called principal components, where the first one is the one with the largest
eigenvalue and will have the most informational content. Further principal components
are arranged according to the eigenvalues and will have a decreasing informational con-
tent. The idea is that only the first or the first few will be interesting. These principal
components can then be used in a familiar way with other outlier detection approaches.
An approach that is based on angles, rather than distances, is described in Kriegel et al.
[2008]. It is based on the idea that normal data will have other data points all around and
therefore have angles to other points that are evenly distributed. The outlier points will
have most of the other points in the same direction, at a smaller interval of angles. This
approach is called angle-based outlier degree (ABOD).
Aggarwal and Yu [2000] describe an approach that divides the data space into an evenly
sized grid, and counts the number of data point that are in each grid cell, and computes a
sparsity coefficient. Those points that are in the cells with the lowest score are considered
outliers. The quality of the result is highly dependent on the grid resolution and position.
Another, seemingly more complicated approach, is the subspace outlier degree (SOD). It
is presented in Kriegel et al. [2009a]. It explores how well a points fits a subspace that is
defined by its nearest neighbors. The subspace is the a hyperplane that is parallel to the
axises and with a lower dimensionality than the input space. It is found by minimizing
the variance of the selected nearest neighbors. If the test point deviates a lot from this
hyperplane it is considered to be an outlier.
2.3 Outlier detection by p-value estimation
The usefulness of using an approach that estimates a p-value is described in Section 1.3.
Most of the approaches listed in Section 2.2 do not inherently produce a p-value. Those
statistically based approaches that creates models of the data can calculate a p-value,
e.g. estimating the parameters of a Gaussian normal distribution. Some of the other ap-
proaches also do, such as the one described in Knorr and Ng [1997], where a measure
similar to a p-value is estimated before tested against a threshold. Modifying the algo-
rithm to catch the measure before the threshold will yield a p-value.
One approach that can acts as a framework to estimate p-values using any other real-
valued-score-producing approach, whether it is a legacy score or not, is the CAD, ex-
plained in Section 2.4. Similarly to the mentioned DB approach the CAD, in its original
form, also applies a threshold, omitting this will yield a p-value.

2.4. Conformal anomaly detection (CAD) 13
The R2S approach developed for a project at Halmstad University, described in Section
2.5 does inherently produce a p-value.
All of the three above mentioned approaches estimates the p-value by calculating the
ratio of data points in the training set that are more different from a test point according
to some arbitrary difference measure, to the total amount of points in the training set.
One approach that does not do this is the application and adaptation of the OCSVM that
is found in this work. The feature of the OCSVM to estimate a boundary at a certain
proportion of outliers in the training set is used. More details on the exact approach are
found in Section 2.6.
2.4 Conformal anomaly detection (CAD)
The conformal anomaly detector (CAD) is an approach to detect outliers that is presented
in a recent thesis by Laxhammar [2014]. It is based on the conformal predictor described
by Gammerman and Vovk [2007].
The conformal predictor is a technique to get a measure of confidence of the predic-
tions made by traditional machine learning approaches. The most important property
of the conformal predictor is the automatic validity under the randomness assumption,
that the data points are generated independently from the same probability distribution.
What is central in the conformal predictor is the p-value calculation done through a fre-
quency test. It is up to a traditional machine learning approach to produce a value of how
“strange” a data point is compared to the rest of the data points in the set. The ratio of
those points that are more “strange” than the current data point divided by the total num-
ber of data points in the set, is the p-value. Based on the law of large numbers, the larger
the total set of data points are, the more accurate the p-value will be. In Gammerman and
Vovk [2007] attempts are done with classifications using a support vector machine on a
set of handwritten numerals. The results are compared with a Bayes-optimal confidence
predictor, with encouraging results. Nearest-neighbors are also mentioned as an example
of difference measure.
In the thesis by Laxhammar [2014] the conformal predictor is adapted to be used as an
outlier detection approach, which is called the conformal anomaly detector (CAD). Al-
though the algorithm is presented as using a threshold to get a binary results, the thresh-
old is applied to the estimated p-value, calculated in a similar way as in the conformal
predictor i.e. by using a difference measure. The difference measure used is now the
measure of how different a data point is to what expected normal is, i.e. a measure of out-
lierness. Laxhammar [2014] calls this value a non-conformity measure (NCM). Any ap-
proach to outlier detection that outputs a real value score can be used as a non-conformity
measure and will have an effect on the behavior of the estimated p-value.
The CAD can be implemented in a number of ways. One choice is whether to use the
online or the offline version. In the online version the training set is continuously updated
and new samples are continuously added, as described by Equation 2.1, where α is the
result of the NCM for the corresponding data point. In the offline version the training set
is kept constant.
p =
|{i = 1, . . . , N + 1 : αi ≥ αN+1}|
N + 1
(2.1)

Where |{. . . }| is the cardinality of the set. The validity can be improved by implementing
a smoothing factor τ. The τ value will be a random sample from a uniform distribution
between 0 and 1. The smoothing factor is implemented according to Equation 2.2.
p =
|{i = 1, . . . , N + 1 : αi > αN+1}| + τ|{i = 1, . . . , N + 1 : αi = αN+1}|
N + 1
(2.2)
The main drawback of the CAD is that it is computationally inefficient, it requires the
NCM to be recalculated for each test sample in addition to each sample in the training
set. This means that as the total number of points in the data set increases, so does the
computational time. How much the computational time increases is also related to the
complexity of the NCM chosen. In order to avoid this, a different version of the CAD
is introduced where the training set is split into a proper training set and a calibration
set. In that way the NCMs of the calibration set can be precomputed based on the proper
training set, and with this the computational complexity is reduced. This new version of
the CAD is called inductive, and the originally is referred to as transductive.
2.4.1 Non-conformity measure (NCM)
The CAD will produce valid results for any real-valued nonconformity measure, but the
performance is highly dependent on the type of nonconformity measure selected, and
how well it can discriminate between outlier data points and data points that are mem-
bers of the normal data.
2.5 R2S
The R2S approach is a very simple and intuitive algorithm that is used as a baseline for
this work. It relies heavily of the concept of a distance matrix. A distance matrix is a
matrix of the distances between each data point in the input set.
X = {x1, . . . , xN} (2.3)
dij = dist(xi, xj) ∀i = 1, . . . , N ∀j = 1, . . . , N (2.4)
D =






d11 d12 · · · d1N
d21 d22 · · · d2N
.
.
.
.
.
.
...
.
.
.
dN1 dN2 · · · dNN






(2.5)
In Equation 2.3 the X is the input data set, and each xi is a data point. N is the total
number of data points. In Equation 2.4 the dist(·, ·) is any suitable real-valued distance
function, and finally, in Equation 2.5 the D is the distance matrix.
The R2S builds on the observation that the data point that has the smallest distance to
all other data points is the same point that has the smallest sum of a row in the distance
matrix of the whole data set and will be the data point that is closest to the mean of

2.6. The evolution of the support vector machine 15
all the data samples. The R2S algorithm selects the data point in the training set with the
minimum row sum from the total distance matrix and uses this as the central sample. The
set of distances from the remaining data samples to this central data sample are then used
as the empirical distribution. The distance measures used in this work is the Mahalanobis
distance for the synthetic data, and the Hellinger distance for the non-synthetic data. The
distance measure can easily be exchanged for any application where another distance
measure is better suited. The p-value for a test sample is then estimated as the number of
samples in the training set that lie further away from the central data sample than the test
sample, as shown in equation (2.6), where dic is the distance from sample i to the central
sample, and dmc is the distance from the test data sample and the central data sample.
p =
|{i = 1, . . . , N : dic > dmc}|
N
(2.6)
Comparing how the R2S approach and the CAD approach estimates the p-value it is
clear that the R2S approach is essentially equal to the CAD with a NCM function as the
distance to the central data sample.
2.6 The evolution of the support vector machine
In this section the evolution of the original support vector machine, as a two-class classi-
fier, into the one-class support vector machine utilized in this work is described. In order
to do this in a suitably pedagogical way, the simplest use of the support vector machine
is described first in most detail, then further iterations will build on this by sequentially
adding those features that eventually enable the one-class support vector machine.
The support vector machine algorithm was invented by Vladimir N. Vapnik and de-
scribed in Boser et al. [1992], where it is refered to as the optimal margin classifier, and
this is the version that is covered in Sections 2.6.1 and 2.6.2. The binary classifying sup-
port vector machine and some of its variations are well described in Campbell [2002].
2.6.1 The linearly separable case
The linear case is the simplest case of the support vector machine. It covers the case of
where data points are linearly separable into two classes. A data point, or a sample, is
represented as a vector xi. For the training data, the label of each data point is known,
and represented by yi. The label is either +1 or −1 depending on which of the classes the
data point belongs to. This is described in mathematical terms as:
(xi, yi)
where





yi = 1 if yi ∈ class A
yi = −1 if yi ∈ class B
i = 1, . . . , N
(2.7)
Since the data points are linearly separable there exists a hyperplane between them, and
their class membership can be determined using the decision function, whether above or

below 0:
D(x) = w · x + b (2.8)
From this, the problem is clear: the vector w and the scalar b need to be found. There
exists an infinite number of w that solve our problem since there are an infinite number of
planes that can separate the data points. But to get the “best” result based on the training
data, there must be a way of finding the “optimal” solution. An optimal solution should
be one that maximizes the distance between the hyperplane and the closest examples of
the training data points, i.e. the margin. So the problem is: maximizing the margin. The
margin, M can be defined according to:
yiD(xi)
kwk
≥ M (2.9)
But the actual scale of M is not important, only at what w is at its maximum, so by
normalizing the scale of M by kwk the problem is equivalent to:
minimize
w
1
2 (w · w)
subject to yi(w · xi + b) ≥ 1
(2.10)
The constraints can be incorporated into the expression so that the problem is trans-
formed into that of minimizing the primal Lagrangian:
minimize
w,b,α
L(w, b, α) = 1
2 (w · w) − ∑i αi(yi(w · xi + b) − 1)
subject to αi ≥ 0
(2.11)
According to Wolfe’s theorem the partial derivatives set to zero with respect to w and b
can be substituted into the primal, which gives the new problem of minimizing the dual
Lagrangian [Wolfe, 1961]:
∂L
∂w = w − ∑i αiyixi = 0 ⇒ w = ∑i αiyixi
∂L
∂b = − ∑i αiyi = 0
(2.12)
minimize
α
1
2 ∑ij αiαjyiyj(xi · xj) − ∑i αi
subject to ∑i αiyi = 0
αi ≥ 0
(2.13)
This minimization problem is solved with quadratic programming and the result is the
vector α. Only a few of the values in the vector α will be > 0, the corresponding x of these
α are called the support vectors and only those are actually needed by the new decision
function that, through substituting equation (2.12) into (2.8) will look like:
D(x) = w · x + b = ∑
i
αiyi(xi · x) + b (2.14)

The b variable has no impact on the optimization and can be calculated separately using
the result of the optimization. One way of calculating the b variable is by taking the most
extreme support vectors of each class and calculating an average.
b = −
1
2
max
{i|yi=−1}
∑
j
yjαjK(xi, xj)
!
+ min
{i|yi=+1}
∑
j
yjαjK(xi, xj)
!!
(2.15)
2.6.2 Nonlinearity
The previous section describes the case when the two classes are linearly separable, but
this is certainly not always the case. But even if the two classes are not separable linearly,
there might still be a way to map the data into a higher dimensional space where the two
classes become linearly separable. To translate each data point and then compare them
in the higher dimension is essentially what needs to be done, but is not reasonable due
to computational constraints.
What can be done is to use what is called the kernel trick. By using a function that
can produce scalar product of two points in high-dimensional space, without doing the
point-by-point translation, a lot of the necessary computations are skipped. It is possible
to do this simply because the two important equations, the decision function (2.14) and
the optimization problem (2.13), neither use the data points explicitly, they only use the
dot product. The dot product is in the linear case the kernel. Replacing the dot product
function with another kernel function is the same as calculating the difference in high-
dimensional space. The equations are changed according to the following:
minimize
α
1
2 ∑ij αiαjyiyjK(xi, xj) − ∑i αi
αi ≥ 0
(2.16)
D(x) = ∑
i
αiyiK(xi, x) + b (2.17)
Where the K(x, y) is the kernel function. By selecting different kernels, different higher-
dimensional spaces are represented, and by selecting the Guassian kernel, an infinite-
dimensional space is emulated, which makes the support vector machine a very powerful
tool indeed. Table 2.1 shows some commonly utilized kernels.
K(xi, xj)
Linear xi · xj
Polynomial (1 + xi · xj)d
Gaussian e
−(xi−xj)2
2σ2
Sigmoid tanh(βxi · xj + b)
Table 2.1: Commonly used kernels for support vector machines.

2.6.3 Soft margins
C-SVM
It is very likely that the training data contains data point that are erroneous in some way.
In order to minimize the impact of these erroneous data points on the final solution, a
slack variable is introduced in a paper by Cortes and Vapnik [1995]. This variable, ξi, will
allow for some exceptions while training. The C parameter will control the proportions
of how much of the error the slack variable will be able to soak up.
The optimization problem is changed to (2.18), where it is clear that the C controls the
proportions of the ordinary SVM problem from (2.10) and the sum of the slack variables.
minimize
w,ξ
1
2 w · w + C ∑i ξi
subject to yi(w · xi + b) ≥ 1 − ξi
ξi ≥ 0
(2.18)
By calculating the dual Lagrangian from this new problem, as in Section 2.6.1, a solution
(2.19) that is surprisingly similar to (2.13). The difference being the upper bound on the
support vector weights.
minimize
α
α
α
1
2 ∑ij αiαjyiyjK(xi, xj) − ∑i αi
0 ≤ αi ≤ C
(2.19)
ν-SVM
The concept of the C parameter was thought to be too abstract, so another version of
the soft-margin SVM was introduced in a paper by Schölkopf et al. [2000] called the ν-
SVM. This version uses a different parameter (ν) that is bound in the interval 0 to 1
and is thus slightly more intuitive for the end user. The resulting problem formulation
(2.20) becomes a bit more complicated, and the dual Lagrangian (2.21) will also become
different, and the limits of the support vectors are a bit more involved.
minimize
w,ξ,ρ
1
2 w · w − νρ + 1
N ∑i ξi
subject to yi(w · xi + b) ≥ ρ − ξi
ξi ≥ 0
ρ ≥ 0
(2.20)
minimize
α
1
2 ∑ij αiαjyiyjK(xi, xj)
∑i αi ≥ ν
0 ≤ αi ≤ 1
N
(2.21)

2.6.4 One class
All of the previously mentioned derivatives of the support vector machine are used for
two-class classification, but the subject of this work is outlier detection. The soft-margin
SVMs, and the introduction of the slack variable concept has allowed further develop-
ment of the SVM into a one-class classifier. That, in which the previously covered SVM
versions are the hyperplane that divides the two classes, is now rather a border that de-
scribes the outer limits of one class. By controlling the exceptions through slack variables,
the amount of outliers can be controlled. There are two major contenders in this field,
the support vector domain descriptor (SVDD) and the one-class support vector machine
(OCSVM).
SVDD
The support vector domain descriptor was introduced in the paper Tax and Duin [1999].
This approach attempts to find a hypersphere with the radius R and the center a that
should contain most of the training data points. By using the idea of slack variables
as introduced in 2.6.3 and incorporating them into the constraints, the problem can be
stated according to 2.22. The first constraint contains the distance from the data point to
the center point a. The constant C controls the trade-off between the volume of the sphere
and the number of data points that are rejected. Also note that there is no mention of the
yi variable in the equation, which is the label in the binary classification case.
minimize
R,ξ,a
R2 + C ∑i ξi
subject to (xi − a)(xi − a)T ≤ R2 + ξi
ξi ≥ 0
(2.22)
The Lagrangian is constructed, the partial derivatives are set to zero and replacing the
distance measure with the kernel function, the problem statement becomes as 2.23.
minimize
α
α
α
∑ij αiαj(xk · xi) − ∑i αi(xi · xi)
subject to ∑i αi = 1
∑i αixi = a
0 ≤ αi ≤ C
(2.23)
Where the dot product can be replaced with a kernel as in 2.6.2. Tax and Duin [1999] finds
the Gaussian kernel to be preferable, and that the width of the kernel should be used to
control the behavior of the algorithm. A small width causes the algorithm to be similar to
a Parzen density estimation, whereas a larger kernel cause the algorithm to become more
general. The C will, just as for the C-SVM, be the upper bound on the size of the support
vectors. Schölkopf et al. [2001] present a solution for using a ν-parameter, as the ν-SVM
instead of the C for the SVDD.

OCSVM
The one-class support vector machine was introduced in Schölkopf et al. [2001] and just
as the SVDD it is designed to estimate a boundary around some given data, rather than
binary classification. Rather than using hyperspheres, the OCSVM attempts to, with the
help of the kernel function, separate the data points from the origin with a maximum
margin to a hyperplane. Just as the SVDD the slack variables ξi are used to tolerate
outliers. The problem is stated as 2.24. Notice that, just as the SVDD, there is no mention
of the label variable yi, and the C, to control the weight of the slack variables, is replaced
with 1/(νN). The N is the size of the training set and the ν-variable works approximately
as it does in the ν-SVM.
minimize
w,ξ,ρ
1
2 w · w + 1
νN ∑i ξi − ρ
subject to w · xi ≥ ρ − ξi
ξi ≥ 0
(2.24)
The ν parameter will in effect be:
• An upper bound on the fraction of outliers.
• A lower bound on the fraction of support vectors.
• Given the right kernel, equal both the fraction of outliers and the fraction of support
vectors.
By that, using a ν = 1, will cause the result to be equal to a Parzen density estimation,
another approach to density estimation or outlier detection mentioned in 2.2.
By incorporating the constraints in 2.24 and thus creating the Lagrangian, setting the par-
tial derivatives to zero, the same procedure as described earlier, the optimization problem
(2.25) emerges.
minimize
α
1
2 ∑ij αiαjK(xi, xj)
subject to ∑i αi = 1
0 ≤ αi ≤ 1
νN
(2.25)
The decision function 2.26 will take on a slightly different appearance, remember that the
label parameter yi is not necessary.
D(x) = ∑
i
αiK(xi, x) − ρ (2.26)
The ρ can be calculated using equation 2.27 where the xi’s corresponding αi > 0, i.e. it is
one of the support vectors.
ρ = ∑
j
αjK(xj, xi) (2.27)
Campbell and Bennett [2001] mention that there is a disadvantage in this approach based
on the assumption about the origin, and that it acts as a prior for where the outliers are

supposed to lie. Still they successfully managed to detect novel outliers when tested on
both synthetic and real data. This approach is also criticized in Manevitz and Yousef
[2002] for necessity to select parameters, particularly to the sensitivity in selecting the
kernel parameters.
In order to use the OCSVM approach to estimate the p-value, the fact that the ν-value
represents the fractions of outliers is utilized, which is more thoroughly described in
Section 3.5.
An alternative approach is presented in Hempstalk et al. [2008] where regular outlier de-
tection is performed and then additional data points are created for the outlier class. The
generated outliers are then used together with the original data to train a binary classifier
with a probability output. The 2-class SVM with probability output is first introduced by
Platt [2000], and improved by Lin et al. [2007].

Chapter 3
Methods
3.1 Approach selection criteria and discussion
Obviously it is impossible to do a complete evaluation of all available approaches. A
selection has to be made, and some criteria for this selection is already covered in the re-
search question in Section 1.4. The number of approaches needs to be kept small enough
to fit the scope of a master thesis, but large enough to cover a fair selection of the state-
of-the-art techniques covered in section 2.2.
The first approach selected is one that is used in a project at Halmstad University. This
approach is based on finding the most central point and calculating the distances from
that point to the rest. It is referred to as the R2S approach in this work.
The second approach is based on support vector machines (SVM). There are two well
known approaches to outlier detection based on the SVM technique, the one-class sup-
port vector machine (OCSVM) and the support vector domain descriptor (SVDD). Out of
these two the OCSVM is selected. Since one of the requirements is the p-value estimation,
the approach selected needs to be able to handle that. Unfortunately none of the above
mentioned support machine derivative approaches does that natively, but thanks to the
nature of the ν parameter of the OCSVM it can be utilized to get a quantized p-value
estimate.
The third approach is a very recent approach introduced by Laxhammar [2014] and based
on the work of Gammerman and Vovk [2007] and called conformal anomaly detection
(CAD). The approach itself is in a sense a way to convert values from a difference measure
into a p-value estimate. The difference measure that is used is called, in the context of the
CAD approach, the non-conformity measure (NCM). Depending on what NCM is used
the CAD is going to show different kinds of behavior and performance. Two different
NCM are selected, one that is based on a statistical and parametrical approach through
calculating the Mahalanobis distance, and another that uses a non-parametrical approach
in the form of a k-nearest neighbor based NCM.
3.2 Adaptation and implementations
One of the requirements of the selected approaches for further investigation in this work
is that their result is given as a p-value estimate. This is not a native function of any of the
selected approaches except for the R2S approach. The CAD algorithm returns a binary
23

24 Chapter 3. Methods
value, as described in the paper by Laxhammar [2014], but catching the score before the
threshold, a p-value estimation is acquired. As for the OCSVM, the adaptation to p-value
estimation is not equally straight forward. The approaches are generally implemented in
MATLAB [MATLAB, 2011] and the LibSVM [Chang and Lin, 2011] library, used for the
OCSVM, is implemented in C and accessible from MATLAB through a MEX-interface.
3.3 Implementation of the CAD
The implementation of the CAD is as close as possible to the algorithm described in
Chapter 4 of Laxhammar [2014]. It is based on the transductive CAD. The reason for
selecting the transductive CAD is that it is more suitable when a fixed training set is used.
Would the inductive CAD be selected as the implementation the training set would have
to be divided into a proper training set and a calibration set, and the proportions between
the two would add to another unnecessary parameter.
Since the training sets are fixed and the test points do not conform to a random distribu-
tion it would not be suitable to add those test points to the training set as they are tested,
it would cause a result that were dependent on the order the test points were tested.
Therefore the implementation can be viewed as being of the offline variety.
The implementation is according to;
zN+1 ← t
for i ← 1 to N + 1 do
αi ← NCM({zi, . . . zN+1} zi, zi)
end for
p ← |{i=1,...,N+1:αi≥αN+1}|
N+1
Where (z1, . . . , zN) is the training set, t is the test data point, and p is the resulting p-value.
3.3.1 First NCM: Mahalanobis distance
The first of the nonconformity measures used is a very simple alternative. It is the Ma-
halanobis distance. The covariance matrix of the training data is calculated and used to
calculate the Mahalanobis distance from the mean of the training data to the test point.
The result will be the distance from the mean of the assumed distribution.
This approach in itself can be used as a rudimentary outlier detection approach with a
carefully selected threshold for determining if a data point is an outlier or not. Although
there is no need to select a threshold in this case, since it is used with the CAD framework.
It assumes that the data belongs to a unimodal distribution, where the mean has the
highest p-value. It also assumes that the underlying distribution is symmetric around the
mean. These assumptions will likely have a detrimental effect on the p-value estimation
for those situations where they prove false.
3.3.2 Second NCM: k-NN
The second nonconformity measure is based on the idea of the nearest neighbor ap-
proaches, that those data points that are likely to be outliers have a large distance to their

3.4. Implementation of the R2S 25
neighbors, and those data points that belong to the normal data have a shorter distance
to their neighbors. Since the NCM measures are presented in a comparative context there
is no effective difference to use the mean or the sum of the nearest neighbors, therefore
the sum is used. Choosing the number of nearest neighbors is explained in Section 3.3.3
below.
3.3.3 Selecting a k-value for k-NN
Selecting the k-value for the nearest neighbor NCM will have an effect on the behavior of
the whole approach. Selecting a k-value that is very large will likely yield an approach
that is similar in behavior to the approach using the Mahalanobis distance. Selecting a
very small k-value will likely yield an approach that is too sensitive to the individual
points in the training data.
The difficulty in selecting a good value for the k is that the scenario is unsupervised;
there is no way of finding the optimal k since the underlying distribution is unknown.
Unlike the supervised 2-class case when finding a close to optimal parameter is simple
by techniques like k-fold validation. In this work the approaches are compared in a con-
trolled context, and since the CAD Mahalanobis and the R2S approaches both assume
that the data is unimodal and symmetrical they are expected to perform best when the
data comes from a normal distribution. It is therefore in the authors view reasonable to
select a k-value that is such that the approach perform comparably good to those other
methods for when the data is based on the normal distribution.
Through testing a good value for k was found to be at 30% of the size of the training set.
Figure 3.1 shows examples of results of trials of different values of k on a single training
set of 100 samples generated from a normal distribution.
In real applications there are no shortcuts like this, where the shape of the training data
is unknown. Instead the k-value has to be based on a qualified guess, where additional
information about the training data will aid the guesswork.
3.4 Implementation of the R2S
The R2S method returns the expected p-value by default so no actual adaptation is nec-
essary. The implementation is verbatim to the description.
3.5 Implementation of the OCSVM
The basic functionality of the support vector machine is provided by using a library called
LibSVM [Chang and Lin, 2011]. The LibSVM library contains an implementation of the
Schölkopf’s OCSVM [Schölkopf et al., 2001]. As mentioned in Section 2.6.4 the OCSVM
utilizes a ν-value parameter in order to give the algorithm an idea of the proportion of
outliers to expect within the training data, this is required information as the algorithm
is working in an unsupervised scenario. The concept of the OCSVM as unsupervised
approach, but still requiring more information beyond the training, is somewhat contra-
dictory to the concept, but that is a more philosophical discussion and will not be further
explored here.

−4 −2 0 2 4
0
0.2
0.4
0.6
0.8
1
(a) 10% nearest neighbors
−4 −2 0 2 4
0
0.2
0.4
0.6
0.8
1
(b) 20% nearest neighbors
−4 −2 0 2 4
0
0.2
0.4
0.6
0.8
1
(c) 30% nearest neighbors
−4 −2 0 2 4
0
0.2
0.4
0.6
0.8
1
(d) 40% nearest neighbors
Figure 3.1: Example of results of different values of k for the CAD k-NN from a single
data set of 100 samples generated from a normal distribution.
Since the ν-value is describing the actual proportions of outliers in the training set, and
thus the proportion of data points that are excluded and classed as outliers, the ν-value
will correspond, in a sense, to a threshold in regards to the p-value. By training several
OCSVM models at different levels of the ν-value, the proper p-values can be estimated
and searched for.
This process is automated in the implementation used in this work by dividing the 0 to 1
span into several parts where the ν-value corresponds to k/R, where k is the current level
trained or searched and R is the total numbers of levels or divisions of the 0 to 1 interval.
The implementation used in this work is designed in two versions, both yielding the
same results but with different performances based on the format of the testing data.
These two cases can be referred to as:
• Doing an online search, which is suitable when having a small amount of test sam-
ples and/or a changing set of training data.
• Doing a batch evaluation of a large set of test samples or where the training data is
static.
The latter case can be divided into two distinct phases, one phase of training, which
will result in a list of trained models, and a second phase of testing using the previously
trained models.
Since the models can be trained in a controlled order with an increasing ν-value, it is
a trivial task to find between which of these ν-values the binary results changes from

3.6. Selecting the kernel and the kernel parameters 27
a classification as a normal data point to an outlying data point. The correct p-value is
somewhere between the ν-values used to train those two OCSVM models. It should now
be clear to the reader that the more OCSVM models trained the more accurate the p-value
estimation will be. The number of models used is referred to in this work as the resolution
(R). When doing the search, utilizing that the OCSVM models are trained and accessed
in order, it is effective to do a binary search by sequentially dividing the search space
in half and testing the middle OCSVM model. This means that each test sample only
needs to be evaluated by a very small proportion of the OCSVM models, only log(R) to
be exact. When using the online version of the approach, this amount equals the amount
of OCSVM models actually trained, which proves to be very effective time performance
wise. The concept of actually doing this binary search in relation to the resolution, gives
a convenient method of determining a suitable number, calculated based on the number
of search levels used and to get an optimal size of the resolution parameter. This number
is referred to as the depth (d), which is related to the resolution (R) according to:
R = 2d
− 1 (3.1)
Then R will have a size that is optimal for searching and d will equal the number of tests
done for each tested data point.
3.6 Selecting the kernel and the kernel parameters
The Gaussian kernel was selected based on that it is the most versatile in relation to the
input data, but even a simpler kernel is expected to work in the case of low-dimensional
data sets of simple distributions. A polynomial kernel was tested and appeared to pro-
duce comparable results for even degrees of the kernel, but not for odd degrees. The free
parameter of the Gaussian kernel is the width, known as the γ value, which corresponds
to 1/2σ2, so a larger γ will produce a narrower kernel. A kernel that is too narrow will
result in overfitting to the data, and a too wide kernel will cause the approach to over
generalize. The γ value was selected to 0.1, and kept at that throughout all of the tests.
A value of 0.1 showed a good overall performance of the estimation on the normal and
symmetric distributions.
In Figure 3.2 the effects of different kernel widths are visualized on a single data set of 100
samples generated according to a normal distribution. Figure 3.2a shows a very narrow
kernel and the effects of overfitting. Figure 3.2b and Figure 3.2c looks very similar, but
the slightly wider kernel in the latter explains the more blunt peak. Even wider kernels,
Figure 3.2c cause artifacts. Figure 3.2b corresponds to the chosen kernel width of γ = 0.1.
As for the real data, the Hellinger distance is used which does not require a parameter.
3.7 Test plan
The first question to answer is how well the different approaches estimate the p-values
for different kinds of distributions and if any of the approaches is more suitable for the
different distributions. This is conducted by generating sets of data according to different
distributions that are then used as training sets for the approaches. A set of evenly dis-
tributed test values are used to sample the resulting p-value curve and comparing that to
the true calculated p-values based on the parameters for the distribution that was used to

−4 −2 0 2 4
0
0.2
0.4
0.6
0.8
1
(a) γ = 1
−4 −2 0 2 4
0
0.2
0.4
0.6
0.8
1
(b) γ = 0.1
−4 −2 0 2 4
0
0.2
0.4
0.6
0.8
1
(c) γ = 0.01
−4 −2 0 2 4
0
0.2
0.4
0.6
0.8
1
(d) γ = 0.001
Figure 3.2: Examples of results of different widths of the Gaussian kernel from a training
set of 100 samples generated from a normal distribution.
generate the data. This procedure is done several times to get a result that is statistically
valid. This question is extended by also exploring how the approaches behave when the
training data is based on a mixture of distributions and how the approaches handle when
there is noise of different forms in the training data.
The second question to answer is how the approaches change when the amount of sam-
ples in the training set increases. In order to be able to compare the results from the
approaches to the calculated true value, a measure of error needs to be used. The root of
the mean of the squared error is deemed suitable. The result is presented as a plot where
an increasing number of samples in the training set is related to the RMSE value. The
error is calculated over sample test data points generated evenly.
This test is conducted over a selected range of sample sizes due to the computational
requirements. A curve is fitted to these RMSE values and sample size levels in order to
get an idea of the general curve to which these results follow. The curve is based on the
formula (3.2). Using the square root of the sample size is appropriate since it relates to
getting the standard error of the mean from the error of the samples, based on the sample
size, i.e. decreasing the error with a factor of N requires N2 observations.
RMSE(N) = α +
β
√
N
α ≥ 0, β ≥ 0
(3.2)
This means that the parameter α will represent an approximate limit and parameter β will

3.7. Test plan 29
represent an approximate rate of convergence. As such the α will attempt to represent
the lowest error achievable using the approach, the model bias, and the β will represent
a measure of how fast the approach will reach the this lowest error, and how much the
approach will benefit from additional training samples. Since it is impossible to have a
negative error or a negative rate of convergence, the α and β are constrained to be non-
negative when fitting the curve.
3.7.1 Synthetic data
The purpose of the test data is to gauge the different abilities and performances of the
approaches. To test the approaches a set of randomized data points are created. These
data points are to be distributed according to a handful of selected distributions and
configurations in order to make up a battery of tests to cover and gauge some of the
interesting features and differences of the approaches.
There will be a total of 16 tests, each of which is a unique combination of distributions
and parameters. Each of the tests will be done at set levels of number of samples of a
training set, at 100, 200, 500, 1000, 2000 samples. Each of these tests will be performed
with 30 sets of generated samples in order to reach a decent generalization of the results.
All the combinations of these will result in 2400 (16 × 5 × 30) different training sets to be
tested on the four approaches, to a total of 9600 combinations.
The 16 distributions to act as sources for the training data are;
1. The first test will be based on a univariate normal distribution with the mean 0
and the standard deviation of 1. Those approaches that are assuming a unimodal
symmetric distribution, R2S and CAD Mahalanobis, are expected to excel here. This
test will act as a baseline for the rest and in that sense the parameters of the other
approaches will be selected to perform decent at this test to create a fair test for all
the approaches. Shown in Figure 3.3.
2. In the second test, data from a t-distribution will be generated, also univariate, with
a degree of freedom of 3. This distribution is similar to the normal distribution, but
with thicker tails. Shown in Figure 3.4.
−4 −2 0 2 4
0
0.1
0.2
0.3
0.4
Figure 3.3: Probability density
function of a Gaussian normal dis-
tribution with µ = 0 and σ = 1.
−4 −2 0 2 4
0
0.1
0.2
0.3
0.4
function of a t-distribution with
ν = 3.
3. The third test is based on a beta distribution. A beta distribution can be either sym-
metric or non-symmetric depending on the parameters chosen. Both these cases are

tested. This is also a univariate distribution. As for the symmetric distribution, it
differs from the normal distribution by being wider and only being defined in the
range from 0 to 1. The parameters for the symmetric beta are set at α = β = 2.
Shown in Figure 3.5. (The α and β used to here define the shape of the beta distri-
bution are not the same as used in the curve fitting procedure.)
4. The fourth test is also generated from a univariate beta distribution, but in this case
it is non-symmetric. The parameters used are α = 2 and β = 5. A non-symmetric
distribution will challenge the approaches’ assumptions about symmetry, and those
approaches that do not make those assumptions are expected to have a slight ad-
vantage. Shown in Figure 3.6.
0 0.5 1
0
0.5
1
1.5
2
function of a beta distribution with
α = β = 2.
0 0.5 1
0
0.5
1
1.5
2
2.5
function of a beta distribution with
α = 2 and β = 5.
5. The fifth test is based on the log-normal distribution, with the µ = 0 and the σ = 1.
This distribution is even more asymmetric than the beta distribution and is ex-
pected to display a clearer view of how well the approaches adapt to asymmetry.
Shown in Figure 3.7.
6. The sixth test is an exponential distribution, with the parameter λ = 1. The asym-
metry of this distribution is extreme in the sense that there is no notion of a left tail,
i.e. it is 0 for values less than 0. Shown in Figure 3.8.
0 5 10
0
0.2
0.4
0.6
0.8
function of a log-normal distribu-
tion with µ = 0 and σ = 1.
0 5 10
0
0.2
0.4
0.6
0.8
1
function of a exponential distribu-
tion with λ = 1.
7. The seventh test will be based on a mixture of two normal distributions, one will
have µ = −2 and the other will have the µ = 2. This test will show how the

3.7. Test plan 31
different approaches manages to estimate a distribution with two peaks, and is
expected pose a challenge to all the approaches. Shown in Figure 3.9.
8. In the eight test noise is introduced into the training set. In this case the noise is
generated according to a uniform distribution in the interval -4 to 4. The distribu-
tion used is a normal distribution with µ = 0 and σ = 1 and this is to represent the
underlying distribution. 90% of the data comes from the normal distribution and
10% is from the uniform distribution. The point is to show how well the approaches
manage to estimate the normal distribution when uniform noise is mixed into the
training set.
9. The ninth test is also where the effects of noise is explored. But this time the noise
is generated according to a second normal distribution that has a µ = 2 and σ = 1.
The setup is similar to the eight test in that 90% of the data comes from the expected
distribution and 10% from the noise distribution.
10. The tenth test is the first test that is bivariate, i.e. two-dimensional. The data will be
generated according to a normal distribution with µ = (0, 0) and the σ = 1 in each
dimension. Shown in Figure 3.10.
−4 −2 0 2 4
0
0.1
0.2
0.3
0.4
Figure 3.9: Multimodal probability
density function of normal distri-
butions with µ1 = −2, µ2 = 2 and
σ = 1.
−4 −2 0 2 4
−4
−2
0
2
4
Figure 3.10: Bivariate probability
density function of a normal distri-
bution with µ = (0, 0) and Σ = I.
11. The eleventh test generated according to a bivariate normal distribution, but in this
case it will have twice the standard deviation in the horizontal axis, i.e. it will be
elliptic;
Σ =
"
4 0
0 1
#
(3.3)
Shown in Figure 3.11.
12. The twelfth test is a mixture of two bivariate normal distributions. Both will have
a σ = 1, but with different means. One will have µ1 = (−2, −2) and the other
µ2 = (2, 2). As with the univariate combination of two normals this will also serve
to show how well the approaches estimate two peaks, and those approaches that
assume a single mean are expected to fail. Shown in Figure 3.12.
13. In the thirteenth test data from a normal distribution with µ = (0, 0) and σ =
1 will be mixed with 10% noise. The noise is generated according to a uniform
distribution that covers the test area.

−4 −2 0 2 4
−4
−2
0
2
4
density function of a normal distri-
bution with µ = (0, 0), Σ00 = 4,
Σ11 = 1 and Σ01 = Σ10 = 0.
−4 −2 0 2 4
−4
−2
0
2
4
density function of normal distri-
butions with µ1 = (−2, −2), µ2 =
(2, 2) and Σ = I.
14. The fourteenth test is also a test with noise but just as in the univariate case, this
time the noise is generated according to a second normal distribution, with a σ = 1
and the µ = (2, 2). 10% of the data points in the training set are from this distribu-
tion, the rest are from a normal distribution with µ = (0, 0) and σ = 1, and will act
as the assumed real data from where the p-value is to be estimated.
15. The fifteenth test is based on a bivariate log-normal distribution, with the µ = 0
and the σ = 1. As a bivariate, non-symmetric distribution this is expected to be a
significant challenge for most of the approaches. Shown in Figure 3.13.
16. The sixteenth test is a combination of a log-normal distribution and a normal dis-
tribution in such a sense that the data points are distributed normally along the
horizontal axis and log-normally along the vertical axis. Shown in Figure 3.14.
0 2 4 6 8
0
2
4
6
8
density function of a log-normal
distribution with µ = (0, 0) and
Σ = I.
−4 −2 0 2 4
0
2
4
6
8
density function of a normal/log-
normal distribution with µ = (0, 0)
and Σ = I.
3.7.2 Real data
The approaches are also tested on data from a fleet of buses. Data were collected between
August 2011 and December 2013. The data were collected during normal operation and
each bus drove approximately 100000 km per year. More than one hundred signals were

3.8. Calculating true p-values 33
Distribution Mean Variance Skewness Kurtosis
Normal (µ = 0, σ = 1) 0 1 0 3
T (ν = 3) 0 3
Beta (α = β = 2) 0.5 0.05 0 2.1429
Beta (α = 2, β = 5) 0.2857 0.0255 0.5963 2.88
Log-normal (µ = 0, σ = 1) 1.6487 4.6708 6.1849 113.936
Exponential (λ = 0) 1 1 2 9
Table 3.1: Summary of distribution parameters.
sampled on-board at one hertz. GPS receivers were also used to give the vehicles posi-
tions. The workshop visits was recorded by service personnel and confirmed with GPS
data.
The data form the on-board loggers were compressed into histograms. The Hellinger dis-
tance between the histograms was used for comparison when applying the approaches.
3.8 Calculating true p-values
3.8.1 The normal Gaussian distribution
Calculating the p-values of a data point supposedly generated according to a univariate
normal distribution, given that the mean and the standard deviation is known, is a trivial
task. This can be calculated from the integral of the probability density function 3.4 from
the test point towards infinity. The integral of the probability density function is the cu-
mulative density function, and the total integral of that, from negative infinity to positive
infinity is equal to 1, and symmetric around the mean, according to equation 3.5.
fX(x) =
1
σ
√
2π
e− 1
2 (x−µ
σ )
2
(3.4)
P(X ≥ x) =
Z ∞
x
fX(y)dy =
1
2
−
Z x
0
fX(y)dy =
1
2
−
1
2
erf

1
√
2
x − µ
σ

(3.5)
Given this, the problem can be simplified to be based on the finite integral from zero to
the test point. This only covers the right tail event, but the distribution, being symmetric,
allows us to get the equivalent of the two tail event by doubling the value for the one tail
event. The exact function for this is presented in equation (3.6), and visualized in Figure
3.15.
p = 1 − erf

1
√
2

(3.6)
The p-value for a bivariate normal distribution is calculated by first calculating the Ma-
halanobis distance from the mean of each point and with the χ2
k cumulative distribution

function with a degree of freedom corresponding to the dimensionality of the data, k = 2
in this case. The resulting equation is 3.7. The result is shown in Figure 3.16.
p = e− (x−µ)Σ−1(x−µ)t
2 (3.7)
−4 −2 0 2 4
0
0.2
0.4
0.6
0.8
1
Figure 3.15: p-value of a univariate
normal distribution.
−4 −2 0 2 4
−4
−2
0
2
4
Figure 3.16: p-value of a bivariate
normal distribution.
3.8.2 The Student’s t-distribution
Calculating the p-value for a t-distribution is similar in manner to the normal distribu-
tion, even though the probability density function and the integral, the cumulative dis-
tribution function are different. The probability density function is presented in (3.8). In
order to clarify the presentation, the part of the equation that is constant given a fixed
degree of freedom (τ) is replaced by C, also presented is the value of C given a τ-value
of 3 which is used throughout this work, (3.9). Through the same procedure as the nor-
mal distribution, the value of the distance from the mean, which is fixed at 0 for the t-
distribution, and the test point the basis for the integration towards infinity, as presented
in (3.10):
fX(x) =
Γ τ+1
2

√
τπΓ τ
2

1 +
x2
τ
− τ+1
2
(3.8)
C =
Γ τ+1
2

√
τπΓ τ
2
C|τ=3 ≈ 0.3676 . . . (3.9)
P(X ≥ x) =
Z ∞
x
fX(y)dy =
1
2
−
Z x
0
fX(y)dy =
1
2
− C
Z x
0

1 +
y2
τ
− τ+1
2
dy (3.10)
The t-distribution, being symmetric allows for the same conversion from a one tail event
to a two tail event by doubling. The final function is presented in equation (3.11) and is
visualized in Figure 3.17.
p|τ=3 = 1 − 2C
Z |x|
0

1 +
y2
3
−2
dy (3.11)

−4 −2 0 2 4
0
0.2
0.4
0.6
0.8
1
Figure 3.17: p-value for the t-distribution with a degree of freedom of 3.
3.8.3 The symmetric beta distribution
The beta distribution has two parameters, when these are equal the probability distribu-
tion (3.12) is symmetric. For clarity the constant part of the equation, that is based on the
values of α and β is replaced by C (3.13). Since fixed values of α = β = 2 are used through-
out this work when the symmetric beta distribution is referenced these are included in
equation (3.14) where they help provide a clearer view.
fX(x) =
Γ(α + β)
Γ(α)Γ(β)
xα−1
(1 − x)β−1
(3.12)
C =
Γ(α + β)
Γ(α)Γ(β)
C|α=β=2 = 6 (3.13)
fX(x)|α=β=2 = 6x(1 − x) (3.14)
The approach as the one applied for the normal distribution is applied here and the result
is a trivial polynomial, but the results are different for the left tail event (3.15) and the
right tail event (3.16). The final function for the p-value in the case of the symmetric beta
distribution is presented in equation (3.17) and visualized in Figure 3.18.
P(X ≤ x)|α=β=2 =
Z x
0
fX(z)dz = −2x3
+ 3x2
(3.15)
P(X ≥ x)|α=β=2 =
Z 1
x
fX(z)dz = 2x3
− 3x2
+ 1 (3.16)
p|α=β=2 =
(
4x3 − 6x2 + 2 if x ≥ 1
2
−4x3 + 6x2 otherwise
(3.17)
3.8.4 The non-symmetric beta distribution
The non-symmetric beta distribution, as selected here where the α = 2 and the β =
5 poses a larger challenge not being symmetric around any point. Equations derived

0 0.5 1
0
0.2
0.4
0.6
0.8
1
Figure 3.18: p-value for the sym-
metric beta distribution with α =
β = 2.
0 0.5 1
0
0.2
0.4
0.6
0.8
1
Figure 3.19: p-value function of
the non-symmetric beta distribu-
tion with α = 2 and β = 5.
from Section 3.8.3 and presented in (3.18) and (3.19). The equations for the left tail event
(3.20) and the right tail event (3.21) are constructed analogous to the symmetric beta
distribution in Section 3.8.3.
C|α=2,β=5 = 30 (3.18)
fX(x)|α=2,β=5 = 30x(1 − x)4
(3.19)
P(X ≤ x)|α=2,β=5 =
Z x
0
fX(z)dz = 5x6
− 24x5
+ 45x4
− 40x3
+ 15x2
(3.20)
P(X ≥ x)|α=2,β=5 =
Z 1
x
fX(z)dz = −5x6
+ 24x5
− 45x4
+ 40x3
− 15x2
+ 1 (3.21)
The probability function not being symmetric prevents a simple doubling procedure to
get the actual p-value. It is known that two tail events can be combined according to
hypothesis testing theory by taking the double of the lowest value of the two tails, and
the final function for the p-value us constructed according to that theory and presented in
3.22 and visualized in Figure 3.19. A disclaimer is necessarily added here, the p-value for
non-symmetric distributions is not well defined [Kulinskaya, 2008], but this calculation
serves the purpose in this work and consideration of this is taken in when the results are
evaluated.
p|α=2,β=5 = 2 · min(P(X ≤ x), P(X ≥ x)) (3.22)
3.8.5 The log-normal distribution
Calculating the p-value of points generated according to a log-normal distribution can be
done easily by translating the coordinates by using the log function and calculating the
p-values using the same way as for the normal distribution (3.6). The result is shown in
Figure 3.20. The same approach is done for the bivariate log-normal distributions, but

using equation 3.7. The results are shown in 3.21 for the log-normal case, and in 3.22 for
the normal/log-normal case.
0 5 10
0
0.2
0.4
0.6
0.8
1
Figure 3.20: p-value of a univariate
log-normal distribution.
0 2 4 6 8
0
2
4
6
8
Figure 3.21: p-value of a bivariate
log-normal distribution.
−4 −2 0 2 4
0
2
4
6
8
Figure 3.22: p-value of a normal/log-normal distribution.
3.8.6 Exponential distribution
The exponential distribution only has one tail, the right tail, and the peak at zero. In that
sense there is zero probability of getting a value that is below zero, i.e. on the left side of
the peak. Thus it follows to calculate the p-value as a sole right tail event which is equal
to equation 3.24, and visualized with λ = 1 in Figure 3.23.
fX(x) = λe−λx
(3.23)
p = P(X ≤ x) =
Z x
0
fX(z)dz = 1 − e−λx
(3.24)
3.8.7 Bimodal normal distribution
When calculating the p-value of a mixture of two normal distributions, the question that
is asked is how likely it is that any value is generated according to either of the two
distributions and by that logic the p-value function of the mixture should have two peaks,
one corresponding to the peaks of each of the two normal distributions. The p-value

0 5 10
0
0.2
0.4
0.6
0.8
1
Figure 3.23: p-value of an exponen-
tial distribution.
−4 −2 0 2 4
0
0.2
0.4
0.6
0.8
1
Figure 3.24: p-value of a mixture of
two normal distributions.
functions for the two distributions can be combined in a logical-or sense by continuously
testing at small intervals of p-values, counting the results and estimating a curve from
that result. This logic-based approach to combining the p-values produces a result equal
to combining the two values by simply selecting the largest of the two at any given point.
A combination of values in this logical sense is recognizable from fuzzy logic theory,
where the max-function is used as a logical-or gate function. Thus the resulting equation
is 3.25 and the result, with µ1 = −2 and µ2 = 2, is visualized in Figure 3.24. The same
approach is used to combine p-values for two bivariate normal distributions.
p = max(pµ=−2, pµ=2) (3.25)

Chapter 4
Results
4.1 Generated univariate data
4.1.1 Normal distribution
Figure 4.1 shows the estimations of the approaches when the training data is drawn from
a Gaussian normal distribution with the parameters µ = 0 and σ = 1. The lines show the
upper and lower confidence limits at 95% significance level calculated from 30 unique
and randomized sets of training data. Figure 4.1a shows the result of the R2S approach.
Judging from the figure the R2S approach seems to produce an even result, the confidence
levels do not deviate visibly at different distance from the mean. Figure 4.1b shows the
result of the OCSVM approach. The OCSVM approach looks slightly worse than the
R2S approach, especially in the tail where the confidence levels are wider. This hints of
a less trustworthy result for lower p-values. Figure 4.1c shows the result of the CAD
Mahalanobis approach. The CAD Mahalanobis seems to produce an identical result to
the R2S approach. Figure 4.1d shows the result of the CAD k-NN approach. The CAD
k-NN looks rather bad at a first glance, especially for higher p-values. Although the tail
does not look as bad.
Figure 4.2 shows the change in the RMSE value when the sample set size is varied. The
function 3.2 is fitted to the RMSE data in order to get an estimated limit (α) and rate
of convergence (β). The numerical results of the fitted function are presented in Table
4.1 and their respective standard error. The approaches with the lowest limits are high-
lighted.
The R2S and the CAD Mahalanobis produces the best results in the RMSE sense for this
kind of distribution. The OCSVM follows closely after. The CAD k-NN approach gets
penalized for the rather bad results for high p-values.
Figure 4.3 shows the test points plotted according to the true p-value in the horizontal
axis, and according to the estimated p-value on the vertical axis. The data is from one
test example trained with 100 samples. An optimal estimation would follow the diag-
onal line. The plots show that all the approaches seem to have a slight systematic low
estimation of the p-value for low p-values. The CAD k-NNs divergence is obvious.
39

outiar.pdf

Recommended

Recommended

More Related Content

Similar to outiar.pdf

Similar to outiar.pdf (20)

More from ssusere02009

More from ssusere02009 (17)

Recently uploaded

Recently uploaded (20)

outiar.pdf