SlideShare a Scribd company logo
1 of 38
Download to read offline
Statistics Netherlands
                 Division Research and Development
                 Department of Statistical Methods




                             ROBUST MULTIVARIATE OUTLIER DETECTION


                                      Peter de Boer and Vincent Feltkamp




                    Summary: Two robust multivariate outlier detection methods, based on the
                    Mahalanobis distance, are reported: the projection method and the Kosinski
                    method. The ability of those methods to detect outliers is exhaustively tested.
                    A comparison is made between the two methods as well as a comparison with
                    other robust outlier detection methods that are reported in the literature.




                 The opinions in this paper are those of the authors and do not necessarily reflect
                 those of Statistics Netherlands.

Projectnummer:   RSM-80820
 BPA-nummer:     324-00-RSM/INTERN
        Datum:   19-jul-00
Robust multivariate outlier detection

1. Introduction

The statistical process can be separated in three steps. The input phase involves the
collection of data by means of surveys and registers. The throughput phase involves
preparing the raw data for tabulation purposes, weighting and variance estimating.
The output phase involves the publication of population totals, means, correlations,
etc., which have come out of the throughput phase.
Data editing is one of the first steps in the throughput process. It is the procedure for
detecting and adjusting individual errors in data. Editing also comprises the
detection and treatment of correct but influential records, i.e. records that have a
substantial contribution to the aggregates to be published.
The search for suspicious records, i.e. records that are possibly wrong or influential,
can be done in basically two ways. The first way is by examining each record and
looking for strange or wrong fields or combinations of fields. In this view a record
includes all fields referring to a particular unit, be it a person, household or business
unit, even if those fields are stored in separate files, like files containing survey data
and files containing auxiliary data.
The second way is by comparing each record with the other records. Even if the
fields of a particular record obey all the edit rules one has laid down, the record
could be an outlier. An outlier is a record, which does not follow the bulk of the
records.
The data can be seen as a rectangular file, each row denoting a particular record and
each column a particular variable. The first way of searching for suspicious data can
be seen as searching in rows, the second way as searching in columns. It is remarked
that some and possibly many errors can be detected by both ways.
Records could be outliers while their outlyingness is not apparent by examining the
variables, or columns, one by one. For instance, a company that has a relatively
large turnover but that has paid relatively little taxes might be no outlier in either one
of the variables, but could be an outlier considering the combination. Outliers
involving more than one variable are multivariate outliers.
In order to quantify how far a record lies from the bulk of the data, one needs a
measure of distance. In the case of categorical data no useful distance measure
exists, but in the case of continuous data the so-called Mahalanobis distance is often
employed.
A distance measure should be robust against the presence of outliers. It is known
that the classical Mahalanobis distance is not. This means that the outliers, which are
to be detected, seriously hamper the detection of those outliers. Hence, a robust
version of the Mahalanobis distance is needed.
In this report two robust multivariate outlier detection algorithms for continuous
data, based on the Mahalanobis distance, are reported. In the next section the
classical Mahalanobis distance is introduced and ways to robustify this distance
measure are discussed. In sections 3 and 4 the two algorithms, successively the


                                        1
Robust multivariate outlier detection

Kosinski method and the projection method, are presented. In section 5 a
comparison between the two algorithms is made as well as a comparison with other
algorithms reported in the outlier literature. A practical example, and problems
involved with it, is the subject of section 6. In section 7 some concluding remarks
are made.



2. The Mahalanobis distance

The Mahalanobis distance is a measure of the distance between a point and the
center of all points, with respect to the scale of the data — and in the multivariate
case with respect to the shape of the data as well. It is remarked that in regression
analysis another distance measure is more convenient: instead of the distance
between a point and the center of the data, the distance between the point and the
regression plane (see also section 5).

Suppose we have a continuous data set y1 , y 2 ,.., y n . The vectors y i are p-

dimensional, i.e. y i = ( y i1          yi 2   .. y ip ) t , where y iq denotes a real number. The
classical squared Mahalanobis distance is defined by

MDi2 = ( y i − y ) t C −1 ( y i − y )

where y and C denote the mean and the covariance matrix respectively:

      1 n
y=      ∑ yi
      n i =1
     1 n
C=      ∑ ( yi − y )( yi − y ) t
   n − 1 i =1

In the case of one-dimensional data the covariance matrix reduces to the variance
and the Mahalanobis distance to MDi = y i − y σ , where σ denotes the standard
deviation.
Another point of view results by noting that the Mahalanobis distance is the solution
of a maximization problem. The maximization problem is defined as follows. The
data points y i can be projected on a projection vector a. The outlyingness of the

point y i is the squared projected distance (a t ( y i − y )) , with respect to the
                                                             2


projected variance a t Ca . Assuming that the covariance matrix C is positive
definite, there exists a non-singular matrix A such that A t CA = I . Using the
Cauchy-Schwarz equality we have




                                               2
Robust multivariate outlier detection

                                  −1
(a t ( y i − y )) 2 (a t A t At ( y i − y )) 2
                   =
      a t Ca                      a t Ca
                     ( A −1 a ) t ( A −1 a ) ( yi − y ) t AAt ( y i − y )
                   ≤
                                              a t Ca
                     a t ( AAt ) −1 a ( y i − y ) t AA t ( y i − y )
                   =
                                            a t Ca
                   = ( y i − y ) t C −1 ( y i − y )
                     = MDi2

with equality if and only if A − a = cAt ( yi − y ) for some constant c. Hence
                                1



                    (a t ( y i − y )) 2
MD = sup
     i
      2

           a t a =1       a t Ca

i.e., the Mahalanobis distance is equal to the supremum of the outlyingness of y i
over all possible projection vectors.
                                                                                2
If the data set y i is multivariate normal the squared Mahalanobis distances MDi

follow the χ 2 distribution with p degrees of freedom.

The classical Mahalanobis distance suffers however from the masking and
swamping effect. Outliers seriously affect the mean and the covariance matrix in
such a way that the Mahalanobis distance of outliers could be small (masking),
while the Mahalanobis distance of points which are not outliers could be large
(swamping).
Therefore, robust estimates of the center and the covariance matrix should be found
in order to calculate a useful Mahalanobis distance. In the univariate case the most
robust choice is the median (med) and the median of absolute deviations (mad)
replacing the mean and the standard deviation respectively. The med and mad have a
robustness of 50%. The robustness of a quantity is defined as the maximum
percentage of data points that can be moved arbitrarily far away while the change in
that quantity remains bounded.
It is not trivial to generalize the robust one-dimensional Mahalanobis distance to the
multivariate case. Several robust estimators for the location and scale of multivariate
data have been developed. We have tested two methods, the projection method and
the Kosinski method. Other methods for robust outlier detection will be discussed in
section 5, where we will compare the different methods on their ability to detect
outliers.
In the next two sections the Kosinski method and the projection method will be
discussed in detail.




                                              3
Robust multivariate outlier detection

3. The Kosinski method

3.1 The principle of Kosinski
The method discussed in this section was quite recently published by Kosinski. The
idea of Kosinski is basically the following:
1) start with a few, say g, points, denoted the “good” part of the data set;
2) calculate the mean and the covariance matrix of those points;
3) calculate the Mahalanobis distances of the complete data set;
4) increase the good part of the data set with one point by selecting the g+1 points
   with the smallest Mahalanobis distance and define g=g+1;
5) return to step 2 or stop as soon as the good part contains more than half the data
   set and the smallest Mahalanobis distance of the remaining points is higher than
   a predefined cutoff value. At the end the remaining part, or the “bad” part,
   should contain the outliers.
In order to assure that the good part will contain no outliers at the end, it is important
to start the algorithm with points which all are good. In the paper by Kosinski this
problem is solved by repetitively choosing a small set of random points, and
performing the algorithm for each set. The number of sets of points to start with is
taken high enough to be sure that at least one set contains no outliers.
We made two major adjustments to the Kosinski algorithm. The first one is the
choice of the starting data set. The demanded property of the starting data set is that
it contains no outliers. It does not matter how these points are found. We choose the
starting data set by robustly estimating the center of the data set and selecting the
p+1 closest points. In the case of a p-dimensional data set, p+1 points are needed to
get a useful starting data set, since the covariance matrix of a set of at most p points
is always non-invertible. A separation of the data set in p+1 good points and n-p-1
bad points is called an elemental partition.
The center is estimated by calculating the mean of the data set, neglecting all
univariate robustly detected outliers. This is of course just a rude estimation, but is
satisfactory for the purpose of selecting a good starting data set. Another rude
estimation of the center that was tried out was the coordinate-wise median. The
coordinate-wise median appeared to result in less satisfactory starting data sets.
The p+1 points closest to the mean are chosen, where closest is defined by an
ordinary distance measure. In order to take the different scales and units of the
different dimensions into account, the data set is coordinate-wisely scaled before the
mean is calculated, i.e. each component of each point is divided by the median of
absolute deviations of the dimension concerned. It is remarked that, after the first
p+1 points are selected the algorithm continues with the original unscaled data.
It is, of course, possible to construct a data set for which this algorithm fails to select
p+1 points that are all good points. However, in all the data sets exploited in this
report, artificial and real, this choice of a starting data set worked very well.




                                        4
Robust multivariate outlier detection

This adjustment results in a spectacular gain in computer time, since the algorithm
has to be run only once instead of more than once. Kosinski estimates the required
number of random starting data sets in his own original algorithm to be
approximately 35 in the case of 2-dimensional data sets, and up to 10000 in 10
dimensions.
The other adjustment is in the expansion of the good part. In the Kosinski paper the
increment is always one point. We implemented an increment proportional to the
good part already found, for instance 10%. This means that the good part is
increased with a factor of 10% each step. This speeds up the algorithm as well,
especially in large data sets. The original algorithm with one-point increment scales
        2
with n , where n is the number of data points, while the algorithm with proportional
increment scales with nln n . Also this adjustment was tested and appeared to be
very good.
In the remainder of this report, “the Kosinski method” denotes the adjusted Kosinski
method, unless otherwise noted.
3.2 The Kosinski algorithm
The purpose of the algorithm is, given a set of n multivariate data points
y1 , y 2 ,.., y n , to calculate the outlyingness u i for each point i. The algorithm can be
summarized as follows.
Step 0. In: data set

The algorithm is started with a set of continuous p-dimensional data y1 , y 2 ,.., y n ,

               (
where y i = y i1       .. y ip ) .
                                t



Step 1. Choose an elemental partition
A good part of p+1 points is found as follows.

•    Calculate         the     med      and   mad      for     each      dimension       q:
     M q = med y kq
                   k

     S q = med y lq − M q
               l


•    Divide each component q of each data point i by the mad of the dimension
     concerned. The scaled data points are denoted by the superscript s:
             y iq
     yiq =
       s

             Sq

•    Declare a point to be a univariate outlier if at least one component of the data
     point is farther than 2.5 standard deviations away from the scaled median. The
     standard deviation is approximated by 1.484 times the mad (see section 4.1 for
     the background of the factor 1.484). So calculate for each component q of each
     point i:




                                         5
Robust multivariate outlier detection


                 1          Mq
     u iq =          y iq −
                       s

               1.484        Sq
     If u iq > 2.5 for any q, then point i is an univariate outlier.

•    Calculate the mean of the data set, neglecting the univariate outliers:
                           n
               1
     ys =
               n0
                         ∑y i =1
                                  i
                                   s


                    yi is no outlier

     where n0 denotes the number of points that are no univariate outliers.

•    Select the p+1 points that are closest to the mean. Define those points to be the
     good        part      of        the       data      set.      So       calculate:
     d i = y is − y s
     The g=p+1 points with the smallest di form the good part, denoted by G.
Step 2. Iteratively increase the good part
The good part is increased until a certain stop criterion is fulfilled.

•    Continue with the original data set y i , not with the scaled data set y is .

•    Calculate the mean and the covariance matrix of the good part:
        1
     y=   ∑ yi
        g i∈G
          1
     C=       ∑ ( yi − y )( yi − y ) t
        g − 1 i∈G

•    Calculate           the           Mahalanobis   distance   of   all   the   data   points:
                                       −1
     MD = ( y i − y ) C ( y i − y )
          i
           2                      t



•    Calculate the number of points with a Mahalanobis distance smaller than a
     predefined cutoff value. A useful cutoff value is χ p ,1−α , with • =1%.
                                                         2



•    Increase the good part with a predefined percentage (a useful percentage is 20%)
     by selecting the points with the smallest Mahalanobis distances, but not more
     than up to
     a) half the data set if the good part is smaller than half the data set
     (g<h=[½(n+p+1]).
     b) the number of points with a Mahalanobis distance smaller than the cutoff if
     the good part is larger than half the data set.

•    Stop the algorithm if the good part was already larger than half the data set and
     no more points were added in the last iteration.
Step 3. Out: outlyingnesses
The outlyingness of each point is now simply the Mahalanobis distance of the point,
calculated with the mean and the covariance matrix of the good part of the data set.


                                                 6
Robust multivariate outlier detection

3.3 Test results
A prototype/test program was implemented in a Borland Pascal 7.0 environment.
Documentation of the program is published elsewhere. We successively tested the
choice of the elemental partition by means of the mean, the amount of swamped
observations of data sets containing no outliers, the amount of masked and swamped
observations of data sets containing outliers, the algorithm with proportional
increment and the time-performance of the proportional increment of the good part
compared to the one-point increment. Finally, we tested the sensitivity of the
number of detected outliers to the cutoff value and the increment percentage in some
known data sets.
3.3.1 Elemental partition
First of all, the choice of the elemental partition was tested with the generated data
set published by Kosinski. The Kosinski data set is a kind of worst-case data set. It
contains a large amount of outliers (40% of the data) and the outliers are distributed
with a variance much smaller than the variance of the good points.
Before using the mean, we calculated the coordinate-wise median as a robust
estimator of the center of the data, and selected the three closest points. This strategy
failed. Although the median has a 50%-robustness, the 40% outliers strongly shift
the median. Hence, one of the three selected points appeared to be an outlier. As a
consequence, the forward search algorithm indicated all point to be good points, i.e.
all the outliers were masked.
This was the reason we searched for another robust measure of the location of the
data. One of the simplest ideas is to search for univariate outliers first, and to
calculate the mean of the points that are outlier in none of the dimensions.
The selected points, the three points closest to the mean, appeared all to be good
points. Moreover, the forward search algorithm, applied with this elemental
partition, successfully distinguished the outliers from the good points.
All following tests were performed using this “mean” to select the first p+1 points.
For all tested data sets the selected p+1 points appeared to be good points, resulting
in a successful forward search. It is possible, in principle, to construct a data set for
which this selection algorithm still fails, for instance a data set with a large fraction
of outliers which are univariately invisible and with no unambiguous dividing line
between the group of outliers and the group of good points. This is, however, a very
hypothetical situation.
3.3.2 Swamping
A simulation study was performed in order to determine the average fraction of
swamped observations in normal distributed data sets. In large data sets almost
always a few points are indicated to be an outlier, even if the whole data set nicely
follows a normal distribution. This is due to the cutoff value. If a cutoff value of
  χ 2 ,1−α is used as discriminator between good points and outliers in a
    p

p-dimensional standard normal data set, a fraction of • data points will have a
Mahalanobis distance larger than the cutoff value.


                                        7
Robust multivariate outlier detection

For each dimension p between 1 and 8 we generated 100 standard normal data sets
of 100 points. The Kosinski algorithm was run twice on each data set, once with a
cutoff value      χ 2 , 0.99 , and once with
                    p                          χ 2 , 0.95 . Each point that is indicated to be an
                                                 p

outlier is a swamped observation since there are no true outliers by construction. We
calculated the average fraction of swamped observations (i.e. the number of
swamped observations of each data set divided by 100, the number of points in the
data set, averaged over all 100 data sets). Results are shown in Table 3.1.



   •      p=1         2        3        4        5        6       7         8
  0.01     0.015 0.011 0.010 0.008 0.008 0.008 0.007 0.007
  0.05     0.239 0.112 0.081 0.070 0.059 0.052 0.045 0.042
Table 3.1. The average fraction of swamped observations of the simulations of 100
generated p-dimensional data sets of 100 points for each p between 1 and 8, with
cutoff value      χ 2 ,1−α .
                    p




For • =0.01 the fraction of swamped observations is very close to the value of •
itself. These results are very similar to the results of the original Kosinski algorithm.
For • =0.05, however, the average fraction of swamped observations is much larger
than 0.05 for the lower dimensions, especially for p=1 and p=2. The reason for this
is the following. Consider a one-dimensional standard normal data set. If the
variance of all points is used, the outlyingness of a fraction of • points will be larger
than     χ 12,1−α . However, in the Kosinski algorithm the variance of all points but at
least that fraction of • points with the largest outlyingnesses is calculated. This
variance is smaller than the variance of all points. Hence, the Mahalanobis distances
are overestimated and too many points are indicated to be an outlier. This is a self-
magnifying effect. More outliers lead to a smaller variance which leads to more
points indicated to be an outlier, etc.
The effect is the strongest in one dimension. In higher dimensions the points with a
large Mahalanobis distance are “all around”. Therefore they less influence the
variance in the separate directions.
Apparently, the effect is quite strong for • =0.05, but almost negligible for • =0.01. In
the remaining tests • =0.01 is used, unless otherwise stated.
3.3.3 Masking and swamping
The ability of the algorithm to detect outliers was tested in another simulation. We
generated data sets in the same way as is done in the Kosinski paper in order to get a
fair comparison between the original and our adjusted Kosinski algorithm. Thus we
generated data sets of 100 points containing good points as well as outliers. Both the
good points and the outliers were generated from a multivariate distribution, with
σ 2 = 40 for the good points and σ 2 = 1 for the bad points. The distance between



                                          8
Robust multivariate outlier detection

the center of the good points and the bad points is denoted by d. The vector between
the centers is along the vector of 1’s.
We varied the dimension (p=2, 5), the fraction of outliers (0.10• 0.45), and the
distance (d=20• 60). We calculated the fraction of masked outliers (the number of
masked outliers of each data set divided by the number of outliers) and the fraction
of swamped points (the number of swamped points of each data set divided by the
number of good points), both averaged over 100 simulation runs for each set of
parameters p, d, and fraction of outliers. Results are shown in Table 3.2.



                      p=2                                                   p=5
  fraction of      fraction of          fraction of       fraction of    fraction of   fraction of
    outliers      masked obs.            swamped            outliers    masked obs.     swamped
                                           obs.                                           obs.
 d=20                                         d=25
     0.10           0.81          0.009          0.10          0.90         0.008
     0.20           0.89          0.014          0.20          0.91         0.021
     0.30           0.88          0.022          0.30          0.93         0.146
     0.40           0.86          0.146          0.40          0.97         0.551
     0.45           0.88          0.350          0.45          1.00         0.855
 d=30                                         d=40
     0.10           0.03          0.011          0.10          0.00         0.008
     0.20           0.00          0.011          0.20          0.04         0.008
     0.30           0.01          0.010          0.30          0.03         0.022
     0.40           0.05          0.043          0.40          0.02         0.020
     0.45           0.01          0.019          0.45          0.01         0.014
 d=40                                         d=60
     0.10           0.00          0.011          0.10          0.00         0.008
     0.20           0.00          0.011          0.20          0.00         0.007
     0.30           0.00          0.011          0.30          0.00         0.009
     0.40           0.00          0.009          0.40          0.00         0.010
     0.45           0.00          0.010          0.45          0.00         0.008
Table 3.2. Average fraction of masked and swamped observations of 2- and
5-dimensional data sets over 100 simulation runs. Each data set consisted of 100
points with a certain fraction of outliers. The good (bad) points were generated from
a multivariate normal distribution with σ = 40 ( σ = 1 ) in each direction. The
                                                      2             2


distance between the center of the good points and the bad points is denoted by d.


The following conclusions can be drawn from these results. The algorithm is said to
be performing well if the fraction of masked outliers is close to zero and the fraction
of swamped observation is close to • =0.01. The first conclusion is: the larger the
distance between the good points and the bad points the better the algorithm
performs. This conclusion is not surprising and is in agreement with Kosinski’s
results. Secondly, the higher the dimension, the worse the performance of the
algorithm. In five dimensions the algorithm starts to perform well at d=40, and close
to perfect at d=60, while in two dimensions the performance is good at d=30,
respectively perfect at d=40. The original algorithm did not show such a dependence
on the dimension. It is remarked, however, that the paper by Kosinski does not give


                                              9
Robust multivariate outlier detection

enough details for a good comparison on this point. Third, for both two and five
dimensions the adjusted algorithm performs worse than the original algorithm. The
original algorithm is almost perfect at d=25 for both p=2 and p=5, while the adjusted
algorithm is not perfect until d=40 or d=60. This is the price that is paid for the large
gain in computer time. The fourth conclusion is: the performance of the algorithm is
almost not dependent on the fraction of outliers, in agreement with Kosinski’s
results. In some cases, the algorithm even seems to perform better for higher
fractions. This is however due to the relatively small number of points (100) per data
set. For very large data sets and very large number of simulation runs this artifact
will disappear.



 p     d     fr    inc         masked    swamped
 2    20    0.10    1p          0.79       0.010
 2    20    0.10 10%            0.80       0.009
 2    20    0.10 100%           0.80       0.009
 2    20    0.40    1p          0.86       0.225
 2    20    0.40 10%            0.86       0.146
 2    20    0.40 100%           0.89       0.093
 2    30    0.10    1p           0.00        0.011
 2    30    0.10 10%             0.03        0.011
 2    30    0.10 100%            0.02        0.011
 2    30    0.40    1p           0.05        0.042
 2    30    0.40 10%             0.05        0.043
 2    30    0.40 100%            0.08        0.038
 2    40    0.10    1p           0.00        0.011
 2    40    0.10 10%             0.00        0.011
 2    40    0.10 100%            0.00        0.011
 2    40    0.40    1p           0.00        0.010
 2    40    0.40 10%             0.00        0.009
 2    40    0.40 100%            0.02        0.009
 5 40 0.10           1p      0.00         0.008
 5 40 0.10 10%               0.00         0.008
 5 40 0.10 100%              0.01         0.008
 5 40 0.40           1p      0.01         0.016
 5 40 0.40 10%               0.01         0.016
 5 40 0.40 100%              0.06         0.035
Table 3.3. Average fraction of masked and swamped observations for p-dimensional
data sets with a fraction of fr outliers on a distance d from the good points (for more
details about the data sets see Table 3.2), calculated with runs with either one-point
increment (1p) or proportional increment (10% or 100% of the good part).


3.3.4 Proportional increment
Until now all tests have been performed using the one-point increment, i.e. at each
step of the algorithm the size of the good part is increased with just one point. In
section 3.1 it was already mentioned that a gain in computer time is possible by
increasing the size of the good part with more than one point per step. The
simulations on the masked and swamped observations were repeated with the
proportional increment algorithm. The increment with a certain percentage was

                                        10
Robust multivariate outlier detection

tested for percentages up to 100% (which means that the size of the good part is
doubled at each step).
The results of Table 3.1, showing the average fraction of swamped observations in
outlier-free data sets, did not change. Small changes showed up for large
percentages in the presence of outliers. A summary of the results is shown in Table
3.3. In order to avoid an unnecessary profusion of data we only show the results for
p=2 in some relevant cases and, as an illustration, in a few cases for p=5.
A general conclusion from the table is that for a wide range of percentages the
proportional increment algorithm works satisfactorily. For a percentage of 100%
outliers are masked slightly more frequently than for lower percentages. The
differences between 10% increment and one-point increment are negligible.
3.3.5 Time dependence
To illustrate the possible gain with the proportional increment we measured the time
per run for p-dimensional data sets of n points, with p ranging from 1 to 8 and n
from 50 to 400. The simulations were performed with outlier-free generated data
sets so that the complete data sets had to be included in the good part. This was done
in order to obtain useful information about the dependence of the simulation times
on the number of points. Table 3.4 shows the results for the simulation runs with
one-point increment. The results for the runs with a proportional increment of 10%
are shown in Table 3.5.



    n      p=1         2                 3            4      5        6        7        8
    50      0.09     0.18               0.29        0.45   0.64     0.84     1.08     1.35
   100      0.36     0.68               1.05        1.75    2.5      3.3      4.3      5.5
   200      1.46     2.8                4.6          7.0    10
   400       6.2      12
Table 3.4. Time (in seconds)            per run on p-dimensional data sets of n points, using
the one-point increment.


    n       p=1        2        3       4       5         6         7         8
    50      0.05      0.10     0.16    0.23    0.31     0.39      0.52      0.62
   100      0.14      0.24     0.39    0.56    0.76     1.00      1.25      1.55
   200      0.33      0.60     0.92    1.35    1.90
   400      0.80      1.40
Table 3.5. Time (in seconds) per run on p-dimensional data sets of n points, using
the proportional increment (perc=10%).


Let us denote the time per run as a function of n for fixed p by tp, and the time per
run as a function of p for fixed n by tn. For the one-point increment simulations tp is
approximately proportional to n2. This is as expected since there are O(n) steps with
a increment of one point and at each step the Mahalanobis distance has to be
calculated for each point (O(n)) and sorted (O(n ln n)). For the simulations with
proportional increment tp is approximately O(n ln n), due to the fact that only
O(ln n) steps are needed instead of O(n). As a consequence there is a substantial

                                               11
Robust multivariate outlier detection

gain in the time per run, ranging from a factor of 2 for 50 points up to a factor of 8
for 400 points.
The time per run for fixed n, tn, is approximately proportional to p1.5, for both one-
point and proportional increment runs. The exponent 1.5 is just an empirical average
over the range p=1..8 and is result of several O(p) and O(p2) steps. Since the
exponent is much smaller than 2 it is more efficient to search for outliers in one
p-dimensional run than in ½p(p-1) 2-dimensional runs, one for each pair of
dimensions, even if one is not interested in outliers in more than 2 dimensions.
Consider for instance p=8, n=2. One run takes 0.62 seconds. However, a total of 1.4
seconds would be needed for the 28 runs in each pair of dimensions, each run taking
0.05 seconds.
3.3.6 Sensitivity to parameters
The Kosinski algorithm was tested on the twelve data sets described in section 5. A
full description of the outliers and a comparison of the results with the results of the
projection algorithm as well as with other methods described in the literature is
given in that section. In the present section we restrict the discussion to the
sensitivity of the number of outliers to the cutoff and the increment percentage.

The algorithm was run with a cutoff            χ 2 ,1−α for • =1% as well as • =5%.
                                                 p

Furthermore, both one-point increment and proportional increment (in the range 0-
40%) were used. The number of detected outliers of the twelve data sets is shown in
Table 3.6.
It is clear that the number of outliers for a specific data set is not the same for each
set of parameters. It is remarked that, in all cases, if different sets of parameters lead
to the same number of outliers, the outliers are exactly the same points. Moreover, if
one set of parameters leads to more outliers than another set, all outliers detected by
the latter are also detected by the former (these are empirical results).
Let us first discuss the differences between the detection with • =1% and with • =5%.
It is obvious that in many cases • =5% results in slightly more outliers than • =1%.
However, in two cases the differences are substantial, i.e. in the Stackloss data and
in the Factory data.
In the Stackloss data five outliers for • =5% are found using moderate increments,
while • =1% shows no outliers at all. The reason for this difference is the relatively
small number of points related to the dimension of the data set. It has been argued
by Rousseeuw that the ratio n/p should be larger than 5 in order to be able to detect
outliers reliably. If n/p is smaller than 5 one comes to a point where it is not useful
to speak about outliers since there is no real bulk of data.
With n=21 and p=4 the Stackloss data lie on the edge of meaningful outlier
detection. Moreover, if the five points which are indicated as outliers with • =5% are
left out, only 16 good points remain, resulting in a ratio n/p=4. In such a case any
outlier detection algorithm will presumably fail to find outliers consistently.




                                        12
Robust multivariate outlier detection




    Data set                     p       n         inc      • =5% • =1%
1. Kosinski                      2      100            1p     42    40
                                                   • 40%      42    40
2. Brain mass                    2      28          1p       5       3
                                                • 10%        5       3
                                               15-20%        4       3
                                               30-40%        3       3
3. Hertzsprung-Russel            2      47             1p    7       6
                                                   • 30%     7       6
                                                     40%     6       6
4. Hadi                          3      25           1p      3       3
                                                  • 5%       3       3
                                                  10%        3       0
                                               15-25%        3       3
                                                  30%        3       0
                                                  40%        3       3
5. Stackloss                     4      21          1p       5       0
                                                • 17%        5       0
                                               18-24%        4       0
                                               25-30%        1       0
                                                  40%        0       0
6. Salinity                      4      28             1p    4       2
                                                   • 30%     4       2
                                                     40%     2       2
7. HBK                           4      75             1p    15     14
                                                   • 30%     15     14
                                                     40%     14     14
8. Factory                       5      50             1p    20      0
                                                   • 40%     20      0
9. Bush fire                     5      38             1p    16     13
                                                   • 40%     16     13
10. Wood gravity                 6      20             1p    6       5
                                                   • 20%     6       5
                                                     30%     6       6
                                                     40%     6       5
11. Coleman                      6      20             1p    7       7
                                                   • 40%     7       7
12. Milk                         8      85
                                         1p                  20     17
                                     • 30%                   20     17
                                       40%                   18     15
Table 3.6. Number of outliers detected by the                Kosinski algorithm with a cutoff of
  χ 2 ,1−α , for • =1% respectively • =5%, with either one-point (1p) or proportional
    p

increment in the range 0-40%.




                                              13
Robust multivariate outlier detection

The Factory data is an interesting case. For • =5% twenty outliers are detected,
which is 40% of all points, while detection with • =1% shows no outliers.
Explorative data analysis shows that about half the data set is quite narrowly
concentrated in a certain region, while the other half is distributed over a much
larger space. There is however no clear distinction between these two parts. The
more widely distributed part is rather a very thick tail of the other part. In such a
case the effect that the algorithm with • =5% tends to detect too much outliers, which
is explained discussing Table 3.1, is very strong. It is questionable whether the
indicated points should be considered as outliers.
Let us now discuss the sensitivity of the number of detected outliers to the
increment. At low percentages the number of outliers is always the same as for the
one-point increment • in fact, at very low percentages the proportional increment
procedure leads to an increment of just one point per step, making the two
algorithms equal. For most data sets the number of outliers is constant for a wide
range of percentages and starts to differ slightly only at 30-40% or higher. Three of
the twelve data sets behave differently: the Brain mass data, the Hadi data, and the
Stackloss data.
The Brain mass data shows 5 outliers at low percentages for • =5%. At percentages
around 15% the number of outliers is only 4 and at 30% only 3. So the number of
outliers changes earlier (at 15%) than in most other data sets (• 30%). For • =1% the
number of outliers is constant over the whole range. In fact, the three outliers which
are found at 30-40% for • =5% are exactly the same as the three outliers found for
• =1%. The two outliers which are missed at higher percentages for • =5% both lie
just above the cutoff value. Therefore it is disputable whether they are real outliers at
all.
The Hadi data shows strange behavior. At all percentages for • =5% and at most
percentages for • =1% three outliers are found. However, near 10% and near 30% no
outliers are detected. Again, the three outliers are disputable. All have a
Mahalanobis distance just above the cutoff (see Table 5.2). Hence it is not strange
that sometimes these three points are included in the good part (the three points lie
close together; hence, the inclusion of one of them in the good part leads to low
Mahalanobis distances for the other two as well). On the other side, it is also not a
big problem, since it is rather a matter of taste than a matter of science to call the
three points outliers or good points.
The Stackloss data shows a decreasing number of outliers for • =5% at relatively low
percentages, like in the Brain mass data. Here, the sensitivity to the percentage is
related to the low ratio n/p, as is discussed previously.
In conclusion, for increments up to 30% the same outliers are found as with the one-
point increment. In cases where this is not true, the supposed outliers always have an
outlyingness slightly above or below the cutoff, so that missing such outliers has no
big consequences. Furthermore, relatively low cutoff values could lead to
disproportionate swamping.



                                        14
Robust multivariate outlier detection

4. The projection method

4.1 The principle of projection
The projection method is based on the idea that outliers in univariate data are easily
recognized, visually as well as by computational means. In one dimension the
Mahalanobis distance is simply y i − y σ . A robust version of the univariate
outlyingness is found by replacing the mean by the med and replacing the standard
deviation by the mad. Denoting the robust outlyingness by u i , this leads to

          yi − M
ui =
                   S
where M and S denote the med respectively the mad:

M = med y k
               k

S = med y l − M
           l


In the case of multivariate data the idea is to “look” at the data set from all possible
directions and to “see” whether a particular data point lies far away from the bulk of
the data points. Looking in this context means projecting the data set on a projection
vector a; seeing means calculating the outlyingness as is done in univariate data. The
ultimate outlyingness of a point is just the maximum of the outlyingnesses over all
projection directions.
The outlyingness defined in this way corresponds to the multivariate Mahalanobis
distance as is shown in section 2. Recalling the expression for the Mahalanobis
distance:

              (a t ( y i − y )) 2
MD = sup
     i
      2

     a t a =1       a t Ca
Robustifying the Mahalanobis distance leads to

                       a t yi − M
u i = sup
       a t a =1            S
Now M and S are defined as follows:

M = med a t y k
               k

S = med a t y l − M
           l

                                    2        2
It is remarked that MDi corresponds to u i .

How is the maximum calculated? The outlyingness

 a t yi − M
       S



                                        15
Robust multivariate outlier detection

as a function of a could posses several local maxima, making gradient search
methods unfeasible. Therefore the outlyingness is calculated on a grid of a finite
number of projection vectors. The grid should be fine enough in order to calculate
the maximum outlyingness with enough accuracy.
This robust measure of outlyingness was firstly developed by Stahel en Donoho.
More recent work on this subject has been reported by Maronna and Yohai. These
authors used the outlyingness in order to calculate a weighted mean and covariance
matrix. Outliers were given small weights so that the Stahel-Donoho estimator of the
mean was robust against the presence of outliers. It is of course possible to use the
weighted mean and covariance matrix to calculate a weighted Mahalanobis distance.
This is not done in the projection method discussed here.

The robust outlyingness u i was slightly adjusted for the following reason. The mad
of univariate standard normal data, which has a standard deviation of 1 by definition,
is 0.674=1/1.484. In order to assure that, in the limiting case of an infinitely large
                                                 2
multivariate normal data set, the outlyingness u i is equal to the squared
Mahalanobis distance, the mad in the denominator is multiplied with 1.484:

                 a t yi − M
u i = sup
      a t a =1   1.484 S

4.2 The projection algorithm
The purpose of the algorithm is, given a set of n multivariate data points
y1 , y 2 ,.., y n , to calculate the outlyingness u i for each point i. The algorithm can be
summarized as follows.
Step 0. In: data set

The algorithm is started with a set of continuous p-dimensional data y1 , y 2 ,.., y n ,

with y i = y i1  (       .. y ip ) .
                                       t



Step 1. Define a grid

                   p
There are   subsets of q dimensions in the total set of p dimensions. The
          q
           
“maximum search dimension” q is predefined. Projection vectors a in a certain
subset are parameterized by the angles θ 1 ,θ 2 ,..,θ q −1 :

    cosθ 1                           
                                     
    cos θ 2 sin θ 1                  
    cos θ sin θ sin θ                
a=        3        2       1
                                      
                                     
       ¡




                                     
    cos θ q −1 sin θ q − 2 sin θ 1 
                                    




    sin θ sin θ              sin θ 1 
                                     
                                




          q −1        q−2




                                           16
Robust multivariate outlier detection

A certain predefined step size step (in degrees) is used to define the grid.

The first angle θ 1 can take the values i step1 , with step1 the largest angle smaller
                                            180                                             180
than or equal to step for which                   is an integer value, and with i = 1,2,..,       .
                                            step1                                           step1

The second angle can take the values j step 2 , with step 2 the largest angle smaller
                          step1             180
than or equal to                 for which                        is an integer value, and with
                          cosθ 1           step 2
                180
 j = 1,2,..,          .
               step 2

The r-th angle can take the values k step r , with step r the largest angle smaller than
                     step r −1                        180
or equal to                      for which                    is an integer value, and with
                     cosθ r −1                       step r
                180
k = 1,2,..,           .
               step r

Such a grid is defined in each subset of q dimensions.
Step 2. Outlyingness for each grid point

For each grid point a, calculate the outlyingness for each data point y i :

•    Calculate the projections a y i .

•    Calculate the median M a = med a y k .
                                               k


•    Calculate the mad La = med a y l − M .
                                        l


                                                     a yi − M a
•    Calculate the outlyingness u i (a ) =                         .
                                                     1.484 La

Step 3. Out: outlyingness

The outlyingness u i is the maximum over the grid:

u i = sup u i (a ) .
        a


4.3 Test results
A prototype/test program was implemented in an Excel/Visual Basic environment.
Documentation of the program is published elsewhere. We successively tested the
amount of swamped observations of data sets containing no outliers, the amount of
masked observations of data sets containing outliers, the time-dependence of the
algorithm on the parameters step and q, and the sensitivity of the number of detected
outliers to these parameters in some known data sets.




                                                17
Robust multivariate outlier detection

4.3.1 Swamping
A simulation study was performed in order to determine the average fraction of
swamped observations in normal distributed data sets. See section 3.3.2 for more
detailed remarks about the swamping effect and about generating the data sets. The
results of the simulations are shown in Table 4.1.



     •             step           p=1            2         3             4          5
    1%              10           0.010         0.011     0.016         0.018      0.023
    5%              10           0.049         0.052     0.067         0.071      0.088
    1%              30           0.010         0.010     0.012         0.011      0.012
    5%              30           0.049         0.049     0.051         0.049      0.058
Table 4.1. The average fraction of swamped observations of the simulations on
several generated p-dimensional data sets of 100 points, with cutoff value          χ 2 ,1−α
                                                                                      p

and step size step. The parameter q is equal to p.



          p=2 q=2                             p=5 q=2                      p=5 q=5
 fraction of fraction of             fraction of fraction of      fraction of fraction of
   outliers   masked obs.              outliers   masked obs.       outliers   masked obs.
d=20                                d=30                         d=30
    0.12          0.83                  0.12          1.00           0.12          0.22
    0.23          1.00                  0.23          1.00           0.23          0.54
    0.34          1.00                  0.34          1.00           0.34          1.00
    0.45          1.00                  0.45          1.00           0.45          1.00
d=40                                d=50                         d=50
    0.12          0.00                  0.12          0.00           0.12          0.00
    0.23          0.00                  0.23          0.67           0.23          0.00
    0.34          0.62                  0.34          1.00           0.34          0.65
    0.45          1.00                  0.45          1.00           0.45          1.00
d=50                                d=80                         d=60
    0.12          0.00                  0.12          0.00           0.12          0.00
    0.23          0.00                  0.23          0.00           0.23          0.00
    0.34          0.00                  0.34          0.00           0.34          0.00
    0.45          1.00                  0.45          1.00           0.45          1.00
d=90                                d=140                        d=120
    0.12          0.00                  0.12          0.00           0.12          0.00
    0.23          0.00                  0.23          0.00           0.23          0.00
    0.34          0.00                  0.34          0.00           0.34          0.00
    0.45          0.00                  0.45          0.00           0.45          0.00
Table 4.2. Average fraction of masked outliers of 2- and 5-dimensional generated
data sets (see also section 3.3.3).


For low dimensions the average fraction of swamped observations tend to be almost
equal to • . The fraction increases, however, with increasing dimension. This due to
the decreasing ratio n/p. It is remarkable that if the step size is 30 the fraction of
swamped observations seems to be much better than for step size 10. This is just a
coincidence. The fact that more observations are declared to be an outlier is

                                          18
Robust multivariate outlier detection

compensated by the fact that outlyingnesses are usually smaller if high step sizes are
used. In fact, the differences between step size 10 and 30 are so large for higher
dimensions that this is an indication that a step size of 30 could be too low to result
in reliable outlyingnesses.


4.3.2 Masking and swamping
The ability of the projection algorithm to detect outliers was tested by generating
data sets that contain good points as well as outliers. See section 3.3.3 for details on
how the data sets were generated.
Results are shown in Table 4.3. In all cases, the ability to detect the outliers is
strongly dependent on the contamination of outliers. If there are many outliers, they
can only be detected if they lie very far away from the cloud of good points. This is
due to the fact that, although the med and the mad have a robustness of 50%, a large
concentrated fraction of outliers strongly shifts the med towards the cloud of outliers
and enlarges the mad.
In higher dimensions it is more difficult to detect the outliers, like in the Kosinski
method. The ability to detect the outliers depends also on the maximum search
dimension q. If q is taken equal to p less outliers are masked.
4.3.3 Time dependence
The time dependence of the projection algorithm on the step size step and the
maximum search dimension q is shown in Table 4.3.



   n     p    q    step           t       n   p   q   step        t
 400     2    2      36        13.0     100   2   2      9      8.0
                     18        21.0           3                19.3
                      9        32.7           4                33.5
                    4.5        56.8           5                50.1
                                              6                71.4
 400     3    3      36       28.1            7                98.9
                     18       68.6            8               128.0
                      9      209.1
                    4.5      719.3      100   5   1     9       5.9
                                                  2            50.1
  50 5 2          9      26.3                     3           479.8
 100                     50.1                     4          2489.1
 200                    107.7                     5          4692.1
 400                    202.9
Table 4.3. Time t (in seconds) per run on p-dimensional data sets of n points using
maximum search dimension q and step size step (in degrees).


                                                                         p  180 q −1
Asymptotically the time per run should be proportional to (n ln n) (
                                                                                 ) ,
                                                                         q  step
                            p
since for each of the   subsets a grid is defined with a number of grid points of
                      q
                       

                                         19
Robust multivariate outlier detection

                 180 q −1
the order of (        ) , and at each grid point the median of the projected points has
                 step
to be calculated (n ln n). The results in the table roughly confirm this theoretical
estimation. The most important conclusion from the table is that the time per run
strongly increases with the search dimension q. This makes the algorithm only
useful for relatively low dimensions.
4.3.4 Sensitivity to parameters
The projection method was tested with the twelve data sets that are fully described
in section 5, like is done with the Kosinski method (see section 3.3.6). The results
are shown in Table 4.4.
Let us first discuss the differences between • =5% and 1%. In almost all cases the
number of outliers, detected with • =5% are larger than with • =1%. This is
completely due to stronger swamping. It is remarked that there is no algorithmic
dependence on the cutoff value, like in the Kosinski method. In the projection
method a set of outlyingnesses is calculated and after the calculation a certain cutoff
value is used in order to discriminate between good and bad points. Hence, a smaller
cutoff value leads to more outliers, but all points still have the same outlyingness. In
the Kosinski method the cutoff value is already used during the algorithm: the cutoff
is used in order to decide whether more points should be added to the good part. A
smaller cutoff leads not only to more outliers but also to a different set of
outlyingnesses since the mean and the covariance matrix are calculated with a
different set of points. As a consequence, in cases where the Kosinski possibly
shows a rather strong sensitivity to the cutoff value, this sensitivity is missing in the
projection method.
Now let us discuss the dependence of the number of outliers on the maximum search
dimension q. In the Hertzsprung-Russel data set and in the HBK data set the number
of outliers found with q=1 is already as large as found with higher values of q. In the
Brain mass data set and in the Milk data set, the number of outliers for q=1 are
however much smaller than for large values of q. In those cases, many outliers are
truly multivariate.
In the Hadi data set, the Factory data set and the Bush fire data set there is also a
rather large discrepancy between q=2 and q=3. It is remarked that the Hadi data set
was constructed so that all outliers were invisible looking at two dimensions only
(see section 5.2.4). Also in the other two data sets it is clear that many outliers can
only be found by inspecting three or more dimensions at the same time.
If q is higher than three, only slightly more outliers are found than for q=3.
Differences can be explained by the fact that searching in higher dimensions with
the projection method leads to more outliers (see section 4.3.1).




                                        20
Robust multivariate outlier detection

    Data set                     p       n         q   step   • =5% • =1%
1. Kosinski                      2      100        2    10      78    34
                                                   2    20      77    34
                                                   2    30      42    31
2. Brain mass                    2      28         2    5      9      6
                                                   2   10      9      4
                                                   2   30      8      4
                                                   1   n/a     3      1
3. Hertzsprung-Russel            2      47         2    1      7      6
                                                   2   30      6      5
                                                   2   90      6      5
                                                   1   n/a     6      5
4. Hadi                          3      25         3    5      11     5
                                                   3   10       8     0
                                                   2   10       0     0
5. Stackloss                     4      21         4    5      14     9
                                                   4   10      10     9
                                                   4   15       8     6
                                                   4   20       9     7
                                                   4   30       6     6
6. Salinity                      4      28         4   10      12     8
                                                   4   20       9     7
                                                   3   30       6     4
7. HBK                           4      75         4   10      15    14
                                                   4   20      14    14
                                                   1   n/a     14    14
8. Factory                       5      50         5   10      24    18
                                                   5   20      14     9
                                                   4   10      24    17
                                                   3   10      22    14
                                                   2   10       9     9
9. Bush fire                     5      38         5   10      24    19
                                                   5   20      19    17
                                                   4   10      22    19
                                                   3   10      21    17
                                                   2   10      13    12
10. Wood gravity                 6      20         5   20      14    14
                                                   5   30      12    11
                                                   3   10      15    14
11. Coleman                      6      20         5   20      10     8
                                                   5   30       4     4
12. Milk                         8    5 85   20      18      14
                                      5      30      15      13
                                      4      20      16      14
                                      4      30      15      13
                                      3      20      15      13
                                      3      30      15      12
                                      2      20      13      11
                                      2      30      12       7
                                      1      n/a      6       5
Table 4.4. Number of outliers detected by the projection algorithm with a cutoff of

  χ 2 ,1−α , for • =1% respectively • =5%, with maximum search dimension q and angular
    p

step size step (in degrees).


                                              21
Robust multivariate outlier detection

The sensitivity to the step size is not large in most cases. In cases like the Hadi data,
the Stackloss data, the Salinity data and the Coleman data, the sensitivity can be
explained by the sparsity of the data sets. A step size near 10-20 seems to work well
in most cases.
In conclusion, the number of outliers is not very sensitive to the parameters q and
step. However, the sensitivity is not completely negligible. In most practical cases
q=3 and step=10 work well enough.



5. Comparison of methods

In this section the projection method and the Kosinski method are compared with
each other as well with other robust outlier detection methods. In section 5.1 we will
shortly describe some other methods reported in the literature. The comparison is
made by applying the projection method and the Kosinski method on data sets that
are analyzed by at least one of the other methods. Those data sets and the results of
the said methods are described in section 5.2. In section 5.3 the results are discussed.
Unfortunately, in most papers on outlier detection methods very little is said about
the efficiency of the methods, i.e. how fast the algorithms are and how it depends on
the number of points and the dimension of the data set. Therefore we restrict the
discussion to the ability to detect outliers.

5.1 Other methods
It is important to note that two different type outliers are distinguished in the outlier
literature. The first type outlier, which is used in this report, is a point that lies far
away from the bulk of the data. The second type is a point that lies far away from the
regression plane formed by the bulk of the data. The two types will be denoted by
bulk outliers respectively regression outliers.
Of course, outliers are often so according to both points of view. That is why we
compare the results of the projection method and the Kosinski method, which are
both bulk outlier methods, also with regression outlier methods. An outlier that is
declared to be so by both methods is called a bad leverage point. In the case that a
point lies far away from the bulk of the points but close to the regression plane it is
called a good leverage point.
Rousseeuw (1987, 1990) developed the minimum volume ellipsoid (MVE) estimator
in order to robustly detect bulk outliers. The principle is to search for the ellipsoid,
covering at least half the data points, for which the volume is minimal. The mean
and the covariance matrix of the points inside the ellipsoid are inserted in the
expression for the Mahalanobis distance. This method is costly due to the
complexity of the algorithm that searches the minimum volume ellipsoid.
A related technique is based on the minimum covariance determinant (MCD)
estimator. This technique is employed by Rocke. The aim of this technique is to
search for the set of points, containing at least half the data, for which the
determinant of the covariance matrix is minimal. Again, the mean and the

                                        22
Robust multivariate outlier detection

covariance matrix, determined by that set of points, are inserted in the Mahalanobis
distance expression. Also this method is rather complex, although substantially
optimized by Rocke.
Hadi (1992) developed a bulk outlier method that is very similar to the Kosinski
method. He also starts with a set of p+1 “good” points and increases the good set
one point by one. The difference lies in the choice of the first p+1 points. Hadi
orders the n points using another robust measure of outlyingness. The question
arises why that other outlyingness would not be appropriate for outlier detection. A
reason could be that an arbitrary robust measure of outlyingness deviates relatively
strongly from the “real” Mahalanobis distance.
Atkinson combines the MVE method of Rousseeuw and the forward search
technique also employed by Kosinski. A few sets of p+1 randomly chosen points are
used for a forward search. The set that results in the ellipsoid with minimal volume
is used for the calculation of the Mahalanobis distances.
Maronna employed a projection–like method, but slightly more complicated. The
outlyingnesses are calculated like in the projection method. Then, weights are
assigned to each point, with low weights for the outlying points, i.e. the influence of
outliers is restricted. The mean and the covariance matrix are calculated using these
weights. They form the Stahel-Donoho estimator for location and scatter. Finally,
Maronna inserts this mean and this covariance matrix in the expression for the
Mahalanobis distance.
Egan proposes resampling by the half-mean method (RHM) and the smallest half-
volume method (SHV). In the RHM method several randomly selected portions of
the data are generated. In each case the outlyingnesses are calculated. For each point
is counted how many times it has a large outlyingness. It is declared to be a true
outlier if this happens often. In the SHV method the distance between each pair of
points is calculated and put in a matrix. The column with the smallest sum of the
smallest n/2 distances is selected. The corresponding n/2 points form the smallest
half-volume. The mean and the covariance of those points are inserted in the
Mahalanobis distance expression.
The detection of regression outliers is mainly done with the least median of squares
(LMS) method. The LMS method is developed by Rousseeuw (1984, 1987, 1990).
Instead of minimizing the sum of the squares of the residuals in the least squares
method (which should rather be called the least sum of squares method in this
context) the median of the squares is minimized. Outliers are simply the points with
large residuals as calculated with the regression coefficients determined with the
LMS method.
Hadi (1993) uses a forward search to detect the regression outliers. The regression
coefficients of a small good set are determined. The set is increased by subsequently
adding the points with the smallest residuals and recalculating the regression
coefficients until a certain stop criterion is fulfilled. A small good set has to be found
beforehand.



                                        23
Robust multivariate outlier detection

Atkinson combines forward search and LMS. A few sets of p+1 randomly chosen
points are used in a forward search. The set that results in the smallest LMS is used
for the final determination of the regression residuals.
A completely different approach is the genetic algorithm for detection of regression
outliers by Walczak. We will not describe this approach here since it lies beyond the
scope of deterministic calculation of outlyingnesses.
Fung developed an adding-back algorithm for confirmation of regression outliers.
Once points are declared to be outliers by any other robust method, the points are
added back to the data set in a stepwise way. The extent to which estimation of
regression coefficients are affected by the adding-back of a point is used as a
diagnostic measure to decide whether that point is a real outlier. This method was
developed since robust outlier methods tend to declare too many points to be
outliers.
5.2 Known data sets
In this section the projection method and the Kosinski method are compared by
running both algorithms on the twelve data sets given in Table 5.1. The main part of
these data sets is well described in the robust outlier detection literature. Hence, we
are able to compare the results of the two algorithms with known results.
The outlyingnesses as calculated by the projection method and the Kosinski method
are shown in Table 5.2, Table 5.4 and Table 5.5. In both methods the cutoff value
for • =1% is used. In the Kosinski method a proportional increment of 20% was
used. The outlyingnesses of the projection method were calculated with q=p (if p<6;
if p>5 then q=5) and the lowest step size that is shown in Table 4.4.
We will now discuss the data sets one by one.



      Data set            p   n Source
1. Kosinski               2 100 Ref. [1]
2. Brain mass             2 28 Ref. [3]
3. Hertzsprung-Russel     2 47 Ref. [3]
4. Hadi                   3 25 Ref. [4]
5. Stackloss              4 21 Ref. [3]
6. Salinity               4 28 Ref. [3]
7. HBK                    4 75 Ref. [3]
8. Factory                5 50 This work
9. Bush fire              5 38 Ref. [5]
10. Wood gravity          6 20 Ref. [6]
11. Coleman               6 20 Ref. [3]
12. Milk                  8 85 Ref. [7]
Table 5.1. The name, the dimension p, the number of points n, and the source of the
tested data sets.


5.2.1 Kosinski data
The Kosinski data form a data set that is difficult to handle from a point of view of
robust outlier detection. The two-dimensional data set contains 100 points. Points 1-


                                        24
Robust multivariate outlier detection

40     are      generated        from   a    bivariate   normal     distribution     with
µ1 = 18, µ 2 = −18, σ 12 = σ 2 = 1, ρ = 0 , and are considered to be outliers. Points
                             2


41-100 are good points and are a sample from the bivariate normal distribution with
µ1 = 0, µ 2 = 0, σ 12 = σ 2 = 40, ρ = 0.7 .
                          2



The Kosinski method correctly identifies all outliers (see Table 5.2). The projection
method identifies none of the outliers and declares many good points to be outliers.
The reason for this failure is the large contamination and the small scatter of the
outliers. Since there are so many outliers they strongly shift the median towards the
outliers. Hence, the outliers are not detected. Furthermore, since they are narrowly
distributed, they almost completely determine the median of absolute deviations in
the projection direction perpendicular to the vector pointing from the center of the
good points to the center of the outliers. Hence, many points, lying at the end points
of the ellipsoid of good points, have a large outlyingness.
It is remarked that this data set is not an arbitrarily chosen data set. It was generated
by Kosinski in order to demonstrate the superiority of his own method over other
methods.
5.2.2 Brain mass data
The Brain mass data contain three outliers according to the Kosinski method: points
6, 16 and 25. Those points are also indicated to be outliers by Rousseeuw (1990) and
Hadi (1992). Those authors also declare point 14 to be an outlier, but with an
outlyingness slightly above the cutoff. The projection method declares points 6, 14,
16, 17, 20 and 25 to be outliers.
5.2.3 Hertzsprung-Russel data
The two methods produce almost the same outlyingnesses for all points. Both
declare points 11, 20, 30 and 34 to be large outliers, in agreement with results by
Rousseeuw (1987) and Hadi (1993). However, the projection method and the
Kosinski method also declare points 7 and 14 as outliers and point 9 is an outlier
according to the Kosinski method . The outlyingness of these three points is
relatively small. Visual inspection of the data (see page 28 in Rousseeuw (1987))
shows that these points are indeed moderately outlying.
5.2.4 Hadi data
The Hadi data is an artificial one. The data set contains three variables x1 , x 2 and y .
The two predictors were originally created as uniform (0,15) and were then
transformed to have a correlation of 0.5. The target variable was then created by
y = x1 + x 2 + ε with ε ~ N (0,1) . Finally, cases 1-3 were perturbed to have
predictor values around (15,15) and to satisfy y = x1 + x 2 + 4 .

The Kosinski method finds the outliers, with a relatively small outlyingness. The
projection method finds these outliers too but declares also two good points to be
outliers.




                                        25
Robust multivariate outlier detection

  A:                Kosinski                     Brain mass   Hertzsprung-Russel        Hadi
  B:                 3,035                         3,035            3,035               3,368
  C:     Proj   Kos            Proj     Kos          Proj Kos         Proj Kos            Proj   Kos
     1   2,59   7,45    51     4,37     1,01     1 1,79 0,75      1 0,80 1,20       1    4,75    3,47
     2   2,80   7,96    52     1,53     0,98     2 1,05 1,13      2 1,39 1,46       2    4,75    3,47
     3   2,46   7,14    53     2,22     1,05     3 0,37 0,16      3 1,41 1,83       3    4,76    3,46
     4   2,87   8,21    54     4,69     1,32     4 0,65 0,13      4 1,39 1,46       4    2,86    1,84
     5   2,78   7,97    55     3,97     1,50     5 1,99 0,92      5 1,42 1,90       5    0,96    0,70
     6   2,59   7,48    56     3,47     1,44     6 8,40 6,19      6 0,80 1,04       6    3,43    1,57
     7   2,84   8,09    57     4,59     2,55     7 2,08 1,27      7 5,55 6,35       7    2,21    0,91
     8   2,75   7,89    58     2,27     0,37     8 0,66 0,55      8 1,44 1,38       8    0,46    0,36
     9   2,51   7,22    59     2,96     0,51     9 0,94 0,91      9 2,59 3,26       9    0,99    0,35
    10   2,45   7,12    60     2,22     0,54    10 1,93 0,99     10 0,61 0,93      10    1,74    1,34
    11   2,69   7,71    61     4,94     1,83    11 1,23 0,51     11 11,01 12,67    11    2,50    1,65
    12   2,84   8,12    62     5,07     1,29    12 0,96 0,90     12 0,91 1,21      12    1,54    1,13
    13   2,77   7,95    63     4,66     1,13    13 0,64 0,60     13 0,79 0,88      13    2,81    1,25
    14   2,68   7,72    64     1,68     1,17    14 3,87 2,21     14 3,04 3,51      14    0,98    0,68
    15   2,37   6,95    65     3,32     1,03    15 2,22 1,44     15 1,55 1,22      15    2,65    1,37
    16   2,46   7,17    66     2,25     1,03    16 7,54 5,63     16 1,23 0,99      16    0,97    0,84
    17   2,64   7,59    67     2,59     1,13    17 3,18 1,83     17 2,17 1,80      17    3,31    1,64
    18   2,40   6,96    68     3,89     1,04    18 0,90 0,92     18 2,17 2,04      18    3,17    1,39
    19   2,46   7,11    69     1,82     0,88    19 3,00 1,43     19 1,77 1,54      19    2,78    1,49
    20   2,45   7,15    70     5,96     1,59    20 3,59 1,71     20 11,26 13,01    20    2,94    1,37
    21   2,70   7,71    71     2,29     0,70    21 1,54 0,66     21 1,35 1,07      21    0,90    0,66
    22   2,62   7,54    72     3,91     0,86    22 0,50 0,25     22 1,62 1,28      22    1,61    1,27
    23   2,82   8,11    73     2,15     1,30    23 0,66 0,74     23 1,60 1,41      23    3,89    1,39
    24   2,68   7,67    74     6,76     2,00    24 2,18 1,11     24 1,21 1,10      24    2,80    1,22
    25   2,37   6,88    75     6,20     2,01    25 8,97 6,75     25 0,34 0,58      25    2,04    1,12
    26   2,75   7,86    76     3,37     0,77    26 2,61 1,24     26 1,04 0,78
    27   2,67   7,70    77     2,67     0,49    27 2,59 1,41     27 0,88 1,07
    28   2,85   8,14    78     1,83     0,50    28 1,13 1,17     28 0,36 0,33
    29   2,78   7,98    79     4,19     2,45                     29 1,43 1,60
    30   2,78   8,00    80     2,71     0,46                     30 11,61 13,48
    31   2,45   7,14    81     4,49     1,12                     31 1,36 1,09
    32   2,91   8,29    82     2,74     0,79                     32 1,59 1,48
    33   2,51   7,27    83     1,62     0,31                     33 0,49 0,52
    34   2,33   6,80    84     2,81     0,47                     34 11,87 13,88
    35   2,68   7,72    85     5,94     1,57                     35 1,50 1,50
    36   2,82   8,08    86     3,50     1,01                     36 1,57 1,70
    37   2,52   7,31    87     1,38     1,93                     37 1,27 1,13
    38   2,65   7,66    88     2,21     1,57                     38 0,49 0,52
    39   2,49   7,18    89     5,47     1,73                     39 1,14 1,03
    40   2,61   7,52    90     3,07     1,44                     40 1,17 1,52
    41   1,89   0,50    91     2,94     1,54                     41 0,88 0,60
    42   1,84   0,41    92     6,02     1,59                     42 0,46 0,30
    43   7,94   2,03    93     3,65     0,80                     43 0,81 0,77
    44   3,04   0,61    94     3,89     0,98                     44 0,61 0,80
    45   2,35   0,67    95     6,68     1,64                     45 1,17 1,19
    46   6,42   1,76    96     2,50     0,84                     46 0,58 0,37
    47   5,36   1,68    97     4,59     1,32                     47 1,41 1,20
    48   3,74   0,77    98     5,65     1,46
    49   3,92   0,92    99     2,12     1,64
    50   6,53   1,78 100       2,31     0,30
Table 5.2. The outlyingness of each point of the Kosinski, the Brain mass, the Hertzsprung-
Russel and the Hadi data. A: Name of data set. B: Cutoff value for • =1%; outlyingnesses
higher than the cutoff are shown in bold. C: Method (Proj: projection method; Kos: Kosinski
method).




                                               26
Robust multivariate outlier detection

The projection method finds consistently larger outlyingnesses than the Kosinski
method, roughly a factor 2 for most points. This is related to the sparsity of the data
set. Consider for instance the extreme case of three points in two dimensions. Every
point will have an infinitely large outlyingness according to the projection method.
This can be understood by noting that the mad of the projected points is zero if the
projection vector intersects two points. The remaining point has an infinite
outlyingness. For data sets with more points the situation is less extreme. But as long
as there are relatively little points the projection outlyingnesses will be relatively
large. In such a case the cutoff values based on the χ -distribution are in fact too
                                                                    2


low, leading to the swamping effect.
5.2.5 Stackloss data
The Stackloss data outlyingnesses show large differences between the two methods.
One of the reasons is the sensitivity of the Kosinski results to the cutoff value in this
case, as is discussed in section 3. If a cutoff value χ 4, 0.95 = 3.080 is used instead of
                                                                2



  χ 4, 0.99 = 3.644 , the Kosinski method shows outlyingnesses as in Table 5.3.
    2




        outl.             outl.                  outl.
   1    4.73          8 0.98                15 1.07
   2    3.30          9 0.76                16 0.87
   3    4.42         10 0.98                17 1.14
   4    4.19         11 0.83                18 0.71
   5    0.63         12 0.93                19 0.80
   6    0.76         13 1.24                20 1.04
   7    0.87         14 1.04                21 3.80
Table   5.3. The    outlyingnesses          of the Stackloss data, calculated with the Kosinski
method with cutoff value χ 4, 0.95 = 3.080 . Outlyingnesses above this value are
                                        2



shown in bold, outlyingnesses that are even higher than χ 4, 0.99 = 3.644 are shown
                                                                        2


in bold italic.


Here 5 points have an outlyingness exceeding the cutoff value for • =5%, four of
them (points 1, 3, 4 and 21) even above the value for • =1%. Even in this case the
differences with the projection method are large. The projection outlyingnesses are
up to 5 times larger than the Kosinski ones.
For comparison, Walczak and Atkinson declared points 1, 3, 4 and 21 to be outliers,
Rocke indicated also point 2 as an outlier, while points 1, 2, 3 and 21 are outliers
according to Hadi (1992). These results are comparable with the results of the
Kosinski method with • =5%. Hence, considering the results in Table 5.4, the
Kosinski method results in too little outliers, the projection method too much. In
both cases the origin lies in the low n/p ratio.




                                               27
Robust multivariate outlier detection

  A:    Stackloss         Salinity                          HBK                          Factory
  B:      3,644            3,644                            3,644                         3,884
  C:    Proj Kos             Proj    Kos           Proj    Kos           Proj   Kos         Proj Kos
    1   8,42 1,62        1 2,67      1,29     1   30,38   32,34     51   1,99   1,64    1 5,23 2,12
    2   6,92 1,53        2 2,58      1,46     2   31,36   33,36     52   2,20   2,06    2 5,66 1,67
    3   8,14 1,45        3 4,65      1,84     3   32,81   34,90     53   3,18   2,80    3 5,55 1,91
    4   9,00 1,51        4 3,54      1,63     4   32,60   34,97     54   2,13   1,96    4 4,57 2,05
    5   1,74 0,41        5 6,06      4,06     5   32,71   34,92     55   1,57   1,22    5 3,28 2,34
    6   2,33 0,82        6 3,12      1,41     6   31,42   33,49     56   1,78   1,46    6 2,19 1,48
    7   3,45 1,31        7 2,62      1,25     7   32,34   34,33     57   1,81   1,61    7 2,27 1,49
    8   3,45 1,24        8 2,87      1,59     8   31,35   33,24     58   1,67   1,55    8 1,85 1,23
    9   2,15 1,11        9 3,31      1,90     9   32,13   34,35     59   0,89   1,13    9 2,15 1,17
   10   4,26 1,16       10 2,08      0,91    10   31,84   33,86     60   2,08   2,05   10 3,56 1,70
   11   3,01 1,11       11 2,76      1,24    11   28,95   32,68     61   1,78   1,99   11 3,64 1,87
   12   3,30 1,34       12 0,77      0,43    12   29,42   33,82     62   2,29   2,00   12 3,67 1,99
   13   3,25 1,01       13 2,36      1,28    13   29,42   33,82     63   1,70   1,70   13 2,24 1,43
   14   3,75 1,15       14 2,52      1,24    14   33,97   36,63     64   1,62   1,75   14 2,13 1,79
   15   3,90 1,20       15 3,71      2,16    15    1,99    1,89     65   1,90   1,85   15 1,84 1,29
   16   2,88 0,85       16 14,83     8,08    16    2,33    2,03     66   1,78   1,87   16 3,52 2,34
   17   7,09 1,78       17 3,68      1,60    17    1,65    1,74     67   1,34   1,20   17 2,42 1,79
   18   3,56 0,98       18 1,84      0,82    18    0,86    0,70     68   2,93   2,20   18 5,55 2,49
   19   3,07 1,04       19 2,93      1,79    19    1,54    1,18     69   1,97   1,56   19 5,65 1,76
   20   2,48 0,61       20 2,00      1,22    20    1,67    1,95     70   1,59   1,93   20 5,91 2,83
   21   8,85 2,11       21 2,50      0,95    21    1,57    1,76     71   0,75   1,01   21 4,35 1,90
                        22 3,34      1,23    22    1,90    1,70     72   1,00   0,83   22 2,20 1,63
                        23 5,20      2,07    23    1,72    1,72     73   1,70   1,53   23 2,77 1,62
                        24 4,62      1,90    24    1,70    1,56     74   1,77   1,80   24 2,14 0,90
                        25 0,77      0,42    25    2,06    1,83     75   2,44   1,98   25 3,11 2,13
                        26 1,80      0,87    26    1,73    1,80                        26 2,27 1,31
                        27 2,85      1,11    27    2,17    2,01                        27 4,88 2,02
                        28 3,72      1,48    28    1,41    1,13                        28 5,08 2,67
                                             29    1,33    1,13                        29 4,49 2,59
                                             30    2,04    1,86                        30 1,91 1,27
                                             31    1,61    1,53                        31 1,13 0,83
                                             32    1,78    1,70                        32 2,00 1,34
                                             33    1,55    1,45                        33 3,13 2,05
                                             34    2,10    2,07                        34 2,43 1,70
                                             35    1,41    1,80                        35 5,96 2,82
                                             36    1,63    1,61                        36 5,78 2,25
                                             37    1,75    1,87                        37 5,75 1,83
                                             38    2,01    1,86                        38 4,14 1,62
                                             39    2,16    1,93                        39 3,16 2,19
                                             40    1,25    1,17                        40 2,77 1,62
                                             41    1,65    1,81                        41 2,75 1,86
                                             42    1,91    1,72                        42 2,56 1,67
                                             43    2,50    2,17                        43 4,54 2,15
                                             44    2,04    1,91                        44 4,25 1,89
                                             45    2,07    1,86                        45 3,91 2,14
                                             46    2,04    1,91                        46 2,10 1,52
                                             47    2,92    2,56                        47 1,06 0,84
                                             48    1,40    1,70                        48 1,47 1,10
                                             49    1,73    2,01                        49 3,34 2,16
                                             50    1,05    1,36                        50 2,51 1,39
Table 5.4. The outlyingness of each point of the Stackloss, the Salinity, the HBK
and the Factory data. A, B, C: see Table 5.2.




                                            28
Robust multivariate outlier detection

  A:     Bush fire      Wood gravity        Coleman                      Milk
  B:       3,884           4,100             4,100                       4,482
  C:     Proj Kos            Proj Kos          Proj Kos         Proj    Kos            Proj    Kos
    1    3,48 1,38       1 4,72 2,65       1 3,56 2,84     1    9.06    9,46     51    2.62    1,98
    2    3,27 1,04       2 2,71 1,20       2 4,92 6,37     2   10.57   10,81     52    3.64    2,98
    3    2,76 1,11       3 3,68 2,19       3 6,76 2,94     3    4.04    5,09     53    2.38    2,22
    4    2,84 1,02       4 14,45 33,75     4 2,99 1,53     4    3.86    2,83     54    1.22    1,16
    5    3,85 1,40       5 3,02 2,80       5 2,70 1,43     5    2.23    2,52     55    1.68    1,69
    6    4,92 1,90       6 16,19 38,83     6 5,74 10,43    6    2.97    2,84     56    1.10    1,01
    7   11,79 4,37       7 7,90 5,00       7 3,11 2,23     7    2.36    2,35     57    1.96    2,19
    8   17,96 11,87      8 15,85 37,88     8 1,48 1,83     8    2.32    2,08     58    2.05    1,95
    9   18,36 12,18      9 6,12 2,72       9 2,49 5,95     9    2.58    2,49     59    1.47    2,21
   10   14,75 7,64      10 8,59 2,37      10 5,71 12,04   10    2.20    1,98     60    2.04    1,76
   11   12,31 6,76      11 5,38 3,04      11 5,07 7,70    11    5.28    4,60     61    1.48    1,42
   12    6,17 2,38      12 6,79 2,65      12 4,31 2,77    12    6.65    6,05     62    2.64    2,07
   13    5,83 1,77      13 7,14 1,98      13 3,49 2,92    13    5.63    5,38     63    2.33    2,60
   14    2,30 1,59      14 2,38 2,09      14 1,95 2,16    14    6.17    5,48     64    2.58    1,90
   15    4,70 1,55      15 2,40 1,47      15 6,11 6,56    15    5.47    5,73     65    1.85    1,56
   16    3,43 1,38      16 4,74 2,86      16 2,18 2,30    16    3.84    4,56     66    2.01    1,64
   17    3,06 0,92      17 6,07 2,12      17 3,78 5,95    17    3.59    4,76     67    3.28    2,59
   18    2,75 1,41      18 3,28 2,49      18 7,86 3,09    18    3.74    3,30     68    2.41    2,33
   19    2,82 1,38      19 18,33 44,49    19 3,48 2,11    19    2.43    2,85     69   46.45   44,61
   20    2,89 1,20      20 7,16 2,07      20 2,80 1,56    20    4.14    3,44     70    1.99    1,87
   21    2,47 1,13                                        21    2.26    2,08     71    2.19    2,27
   22    2,44 1,73                                        22    1.69    1,59     72    3.24    3,02
   23    2,46 1,04                                        23    1.81    2,04     73    6.89    6,99
   24    3,44 1,04                                        24    2.28    2,05     74    5.01    4,90
   25    1,90 0,91                                        25    2.81    2,83     75    2.02    2,03
   26    1,69 0,97                                        26    1.83    2,09     76    4.77    4,51
   27    2,27 0,99                                        27    4.24    3,71     77    1.35    1,43
   28    3,31 1,35                                        28    3.29    3,04     78    1.49    1,87
   29    4,82 1,83                                        29    3.19    2,57     79    2.93    2,66
   30    5,06 2,18                                        30    1.47    1,39     80    1.40    1,38
   31    6,00 5,66                                        31    2.87    2,29     81    2.59    2,34
   32   13,48 14,08                                       32    2.37    2,66     82    2.14    2,42
   33   15,34 16,35                                       33    1.78    1,33     83    3.00    2,56
   34   15,10 16,11                                       34    2.09    1,96     84    3.88    3,06
   35   15,33 16,43                                       35    2.73    2,10     85    2.19    2,36
   36   15,02 16,04                                       36    2.66    2,32
   37   15,17 16,30                                       37    2.61    2,23
   38   15,25 16,41                                       38    2.23    2,07
                                                          39    2.27    2,07
                                                          40    3.31    2,89
                                                          41   10.63   10,11
                                                          42    3.69    3,04
                                                          43    3.20    2,85
                                                          44    7.67    6,08
                                                          45    1.99    2,28
                                                          46    1.78    2,41
                                                          47    5.19    5,35
                                                          48    2.92    2,58
                                                          49    3.43    2,70
                                                          50    3.96    2,69
Table 5.5. The outlyingness of each point of the Bush fire, the Wood gravity, the
Coleman, and the Milk data. A, B, C: see Table 5.2.




                                         29
Robust multivariate outlier detection

5.2.6 Salinity data
The outlyingnesses of the Salinity data are roughly two times larger for the
projection method as compared to the Kosinski method. As a consequence, the latter
shows just 2 outliers (points 5 and 16), the former 8 points. Rousseeuw (1987) and
Walczak agree that the points 5, 16, 23 and 24 are outliers, with points 23 and 24
lying just above the cutoff. Fung finds the same points in first instance, but after
applying his adding-back algorithm he concludes that point 16 is the only outlier.
The projection method shows too much outliers, while the Kosinski method misses
points 23 and 24.
5.2.7 HBK data
In the case of the HBK data the projection method and the Kosinski method agree
completely. Both indicate points 1-14 to be outliers. This is also in agreement with
the results of the original Kosinski method and of Egan, Hadi (1992,1993), Rocke,
Rousseeuw (1987,1990), Fung and Walczak. It is remarked that some of these
authors only find points 1-10 as outliers, but they use the “regression” definition of
an outlier. The HBK is a artificial data set, where the good points lie along a
regression plane. Points 1-10 are bad leverage points, i.e. they lie far away from the
center of the good points and from the regression plane as well. Points 11-14 are
good leverage points, i.e. although they lie far away from the bulk of the data they
still lie close to the regression plane. If one considers the distance from the
regression plane, the points 11-14 are not outliers.
5.2.8 Factory data
The Factory data set is a new one1. It is given in Table 5.6.
The outlyingnesses show a big discrepancy between the two methods. The
projection outlyingnesses are much larger than the Kosinski ones, resulting in 18
versus 0 outliers. The outlyingnesses are so large due to the shape of the data. About
half the data set is quite narrowly concentrated around the center of the data, the
other half forms a rather thick tail. Hence, in many projection directions the mad is
very small, leading to large outlyingnesses for the points in the tail. It is remarked
that the projection outliers are well comparable to the Kosinski outliers found with a
cutoff for • =5% (see also section 3.3.6).




1
  The Factory data is a generated data set, originally used in an exercise on regression
analysis in the CBS course “multivariate technics with SPSS”. It is interesting to note that
the regression coefficients change radically if the points, that are indicated to be outliers by
the projection method and the Kosinski method with low cutoff, are removed from the data
set. In other words, the regression coefficients are mainly determined by the “outlying”
points.

                                        30
Robust multivariate outlier detection

       x1      x2    x3 x4        x5           x1     x2    x3 x4      x5
   1 14.9 7.107 21 129 11.609 26 12.3 12.616 20 192 11.478
   2 8.4 6.373 22 141 10.704 27 4.1 14.019 20 177 14.261
   3 21.6 6.796 22 153 10.942 28 6.8 16.631 23 185 15.300
   4 25.2 9.208 20 166 11.332 29 6.2 14.521 19 216 10.181
   5 26.3 14.792 25 193 11.665 30 13.7 13.689 22 188 13.475
   6 27.2 14.564 23 189 14.754 31               18 14.525 21 192 14.155
   7 22.2 11.964 20 175 13.255 32 22.8 14.523 21 183 15.401
   8 17.7 13.526 23 186 11.582 33 26.5 18.473 22 205 14.891
   9 12.5 12.656 20 190 12.154 34 26.1 15.718 22 200 15.459
 10 4.2 14.119 20 187 12.438 35 14.8 7.008 21 124 10.768
 11 6.9 16.691 22 195 13.407 36 18.7 6.274 21 145 12.435
 12 6.4 14.571 19 206 11.828 37 21.2 6.711 22 153 9.655
 13 13.3 13.619 22 198 11.438 38 25.1 9.257 22 169 10.445
 14 18.2 14.575 22 192 11.060 39 26.3 14.832 25 191 13.150
 15 22.8 14.556 21 191 14.951 40 27.5 14.521 24 177 14.067
 16 26.1 18.573 21 200 16.987 41 17.6 13.533 24 186 12.184
 17 26.3 15.618 22 200 12.472 42 12.4 12.618 21 194 12.427
 18 14.8 7.003 22 130 9.920 43 4.3 14.178 20 181 14.863
 19 18.2 6.368 22 144 10.773 44                  6 16.612 21 192 14.274
 20 21.3 6.722 21 123 15.088 45 6.6 14.513 20 213 10.706
 21     25 9.258 20 157 13.510 46 13.1 13.656 22 192 13.191
 22 26.1 14.762 24 183 13.047 47 18.2 14.525 21 191 12.956
 23 27.4 14.464 23 177 15.745 48 22.8 14.486 21 189 13.690
 24 22.4 11.864 21 175 12.725 49 26.2 18.527 22 200 17.551
 25 17.9 13.576 23 167 12.119 50 26.1 15.578 22 204 13.530
Table 5.6. The Factory data (n=50, p=5). The average temperature (x1, in degrees
Celsius), the production (x2, in 1000 pieces), the number of working days (x3), the
number of employees (x4) and the water consumption (x5, in 1000 liters) at a factory
in 50 successive months.


5.2.9 Bushfire data
The outliers found by the adjusted Kosinski method (points 7-11, 31-38) agree
perfectly with those found by the original algorithm of Kosinski and with the results
by Rocke and Maronna. The projection method shows as additional outliers points 6,
12, 13, 15, 29 and 30. Due to the large contamination the projected median is shifted
strongly, leading to relatively large outlyingnesses for the good points and,
consequently, many swamped points.
5.2.10 Wood gravity data
Rousseeuw (1984), Hadi (1993), Atkinson, Rocke and Egan declare points 4, 6, 8
and 19 to be outliers. The Kosinski method finds these outliers too, but outlier 7 is
additional. The projection method shows strange results. Fourteen points have an
outlyingness above the cutoff, which is 70% of the data set. This is of course not
realistic. The reason is again the sparsity of the data set. Hence, it is rather surprising
that the Kosinski method and the methods by other authors perform relatively well
in this case.
5.2.11 Coleman data
The Coleman data contain 8 outliers according to the projection method, 7 according
to the Kosinski method. However, they agree only upon 5 points (2, 6, 10, 11, 15).

                                        31
ROBUST MULTIVARIATE OUTLIER DETECTION
ROBUST MULTIVARIATE OUTLIER DETECTION
ROBUST MULTIVARIATE OUTLIER DETECTION
ROBUST MULTIVARIATE OUTLIER DETECTION
ROBUST MULTIVARIATE OUTLIER DETECTION
ROBUST MULTIVARIATE OUTLIER DETECTION

More Related Content

What's hot

Bayesian Estimation for Missing Values in Latin Square Design
Bayesian Estimation for Missing Values in Latin Square DesignBayesian Estimation for Missing Values in Latin Square Design
Bayesian Estimation for Missing Values in Latin Square Designinventionjournals
 
On observer design methods for a
On observer design methods for aOn observer design methods for a
On observer design methods for acsandit
 
Survival and hazard estimation of weibull distribution based on
Survival and hazard estimation of weibull distribution based onSurvival and hazard estimation of weibull distribution based on
Survival and hazard estimation of weibull distribution based onAlexander Decker
 
An Exponential Observer Design for a Class of Chaotic Systems with Exponentia...
An Exponential Observer Design for a Class of Chaotic Systems with Exponentia...An Exponential Observer Design for a Class of Chaotic Systems with Exponentia...
An Exponential Observer Design for a Class of Chaotic Systems with Exponentia...ijtsrd
 
Real-Time Stock Market Analysis using Spark Streaming
 Real-Time Stock Market Analysis using Spark Streaming Real-Time Stock Market Analysis using Spark Streaming
Real-Time Stock Market Analysis using Spark StreamingSigmoid
 

What's hot (6)

Paper 7 (s.k. ashour)
Paper 7 (s.k. ashour)Paper 7 (s.k. ashour)
Paper 7 (s.k. ashour)
 
Bayesian Estimation for Missing Values in Latin Square Design
Bayesian Estimation for Missing Values in Latin Square DesignBayesian Estimation for Missing Values in Latin Square Design
Bayesian Estimation for Missing Values in Latin Square Design
 
On observer design methods for a
On observer design methods for aOn observer design methods for a
On observer design methods for a
 
Survival and hazard estimation of weibull distribution based on
Survival and hazard estimation of weibull distribution based onSurvival and hazard estimation of weibull distribution based on
Survival and hazard estimation of weibull distribution based on
 
An Exponential Observer Design for a Class of Chaotic Systems with Exponentia...
An Exponential Observer Design for a Class of Chaotic Systems with Exponentia...An Exponential Observer Design for a Class of Chaotic Systems with Exponentia...
An Exponential Observer Design for a Class of Chaotic Systems with Exponentia...
 
Real-Time Stock Market Analysis using Spark Streaming
 Real-Time Stock Market Analysis using Spark Streaming Real-Time Stock Market Analysis using Spark Streaming
Real-Time Stock Market Analysis using Spark Streaming
 

Viewers also liked

SAS Regression Certificate
SAS Regression CertificateSAS Regression Certificate
SAS Regression CertificateSameer Shaikh
 
H2O World - Cancer Detection via the Lasso - Rob Tibshirani
H2O World - Cancer Detection via the Lasso - Rob TibshiraniH2O World - Cancer Detection via the Lasso - Rob Tibshirani
H2O World - Cancer Detection via the Lasso - Rob TibshiraniSri Ambati
 
The RuLIS approach to outliers (Marcello D'Orazio,FAO)
The RuLIS approach to outliers (Marcello D'Orazio,FAO)The RuLIS approach to outliers (Marcello D'Orazio,FAO)
The RuLIS approach to outliers (Marcello D'Orazio,FAO)FAO
 
Bayesian Robust Linear Regression with Outlier Detection
Bayesian Robust Linear Regression with Outlier DetectionBayesian Robust Linear Regression with Outlier Detection
Bayesian Robust Linear Regression with Outlier DetectionJonathan Sedar
 
Inferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSOInferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSOtuxette
 
Outlier detection for high dimensional data
Outlier detection for high dimensional dataOutlier detection for high dimensional data
Outlier detection for high dimensional dataParag Tamhane
 
Heteroscedasticity
HeteroscedasticityHeteroscedasticity
HeteroscedasticityMuhammad Ali
 
A LASSO for Linked Data
A LASSO for Linked DataA LASSO for Linked Data
A LASSO for Linked Datathosch
 
Multicollinearity1
Multicollinearity1Multicollinearity1
Multicollinearity1Muhammad Ali
 
Multicolinearity
MulticolinearityMulticolinearity
MulticolinearityPawan Kawan
 
Analysis of variance (ANOVA)
Analysis of variance (ANOVA)Analysis of variance (ANOVA)
Analysis of variance (ANOVA)Sneh Kumari
 
SAS Enterprise Minerを使用した機械学習
SAS Enterprise Minerを使用した機械学習SAS Enterprise Minerを使用した機械学習
SAS Enterprise Minerを使用した機械学習SAS Institute Japan
 

Viewers also liked (15)

SAS Regression Certificate
SAS Regression CertificateSAS Regression Certificate
SAS Regression Certificate
 
H2O World - Cancer Detection via the Lasso - Rob Tibshirani
H2O World - Cancer Detection via the Lasso - Rob TibshiraniH2O World - Cancer Detection via the Lasso - Rob Tibshirani
H2O World - Cancer Detection via the Lasso - Rob Tibshirani
 
The RuLIS approach to outliers (Marcello D'Orazio,FAO)
The RuLIS approach to outliers (Marcello D'Orazio,FAO)The RuLIS approach to outliers (Marcello D'Orazio,FAO)
The RuLIS approach to outliers (Marcello D'Orazio,FAO)
 
Bayesian Robust Linear Regression with Outlier Detection
Bayesian Robust Linear Regression with Outlier DetectionBayesian Robust Linear Regression with Outlier Detection
Bayesian Robust Linear Regression with Outlier Detection
 
Inferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSOInferring networks from multiple samples with consensus LASSO
Inferring networks from multiple samples with consensus LASSO
 
Outlier detection for high dimensional data
Outlier detection for high dimensional dataOutlier detection for high dimensional data
Outlier detection for high dimensional data
 
Outliers
OutliersOutliers
Outliers
 
Heteroscedasticity
HeteroscedasticityHeteroscedasticity
Heteroscedasticity
 
A LASSO for Linked Data
A LASSO for Linked DataA LASSO for Linked Data
A LASSO for Linked Data
 
Factor Analysis
Factor AnalysisFactor Analysis
Factor Analysis
 
Multicollinearity1
Multicollinearity1Multicollinearity1
Multicollinearity1
 
Multicolinearity
MulticolinearityMulticolinearity
Multicolinearity
 
Analysis of variance (ANOVA)
Analysis of variance (ANOVA)Analysis of variance (ANOVA)
Analysis of variance (ANOVA)
 
Multicollinearity
MulticollinearityMulticollinearity
Multicollinearity
 
SAS Enterprise Minerを使用した機械学習
SAS Enterprise Minerを使用した機械学習SAS Enterprise Minerを使用した機械学習
SAS Enterprise Minerを使用した機械学習
 

Similar to ROBUST MULTIVARIATE OUTLIER DETECTION

Temporal disaggregation methods
Temporal disaggregation methodsTemporal disaggregation methods
Temporal disaggregation methodsStephen Bradley
 
New Data Association Technique for Target Tracking in Dense Clutter Environme...
New Data Association Technique for Target Tracking in Dense Clutter Environme...New Data Association Technique for Target Tracking in Dense Clutter Environme...
New Data Association Technique for Target Tracking in Dense Clutter Environme...CSCJournals
 
Prob and statistics models for outlier detection
Prob and statistics models for outlier detectionProb and statistics models for outlier detection
Prob and statistics models for outlier detectionTrilochan Panigrahi
 
Distance Metric Based Multi-Attribute Seismic Facies Classification to Identi...
Distance Metric Based Multi-Attribute Seismic Facies Classification to Identi...Distance Metric Based Multi-Attribute Seismic Facies Classification to Identi...
Distance Metric Based Multi-Attribute Seismic Facies Classification to Identi...Pioneer Natural Resources
 
Study Analysis on Tracking Multiple Objects in Presence of Inter Occlusion in...
Study Analysis on Tracking Multiple Objects in Presence of Inter Occlusion in...Study Analysis on Tracking Multiple Objects in Presence of Inter Occlusion in...
Study Analysis on Tracking Multiple Objects in Presence of Inter Occlusion in...aciijournal
 
Study Analysis on Tracking Multiple Objects in Presence of Inter Occlusion in...
Study Analysis on Tracking Multiple Objects in Presence of Inter Occlusion in...Study Analysis on Tracking Multiple Objects in Presence of Inter Occlusion in...
Study Analysis on Tracking Multiple Objects in Presence of Inter Occlusion in...aciijournal
 
Bel ventutorial hetero
Bel ventutorial heteroBel ventutorial hetero
Bel ventutorial heteroEdda Kang
 
Study Analysis on Teeth Segmentation Using Level Set Method
Study Analysis on Teeth Segmentation Using Level Set MethodStudy Analysis on Teeth Segmentation Using Level Set Method
Study Analysis on Teeth Segmentation Using Level Set Methodaciijournal
 
STUDY ANALYSIS ON TRACKING MULTIPLE OBJECTS IN PRESENCE OF INTER OCCLUSION IN...
STUDY ANALYSIS ON TRACKING MULTIPLE OBJECTS IN PRESENCE OF INTER OCCLUSION IN...STUDY ANALYSIS ON TRACKING MULTIPLE OBJECTS IN PRESENCE OF INTER OCCLUSION IN...
STUDY ANALYSIS ON TRACKING MULTIPLE OBJECTS IN PRESENCE OF INTER OCCLUSION IN...aciijournal
 
InternshipReport
InternshipReportInternshipReport
InternshipReportHamza Ameur
 
Exploration of Imputation Methods for Missingness in Image Segmentation
Exploration of Imputation Methods for Missingness in Image SegmentationExploration of Imputation Methods for Missingness in Image Segmentation
Exploration of Imputation Methods for Missingness in Image SegmentationChristopher Peter Makris
 
Poor man's missing value imputation
Poor man's missing value imputationPoor man's missing value imputation
Poor man's missing value imputationLeonardo Auslender
 
Statistical techniques used in measurement
Statistical techniques used in measurementStatistical techniques used in measurement
Statistical techniques used in measurementShivamKhajuria3
 
IFAC2008art
IFAC2008artIFAC2008art
IFAC2008artYuri Kim
 
Pm m23 & pmnm06 week 3 lectures 2015
Pm m23 & pmnm06 week 3 lectures 2015Pm m23 & pmnm06 week 3 lectures 2015
Pm m23 & pmnm06 week 3 lectures 2015pdiddyboy2
 
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...IJDKP
 
Quantum inspired evolutionary algorithm for solving multiple travelling sales...
Quantum inspired evolutionary algorithm for solving multiple travelling sales...Quantum inspired evolutionary algorithm for solving multiple travelling sales...
Quantum inspired evolutionary algorithm for solving multiple travelling sales...eSAT Publishing House
 
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...Adam Fausett
 

Similar to ROBUST MULTIVARIATE OUTLIER DETECTION (20)

Temporal disaggregation methods
Temporal disaggregation methodsTemporal disaggregation methods
Temporal disaggregation methods
 
New Data Association Technique for Target Tracking in Dense Clutter Environme...
New Data Association Technique for Target Tracking in Dense Clutter Environme...New Data Association Technique for Target Tracking in Dense Clutter Environme...
New Data Association Technique for Target Tracking in Dense Clutter Environme...
 
Prob and statistics models for outlier detection
Prob and statistics models for outlier detectionProb and statistics models for outlier detection
Prob and statistics models for outlier detection
 
Distance Metric Based Multi-Attribute Seismic Facies Classification to Identi...
Distance Metric Based Multi-Attribute Seismic Facies Classification to Identi...Distance Metric Based Multi-Attribute Seismic Facies Classification to Identi...
Distance Metric Based Multi-Attribute Seismic Facies Classification to Identi...
 
Study Analysis on Tracking Multiple Objects in Presence of Inter Occlusion in...
Study Analysis on Tracking Multiple Objects in Presence of Inter Occlusion in...Study Analysis on Tracking Multiple Objects in Presence of Inter Occlusion in...
Study Analysis on Tracking Multiple Objects in Presence of Inter Occlusion in...
 
Study Analysis on Tracking Multiple Objects in Presence of Inter Occlusion in...
Study Analysis on Tracking Multiple Objects in Presence of Inter Occlusion in...Study Analysis on Tracking Multiple Objects in Presence of Inter Occlusion in...
Study Analysis on Tracking Multiple Objects in Presence of Inter Occlusion in...
 
Bel ventutorial hetero
Bel ventutorial heteroBel ventutorial hetero
Bel ventutorial hetero
 
Study Analysis on Teeth Segmentation Using Level Set Method
Study Analysis on Teeth Segmentation Using Level Set MethodStudy Analysis on Teeth Segmentation Using Level Set Method
Study Analysis on Teeth Segmentation Using Level Set Method
 
STUDY ANALYSIS ON TRACKING MULTIPLE OBJECTS IN PRESENCE OF INTER OCCLUSION IN...
STUDY ANALYSIS ON TRACKING MULTIPLE OBJECTS IN PRESENCE OF INTER OCCLUSION IN...STUDY ANALYSIS ON TRACKING MULTIPLE OBJECTS IN PRESENCE OF INTER OCCLUSION IN...
STUDY ANALYSIS ON TRACKING MULTIPLE OBJECTS IN PRESENCE OF INTER OCCLUSION IN...
 
InternshipReport
InternshipReportInternshipReport
InternshipReport
 
Exploration of Imputation Methods for Missingness in Image Segmentation
Exploration of Imputation Methods for Missingness in Image SegmentationExploration of Imputation Methods for Missingness in Image Segmentation
Exploration of Imputation Methods for Missingness in Image Segmentation
 
Poor man's missing value imputation
Poor man's missing value imputationPoor man's missing value imputation
Poor man's missing value imputation
 
CLIM Program: Remote Sensing Workshop, Blocking Methods for Spatial Statistic...
CLIM Program: Remote Sensing Workshop, Blocking Methods for Spatial Statistic...CLIM Program: Remote Sensing Workshop, Blocking Methods for Spatial Statistic...
CLIM Program: Remote Sensing Workshop, Blocking Methods for Spatial Statistic...
 
Statistical techniques used in measurement
Statistical techniques used in measurementStatistical techniques used in measurement
Statistical techniques used in measurement
 
IFAC2008art
IFAC2008artIFAC2008art
IFAC2008art
 
Pm m23 & pmnm06 week 3 lectures 2015
Pm m23 & pmnm06 week 3 lectures 2015Pm m23 & pmnm06 week 3 lectures 2015
Pm m23 & pmnm06 week 3 lectures 2015
 
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
IDENTIFICATION OF OUTLIERS IN OXAZOLINES AND OXAZOLES HIGH DIMENSION MOLECULA...
 
Quantum inspired evolutionary algorithm for solving multiple travelling sales...
Quantum inspired evolutionary algorithm for solving multiple travelling sales...Quantum inspired evolutionary algorithm for solving multiple travelling sales...
Quantum inspired evolutionary algorithm for solving multiple travelling sales...
 
50120130405020
5012013040502050120130405020
50120130405020
 
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
 

Recently uploaded

Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 

Recently uploaded (20)

Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 

ROBUST MULTIVARIATE OUTLIER DETECTION

  • 1. Statistics Netherlands Division Research and Development Department of Statistical Methods ROBUST MULTIVARIATE OUTLIER DETECTION Peter de Boer and Vincent Feltkamp Summary: Two robust multivariate outlier detection methods, based on the Mahalanobis distance, are reported: the projection method and the Kosinski method. The ability of those methods to detect outliers is exhaustively tested. A comparison is made between the two methods as well as a comparison with other robust outlier detection methods that are reported in the literature. The opinions in this paper are those of the authors and do not necessarily reflect those of Statistics Netherlands. Projectnummer: RSM-80820 BPA-nummer: 324-00-RSM/INTERN Datum: 19-jul-00
  • 2. Robust multivariate outlier detection 1. Introduction The statistical process can be separated in three steps. The input phase involves the collection of data by means of surveys and registers. The throughput phase involves preparing the raw data for tabulation purposes, weighting and variance estimating. The output phase involves the publication of population totals, means, correlations, etc., which have come out of the throughput phase. Data editing is one of the first steps in the throughput process. It is the procedure for detecting and adjusting individual errors in data. Editing also comprises the detection and treatment of correct but influential records, i.e. records that have a substantial contribution to the aggregates to be published. The search for suspicious records, i.e. records that are possibly wrong or influential, can be done in basically two ways. The first way is by examining each record and looking for strange or wrong fields or combinations of fields. In this view a record includes all fields referring to a particular unit, be it a person, household or business unit, even if those fields are stored in separate files, like files containing survey data and files containing auxiliary data. The second way is by comparing each record with the other records. Even if the fields of a particular record obey all the edit rules one has laid down, the record could be an outlier. An outlier is a record, which does not follow the bulk of the records. The data can be seen as a rectangular file, each row denoting a particular record and each column a particular variable. The first way of searching for suspicious data can be seen as searching in rows, the second way as searching in columns. It is remarked that some and possibly many errors can be detected by both ways. Records could be outliers while their outlyingness is not apparent by examining the variables, or columns, one by one. For instance, a company that has a relatively large turnover but that has paid relatively little taxes might be no outlier in either one of the variables, but could be an outlier considering the combination. Outliers involving more than one variable are multivariate outliers. In order to quantify how far a record lies from the bulk of the data, one needs a measure of distance. In the case of categorical data no useful distance measure exists, but in the case of continuous data the so-called Mahalanobis distance is often employed. A distance measure should be robust against the presence of outliers. It is known that the classical Mahalanobis distance is not. This means that the outliers, which are to be detected, seriously hamper the detection of those outliers. Hence, a robust version of the Mahalanobis distance is needed. In this report two robust multivariate outlier detection algorithms for continuous data, based on the Mahalanobis distance, are reported. In the next section the classical Mahalanobis distance is introduced and ways to robustify this distance measure are discussed. In sections 3 and 4 the two algorithms, successively the 1
  • 3. Robust multivariate outlier detection Kosinski method and the projection method, are presented. In section 5 a comparison between the two algorithms is made as well as a comparison with other algorithms reported in the outlier literature. A practical example, and problems involved with it, is the subject of section 6. In section 7 some concluding remarks are made. 2. The Mahalanobis distance The Mahalanobis distance is a measure of the distance between a point and the center of all points, with respect to the scale of the data — and in the multivariate case with respect to the shape of the data as well. It is remarked that in regression analysis another distance measure is more convenient: instead of the distance between a point and the center of the data, the distance between the point and the regression plane (see also section 5). Suppose we have a continuous data set y1 , y 2 ,.., y n . The vectors y i are p- dimensional, i.e. y i = ( y i1 yi 2 .. y ip ) t , where y iq denotes a real number. The classical squared Mahalanobis distance is defined by MDi2 = ( y i − y ) t C −1 ( y i − y ) where y and C denote the mean and the covariance matrix respectively: 1 n y= ∑ yi n i =1 1 n C= ∑ ( yi − y )( yi − y ) t n − 1 i =1 In the case of one-dimensional data the covariance matrix reduces to the variance and the Mahalanobis distance to MDi = y i − y σ , where σ denotes the standard deviation. Another point of view results by noting that the Mahalanobis distance is the solution of a maximization problem. The maximization problem is defined as follows. The data points y i can be projected on a projection vector a. The outlyingness of the point y i is the squared projected distance (a t ( y i − y )) , with respect to the 2 projected variance a t Ca . Assuming that the covariance matrix C is positive definite, there exists a non-singular matrix A such that A t CA = I . Using the Cauchy-Schwarz equality we have 2
  • 4. Robust multivariate outlier detection −1 (a t ( y i − y )) 2 (a t A t At ( y i − y )) 2 = a t Ca a t Ca ( A −1 a ) t ( A −1 a ) ( yi − y ) t AAt ( y i − y ) ≤ a t Ca a t ( AAt ) −1 a ( y i − y ) t AA t ( y i − y ) = a t Ca = ( y i − y ) t C −1 ( y i − y ) = MDi2 with equality if and only if A − a = cAt ( yi − y ) for some constant c. Hence 1 (a t ( y i − y )) 2 MD = sup i 2 a t a =1 a t Ca i.e., the Mahalanobis distance is equal to the supremum of the outlyingness of y i over all possible projection vectors. 2 If the data set y i is multivariate normal the squared Mahalanobis distances MDi follow the χ 2 distribution with p degrees of freedom. The classical Mahalanobis distance suffers however from the masking and swamping effect. Outliers seriously affect the mean and the covariance matrix in such a way that the Mahalanobis distance of outliers could be small (masking), while the Mahalanobis distance of points which are not outliers could be large (swamping). Therefore, robust estimates of the center and the covariance matrix should be found in order to calculate a useful Mahalanobis distance. In the univariate case the most robust choice is the median (med) and the median of absolute deviations (mad) replacing the mean and the standard deviation respectively. The med and mad have a robustness of 50%. The robustness of a quantity is defined as the maximum percentage of data points that can be moved arbitrarily far away while the change in that quantity remains bounded. It is not trivial to generalize the robust one-dimensional Mahalanobis distance to the multivariate case. Several robust estimators for the location and scale of multivariate data have been developed. We have tested two methods, the projection method and the Kosinski method. Other methods for robust outlier detection will be discussed in section 5, where we will compare the different methods on their ability to detect outliers. In the next two sections the Kosinski method and the projection method will be discussed in detail. 3
  • 5. Robust multivariate outlier detection 3. The Kosinski method 3.1 The principle of Kosinski The method discussed in this section was quite recently published by Kosinski. The idea of Kosinski is basically the following: 1) start with a few, say g, points, denoted the “good” part of the data set; 2) calculate the mean and the covariance matrix of those points; 3) calculate the Mahalanobis distances of the complete data set; 4) increase the good part of the data set with one point by selecting the g+1 points with the smallest Mahalanobis distance and define g=g+1; 5) return to step 2 or stop as soon as the good part contains more than half the data set and the smallest Mahalanobis distance of the remaining points is higher than a predefined cutoff value. At the end the remaining part, or the “bad” part, should contain the outliers. In order to assure that the good part will contain no outliers at the end, it is important to start the algorithm with points which all are good. In the paper by Kosinski this problem is solved by repetitively choosing a small set of random points, and performing the algorithm for each set. The number of sets of points to start with is taken high enough to be sure that at least one set contains no outliers. We made two major adjustments to the Kosinski algorithm. The first one is the choice of the starting data set. The demanded property of the starting data set is that it contains no outliers. It does not matter how these points are found. We choose the starting data set by robustly estimating the center of the data set and selecting the p+1 closest points. In the case of a p-dimensional data set, p+1 points are needed to get a useful starting data set, since the covariance matrix of a set of at most p points is always non-invertible. A separation of the data set in p+1 good points and n-p-1 bad points is called an elemental partition. The center is estimated by calculating the mean of the data set, neglecting all univariate robustly detected outliers. This is of course just a rude estimation, but is satisfactory for the purpose of selecting a good starting data set. Another rude estimation of the center that was tried out was the coordinate-wise median. The coordinate-wise median appeared to result in less satisfactory starting data sets. The p+1 points closest to the mean are chosen, where closest is defined by an ordinary distance measure. In order to take the different scales and units of the different dimensions into account, the data set is coordinate-wisely scaled before the mean is calculated, i.e. each component of each point is divided by the median of absolute deviations of the dimension concerned. It is remarked that, after the first p+1 points are selected the algorithm continues with the original unscaled data. It is, of course, possible to construct a data set for which this algorithm fails to select p+1 points that are all good points. However, in all the data sets exploited in this report, artificial and real, this choice of a starting data set worked very well. 4
  • 6. Robust multivariate outlier detection This adjustment results in a spectacular gain in computer time, since the algorithm has to be run only once instead of more than once. Kosinski estimates the required number of random starting data sets in his own original algorithm to be approximately 35 in the case of 2-dimensional data sets, and up to 10000 in 10 dimensions. The other adjustment is in the expansion of the good part. In the Kosinski paper the increment is always one point. We implemented an increment proportional to the good part already found, for instance 10%. This means that the good part is increased with a factor of 10% each step. This speeds up the algorithm as well, especially in large data sets. The original algorithm with one-point increment scales 2 with n , where n is the number of data points, while the algorithm with proportional increment scales with nln n . Also this adjustment was tested and appeared to be very good. In the remainder of this report, “the Kosinski method” denotes the adjusted Kosinski method, unless otherwise noted. 3.2 The Kosinski algorithm The purpose of the algorithm is, given a set of n multivariate data points y1 , y 2 ,.., y n , to calculate the outlyingness u i for each point i. The algorithm can be summarized as follows. Step 0. In: data set The algorithm is started with a set of continuous p-dimensional data y1 , y 2 ,.., y n , ( where y i = y i1 .. y ip ) . t Step 1. Choose an elemental partition A good part of p+1 points is found as follows. • Calculate the med and mad for each dimension q: M q = med y kq k S q = med y lq − M q l • Divide each component q of each data point i by the mad of the dimension concerned. The scaled data points are denoted by the superscript s: y iq yiq = s Sq • Declare a point to be a univariate outlier if at least one component of the data point is farther than 2.5 standard deviations away from the scaled median. The standard deviation is approximated by 1.484 times the mad (see section 4.1 for the background of the factor 1.484). So calculate for each component q of each point i: 5
  • 7. Robust multivariate outlier detection 1 Mq u iq = y iq − s 1.484 Sq If u iq > 2.5 for any q, then point i is an univariate outlier. • Calculate the mean of the data set, neglecting the univariate outliers: n 1 ys = n0 ∑y i =1 i s yi is no outlier where n0 denotes the number of points that are no univariate outliers. • Select the p+1 points that are closest to the mean. Define those points to be the good part of the data set. So calculate: d i = y is − y s The g=p+1 points with the smallest di form the good part, denoted by G. Step 2. Iteratively increase the good part The good part is increased until a certain stop criterion is fulfilled. • Continue with the original data set y i , not with the scaled data set y is . • Calculate the mean and the covariance matrix of the good part: 1 y= ∑ yi g i∈G 1 C= ∑ ( yi − y )( yi − y ) t g − 1 i∈G • Calculate the Mahalanobis distance of all the data points: −1 MD = ( y i − y ) C ( y i − y ) i 2 t • Calculate the number of points with a Mahalanobis distance smaller than a predefined cutoff value. A useful cutoff value is χ p ,1−α , with • =1%. 2 • Increase the good part with a predefined percentage (a useful percentage is 20%) by selecting the points with the smallest Mahalanobis distances, but not more than up to a) half the data set if the good part is smaller than half the data set (g<h=[½(n+p+1]). b) the number of points with a Mahalanobis distance smaller than the cutoff if the good part is larger than half the data set. • Stop the algorithm if the good part was already larger than half the data set and no more points were added in the last iteration. Step 3. Out: outlyingnesses The outlyingness of each point is now simply the Mahalanobis distance of the point, calculated with the mean and the covariance matrix of the good part of the data set. 6
  • 8. Robust multivariate outlier detection 3.3 Test results A prototype/test program was implemented in a Borland Pascal 7.0 environment. Documentation of the program is published elsewhere. We successively tested the choice of the elemental partition by means of the mean, the amount of swamped observations of data sets containing no outliers, the amount of masked and swamped observations of data sets containing outliers, the algorithm with proportional increment and the time-performance of the proportional increment of the good part compared to the one-point increment. Finally, we tested the sensitivity of the number of detected outliers to the cutoff value and the increment percentage in some known data sets. 3.3.1 Elemental partition First of all, the choice of the elemental partition was tested with the generated data set published by Kosinski. The Kosinski data set is a kind of worst-case data set. It contains a large amount of outliers (40% of the data) and the outliers are distributed with a variance much smaller than the variance of the good points. Before using the mean, we calculated the coordinate-wise median as a robust estimator of the center of the data, and selected the three closest points. This strategy failed. Although the median has a 50%-robustness, the 40% outliers strongly shift the median. Hence, one of the three selected points appeared to be an outlier. As a consequence, the forward search algorithm indicated all point to be good points, i.e. all the outliers were masked. This was the reason we searched for another robust measure of the location of the data. One of the simplest ideas is to search for univariate outliers first, and to calculate the mean of the points that are outlier in none of the dimensions. The selected points, the three points closest to the mean, appeared all to be good points. Moreover, the forward search algorithm, applied with this elemental partition, successfully distinguished the outliers from the good points. All following tests were performed using this “mean” to select the first p+1 points. For all tested data sets the selected p+1 points appeared to be good points, resulting in a successful forward search. It is possible, in principle, to construct a data set for which this selection algorithm still fails, for instance a data set with a large fraction of outliers which are univariately invisible and with no unambiguous dividing line between the group of outliers and the group of good points. This is, however, a very hypothetical situation. 3.3.2 Swamping A simulation study was performed in order to determine the average fraction of swamped observations in normal distributed data sets. In large data sets almost always a few points are indicated to be an outlier, even if the whole data set nicely follows a normal distribution. This is due to the cutoff value. If a cutoff value of χ 2 ,1−α is used as discriminator between good points and outliers in a p p-dimensional standard normal data set, a fraction of • data points will have a Mahalanobis distance larger than the cutoff value. 7
  • 9. Robust multivariate outlier detection For each dimension p between 1 and 8 we generated 100 standard normal data sets of 100 points. The Kosinski algorithm was run twice on each data set, once with a cutoff value χ 2 , 0.99 , and once with p χ 2 , 0.95 . Each point that is indicated to be an p outlier is a swamped observation since there are no true outliers by construction. We calculated the average fraction of swamped observations (i.e. the number of swamped observations of each data set divided by 100, the number of points in the data set, averaged over all 100 data sets). Results are shown in Table 3.1. • p=1 2 3 4 5 6 7 8 0.01 0.015 0.011 0.010 0.008 0.008 0.008 0.007 0.007 0.05 0.239 0.112 0.081 0.070 0.059 0.052 0.045 0.042 Table 3.1. The average fraction of swamped observations of the simulations of 100 generated p-dimensional data sets of 100 points for each p between 1 and 8, with cutoff value χ 2 ,1−α . p For • =0.01 the fraction of swamped observations is very close to the value of • itself. These results are very similar to the results of the original Kosinski algorithm. For • =0.05, however, the average fraction of swamped observations is much larger than 0.05 for the lower dimensions, especially for p=1 and p=2. The reason for this is the following. Consider a one-dimensional standard normal data set. If the variance of all points is used, the outlyingness of a fraction of • points will be larger than χ 12,1−α . However, in the Kosinski algorithm the variance of all points but at least that fraction of • points with the largest outlyingnesses is calculated. This variance is smaller than the variance of all points. Hence, the Mahalanobis distances are overestimated and too many points are indicated to be an outlier. This is a self- magnifying effect. More outliers lead to a smaller variance which leads to more points indicated to be an outlier, etc. The effect is the strongest in one dimension. In higher dimensions the points with a large Mahalanobis distance are “all around”. Therefore they less influence the variance in the separate directions. Apparently, the effect is quite strong for • =0.05, but almost negligible for • =0.01. In the remaining tests • =0.01 is used, unless otherwise stated. 3.3.3 Masking and swamping The ability of the algorithm to detect outliers was tested in another simulation. We generated data sets in the same way as is done in the Kosinski paper in order to get a fair comparison between the original and our adjusted Kosinski algorithm. Thus we generated data sets of 100 points containing good points as well as outliers. Both the good points and the outliers were generated from a multivariate distribution, with σ 2 = 40 for the good points and σ 2 = 1 for the bad points. The distance between 8
  • 10. Robust multivariate outlier detection the center of the good points and the bad points is denoted by d. The vector between the centers is along the vector of 1’s. We varied the dimension (p=2, 5), the fraction of outliers (0.10• 0.45), and the distance (d=20• 60). We calculated the fraction of masked outliers (the number of masked outliers of each data set divided by the number of outliers) and the fraction of swamped points (the number of swamped points of each data set divided by the number of good points), both averaged over 100 simulation runs for each set of parameters p, d, and fraction of outliers. Results are shown in Table 3.2. p=2 p=5 fraction of fraction of fraction of fraction of fraction of fraction of outliers masked obs. swamped outliers masked obs. swamped obs. obs. d=20 d=25 0.10 0.81 0.009 0.10 0.90 0.008 0.20 0.89 0.014 0.20 0.91 0.021 0.30 0.88 0.022 0.30 0.93 0.146 0.40 0.86 0.146 0.40 0.97 0.551 0.45 0.88 0.350 0.45 1.00 0.855 d=30 d=40 0.10 0.03 0.011 0.10 0.00 0.008 0.20 0.00 0.011 0.20 0.04 0.008 0.30 0.01 0.010 0.30 0.03 0.022 0.40 0.05 0.043 0.40 0.02 0.020 0.45 0.01 0.019 0.45 0.01 0.014 d=40 d=60 0.10 0.00 0.011 0.10 0.00 0.008 0.20 0.00 0.011 0.20 0.00 0.007 0.30 0.00 0.011 0.30 0.00 0.009 0.40 0.00 0.009 0.40 0.00 0.010 0.45 0.00 0.010 0.45 0.00 0.008 Table 3.2. Average fraction of masked and swamped observations of 2- and 5-dimensional data sets over 100 simulation runs. Each data set consisted of 100 points with a certain fraction of outliers. The good (bad) points were generated from a multivariate normal distribution with σ = 40 ( σ = 1 ) in each direction. The 2 2 distance between the center of the good points and the bad points is denoted by d. The following conclusions can be drawn from these results. The algorithm is said to be performing well if the fraction of masked outliers is close to zero and the fraction of swamped observation is close to • =0.01. The first conclusion is: the larger the distance between the good points and the bad points the better the algorithm performs. This conclusion is not surprising and is in agreement with Kosinski’s results. Secondly, the higher the dimension, the worse the performance of the algorithm. In five dimensions the algorithm starts to perform well at d=40, and close to perfect at d=60, while in two dimensions the performance is good at d=30, respectively perfect at d=40. The original algorithm did not show such a dependence on the dimension. It is remarked, however, that the paper by Kosinski does not give 9
  • 11. Robust multivariate outlier detection enough details for a good comparison on this point. Third, for both two and five dimensions the adjusted algorithm performs worse than the original algorithm. The original algorithm is almost perfect at d=25 for both p=2 and p=5, while the adjusted algorithm is not perfect until d=40 or d=60. This is the price that is paid for the large gain in computer time. The fourth conclusion is: the performance of the algorithm is almost not dependent on the fraction of outliers, in agreement with Kosinski’s results. In some cases, the algorithm even seems to perform better for higher fractions. This is however due to the relatively small number of points (100) per data set. For very large data sets and very large number of simulation runs this artifact will disappear. p d fr inc masked swamped 2 20 0.10 1p 0.79 0.010 2 20 0.10 10% 0.80 0.009 2 20 0.10 100% 0.80 0.009 2 20 0.40 1p 0.86 0.225 2 20 0.40 10% 0.86 0.146 2 20 0.40 100% 0.89 0.093 2 30 0.10 1p 0.00 0.011 2 30 0.10 10% 0.03 0.011 2 30 0.10 100% 0.02 0.011 2 30 0.40 1p 0.05 0.042 2 30 0.40 10% 0.05 0.043 2 30 0.40 100% 0.08 0.038 2 40 0.10 1p 0.00 0.011 2 40 0.10 10% 0.00 0.011 2 40 0.10 100% 0.00 0.011 2 40 0.40 1p 0.00 0.010 2 40 0.40 10% 0.00 0.009 2 40 0.40 100% 0.02 0.009 5 40 0.10 1p 0.00 0.008 5 40 0.10 10% 0.00 0.008 5 40 0.10 100% 0.01 0.008 5 40 0.40 1p 0.01 0.016 5 40 0.40 10% 0.01 0.016 5 40 0.40 100% 0.06 0.035 Table 3.3. Average fraction of masked and swamped observations for p-dimensional data sets with a fraction of fr outliers on a distance d from the good points (for more details about the data sets see Table 3.2), calculated with runs with either one-point increment (1p) or proportional increment (10% or 100% of the good part). 3.3.4 Proportional increment Until now all tests have been performed using the one-point increment, i.e. at each step of the algorithm the size of the good part is increased with just one point. In section 3.1 it was already mentioned that a gain in computer time is possible by increasing the size of the good part with more than one point per step. The simulations on the masked and swamped observations were repeated with the proportional increment algorithm. The increment with a certain percentage was 10
  • 12. Robust multivariate outlier detection tested for percentages up to 100% (which means that the size of the good part is doubled at each step). The results of Table 3.1, showing the average fraction of swamped observations in outlier-free data sets, did not change. Small changes showed up for large percentages in the presence of outliers. A summary of the results is shown in Table 3.3. In order to avoid an unnecessary profusion of data we only show the results for p=2 in some relevant cases and, as an illustration, in a few cases for p=5. A general conclusion from the table is that for a wide range of percentages the proportional increment algorithm works satisfactorily. For a percentage of 100% outliers are masked slightly more frequently than for lower percentages. The differences between 10% increment and one-point increment are negligible. 3.3.5 Time dependence To illustrate the possible gain with the proportional increment we measured the time per run for p-dimensional data sets of n points, with p ranging from 1 to 8 and n from 50 to 400. The simulations were performed with outlier-free generated data sets so that the complete data sets had to be included in the good part. This was done in order to obtain useful information about the dependence of the simulation times on the number of points. Table 3.4 shows the results for the simulation runs with one-point increment. The results for the runs with a proportional increment of 10% are shown in Table 3.5. n p=1 2 3 4 5 6 7 8 50 0.09 0.18 0.29 0.45 0.64 0.84 1.08 1.35 100 0.36 0.68 1.05 1.75 2.5 3.3 4.3 5.5 200 1.46 2.8 4.6 7.0 10 400 6.2 12 Table 3.4. Time (in seconds) per run on p-dimensional data sets of n points, using the one-point increment. n p=1 2 3 4 5 6 7 8 50 0.05 0.10 0.16 0.23 0.31 0.39 0.52 0.62 100 0.14 0.24 0.39 0.56 0.76 1.00 1.25 1.55 200 0.33 0.60 0.92 1.35 1.90 400 0.80 1.40 Table 3.5. Time (in seconds) per run on p-dimensional data sets of n points, using the proportional increment (perc=10%). Let us denote the time per run as a function of n for fixed p by tp, and the time per run as a function of p for fixed n by tn. For the one-point increment simulations tp is approximately proportional to n2. This is as expected since there are O(n) steps with a increment of one point and at each step the Mahalanobis distance has to be calculated for each point (O(n)) and sorted (O(n ln n)). For the simulations with proportional increment tp is approximately O(n ln n), due to the fact that only O(ln n) steps are needed instead of O(n). As a consequence there is a substantial 11
  • 13. Robust multivariate outlier detection gain in the time per run, ranging from a factor of 2 for 50 points up to a factor of 8 for 400 points. The time per run for fixed n, tn, is approximately proportional to p1.5, for both one- point and proportional increment runs. The exponent 1.5 is just an empirical average over the range p=1..8 and is result of several O(p) and O(p2) steps. Since the exponent is much smaller than 2 it is more efficient to search for outliers in one p-dimensional run than in ½p(p-1) 2-dimensional runs, one for each pair of dimensions, even if one is not interested in outliers in more than 2 dimensions. Consider for instance p=8, n=2. One run takes 0.62 seconds. However, a total of 1.4 seconds would be needed for the 28 runs in each pair of dimensions, each run taking 0.05 seconds. 3.3.6 Sensitivity to parameters The Kosinski algorithm was tested on the twelve data sets described in section 5. A full description of the outliers and a comparison of the results with the results of the projection algorithm as well as with other methods described in the literature is given in that section. In the present section we restrict the discussion to the sensitivity of the number of outliers to the cutoff and the increment percentage. The algorithm was run with a cutoff χ 2 ,1−α for • =1% as well as • =5%. p Furthermore, both one-point increment and proportional increment (in the range 0- 40%) were used. The number of detected outliers of the twelve data sets is shown in Table 3.6. It is clear that the number of outliers for a specific data set is not the same for each set of parameters. It is remarked that, in all cases, if different sets of parameters lead to the same number of outliers, the outliers are exactly the same points. Moreover, if one set of parameters leads to more outliers than another set, all outliers detected by the latter are also detected by the former (these are empirical results). Let us first discuss the differences between the detection with • =1% and with • =5%. It is obvious that in many cases • =5% results in slightly more outliers than • =1%. However, in two cases the differences are substantial, i.e. in the Stackloss data and in the Factory data. In the Stackloss data five outliers for • =5% are found using moderate increments, while • =1% shows no outliers at all. The reason for this difference is the relatively small number of points related to the dimension of the data set. It has been argued by Rousseeuw that the ratio n/p should be larger than 5 in order to be able to detect outliers reliably. If n/p is smaller than 5 one comes to a point where it is not useful to speak about outliers since there is no real bulk of data. With n=21 and p=4 the Stackloss data lie on the edge of meaningful outlier detection. Moreover, if the five points which are indicated as outliers with • =5% are left out, only 16 good points remain, resulting in a ratio n/p=4. In such a case any outlier detection algorithm will presumably fail to find outliers consistently. 12
  • 14. Robust multivariate outlier detection Data set p n inc • =5% • =1% 1. Kosinski 2 100 1p 42 40 • 40% 42 40 2. Brain mass 2 28 1p 5 3 • 10% 5 3 15-20% 4 3 30-40% 3 3 3. Hertzsprung-Russel 2 47 1p 7 6 • 30% 7 6 40% 6 6 4. Hadi 3 25 1p 3 3 • 5% 3 3 10% 3 0 15-25% 3 3 30% 3 0 40% 3 3 5. Stackloss 4 21 1p 5 0 • 17% 5 0 18-24% 4 0 25-30% 1 0 40% 0 0 6. Salinity 4 28 1p 4 2 • 30% 4 2 40% 2 2 7. HBK 4 75 1p 15 14 • 30% 15 14 40% 14 14 8. Factory 5 50 1p 20 0 • 40% 20 0 9. Bush fire 5 38 1p 16 13 • 40% 16 13 10. Wood gravity 6 20 1p 6 5 • 20% 6 5 30% 6 6 40% 6 5 11. Coleman 6 20 1p 7 7 • 40% 7 7 12. Milk 8 85 1p 20 17 • 30% 20 17 40% 18 15 Table 3.6. Number of outliers detected by the Kosinski algorithm with a cutoff of χ 2 ,1−α , for • =1% respectively • =5%, with either one-point (1p) or proportional p increment in the range 0-40%. 13
  • 15. Robust multivariate outlier detection The Factory data is an interesting case. For • =5% twenty outliers are detected, which is 40% of all points, while detection with • =1% shows no outliers. Explorative data analysis shows that about half the data set is quite narrowly concentrated in a certain region, while the other half is distributed over a much larger space. There is however no clear distinction between these two parts. The more widely distributed part is rather a very thick tail of the other part. In such a case the effect that the algorithm with • =5% tends to detect too much outliers, which is explained discussing Table 3.1, is very strong. It is questionable whether the indicated points should be considered as outliers. Let us now discuss the sensitivity of the number of detected outliers to the increment. At low percentages the number of outliers is always the same as for the one-point increment • in fact, at very low percentages the proportional increment procedure leads to an increment of just one point per step, making the two algorithms equal. For most data sets the number of outliers is constant for a wide range of percentages and starts to differ slightly only at 30-40% or higher. Three of the twelve data sets behave differently: the Brain mass data, the Hadi data, and the Stackloss data. The Brain mass data shows 5 outliers at low percentages for • =5%. At percentages around 15% the number of outliers is only 4 and at 30% only 3. So the number of outliers changes earlier (at 15%) than in most other data sets (• 30%). For • =1% the number of outliers is constant over the whole range. In fact, the three outliers which are found at 30-40% for • =5% are exactly the same as the three outliers found for • =1%. The two outliers which are missed at higher percentages for • =5% both lie just above the cutoff value. Therefore it is disputable whether they are real outliers at all. The Hadi data shows strange behavior. At all percentages for • =5% and at most percentages for • =1% three outliers are found. However, near 10% and near 30% no outliers are detected. Again, the three outliers are disputable. All have a Mahalanobis distance just above the cutoff (see Table 5.2). Hence it is not strange that sometimes these three points are included in the good part (the three points lie close together; hence, the inclusion of one of them in the good part leads to low Mahalanobis distances for the other two as well). On the other side, it is also not a big problem, since it is rather a matter of taste than a matter of science to call the three points outliers or good points. The Stackloss data shows a decreasing number of outliers for • =5% at relatively low percentages, like in the Brain mass data. Here, the sensitivity to the percentage is related to the low ratio n/p, as is discussed previously. In conclusion, for increments up to 30% the same outliers are found as with the one- point increment. In cases where this is not true, the supposed outliers always have an outlyingness slightly above or below the cutoff, so that missing such outliers has no big consequences. Furthermore, relatively low cutoff values could lead to disproportionate swamping. 14
  • 16. Robust multivariate outlier detection 4. The projection method 4.1 The principle of projection The projection method is based on the idea that outliers in univariate data are easily recognized, visually as well as by computational means. In one dimension the Mahalanobis distance is simply y i − y σ . A robust version of the univariate outlyingness is found by replacing the mean by the med and replacing the standard deviation by the mad. Denoting the robust outlyingness by u i , this leads to yi − M ui = S where M and S denote the med respectively the mad: M = med y k k S = med y l − M l In the case of multivariate data the idea is to “look” at the data set from all possible directions and to “see” whether a particular data point lies far away from the bulk of the data points. Looking in this context means projecting the data set on a projection vector a; seeing means calculating the outlyingness as is done in univariate data. The ultimate outlyingness of a point is just the maximum of the outlyingnesses over all projection directions. The outlyingness defined in this way corresponds to the multivariate Mahalanobis distance as is shown in section 2. Recalling the expression for the Mahalanobis distance: (a t ( y i − y )) 2 MD = sup i 2 a t a =1 a t Ca Robustifying the Mahalanobis distance leads to a t yi − M u i = sup a t a =1 S Now M and S are defined as follows: M = med a t y k k S = med a t y l − M l 2 2 It is remarked that MDi corresponds to u i . How is the maximum calculated? The outlyingness a t yi − M S 15
  • 17. Robust multivariate outlier detection as a function of a could posses several local maxima, making gradient search methods unfeasible. Therefore the outlyingness is calculated on a grid of a finite number of projection vectors. The grid should be fine enough in order to calculate the maximum outlyingness with enough accuracy. This robust measure of outlyingness was firstly developed by Stahel en Donoho. More recent work on this subject has been reported by Maronna and Yohai. These authors used the outlyingness in order to calculate a weighted mean and covariance matrix. Outliers were given small weights so that the Stahel-Donoho estimator of the mean was robust against the presence of outliers. It is of course possible to use the weighted mean and covariance matrix to calculate a weighted Mahalanobis distance. This is not done in the projection method discussed here. The robust outlyingness u i was slightly adjusted for the following reason. The mad of univariate standard normal data, which has a standard deviation of 1 by definition, is 0.674=1/1.484. In order to assure that, in the limiting case of an infinitely large 2 multivariate normal data set, the outlyingness u i is equal to the squared Mahalanobis distance, the mad in the denominator is multiplied with 1.484: a t yi − M u i = sup a t a =1 1.484 S 4.2 The projection algorithm The purpose of the algorithm is, given a set of n multivariate data points y1 , y 2 ,.., y n , to calculate the outlyingness u i for each point i. The algorithm can be summarized as follows. Step 0. In: data set The algorithm is started with a set of continuous p-dimensional data y1 , y 2 ,.., y n , with y i = y i1 ( .. y ip ) . t Step 1. Define a grid  p There are   subsets of q dimensions in the total set of p dimensions. The q   “maximum search dimension” q is predefined. Projection vectors a in a certain subset are parameterized by the angles θ 1 ,θ 2 ,..,θ q −1 :  cosθ 1     cos θ 2 sin θ 1   cos θ sin θ sin θ  a=  3 2 1    ¡    cos θ q −1 sin θ q − 2 sin θ 1     sin θ sin θ sin θ 1      q −1 q−2 16
  • 18. Robust multivariate outlier detection A certain predefined step size step (in degrees) is used to define the grid. The first angle θ 1 can take the values i step1 , with step1 the largest angle smaller 180 180 than or equal to step for which is an integer value, and with i = 1,2,.., . step1 step1 The second angle can take the values j step 2 , with step 2 the largest angle smaller step1 180 than or equal to for which is an integer value, and with cosθ 1 step 2 180 j = 1,2,.., . step 2 The r-th angle can take the values k step r , with step r the largest angle smaller than step r −1 180 or equal to for which is an integer value, and with cosθ r −1 step r 180 k = 1,2,.., . step r Such a grid is defined in each subset of q dimensions. Step 2. Outlyingness for each grid point For each grid point a, calculate the outlyingness for each data point y i : • Calculate the projections a y i . • Calculate the median M a = med a y k . k • Calculate the mad La = med a y l − M . l a yi − M a • Calculate the outlyingness u i (a ) = . 1.484 La Step 3. Out: outlyingness The outlyingness u i is the maximum over the grid: u i = sup u i (a ) . a 4.3 Test results A prototype/test program was implemented in an Excel/Visual Basic environment. Documentation of the program is published elsewhere. We successively tested the amount of swamped observations of data sets containing no outliers, the amount of masked observations of data sets containing outliers, the time-dependence of the algorithm on the parameters step and q, and the sensitivity of the number of detected outliers to these parameters in some known data sets. 17
  • 19. Robust multivariate outlier detection 4.3.1 Swamping A simulation study was performed in order to determine the average fraction of swamped observations in normal distributed data sets. See section 3.3.2 for more detailed remarks about the swamping effect and about generating the data sets. The results of the simulations are shown in Table 4.1. • step p=1 2 3 4 5 1% 10 0.010 0.011 0.016 0.018 0.023 5% 10 0.049 0.052 0.067 0.071 0.088 1% 30 0.010 0.010 0.012 0.011 0.012 5% 30 0.049 0.049 0.051 0.049 0.058 Table 4.1. The average fraction of swamped observations of the simulations on several generated p-dimensional data sets of 100 points, with cutoff value χ 2 ,1−α p and step size step. The parameter q is equal to p. p=2 q=2 p=5 q=2 p=5 q=5 fraction of fraction of fraction of fraction of fraction of fraction of outliers masked obs. outliers masked obs. outliers masked obs. d=20 d=30 d=30 0.12 0.83 0.12 1.00 0.12 0.22 0.23 1.00 0.23 1.00 0.23 0.54 0.34 1.00 0.34 1.00 0.34 1.00 0.45 1.00 0.45 1.00 0.45 1.00 d=40 d=50 d=50 0.12 0.00 0.12 0.00 0.12 0.00 0.23 0.00 0.23 0.67 0.23 0.00 0.34 0.62 0.34 1.00 0.34 0.65 0.45 1.00 0.45 1.00 0.45 1.00 d=50 d=80 d=60 0.12 0.00 0.12 0.00 0.12 0.00 0.23 0.00 0.23 0.00 0.23 0.00 0.34 0.00 0.34 0.00 0.34 0.00 0.45 1.00 0.45 1.00 0.45 1.00 d=90 d=140 d=120 0.12 0.00 0.12 0.00 0.12 0.00 0.23 0.00 0.23 0.00 0.23 0.00 0.34 0.00 0.34 0.00 0.34 0.00 0.45 0.00 0.45 0.00 0.45 0.00 Table 4.2. Average fraction of masked outliers of 2- and 5-dimensional generated data sets (see also section 3.3.3). For low dimensions the average fraction of swamped observations tend to be almost equal to • . The fraction increases, however, with increasing dimension. This due to the decreasing ratio n/p. It is remarkable that if the step size is 30 the fraction of swamped observations seems to be much better than for step size 10. This is just a coincidence. The fact that more observations are declared to be an outlier is 18
  • 20. Robust multivariate outlier detection compensated by the fact that outlyingnesses are usually smaller if high step sizes are used. In fact, the differences between step size 10 and 30 are so large for higher dimensions that this is an indication that a step size of 30 could be too low to result in reliable outlyingnesses. 4.3.2 Masking and swamping The ability of the projection algorithm to detect outliers was tested by generating data sets that contain good points as well as outliers. See section 3.3.3 for details on how the data sets were generated. Results are shown in Table 4.3. In all cases, the ability to detect the outliers is strongly dependent on the contamination of outliers. If there are many outliers, they can only be detected if they lie very far away from the cloud of good points. This is due to the fact that, although the med and the mad have a robustness of 50%, a large concentrated fraction of outliers strongly shifts the med towards the cloud of outliers and enlarges the mad. In higher dimensions it is more difficult to detect the outliers, like in the Kosinski method. The ability to detect the outliers depends also on the maximum search dimension q. If q is taken equal to p less outliers are masked. 4.3.3 Time dependence The time dependence of the projection algorithm on the step size step and the maximum search dimension q is shown in Table 4.3. n p q step t n p q step t 400 2 2 36 13.0 100 2 2 9 8.0 18 21.0 3 19.3 9 32.7 4 33.5 4.5 56.8 5 50.1 6 71.4 400 3 3 36 28.1 7 98.9 18 68.6 8 128.0 9 209.1 4.5 719.3 100 5 1 9 5.9 2 50.1 50 5 2 9 26.3 3 479.8 100 50.1 4 2489.1 200 107.7 5 4692.1 400 202.9 Table 4.3. Time t (in seconds) per run on p-dimensional data sets of n points using maximum search dimension q and step size step (in degrees).  p  180 q −1 Asymptotically the time per run should be proportional to (n ln n) (   ) ,  q  step  p since for each of the   subsets a grid is defined with a number of grid points of q   19
  • 21. Robust multivariate outlier detection 180 q −1 the order of ( ) , and at each grid point the median of the projected points has step to be calculated (n ln n). The results in the table roughly confirm this theoretical estimation. The most important conclusion from the table is that the time per run strongly increases with the search dimension q. This makes the algorithm only useful for relatively low dimensions. 4.3.4 Sensitivity to parameters The projection method was tested with the twelve data sets that are fully described in section 5, like is done with the Kosinski method (see section 3.3.6). The results are shown in Table 4.4. Let us first discuss the differences between • =5% and 1%. In almost all cases the number of outliers, detected with • =5% are larger than with • =1%. This is completely due to stronger swamping. It is remarked that there is no algorithmic dependence on the cutoff value, like in the Kosinski method. In the projection method a set of outlyingnesses is calculated and after the calculation a certain cutoff value is used in order to discriminate between good and bad points. Hence, a smaller cutoff value leads to more outliers, but all points still have the same outlyingness. In the Kosinski method the cutoff value is already used during the algorithm: the cutoff is used in order to decide whether more points should be added to the good part. A smaller cutoff leads not only to more outliers but also to a different set of outlyingnesses since the mean and the covariance matrix are calculated with a different set of points. As a consequence, in cases where the Kosinski possibly shows a rather strong sensitivity to the cutoff value, this sensitivity is missing in the projection method. Now let us discuss the dependence of the number of outliers on the maximum search dimension q. In the Hertzsprung-Russel data set and in the HBK data set the number of outliers found with q=1 is already as large as found with higher values of q. In the Brain mass data set and in the Milk data set, the number of outliers for q=1 are however much smaller than for large values of q. In those cases, many outliers are truly multivariate. In the Hadi data set, the Factory data set and the Bush fire data set there is also a rather large discrepancy between q=2 and q=3. It is remarked that the Hadi data set was constructed so that all outliers were invisible looking at two dimensions only (see section 5.2.4). Also in the other two data sets it is clear that many outliers can only be found by inspecting three or more dimensions at the same time. If q is higher than three, only slightly more outliers are found than for q=3. Differences can be explained by the fact that searching in higher dimensions with the projection method leads to more outliers (see section 4.3.1). 20
  • 22. Robust multivariate outlier detection Data set p n q step • =5% • =1% 1. Kosinski 2 100 2 10 78 34 2 20 77 34 2 30 42 31 2. Brain mass 2 28 2 5 9 6 2 10 9 4 2 30 8 4 1 n/a 3 1 3. Hertzsprung-Russel 2 47 2 1 7 6 2 30 6 5 2 90 6 5 1 n/a 6 5 4. Hadi 3 25 3 5 11 5 3 10 8 0 2 10 0 0 5. Stackloss 4 21 4 5 14 9 4 10 10 9 4 15 8 6 4 20 9 7 4 30 6 6 6. Salinity 4 28 4 10 12 8 4 20 9 7 3 30 6 4 7. HBK 4 75 4 10 15 14 4 20 14 14 1 n/a 14 14 8. Factory 5 50 5 10 24 18 5 20 14 9 4 10 24 17 3 10 22 14 2 10 9 9 9. Bush fire 5 38 5 10 24 19 5 20 19 17 4 10 22 19 3 10 21 17 2 10 13 12 10. Wood gravity 6 20 5 20 14 14 5 30 12 11 3 10 15 14 11. Coleman 6 20 5 20 10 8 5 30 4 4 12. Milk 8 5 85 20 18 14 5 30 15 13 4 20 16 14 4 30 15 13 3 20 15 13 3 30 15 12 2 20 13 11 2 30 12 7 1 n/a 6 5 Table 4.4. Number of outliers detected by the projection algorithm with a cutoff of χ 2 ,1−α , for • =1% respectively • =5%, with maximum search dimension q and angular p step size step (in degrees). 21
  • 23. Robust multivariate outlier detection The sensitivity to the step size is not large in most cases. In cases like the Hadi data, the Stackloss data, the Salinity data and the Coleman data, the sensitivity can be explained by the sparsity of the data sets. A step size near 10-20 seems to work well in most cases. In conclusion, the number of outliers is not very sensitive to the parameters q and step. However, the sensitivity is not completely negligible. In most practical cases q=3 and step=10 work well enough. 5. Comparison of methods In this section the projection method and the Kosinski method are compared with each other as well with other robust outlier detection methods. In section 5.1 we will shortly describe some other methods reported in the literature. The comparison is made by applying the projection method and the Kosinski method on data sets that are analyzed by at least one of the other methods. Those data sets and the results of the said methods are described in section 5.2. In section 5.3 the results are discussed. Unfortunately, in most papers on outlier detection methods very little is said about the efficiency of the methods, i.e. how fast the algorithms are and how it depends on the number of points and the dimension of the data set. Therefore we restrict the discussion to the ability to detect outliers. 5.1 Other methods It is important to note that two different type outliers are distinguished in the outlier literature. The first type outlier, which is used in this report, is a point that lies far away from the bulk of the data. The second type is a point that lies far away from the regression plane formed by the bulk of the data. The two types will be denoted by bulk outliers respectively regression outliers. Of course, outliers are often so according to both points of view. That is why we compare the results of the projection method and the Kosinski method, which are both bulk outlier methods, also with regression outlier methods. An outlier that is declared to be so by both methods is called a bad leverage point. In the case that a point lies far away from the bulk of the points but close to the regression plane it is called a good leverage point. Rousseeuw (1987, 1990) developed the minimum volume ellipsoid (MVE) estimator in order to robustly detect bulk outliers. The principle is to search for the ellipsoid, covering at least half the data points, for which the volume is minimal. The mean and the covariance matrix of the points inside the ellipsoid are inserted in the expression for the Mahalanobis distance. This method is costly due to the complexity of the algorithm that searches the minimum volume ellipsoid. A related technique is based on the minimum covariance determinant (MCD) estimator. This technique is employed by Rocke. The aim of this technique is to search for the set of points, containing at least half the data, for which the determinant of the covariance matrix is minimal. Again, the mean and the 22
  • 24. Robust multivariate outlier detection covariance matrix, determined by that set of points, are inserted in the Mahalanobis distance expression. Also this method is rather complex, although substantially optimized by Rocke. Hadi (1992) developed a bulk outlier method that is very similar to the Kosinski method. He also starts with a set of p+1 “good” points and increases the good set one point by one. The difference lies in the choice of the first p+1 points. Hadi orders the n points using another robust measure of outlyingness. The question arises why that other outlyingness would not be appropriate for outlier detection. A reason could be that an arbitrary robust measure of outlyingness deviates relatively strongly from the “real” Mahalanobis distance. Atkinson combines the MVE method of Rousseeuw and the forward search technique also employed by Kosinski. A few sets of p+1 randomly chosen points are used for a forward search. The set that results in the ellipsoid with minimal volume is used for the calculation of the Mahalanobis distances. Maronna employed a projection–like method, but slightly more complicated. The outlyingnesses are calculated like in the projection method. Then, weights are assigned to each point, with low weights for the outlying points, i.e. the influence of outliers is restricted. The mean and the covariance matrix are calculated using these weights. They form the Stahel-Donoho estimator for location and scatter. Finally, Maronna inserts this mean and this covariance matrix in the expression for the Mahalanobis distance. Egan proposes resampling by the half-mean method (RHM) and the smallest half- volume method (SHV). In the RHM method several randomly selected portions of the data are generated. In each case the outlyingnesses are calculated. For each point is counted how many times it has a large outlyingness. It is declared to be a true outlier if this happens often. In the SHV method the distance between each pair of points is calculated and put in a matrix. The column with the smallest sum of the smallest n/2 distances is selected. The corresponding n/2 points form the smallest half-volume. The mean and the covariance of those points are inserted in the Mahalanobis distance expression. The detection of regression outliers is mainly done with the least median of squares (LMS) method. The LMS method is developed by Rousseeuw (1984, 1987, 1990). Instead of minimizing the sum of the squares of the residuals in the least squares method (which should rather be called the least sum of squares method in this context) the median of the squares is minimized. Outliers are simply the points with large residuals as calculated with the regression coefficients determined with the LMS method. Hadi (1993) uses a forward search to detect the regression outliers. The regression coefficients of a small good set are determined. The set is increased by subsequently adding the points with the smallest residuals and recalculating the regression coefficients until a certain stop criterion is fulfilled. A small good set has to be found beforehand. 23
  • 25. Robust multivariate outlier detection Atkinson combines forward search and LMS. A few sets of p+1 randomly chosen points are used in a forward search. The set that results in the smallest LMS is used for the final determination of the regression residuals. A completely different approach is the genetic algorithm for detection of regression outliers by Walczak. We will not describe this approach here since it lies beyond the scope of deterministic calculation of outlyingnesses. Fung developed an adding-back algorithm for confirmation of regression outliers. Once points are declared to be outliers by any other robust method, the points are added back to the data set in a stepwise way. The extent to which estimation of regression coefficients are affected by the adding-back of a point is used as a diagnostic measure to decide whether that point is a real outlier. This method was developed since robust outlier methods tend to declare too many points to be outliers. 5.2 Known data sets In this section the projection method and the Kosinski method are compared by running both algorithms on the twelve data sets given in Table 5.1. The main part of these data sets is well described in the robust outlier detection literature. Hence, we are able to compare the results of the two algorithms with known results. The outlyingnesses as calculated by the projection method and the Kosinski method are shown in Table 5.2, Table 5.4 and Table 5.5. In both methods the cutoff value for • =1% is used. In the Kosinski method a proportional increment of 20% was used. The outlyingnesses of the projection method were calculated with q=p (if p<6; if p>5 then q=5) and the lowest step size that is shown in Table 4.4. We will now discuss the data sets one by one. Data set p n Source 1. Kosinski 2 100 Ref. [1] 2. Brain mass 2 28 Ref. [3] 3. Hertzsprung-Russel 2 47 Ref. [3] 4. Hadi 3 25 Ref. [4] 5. Stackloss 4 21 Ref. [3] 6. Salinity 4 28 Ref. [3] 7. HBK 4 75 Ref. [3] 8. Factory 5 50 This work 9. Bush fire 5 38 Ref. [5] 10. Wood gravity 6 20 Ref. [6] 11. Coleman 6 20 Ref. [3] 12. Milk 8 85 Ref. [7] Table 5.1. The name, the dimension p, the number of points n, and the source of the tested data sets. 5.2.1 Kosinski data The Kosinski data form a data set that is difficult to handle from a point of view of robust outlier detection. The two-dimensional data set contains 100 points. Points 1- 24
  • 26. Robust multivariate outlier detection 40 are generated from a bivariate normal distribution with µ1 = 18, µ 2 = −18, σ 12 = σ 2 = 1, ρ = 0 , and are considered to be outliers. Points 2 41-100 are good points and are a sample from the bivariate normal distribution with µ1 = 0, µ 2 = 0, σ 12 = σ 2 = 40, ρ = 0.7 . 2 The Kosinski method correctly identifies all outliers (see Table 5.2). The projection method identifies none of the outliers and declares many good points to be outliers. The reason for this failure is the large contamination and the small scatter of the outliers. Since there are so many outliers they strongly shift the median towards the outliers. Hence, the outliers are not detected. Furthermore, since they are narrowly distributed, they almost completely determine the median of absolute deviations in the projection direction perpendicular to the vector pointing from the center of the good points to the center of the outliers. Hence, many points, lying at the end points of the ellipsoid of good points, have a large outlyingness. It is remarked that this data set is not an arbitrarily chosen data set. It was generated by Kosinski in order to demonstrate the superiority of his own method over other methods. 5.2.2 Brain mass data The Brain mass data contain three outliers according to the Kosinski method: points 6, 16 and 25. Those points are also indicated to be outliers by Rousseeuw (1990) and Hadi (1992). Those authors also declare point 14 to be an outlier, but with an outlyingness slightly above the cutoff. The projection method declares points 6, 14, 16, 17, 20 and 25 to be outliers. 5.2.3 Hertzsprung-Russel data The two methods produce almost the same outlyingnesses for all points. Both declare points 11, 20, 30 and 34 to be large outliers, in agreement with results by Rousseeuw (1987) and Hadi (1993). However, the projection method and the Kosinski method also declare points 7 and 14 as outliers and point 9 is an outlier according to the Kosinski method . The outlyingness of these three points is relatively small. Visual inspection of the data (see page 28 in Rousseeuw (1987)) shows that these points are indeed moderately outlying. 5.2.4 Hadi data The Hadi data is an artificial one. The data set contains three variables x1 , x 2 and y . The two predictors were originally created as uniform (0,15) and were then transformed to have a correlation of 0.5. The target variable was then created by y = x1 + x 2 + ε with ε ~ N (0,1) . Finally, cases 1-3 were perturbed to have predictor values around (15,15) and to satisfy y = x1 + x 2 + 4 . The Kosinski method finds the outliers, with a relatively small outlyingness. The projection method finds these outliers too but declares also two good points to be outliers. 25
  • 27. Robust multivariate outlier detection A: Kosinski Brain mass Hertzsprung-Russel Hadi B: 3,035 3,035 3,035 3,368 C: Proj Kos Proj Kos Proj Kos Proj Kos Proj Kos 1 2,59 7,45 51 4,37 1,01 1 1,79 0,75 1 0,80 1,20 1 4,75 3,47 2 2,80 7,96 52 1,53 0,98 2 1,05 1,13 2 1,39 1,46 2 4,75 3,47 3 2,46 7,14 53 2,22 1,05 3 0,37 0,16 3 1,41 1,83 3 4,76 3,46 4 2,87 8,21 54 4,69 1,32 4 0,65 0,13 4 1,39 1,46 4 2,86 1,84 5 2,78 7,97 55 3,97 1,50 5 1,99 0,92 5 1,42 1,90 5 0,96 0,70 6 2,59 7,48 56 3,47 1,44 6 8,40 6,19 6 0,80 1,04 6 3,43 1,57 7 2,84 8,09 57 4,59 2,55 7 2,08 1,27 7 5,55 6,35 7 2,21 0,91 8 2,75 7,89 58 2,27 0,37 8 0,66 0,55 8 1,44 1,38 8 0,46 0,36 9 2,51 7,22 59 2,96 0,51 9 0,94 0,91 9 2,59 3,26 9 0,99 0,35 10 2,45 7,12 60 2,22 0,54 10 1,93 0,99 10 0,61 0,93 10 1,74 1,34 11 2,69 7,71 61 4,94 1,83 11 1,23 0,51 11 11,01 12,67 11 2,50 1,65 12 2,84 8,12 62 5,07 1,29 12 0,96 0,90 12 0,91 1,21 12 1,54 1,13 13 2,77 7,95 63 4,66 1,13 13 0,64 0,60 13 0,79 0,88 13 2,81 1,25 14 2,68 7,72 64 1,68 1,17 14 3,87 2,21 14 3,04 3,51 14 0,98 0,68 15 2,37 6,95 65 3,32 1,03 15 2,22 1,44 15 1,55 1,22 15 2,65 1,37 16 2,46 7,17 66 2,25 1,03 16 7,54 5,63 16 1,23 0,99 16 0,97 0,84 17 2,64 7,59 67 2,59 1,13 17 3,18 1,83 17 2,17 1,80 17 3,31 1,64 18 2,40 6,96 68 3,89 1,04 18 0,90 0,92 18 2,17 2,04 18 3,17 1,39 19 2,46 7,11 69 1,82 0,88 19 3,00 1,43 19 1,77 1,54 19 2,78 1,49 20 2,45 7,15 70 5,96 1,59 20 3,59 1,71 20 11,26 13,01 20 2,94 1,37 21 2,70 7,71 71 2,29 0,70 21 1,54 0,66 21 1,35 1,07 21 0,90 0,66 22 2,62 7,54 72 3,91 0,86 22 0,50 0,25 22 1,62 1,28 22 1,61 1,27 23 2,82 8,11 73 2,15 1,30 23 0,66 0,74 23 1,60 1,41 23 3,89 1,39 24 2,68 7,67 74 6,76 2,00 24 2,18 1,11 24 1,21 1,10 24 2,80 1,22 25 2,37 6,88 75 6,20 2,01 25 8,97 6,75 25 0,34 0,58 25 2,04 1,12 26 2,75 7,86 76 3,37 0,77 26 2,61 1,24 26 1,04 0,78 27 2,67 7,70 77 2,67 0,49 27 2,59 1,41 27 0,88 1,07 28 2,85 8,14 78 1,83 0,50 28 1,13 1,17 28 0,36 0,33 29 2,78 7,98 79 4,19 2,45 29 1,43 1,60 30 2,78 8,00 80 2,71 0,46 30 11,61 13,48 31 2,45 7,14 81 4,49 1,12 31 1,36 1,09 32 2,91 8,29 82 2,74 0,79 32 1,59 1,48 33 2,51 7,27 83 1,62 0,31 33 0,49 0,52 34 2,33 6,80 84 2,81 0,47 34 11,87 13,88 35 2,68 7,72 85 5,94 1,57 35 1,50 1,50 36 2,82 8,08 86 3,50 1,01 36 1,57 1,70 37 2,52 7,31 87 1,38 1,93 37 1,27 1,13 38 2,65 7,66 88 2,21 1,57 38 0,49 0,52 39 2,49 7,18 89 5,47 1,73 39 1,14 1,03 40 2,61 7,52 90 3,07 1,44 40 1,17 1,52 41 1,89 0,50 91 2,94 1,54 41 0,88 0,60 42 1,84 0,41 92 6,02 1,59 42 0,46 0,30 43 7,94 2,03 93 3,65 0,80 43 0,81 0,77 44 3,04 0,61 94 3,89 0,98 44 0,61 0,80 45 2,35 0,67 95 6,68 1,64 45 1,17 1,19 46 6,42 1,76 96 2,50 0,84 46 0,58 0,37 47 5,36 1,68 97 4,59 1,32 47 1,41 1,20 48 3,74 0,77 98 5,65 1,46 49 3,92 0,92 99 2,12 1,64 50 6,53 1,78 100 2,31 0,30 Table 5.2. The outlyingness of each point of the Kosinski, the Brain mass, the Hertzsprung- Russel and the Hadi data. A: Name of data set. B: Cutoff value for • =1%; outlyingnesses higher than the cutoff are shown in bold. C: Method (Proj: projection method; Kos: Kosinski method). 26
  • 28. Robust multivariate outlier detection The projection method finds consistently larger outlyingnesses than the Kosinski method, roughly a factor 2 for most points. This is related to the sparsity of the data set. Consider for instance the extreme case of three points in two dimensions. Every point will have an infinitely large outlyingness according to the projection method. This can be understood by noting that the mad of the projected points is zero if the projection vector intersects two points. The remaining point has an infinite outlyingness. For data sets with more points the situation is less extreme. But as long as there are relatively little points the projection outlyingnesses will be relatively large. In such a case the cutoff values based on the χ -distribution are in fact too 2 low, leading to the swamping effect. 5.2.5 Stackloss data The Stackloss data outlyingnesses show large differences between the two methods. One of the reasons is the sensitivity of the Kosinski results to the cutoff value in this case, as is discussed in section 3. If a cutoff value χ 4, 0.95 = 3.080 is used instead of 2 χ 4, 0.99 = 3.644 , the Kosinski method shows outlyingnesses as in Table 5.3. 2 outl. outl. outl. 1 4.73 8 0.98 15 1.07 2 3.30 9 0.76 16 0.87 3 4.42 10 0.98 17 1.14 4 4.19 11 0.83 18 0.71 5 0.63 12 0.93 19 0.80 6 0.76 13 1.24 20 1.04 7 0.87 14 1.04 21 3.80 Table 5.3. The outlyingnesses of the Stackloss data, calculated with the Kosinski method with cutoff value χ 4, 0.95 = 3.080 . Outlyingnesses above this value are 2 shown in bold, outlyingnesses that are even higher than χ 4, 0.99 = 3.644 are shown 2 in bold italic. Here 5 points have an outlyingness exceeding the cutoff value for • =5%, four of them (points 1, 3, 4 and 21) even above the value for • =1%. Even in this case the differences with the projection method are large. The projection outlyingnesses are up to 5 times larger than the Kosinski ones. For comparison, Walczak and Atkinson declared points 1, 3, 4 and 21 to be outliers, Rocke indicated also point 2 as an outlier, while points 1, 2, 3 and 21 are outliers according to Hadi (1992). These results are comparable with the results of the Kosinski method with • =5%. Hence, considering the results in Table 5.4, the Kosinski method results in too little outliers, the projection method too much. In both cases the origin lies in the low n/p ratio. 27
  • 29. Robust multivariate outlier detection A: Stackloss Salinity HBK Factory B: 3,644 3,644 3,644 3,884 C: Proj Kos Proj Kos Proj Kos Proj Kos Proj Kos 1 8,42 1,62 1 2,67 1,29 1 30,38 32,34 51 1,99 1,64 1 5,23 2,12 2 6,92 1,53 2 2,58 1,46 2 31,36 33,36 52 2,20 2,06 2 5,66 1,67 3 8,14 1,45 3 4,65 1,84 3 32,81 34,90 53 3,18 2,80 3 5,55 1,91 4 9,00 1,51 4 3,54 1,63 4 32,60 34,97 54 2,13 1,96 4 4,57 2,05 5 1,74 0,41 5 6,06 4,06 5 32,71 34,92 55 1,57 1,22 5 3,28 2,34 6 2,33 0,82 6 3,12 1,41 6 31,42 33,49 56 1,78 1,46 6 2,19 1,48 7 3,45 1,31 7 2,62 1,25 7 32,34 34,33 57 1,81 1,61 7 2,27 1,49 8 3,45 1,24 8 2,87 1,59 8 31,35 33,24 58 1,67 1,55 8 1,85 1,23 9 2,15 1,11 9 3,31 1,90 9 32,13 34,35 59 0,89 1,13 9 2,15 1,17 10 4,26 1,16 10 2,08 0,91 10 31,84 33,86 60 2,08 2,05 10 3,56 1,70 11 3,01 1,11 11 2,76 1,24 11 28,95 32,68 61 1,78 1,99 11 3,64 1,87 12 3,30 1,34 12 0,77 0,43 12 29,42 33,82 62 2,29 2,00 12 3,67 1,99 13 3,25 1,01 13 2,36 1,28 13 29,42 33,82 63 1,70 1,70 13 2,24 1,43 14 3,75 1,15 14 2,52 1,24 14 33,97 36,63 64 1,62 1,75 14 2,13 1,79 15 3,90 1,20 15 3,71 2,16 15 1,99 1,89 65 1,90 1,85 15 1,84 1,29 16 2,88 0,85 16 14,83 8,08 16 2,33 2,03 66 1,78 1,87 16 3,52 2,34 17 7,09 1,78 17 3,68 1,60 17 1,65 1,74 67 1,34 1,20 17 2,42 1,79 18 3,56 0,98 18 1,84 0,82 18 0,86 0,70 68 2,93 2,20 18 5,55 2,49 19 3,07 1,04 19 2,93 1,79 19 1,54 1,18 69 1,97 1,56 19 5,65 1,76 20 2,48 0,61 20 2,00 1,22 20 1,67 1,95 70 1,59 1,93 20 5,91 2,83 21 8,85 2,11 21 2,50 0,95 21 1,57 1,76 71 0,75 1,01 21 4,35 1,90 22 3,34 1,23 22 1,90 1,70 72 1,00 0,83 22 2,20 1,63 23 5,20 2,07 23 1,72 1,72 73 1,70 1,53 23 2,77 1,62 24 4,62 1,90 24 1,70 1,56 74 1,77 1,80 24 2,14 0,90 25 0,77 0,42 25 2,06 1,83 75 2,44 1,98 25 3,11 2,13 26 1,80 0,87 26 1,73 1,80 26 2,27 1,31 27 2,85 1,11 27 2,17 2,01 27 4,88 2,02 28 3,72 1,48 28 1,41 1,13 28 5,08 2,67 29 1,33 1,13 29 4,49 2,59 30 2,04 1,86 30 1,91 1,27 31 1,61 1,53 31 1,13 0,83 32 1,78 1,70 32 2,00 1,34 33 1,55 1,45 33 3,13 2,05 34 2,10 2,07 34 2,43 1,70 35 1,41 1,80 35 5,96 2,82 36 1,63 1,61 36 5,78 2,25 37 1,75 1,87 37 5,75 1,83 38 2,01 1,86 38 4,14 1,62 39 2,16 1,93 39 3,16 2,19 40 1,25 1,17 40 2,77 1,62 41 1,65 1,81 41 2,75 1,86 42 1,91 1,72 42 2,56 1,67 43 2,50 2,17 43 4,54 2,15 44 2,04 1,91 44 4,25 1,89 45 2,07 1,86 45 3,91 2,14 46 2,04 1,91 46 2,10 1,52 47 2,92 2,56 47 1,06 0,84 48 1,40 1,70 48 1,47 1,10 49 1,73 2,01 49 3,34 2,16 50 1,05 1,36 50 2,51 1,39 Table 5.4. The outlyingness of each point of the Stackloss, the Salinity, the HBK and the Factory data. A, B, C: see Table 5.2. 28
  • 30. Robust multivariate outlier detection A: Bush fire Wood gravity Coleman Milk B: 3,884 4,100 4,100 4,482 C: Proj Kos Proj Kos Proj Kos Proj Kos Proj Kos 1 3,48 1,38 1 4,72 2,65 1 3,56 2,84 1 9.06 9,46 51 2.62 1,98 2 3,27 1,04 2 2,71 1,20 2 4,92 6,37 2 10.57 10,81 52 3.64 2,98 3 2,76 1,11 3 3,68 2,19 3 6,76 2,94 3 4.04 5,09 53 2.38 2,22 4 2,84 1,02 4 14,45 33,75 4 2,99 1,53 4 3.86 2,83 54 1.22 1,16 5 3,85 1,40 5 3,02 2,80 5 2,70 1,43 5 2.23 2,52 55 1.68 1,69 6 4,92 1,90 6 16,19 38,83 6 5,74 10,43 6 2.97 2,84 56 1.10 1,01 7 11,79 4,37 7 7,90 5,00 7 3,11 2,23 7 2.36 2,35 57 1.96 2,19 8 17,96 11,87 8 15,85 37,88 8 1,48 1,83 8 2.32 2,08 58 2.05 1,95 9 18,36 12,18 9 6,12 2,72 9 2,49 5,95 9 2.58 2,49 59 1.47 2,21 10 14,75 7,64 10 8,59 2,37 10 5,71 12,04 10 2.20 1,98 60 2.04 1,76 11 12,31 6,76 11 5,38 3,04 11 5,07 7,70 11 5.28 4,60 61 1.48 1,42 12 6,17 2,38 12 6,79 2,65 12 4,31 2,77 12 6.65 6,05 62 2.64 2,07 13 5,83 1,77 13 7,14 1,98 13 3,49 2,92 13 5.63 5,38 63 2.33 2,60 14 2,30 1,59 14 2,38 2,09 14 1,95 2,16 14 6.17 5,48 64 2.58 1,90 15 4,70 1,55 15 2,40 1,47 15 6,11 6,56 15 5.47 5,73 65 1.85 1,56 16 3,43 1,38 16 4,74 2,86 16 2,18 2,30 16 3.84 4,56 66 2.01 1,64 17 3,06 0,92 17 6,07 2,12 17 3,78 5,95 17 3.59 4,76 67 3.28 2,59 18 2,75 1,41 18 3,28 2,49 18 7,86 3,09 18 3.74 3,30 68 2.41 2,33 19 2,82 1,38 19 18,33 44,49 19 3,48 2,11 19 2.43 2,85 69 46.45 44,61 20 2,89 1,20 20 7,16 2,07 20 2,80 1,56 20 4.14 3,44 70 1.99 1,87 21 2,47 1,13 21 2.26 2,08 71 2.19 2,27 22 2,44 1,73 22 1.69 1,59 72 3.24 3,02 23 2,46 1,04 23 1.81 2,04 73 6.89 6,99 24 3,44 1,04 24 2.28 2,05 74 5.01 4,90 25 1,90 0,91 25 2.81 2,83 75 2.02 2,03 26 1,69 0,97 26 1.83 2,09 76 4.77 4,51 27 2,27 0,99 27 4.24 3,71 77 1.35 1,43 28 3,31 1,35 28 3.29 3,04 78 1.49 1,87 29 4,82 1,83 29 3.19 2,57 79 2.93 2,66 30 5,06 2,18 30 1.47 1,39 80 1.40 1,38 31 6,00 5,66 31 2.87 2,29 81 2.59 2,34 32 13,48 14,08 32 2.37 2,66 82 2.14 2,42 33 15,34 16,35 33 1.78 1,33 83 3.00 2,56 34 15,10 16,11 34 2.09 1,96 84 3.88 3,06 35 15,33 16,43 35 2.73 2,10 85 2.19 2,36 36 15,02 16,04 36 2.66 2,32 37 15,17 16,30 37 2.61 2,23 38 15,25 16,41 38 2.23 2,07 39 2.27 2,07 40 3.31 2,89 41 10.63 10,11 42 3.69 3,04 43 3.20 2,85 44 7.67 6,08 45 1.99 2,28 46 1.78 2,41 47 5.19 5,35 48 2.92 2,58 49 3.43 2,70 50 3.96 2,69 Table 5.5. The outlyingness of each point of the Bush fire, the Wood gravity, the Coleman, and the Milk data. A, B, C: see Table 5.2. 29
  • 31. Robust multivariate outlier detection 5.2.6 Salinity data The outlyingnesses of the Salinity data are roughly two times larger for the projection method as compared to the Kosinski method. As a consequence, the latter shows just 2 outliers (points 5 and 16), the former 8 points. Rousseeuw (1987) and Walczak agree that the points 5, 16, 23 and 24 are outliers, with points 23 and 24 lying just above the cutoff. Fung finds the same points in first instance, but after applying his adding-back algorithm he concludes that point 16 is the only outlier. The projection method shows too much outliers, while the Kosinski method misses points 23 and 24. 5.2.7 HBK data In the case of the HBK data the projection method and the Kosinski method agree completely. Both indicate points 1-14 to be outliers. This is also in agreement with the results of the original Kosinski method and of Egan, Hadi (1992,1993), Rocke, Rousseeuw (1987,1990), Fung and Walczak. It is remarked that some of these authors only find points 1-10 as outliers, but they use the “regression” definition of an outlier. The HBK is a artificial data set, where the good points lie along a regression plane. Points 1-10 are bad leverage points, i.e. they lie far away from the center of the good points and from the regression plane as well. Points 11-14 are good leverage points, i.e. although they lie far away from the bulk of the data they still lie close to the regression plane. If one considers the distance from the regression plane, the points 11-14 are not outliers. 5.2.8 Factory data The Factory data set is a new one1. It is given in Table 5.6. The outlyingnesses show a big discrepancy between the two methods. The projection outlyingnesses are much larger than the Kosinski ones, resulting in 18 versus 0 outliers. The outlyingnesses are so large due to the shape of the data. About half the data set is quite narrowly concentrated around the center of the data, the other half forms a rather thick tail. Hence, in many projection directions the mad is very small, leading to large outlyingnesses for the points in the tail. It is remarked that the projection outliers are well comparable to the Kosinski outliers found with a cutoff for • =5% (see also section 3.3.6). 1 The Factory data is a generated data set, originally used in an exercise on regression analysis in the CBS course “multivariate technics with SPSS”. It is interesting to note that the regression coefficients change radically if the points, that are indicated to be outliers by the projection method and the Kosinski method with low cutoff, are removed from the data set. In other words, the regression coefficients are mainly determined by the “outlying” points. 30
  • 32. Robust multivariate outlier detection x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 1 14.9 7.107 21 129 11.609 26 12.3 12.616 20 192 11.478 2 8.4 6.373 22 141 10.704 27 4.1 14.019 20 177 14.261 3 21.6 6.796 22 153 10.942 28 6.8 16.631 23 185 15.300 4 25.2 9.208 20 166 11.332 29 6.2 14.521 19 216 10.181 5 26.3 14.792 25 193 11.665 30 13.7 13.689 22 188 13.475 6 27.2 14.564 23 189 14.754 31 18 14.525 21 192 14.155 7 22.2 11.964 20 175 13.255 32 22.8 14.523 21 183 15.401 8 17.7 13.526 23 186 11.582 33 26.5 18.473 22 205 14.891 9 12.5 12.656 20 190 12.154 34 26.1 15.718 22 200 15.459 10 4.2 14.119 20 187 12.438 35 14.8 7.008 21 124 10.768 11 6.9 16.691 22 195 13.407 36 18.7 6.274 21 145 12.435 12 6.4 14.571 19 206 11.828 37 21.2 6.711 22 153 9.655 13 13.3 13.619 22 198 11.438 38 25.1 9.257 22 169 10.445 14 18.2 14.575 22 192 11.060 39 26.3 14.832 25 191 13.150 15 22.8 14.556 21 191 14.951 40 27.5 14.521 24 177 14.067 16 26.1 18.573 21 200 16.987 41 17.6 13.533 24 186 12.184 17 26.3 15.618 22 200 12.472 42 12.4 12.618 21 194 12.427 18 14.8 7.003 22 130 9.920 43 4.3 14.178 20 181 14.863 19 18.2 6.368 22 144 10.773 44 6 16.612 21 192 14.274 20 21.3 6.722 21 123 15.088 45 6.6 14.513 20 213 10.706 21 25 9.258 20 157 13.510 46 13.1 13.656 22 192 13.191 22 26.1 14.762 24 183 13.047 47 18.2 14.525 21 191 12.956 23 27.4 14.464 23 177 15.745 48 22.8 14.486 21 189 13.690 24 22.4 11.864 21 175 12.725 49 26.2 18.527 22 200 17.551 25 17.9 13.576 23 167 12.119 50 26.1 15.578 22 204 13.530 Table 5.6. The Factory data (n=50, p=5). The average temperature (x1, in degrees Celsius), the production (x2, in 1000 pieces), the number of working days (x3), the number of employees (x4) and the water consumption (x5, in 1000 liters) at a factory in 50 successive months. 5.2.9 Bushfire data The outliers found by the adjusted Kosinski method (points 7-11, 31-38) agree perfectly with those found by the original algorithm of Kosinski and with the results by Rocke and Maronna. The projection method shows as additional outliers points 6, 12, 13, 15, 29 and 30. Due to the large contamination the projected median is shifted strongly, leading to relatively large outlyingnesses for the good points and, consequently, many swamped points. 5.2.10 Wood gravity data Rousseeuw (1984), Hadi (1993), Atkinson, Rocke and Egan declare points 4, 6, 8 and 19 to be outliers. The Kosinski method finds these outliers too, but outlier 7 is additional. The projection method shows strange results. Fourteen points have an outlyingness above the cutoff, which is 70% of the data set. This is of course not realistic. The reason is again the sparsity of the data set. Hence, it is rather surprising that the Kosinski method and the methods by other authors perform relatively well in this case. 5.2.11 Coleman data The Coleman data contain 8 outliers according to the projection method, 7 according to the Kosinski method. However, they agree only upon 5 points (2, 6, 10, 11, 15). 31