Upcoming SlideShare
×

# Robust outlier detection

2,819 views

Published on

Published in: Education, Technology
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
2,819
On SlideShare
0
From Embeds
0
Number of Embeds
80
Actions
Shares
0
41
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Robust outlier detection

1. 1. Statistics Netherlands Division Research and Development Department of Statistical Methods ROBUST MULTIVARIATE OUTLIER DETECTION Peter de Boer and Vincent Feltkamp Summary: Two robust multivariate outlier detection methods, based on the Mahalanobis distance, are reported: the projection method and the Kosinski method. The ability of those methods to detect outliers is exhaustively tested. A comparison is made between the two methods as well as a comparison with other robust outlier detection methods that are reported in the literature. The opinions in this paper are those of the authors and do not necessarily reflect those of Statistics Netherlands.Projectnummer: RSM-80820 BPA-nummer: 324-00-RSM/INTERN Datum: 19-jul-00
2. 2. Robust multivariate outlier detection1. IntroductionThe statistical process can be separated in three steps. The input phase involves thecollection of data by means of surveys and registers. The throughput phase involvespreparing the raw data for tabulation purposes, weighting and variance estimating.The output phase involves the publication of population totals, means, correlations,etc., which have come out of the throughput phase.Data editing is one of the first steps in the throughput process. It is the procedure fordetecting and adjusting individual errors in data. Editing also comprises thedetection and treatment of correct but influential records, i.e. records that have asubstantial contribution to the aggregates to be published.The search for suspicious records, i.e. records that are possibly wrong or influential,can be done in basically two ways. The first way is by examining each record andlooking for strange or wrong fields or combinations of fields. In this view a recordincludes all fields referring to a particular unit, be it a person, household or businessunit, even if those fields are stored in separate files, like files containing survey dataand files containing auxiliary data.The second way is by comparing each record with the other records. Even if thefields of a particular record obey all the edit rules one has laid down, the recordcould be an outlier. An outlier is a record, which does not follow the bulk of therecords.The data can be seen as a rectangular file, each row denoting a particular record andeach column a particular variable. The first way of searching for suspicious data canbe seen as searching in rows, the second way as searching in columns. It is remarkedthat some and possibly many errors can be detected by both ways.Records could be outliers while their outlyingness is not apparent by examining thevariables, or columns, one by one. For instance, a company that has a relativelylarge turnover but that has paid relatively little taxes might be no outlier in either oneof the variables, but could be an outlier considering the combination. Outliersinvolving more than one variable are multivariate outliers.In order to quantify how far a record lies from the bulk of the data, one needs ameasure of distance. In the case of categorical data no useful distance measureexists, but in the case of continuous data the so-called Mahalanobis distance is oftenemployed.A distance measure should be robust against the presence of outliers. It is knownthat the classical Mahalanobis distance is not. This means that the outliers, which areto be detected, seriously hamper the detection of those outliers. Hence, a robustversion of the Mahalanobis distance is needed.In this report two robust multivariate outlier detection algorithms for continuousdata, based on the Mahalanobis distance, are reported. In the next section theclassical Mahalanobis distance is introduced and ways to robustify this distancemeasure are discussed. In sections 3 and 4 the two algorithms, successively the 1
3. 3. Robust multivariate outlier detectionKosinski method and the projection method, are presented. In section 5 acomparison between the two algorithms is made as well as a comparison with otheralgorithms reported in the outlier literature. A practical example, and problemsinvolved with it, is the subject of section 6. In section 7 some concluding remarksare made.2. The Mahalanobis distanceThe Mahalanobis distance is a measure of the distance between a point and thecenter of all points, with respect to the scale of the data — and in the multivariatecase with respect to the shape of the data as well. It is remarked that in regressionanalysis another distance measure is more convenient: instead of the distancebetween a point and the center of the data, the distance between the point and theregression plane (see also section 5).Suppose we have a continuous data set y1 , y 2 ,.., y n . The vectors y i are p-dimensional, i.e. y i = ( y i1 yi 2 .. y ip ) t , where y iq denotes a real number. Theclassical squared Mahalanobis distance is defined byMDi2 = ( y i − y ) t C −1 ( y i − y )where y and C denote the mean and the covariance matrix respectively: 1 ny= ∑ yi n i =1 1 nC= ∑ ( yi − y )( yi − y ) t n − 1 i =1In the case of one-dimensional data the covariance matrix reduces to the varianceand the Mahalanobis distance to MDi = y i − y σ , where σ denotes the standarddeviation.Another point of view results by noting that the Mahalanobis distance is the solutionof a maximization problem. The maximization problem is defined as follows. Thedata points y i can be projected on a projection vector a. The outlyingness of thepoint y i is the squared projected distance (a t ( y i − y )) , with respect to the 2projected variance a t Ca . Assuming that the covariance matrix C is positivedefinite, there exists a non-singular matrix A such that A t CA = I . Using theCauchy-Schwarz equality we have 2
4. 4. Robust multivariate outlier detection −1(a t ( y i − y )) 2 (a t A t At ( y i − y )) 2 = a t Ca a t Ca ( A −1 a ) t ( A −1 a ) ( yi − y ) t AAt ( y i − y ) ≤ a t Ca a t ( AAt ) −1 a ( y i − y ) t AA t ( y i − y ) = a t Ca = ( y i − y ) t C −1 ( y i − y ) = MDi2with equality if and only if A − a = cAt ( yi − y ) for some constant c. Hence 1 (a t ( y i − y )) 2MD = sup i 2 a t a =1 a t Cai.e., the Mahalanobis distance is equal to the supremum of the outlyingness of y iover all possible projection vectors. 2If the data set y i is multivariate normal the squared Mahalanobis distances MDifollow the χ 2 distribution with p degrees of freedom.The classical Mahalanobis distance suffers however from the masking andswamping effect. Outliers seriously affect the mean and the covariance matrix insuch a way that the Mahalanobis distance of outliers could be small (masking),while the Mahalanobis distance of points which are not outliers could be large(swamping).Therefore, robust estimates of the center and the covariance matrix should be foundin order to calculate a useful Mahalanobis distance. In the univariate case the mostrobust choice is the median (med) and the median of absolute deviations (mad)replacing the mean and the standard deviation respectively. The med and mad have arobustness of 50%. The robustness of a quantity is defined as the maximumpercentage of data points that can be moved arbitrarily far away while the change inthat quantity remains bounded.It is not trivial to generalize the robust one-dimensional Mahalanobis distance to themultivariate case. Several robust estimators for the location and scale of multivariatedata have been developed. We have tested two methods, the projection method andthe Kosinski method. Other methods for robust outlier detection will be discussed insection 5, where we will compare the different methods on their ability to detectoutliers.In the next two sections the Kosinski method and the projection method will bediscussed in detail. 3
6. 6. Robust multivariate outlier detectionThis adjustment results in a spectacular gain in computer time, since the algorithmhas to be run only once instead of more than once. Kosinski estimates the requirednumber of random starting data sets in his own original algorithm to beapproximately 35 in the case of 2-dimensional data sets, and up to 10000 in 10dimensions.The other adjustment is in the expansion of the good part. In the Kosinski paper theincrement is always one point. We implemented an increment proportional to thegood part already found, for instance 10%. This means that the good part isincreased with a factor of 10% each step. This speeds up the algorithm as well,especially in large data sets. The original algorithm with one-point increment scales 2with n , where n is the number of data points, while the algorithm with proportionalincrement scales with nln n . Also this adjustment was tested and appeared to bevery good.In the remainder of this report, “the Kosinski method” denotes the adjusted Kosinskimethod, unless otherwise noted.3.2 The Kosinski algorithmThe purpose of the algorithm is, given a set of n multivariate data pointsy1 , y 2 ,.., y n , to calculate the outlyingness u i for each point i. The algorithm can besummarized as follows.Step 0. In: data setThe algorithm is started with a set of continuous p-dimensional data y1 , y 2 ,.., y n , (where y i = y i1 .. y ip ) . tStep 1. Choose an elemental partitionA good part of p+1 points is found as follows.• Calculate the med and mad for each dimension q: M q = med y kq k S q = med y lq − M q l• Divide each component q of each data point i by the mad of the dimension concerned. The scaled data points are denoted by the superscript s: y iq yiq = s Sq• Declare a point to be a univariate outlier if at least one component of the data point is farther than 2.5 standard deviations away from the scaled median. The standard deviation is approximated by 1.484 times the mad (see section 4.1 for the background of the factor 1.484). So calculate for each component q of each point i: 5
7. 7. Robust multivariate outlier detection 1 Mq u iq = y iq − s 1.484 Sq If u iq > 2.5 for any q, then point i is an univariate outlier.• Calculate the mean of the data set, neglecting the univariate outliers: n 1 ys = n0 ∑y i =1 i s yi is no outlier where n0 denotes the number of points that are no univariate outliers.• Select the p+1 points that are closest to the mean. Define those points to be the good part of the data set. So calculate: d i = y is − y s The g=p+1 points with the smallest di form the good part, denoted by G.Step 2. Iteratively increase the good partThe good part is increased until a certain stop criterion is fulfilled.• Continue with the original data set y i , not with the scaled data set y is .• Calculate the mean and the covariance matrix of the good part: 1 y= ∑ yi g i∈G 1 C= ∑ ( yi − y )( yi − y ) t g − 1 i∈G• Calculate the Mahalanobis distance of all the data points: −1 MD = ( y i − y ) C ( y i − y ) i 2 t• Calculate the number of points with a Mahalanobis distance smaller than a predefined cutoff value. A useful cutoff value is χ p ,1−α , with • =1%. 2• Increase the good part with a predefined percentage (a useful percentage is 20%) by selecting the points with the smallest Mahalanobis distances, but not more than up to a) half the data set if the good part is smaller than half the data set (g<h=[½(n+p+1]). b) the number of points with a Mahalanobis distance smaller than the cutoff if the good part is larger than half the data set.• Stop the algorithm if the good part was already larger than half the data set and no more points were added in the last iteration.Step 3. Out: outlyingnessesThe outlyingness of each point is now simply the Mahalanobis distance of the point,calculated with the mean and the covariance matrix of the good part of the data set. 6
8. 8. Robust multivariate outlier detection3.3 Test resultsA prototype/test program was implemented in a Borland Pascal 7.0 environment.Documentation of the program is published elsewhere. We successively tested thechoice of the elemental partition by means of the mean, the amount of swampedobservations of data sets containing no outliers, the amount of masked and swampedobservations of data sets containing outliers, the algorithm with proportionalincrement and the time-performance of the proportional increment of the good partcompared to the one-point increment. Finally, we tested the sensitivity of thenumber of detected outliers to the cutoff value and the increment percentage in someknown data sets.3.3.1 Elemental partitionFirst of all, the choice of the elemental partition was tested with the generated dataset published by Kosinski. The Kosinski data set is a kind of worst-case data set. Itcontains a large amount of outliers (40% of the data) and the outliers are distributedwith a variance much smaller than the variance of the good points.Before using the mean, we calculated the coordinate-wise median as a robustestimator of the center of the data, and selected the three closest points. This strategyfailed. Although the median has a 50%-robustness, the 40% outliers strongly shiftthe median. Hence, one of the three selected points appeared to be an outlier. As aconsequence, the forward search algorithm indicated all point to be good points, i.e.all the outliers were masked.This was the reason we searched for another robust measure of the location of thedata. One of the simplest ideas is to search for univariate outliers first, and tocalculate the mean of the points that are outlier in none of the dimensions.The selected points, the three points closest to the mean, appeared all to be goodpoints. Moreover, the forward search algorithm, applied with this elementalpartition, successfully distinguished the outliers from the good points.All following tests were performed using this “mean” to select the first p+1 points.For all tested data sets the selected p+1 points appeared to be good points, resultingin a successful forward search. It is possible, in principle, to construct a data set forwhich this selection algorithm still fails, for instance a data set with a large fractionof outliers which are univariately invisible and with no unambiguous dividing linebetween the group of outliers and the group of good points. This is, however, a veryhypothetical situation.3.3.2 SwampingA simulation study was performed in order to determine the average fraction ofswamped observations in normal distributed data sets. In large data sets almostalways a few points are indicated to be an outlier, even if the whole data set nicelyfollows a normal distribution. This is due to the cutoff value. If a cutoff value of χ 2 ,1−α is used as discriminator between good points and outliers in a pp-dimensional standard normal data set, a fraction of • data points will have aMahalanobis distance larger than the cutoff value. 7
9. 9. Robust multivariate outlier detectionFor each dimension p between 1 and 8 we generated 100 standard normal data setsof 100 points. The Kosinski algorithm was run twice on each data set, once with acutoff value χ 2 , 0.99 , and once with p χ 2 , 0.95 . Each point that is indicated to be an poutlier is a swamped observation since there are no true outliers by construction. Wecalculated the average fraction of swamped observations (i.e. the number ofswamped observations of each data set divided by 100, the number of points in thedata set, averaged over all 100 data sets). Results are shown in Table 3.1. • p=1 2 3 4 5 6 7 8 0.01 0.015 0.011 0.010 0.008 0.008 0.008 0.007 0.007 0.05 0.239 0.112 0.081 0.070 0.059 0.052 0.045 0.042Table 3.1. The average fraction of swamped observations of the simulations of 100generated p-dimensional data sets of 100 points for each p between 1 and 8, withcutoff value χ 2 ,1−α . pFor • =0.01 the fraction of swamped observations is very close to the value of •itself. These results are very similar to the results of the original Kosinski algorithm.For • =0.05, however, the average fraction of swamped observations is much largerthan 0.05 for the lower dimensions, especially for p=1 and p=2. The reason for thisis the following. Consider a one-dimensional standard normal data set. If thevariance of all points is used, the outlyingness of a fraction of • points will be largerthan χ 12,1−α . However, in the Kosinski algorithm the variance of all points but atleast that fraction of • points with the largest outlyingnesses is calculated. Thisvariance is smaller than the variance of all points. Hence, the Mahalanobis distancesare overestimated and too many points are indicated to be an outlier. This is a self-magnifying effect. More outliers lead to a smaller variance which leads to morepoints indicated to be an outlier, etc.The effect is the strongest in one dimension. In higher dimensions the points with alarge Mahalanobis distance are “all around”. Therefore they less influence thevariance in the separate directions.Apparently, the effect is quite strong for • =0.05, but almost negligible for • =0.01. Inthe remaining tests • =0.01 is used, unless otherwise stated.3.3.3 Masking and swampingThe ability of the algorithm to detect outliers was tested in another simulation. Wegenerated data sets in the same way as is done in the Kosinski paper in order to get afair comparison between the original and our adjusted Kosinski algorithm. Thus wegenerated data sets of 100 points containing good points as well as outliers. Both thegood points and the outliers were generated from a multivariate distribution, withσ 2 = 40 for the good points and σ 2 = 1 for the bad points. The distance between 8
10. 10. Robust multivariate outlier detectionthe center of the good points and the bad points is denoted by d. The vector betweenthe centers is along the vector of 1’s.We varied the dimension (p=2, 5), the fraction of outliers (0.10• 0.45), and thedistance (d=20• 60). We calculated the fraction of masked outliers (the number ofmasked outliers of each data set divided by the number of outliers) and the fractionof swamped points (the number of swamped points of each data set divided by thenumber of good points), both averaged over 100 simulation runs for each set ofparameters p, d, and fraction of outliers. Results are shown in Table 3.2. p=2 p=5 fraction of fraction of fraction of fraction of fraction of fraction of outliers masked obs. swamped outliers masked obs. swamped obs. obs. d=20 d=25 0.10 0.81 0.009 0.10 0.90 0.008 0.20 0.89 0.014 0.20 0.91 0.021 0.30 0.88 0.022 0.30 0.93 0.146 0.40 0.86 0.146 0.40 0.97 0.551 0.45 0.88 0.350 0.45 1.00 0.855 d=30 d=40 0.10 0.03 0.011 0.10 0.00 0.008 0.20 0.00 0.011 0.20 0.04 0.008 0.30 0.01 0.010 0.30 0.03 0.022 0.40 0.05 0.043 0.40 0.02 0.020 0.45 0.01 0.019 0.45 0.01 0.014 d=40 d=60 0.10 0.00 0.011 0.10 0.00 0.008 0.20 0.00 0.011 0.20 0.00 0.007 0.30 0.00 0.011 0.30 0.00 0.009 0.40 0.00 0.009 0.40 0.00 0.010 0.45 0.00 0.010 0.45 0.00 0.008Table 3.2. Average fraction of masked and swamped observations of 2- and5-dimensional data sets over 100 simulation runs. Each data set consisted of 100points with a certain fraction of outliers. The good (bad) points were generated froma multivariate normal distribution with σ = 40 ( σ = 1 ) in each direction. The 2 2distance between the center of the good points and the bad points is denoted by d.The following conclusions can be drawn from these results. The algorithm is said tobe performing well if the fraction of masked outliers is close to zero and the fractionof swamped observation is close to • =0.01. The first conclusion is: the larger thedistance between the good points and the bad points the better the algorithmperforms. This conclusion is not surprising and is in agreement with Kosinski’sresults. Secondly, the higher the dimension, the worse the performance of thealgorithm. In five dimensions the algorithm starts to perform well at d=40, and closeto perfect at d=60, while in two dimensions the performance is good at d=30,respectively perfect at d=40. The original algorithm did not show such a dependenceon the dimension. It is remarked, however, that the paper by Kosinski does not give 9
11. 11. Robust multivariate outlier detectionenough details for a good comparison on this point. Third, for both two and fivedimensions the adjusted algorithm performs worse than the original algorithm. Theoriginal algorithm is almost perfect at d=25 for both p=2 and p=5, while the adjustedalgorithm is not perfect until d=40 or d=60. This is the price that is paid for the largegain in computer time. The fourth conclusion is: the performance of the algorithm isalmost not dependent on the fraction of outliers, in agreement with Kosinski’sresults. In some cases, the algorithm even seems to perform better for higherfractions. This is however due to the relatively small number of points (100) per dataset. For very large data sets and very large number of simulation runs this artifactwill disappear. p d fr inc masked swamped 2 20 0.10 1p 0.79 0.010 2 20 0.10 10% 0.80 0.009 2 20 0.10 100% 0.80 0.009 2 20 0.40 1p 0.86 0.225 2 20 0.40 10% 0.86 0.146 2 20 0.40 100% 0.89 0.093 2 30 0.10 1p 0.00 0.011 2 30 0.10 10% 0.03 0.011 2 30 0.10 100% 0.02 0.011 2 30 0.40 1p 0.05 0.042 2 30 0.40 10% 0.05 0.043 2 30 0.40 100% 0.08 0.038 2 40 0.10 1p 0.00 0.011 2 40 0.10 10% 0.00 0.011 2 40 0.10 100% 0.00 0.011 2 40 0.40 1p 0.00 0.010 2 40 0.40 10% 0.00 0.009 2 40 0.40 100% 0.02 0.009 5 40 0.10 1p 0.00 0.008 5 40 0.10 10% 0.00 0.008 5 40 0.10 100% 0.01 0.008 5 40 0.40 1p 0.01 0.016 5 40 0.40 10% 0.01 0.016 5 40 0.40 100% 0.06 0.035Table 3.3. Average fraction of masked and swamped observations for p-dimensionaldata sets with a fraction of fr outliers on a distance d from the good points (for moredetails about the data sets see Table 3.2), calculated with runs with either one-pointincrement (1p) or proportional increment (10% or 100% of the good part).3.3.4 Proportional incrementUntil now all tests have been performed using the one-point increment, i.e. at eachstep of the algorithm the size of the good part is increased with just one point. Insection 3.1 it was already mentioned that a gain in computer time is possible byincreasing the size of the good part with more than one point per step. Thesimulations on the masked and swamped observations were repeated with theproportional increment algorithm. The increment with a certain percentage was 10
12. 12. Robust multivariate outlier detectiontested for percentages up to 100% (which means that the size of the good part isdoubled at each step).The results of Table 3.1, showing the average fraction of swamped observations inoutlier-free data sets, did not change. Small changes showed up for largepercentages in the presence of outliers. A summary of the results is shown in Table3.3. In order to avoid an unnecessary profusion of data we only show the results forp=2 in some relevant cases and, as an illustration, in a few cases for p=5.A general conclusion from the table is that for a wide range of percentages theproportional increment algorithm works satisfactorily. For a percentage of 100%outliers are masked slightly more frequently than for lower percentages. Thedifferences between 10% increment and one-point increment are negligible.3.3.5 Time dependenceTo illustrate the possible gain with the proportional increment we measured the timeper run for p-dimensional data sets of n points, with p ranging from 1 to 8 and nfrom 50 to 400. The simulations were performed with outlier-free generated datasets so that the complete data sets had to be included in the good part. This was donein order to obtain useful information about the dependence of the simulation timeson the number of points. Table 3.4 shows the results for the simulation runs withone-point increment. The results for the runs with a proportional increment of 10%are shown in Table 3.5. n p=1 2 3 4 5 6 7 8 50 0.09 0.18 0.29 0.45 0.64 0.84 1.08 1.35 100 0.36 0.68 1.05 1.75 2.5 3.3 4.3 5.5 200 1.46 2.8 4.6 7.0 10 400 6.2 12Table 3.4. Time (in seconds) per run on p-dimensional data sets of n points, usingthe one-point increment. n p=1 2 3 4 5 6 7 8 50 0.05 0.10 0.16 0.23 0.31 0.39 0.52 0.62 100 0.14 0.24 0.39 0.56 0.76 1.00 1.25 1.55 200 0.33 0.60 0.92 1.35 1.90 400 0.80 1.40Table 3.5. Time (in seconds) per run on p-dimensional data sets of n points, usingthe proportional increment (perc=10%).Let us denote the time per run as a function of n for fixed p by tp, and the time perrun as a function of p for fixed n by tn. For the one-point increment simulations tp isapproximately proportional to n2. This is as expected since there are O(n) steps witha increment of one point and at each step the Mahalanobis distance has to becalculated for each point (O(n)) and sorted (O(n ln n)). For the simulations withproportional increment tp is approximately O(n ln n), due to the fact that onlyO(ln n) steps are needed instead of O(n). As a consequence there is a substantial 11
13. 13. Robust multivariate outlier detectiongain in the time per run, ranging from a factor of 2 for 50 points up to a factor of 8for 400 points.The time per run for fixed n, tn, is approximately proportional to p1.5, for both one-point and proportional increment runs. The exponent 1.5 is just an empirical averageover the range p=1..8 and is result of several O(p) and O(p2) steps. Since theexponent is much smaller than 2 it is more efficient to search for outliers in onep-dimensional run than in ½p(p-1) 2-dimensional runs, one for each pair ofdimensions, even if one is not interested in outliers in more than 2 dimensions.Consider for instance p=8, n=2. One run takes 0.62 seconds. However, a total of 1.4seconds would be needed for the 28 runs in each pair of dimensions, each run taking0.05 seconds.3.3.6 Sensitivity to parametersThe Kosinski algorithm was tested on the twelve data sets described in section 5. Afull description of the outliers and a comparison of the results with the results of theprojection algorithm as well as with other methods described in the literature isgiven in that section. In the present section we restrict the discussion to thesensitivity of the number of outliers to the cutoff and the increment percentage.The algorithm was run with a cutoff χ 2 ,1−α for • =1% as well as • =5%. pFurthermore, both one-point increment and proportional increment (in the range 0-40%) were used. The number of detected outliers of the twelve data sets is shown inTable 3.6.It is clear that the number of outliers for a specific data set is not the same for eachset of parameters. It is remarked that, in all cases, if different sets of parameters leadto the same number of outliers, the outliers are exactly the same points. Moreover, ifone set of parameters leads to more outliers than another set, all outliers detected bythe latter are also detected by the former (these are empirical results).Let us first discuss the differences between the detection with • =1% and with • =5%.It is obvious that in many cases • =5% results in slightly more outliers than • =1%.However, in two cases the differences are substantial, i.e. in the Stackloss data andin the Factory data.In the Stackloss data five outliers for • =5% are found using moderate increments,while • =1% shows no outliers at all. The reason for this difference is the relativelysmall number of points related to the dimension of the data set. It has been arguedby Rousseeuw that the ratio n/p should be larger than 5 in order to be able to detectoutliers reliably. If n/p is smaller than 5 one comes to a point where it is not usefulto speak about outliers since there is no real bulk of data.With n=21 and p=4 the Stackloss data lie on the edge of meaningful outlierdetection. Moreover, if the five points which are indicated as outliers with • =5% areleft out, only 16 good points remain, resulting in a ratio n/p=4. In such a case anyoutlier detection algorithm will presumably fail to find outliers consistently. 12
14. 14. Robust multivariate outlier detection Data set p n inc • =5% • =1%1. Kosinski 2 100 1p 42 40 • 40% 42 402. Brain mass 2 28 1p 5 3 • 10% 5 3 15-20% 4 3 30-40% 3 33. Hertzsprung-Russel 2 47 1p 7 6 • 30% 7 6 40% 6 64. Hadi 3 25 1p 3 3 • 5% 3 3 10% 3 0 15-25% 3 3 30% 3 0 40% 3 35. Stackloss 4 21 1p 5 0 • 17% 5 0 18-24% 4 0 25-30% 1 0 40% 0 06. Salinity 4 28 1p 4 2 • 30% 4 2 40% 2 27. HBK 4 75 1p 15 14 • 30% 15 14 40% 14 148. Factory 5 50 1p 20 0 • 40% 20 09. Bush fire 5 38 1p 16 13 • 40% 16 1310. Wood gravity 6 20 1p 6 5 • 20% 6 5 30% 6 6 40% 6 511. Coleman 6 20 1p 7 7 • 40% 7 712. Milk 8 85 1p 20 17 • 30% 20 17 40% 18 15Table 3.6. Number of outliers detected by the Kosinski algorithm with a cutoff of χ 2 ,1−α , for • =1% respectively • =5%, with either one-point (1p) or proportional pincrement in the range 0-40%. 13
15. 15. Robust multivariate outlier detectionThe Factory data is an interesting case. For • =5% twenty outliers are detected,which is 40% of all points, while detection with • =1% shows no outliers.Explorative data analysis shows that about half the data set is quite narrowlyconcentrated in a certain region, while the other half is distributed over a muchlarger space. There is however no clear distinction between these two parts. Themore widely distributed part is rather a very thick tail of the other part. In such acase the effect that the algorithm with • =5% tends to detect too much outliers, whichis explained discussing Table 3.1, is very strong. It is questionable whether theindicated points should be considered as outliers.Let us now discuss the sensitivity of the number of detected outliers to theincrement. At low percentages the number of outliers is always the same as for theone-point increment • in fact, at very low percentages the proportional incrementprocedure leads to an increment of just one point per step, making the twoalgorithms equal. For most data sets the number of outliers is constant for a widerange of percentages and starts to differ slightly only at 30-40% or higher. Three ofthe twelve data sets behave differently: the Brain mass data, the Hadi data, and theStackloss data.The Brain mass data shows 5 outliers at low percentages for • =5%. At percentagesaround 15% the number of outliers is only 4 and at 30% only 3. So the number ofoutliers changes earlier (at 15%) than in most other data sets (• 30%). For • =1% thenumber of outliers is constant over the whole range. In fact, the three outliers whichare found at 30-40% for • =5% are exactly the same as the three outliers found for• =1%. The two outliers which are missed at higher percentages for • =5% both liejust above the cutoff value. Therefore it is disputable whether they are real outliers atall.The Hadi data shows strange behavior. At all percentages for • =5% and at mostpercentages for • =1% three outliers are found. However, near 10% and near 30% nooutliers are detected. Again, the three outliers are disputable. All have aMahalanobis distance just above the cutoff (see Table 5.2). Hence it is not strangethat sometimes these three points are included in the good part (the three points lieclose together; hence, the inclusion of one of them in the good part leads to lowMahalanobis distances for the other two as well). On the other side, it is also not abig problem, since it is rather a matter of taste than a matter of science to call thethree points outliers or good points.The Stackloss data shows a decreasing number of outliers for • =5% at relatively lowpercentages, like in the Brain mass data. Here, the sensitivity to the percentage isrelated to the low ratio n/p, as is discussed previously.In conclusion, for increments up to 30% the same outliers are found as with the one-point increment. In cases where this is not true, the supposed outliers always have anoutlyingness slightly above or below the cutoff, so that missing such outliers has nobig consequences. Furthermore, relatively low cutoff values could lead todisproportionate swamping. 14
16. 16. Robust multivariate outlier detection4. The projection method4.1 The principle of projectionThe projection method is based on the idea that outliers in univariate data are easilyrecognized, visually as well as by computational means. In one dimension theMahalanobis distance is simply y i − y σ . A robust version of the univariateoutlyingness is found by replacing the mean by the med and replacing the standarddeviation by the mad. Denoting the robust outlyingness by u i , this leads to yi − Mui = Swhere M and S denote the med respectively the mad:M = med y k kS = med y l − M lIn the case of multivariate data the idea is to “look” at the data set from all possibledirections and to “see” whether a particular data point lies far away from the bulk ofthe data points. Looking in this context means projecting the data set on a projectionvector a; seeing means calculating the outlyingness as is done in univariate data. Theultimate outlyingness of a point is just the maximum of the outlyingnesses over allprojection directions.The outlyingness defined in this way corresponds to the multivariate Mahalanobisdistance as is shown in section 2. Recalling the expression for the Mahalanobisdistance: (a t ( y i − y )) 2MD = sup i 2 a t a =1 a t CaRobustifying the Mahalanobis distance leads to a t yi − Mu i = sup a t a =1 SNow M and S are defined as follows:M = med a t y k kS = med a t y l − M l 2 2It is remarked that MDi corresponds to u i .How is the maximum calculated? The outlyingness a t yi − M S 15
17. 17. Robust multivariate outlier detectionas a function of a could posses several local maxima, making gradient searchmethods unfeasible. Therefore the outlyingness is calculated on a grid of a finitenumber of projection vectors. The grid should be fine enough in order to calculatethe maximum outlyingness with enough accuracy.This robust measure of outlyingness was firstly developed by Stahel en Donoho.More recent work on this subject has been reported by Maronna and Yohai. Theseauthors used the outlyingness in order to calculate a weighted mean and covariancematrix. Outliers were given small weights so that the Stahel-Donoho estimator of themean was robust against the presence of outliers. It is of course possible to use theweighted mean and covariance matrix to calculate a weighted Mahalanobis distance.This is not done in the projection method discussed here.The robust outlyingness u i was slightly adjusted for the following reason. The madof univariate standard normal data, which has a standard deviation of 1 by definition,is 0.674=1/1.484. In order to assure that, in the limiting case of an infinitely large 2multivariate normal data set, the outlyingness u i is equal to the squaredMahalanobis distance, the mad in the denominator is multiplied with 1.484: a t yi − Mu i = sup a t a =1 1.484 S4.2 The projection algorithmThe purpose of the algorithm is, given a set of n multivariate data pointsy1 , y 2 ,.., y n , to calculate the outlyingness u i for each point i. The algorithm can besummarized as follows.Step 0. In: data setThe algorithm is started with a set of continuous p-dimensional data y1 , y 2 ,.., y n ,with y i = y i1 ( .. y ip ) . tStep 1. Define a grid  pThere are   subsets of q dimensions in the total set of p dimensions. The q  “maximum search dimension” q is predefined. Projection vectors a in a certainsubset are parameterized by the angles θ 1 ,θ 2 ,..,θ q −1 :  cosθ 1     cos θ 2 sin θ 1   cos θ sin θ sin θ a=  3 2 1    ¡    cos θ q −1 sin θ q − 2 sin θ 1     sin θ sin θ sin θ 1      q −1 q−2 16
18. 18. Robust multivariate outlier detectionA certain predefined step size step (in degrees) is used to define the grid.The first angle θ 1 can take the values i step1 , with step1 the largest angle smaller 180 180than or equal to step for which is an integer value, and with i = 1,2,.., . step1 step1The second angle can take the values j step 2 , with step 2 the largest angle smaller step1 180than or equal to for which is an integer value, and with cosθ 1 step 2 180 j = 1,2,.., . step 2The r-th angle can take the values k step r , with step r the largest angle smaller than step r −1 180or equal to for which is an integer value, and with cosθ r −1 step r 180k = 1,2,.., . step rSuch a grid is defined in each subset of q dimensions.Step 2. Outlyingness for each grid pointFor each grid point a, calculate the outlyingness for each data point y i :• Calculate the projections a y i .• Calculate the median M a = med a y k . k• Calculate the mad La = med a y l − M . l a yi − M a• Calculate the outlyingness u i (a ) = . 1.484 LaStep 3. Out: outlyingnessThe outlyingness u i is the maximum over the grid:u i = sup u i (a ) . a4.3 Test resultsA prototype/test program was implemented in an Excel/Visual Basic environment.Documentation of the program is published elsewhere. We successively tested theamount of swamped observations of data sets containing no outliers, the amount ofmasked observations of data sets containing outliers, the time-dependence of thealgorithm on the parameters step and q, and the sensitivity of the number of detectedoutliers to these parameters in some known data sets. 17
19. 19. Robust multivariate outlier detection4.3.1 SwampingA simulation study was performed in order to determine the average fraction ofswamped observations in normal distributed data sets. See section 3.3.2 for moredetailed remarks about the swamping effect and about generating the data sets. Theresults of the simulations are shown in Table 4.1. • step p=1 2 3 4 5 1% 10 0.010 0.011 0.016 0.018 0.023 5% 10 0.049 0.052 0.067 0.071 0.088 1% 30 0.010 0.010 0.012 0.011 0.012 5% 30 0.049 0.049 0.051 0.049 0.058Table 4.1. The average fraction of swamped observations of the simulations onseveral generated p-dimensional data sets of 100 points, with cutoff value χ 2 ,1−α pand step size step. The parameter q is equal to p. p=2 q=2 p=5 q=2 p=5 q=5 fraction of fraction of fraction of fraction of fraction of fraction of outliers masked obs. outliers masked obs. outliers masked obs.d=20 d=30 d=30 0.12 0.83 0.12 1.00 0.12 0.22 0.23 1.00 0.23 1.00 0.23 0.54 0.34 1.00 0.34 1.00 0.34 1.00 0.45 1.00 0.45 1.00 0.45 1.00d=40 d=50 d=50 0.12 0.00 0.12 0.00 0.12 0.00 0.23 0.00 0.23 0.67 0.23 0.00 0.34 0.62 0.34 1.00 0.34 0.65 0.45 1.00 0.45 1.00 0.45 1.00d=50 d=80 d=60 0.12 0.00 0.12 0.00 0.12 0.00 0.23 0.00 0.23 0.00 0.23 0.00 0.34 0.00 0.34 0.00 0.34 0.00 0.45 1.00 0.45 1.00 0.45 1.00d=90 d=140 d=120 0.12 0.00 0.12 0.00 0.12 0.00 0.23 0.00 0.23 0.00 0.23 0.00 0.34 0.00 0.34 0.00 0.34 0.00 0.45 0.00 0.45 0.00 0.45 0.00Table 4.2. Average fraction of masked outliers of 2- and 5-dimensional generateddata sets (see also section 3.3.3).For low dimensions the average fraction of swamped observations tend to be almostequal to • . The fraction increases, however, with increasing dimension. This due tothe decreasing ratio n/p. It is remarkable that if the step size is 30 the fraction ofswamped observations seems to be much better than for step size 10. This is just acoincidence. The fact that more observations are declared to be an outlier is 18
20. 20. Robust multivariate outlier detectioncompensated by the fact that outlyingnesses are usually smaller if high step sizes areused. In fact, the differences between step size 10 and 30 are so large for higherdimensions that this is an indication that a step size of 30 could be too low to resultin reliable outlyingnesses.4.3.2 Masking and swampingThe ability of the projection algorithm to detect outliers was tested by generatingdata sets that contain good points as well as outliers. See section 3.3.3 for details onhow the data sets were generated.Results are shown in Table 4.3. In all cases, the ability to detect the outliers isstrongly dependent on the contamination of outliers. If there are many outliers, theycan only be detected if they lie very far away from the cloud of good points. This isdue to the fact that, although the med and the mad have a robustness of 50%, a largeconcentrated fraction of outliers strongly shifts the med towards the cloud of outliersand enlarges the mad.In higher dimensions it is more difficult to detect the outliers, like in the Kosinskimethod. The ability to detect the outliers depends also on the maximum searchdimension q. If q is taken equal to p less outliers are masked.4.3.3 Time dependenceThe time dependence of the projection algorithm on the step size step and themaximum search dimension q is shown in Table 4.3. n p q step t n p q step t 400 2 2 36 13.0 100 2 2 9 8.0 18 21.0 3 19.3 9 32.7 4 33.5 4.5 56.8 5 50.1 6 71.4 400 3 3 36 28.1 7 98.9 18 68.6 8 128.0 9 209.1 4.5 719.3 100 5 1 9 5.9 2 50.1 50 5 2 9 26.3 3 479.8 100 50.1 4 2489.1 200 107.7 5 4692.1 400 202.9Table 4.3. Time t (in seconds) per run on p-dimensional data sets of n points usingmaximum search dimension q and step size step (in degrees).  p  180 q −1Asymptotically the time per run should be proportional to (n ln n) (   ) ,  q  step  psince for each of the   subsets a grid is defined with a number of grid points of q   19
21. 21. Robust multivariate outlier detection 180 q −1the order of ( ) , and at each grid point the median of the projected points has stepto be calculated (n ln n). The results in the table roughly confirm this theoreticalestimation. The most important conclusion from the table is that the time per runstrongly increases with the search dimension q. This makes the algorithm onlyuseful for relatively low dimensions.4.3.4 Sensitivity to parametersThe projection method was tested with the twelve data sets that are fully describedin section 5, like is done with the Kosinski method (see section 3.3.6). The resultsare shown in Table 4.4.Let us first discuss the differences between • =5% and 1%. In almost all cases thenumber of outliers, detected with • =5% are larger than with • =1%. This iscompletely due to stronger swamping. It is remarked that there is no algorithmicdependence on the cutoff value, like in the Kosinski method. In the projectionmethod a set of outlyingnesses is calculated and after the calculation a certain cutoffvalue is used in order to discriminate between good and bad points. Hence, a smallercutoff value leads to more outliers, but all points still have the same outlyingness. Inthe Kosinski method the cutoff value is already used during the algorithm: the cutoffis used in order to decide whether more points should be added to the good part. Asmaller cutoff leads not only to more outliers but also to a different set ofoutlyingnesses since the mean and the covariance matrix are calculated with adifferent set of points. As a consequence, in cases where the Kosinski possiblyshows a rather strong sensitivity to the cutoff value, this sensitivity is missing in theprojection method.Now let us discuss the dependence of the number of outliers on the maximum searchdimension q. In the Hertzsprung-Russel data set and in the HBK data set the numberof outliers found with q=1 is already as large as found with higher values of q. In theBrain mass data set and in the Milk data set, the number of outliers for q=1 arehowever much smaller than for large values of q. In those cases, many outliers aretruly multivariate.In the Hadi data set, the Factory data set and the Bush fire data set there is also arather large discrepancy between q=2 and q=3. It is remarked that the Hadi data setwas constructed so that all outliers were invisible looking at two dimensions only(see section 5.2.4). Also in the other two data sets it is clear that many outliers canonly be found by inspecting three or more dimensions at the same time.If q is higher than three, only slightly more outliers are found than for q=3.Differences can be explained by the fact that searching in higher dimensions withthe projection method leads to more outliers (see section 4.3.1). 20
22. 22. Robust multivariate outlier detection Data set p n q step • =5% • =1%1. Kosinski 2 100 2 10 78 34 2 20 77 34 2 30 42 312. Brain mass 2 28 2 5 9 6 2 10 9 4 2 30 8 4 1 n/a 3 13. Hertzsprung-Russel 2 47 2 1 7 6 2 30 6 5 2 90 6 5 1 n/a 6 54. Hadi 3 25 3 5 11 5 3 10 8 0 2 10 0 05. Stackloss 4 21 4 5 14 9 4 10 10 9 4 15 8 6 4 20 9 7 4 30 6 66. Salinity 4 28 4 10 12 8 4 20 9 7 3 30 6 47. HBK 4 75 4 10 15 14 4 20 14 14 1 n/a 14 148. Factory 5 50 5 10 24 18 5 20 14 9 4 10 24 17 3 10 22 14 2 10 9 99. Bush fire 5 38 5 10 24 19 5 20 19 17 4 10 22 19 3 10 21 17 2 10 13 1210. Wood gravity 6 20 5 20 14 14 5 30 12 11 3 10 15 1411. Coleman 6 20 5 20 10 8 5 30 4 412. Milk 8 5 85 20 18 14 5 30 15 13 4 20 16 14 4 30 15 13 3 20 15 13 3 30 15 12 2 20 13 11 2 30 12 7 1 n/a 6 5Table 4.4. Number of outliers detected by the projection algorithm with a cutoff of χ 2 ,1−α , for • =1% respectively • =5%, with maximum search dimension q and angular pstep size step (in degrees). 21
23. 23. Robust multivariate outlier detectionThe sensitivity to the step size is not large in most cases. In cases like the Hadi data,the Stackloss data, the Salinity data and the Coleman data, the sensitivity can beexplained by the sparsity of the data sets. A step size near 10-20 seems to work wellin most cases.In conclusion, the number of outliers is not very sensitive to the parameters q andstep. However, the sensitivity is not completely negligible. In most practical casesq=3 and step=10 work well enough.5. Comparison of methodsIn this section the projection method and the Kosinski method are compared witheach other as well with other robust outlier detection methods. In section 5.1 we willshortly describe some other methods reported in the literature. The comparison ismade by applying the projection method and the Kosinski method on data sets thatare analyzed by at least one of the other methods. Those data sets and the results ofthe said methods are described in section 5.2. In section 5.3 the results are discussed.Unfortunately, in most papers on outlier detection methods very little is said aboutthe efficiency of the methods, i.e. how fast the algorithms are and how it depends onthe number of points and the dimension of the data set. Therefore we restrict thediscussion to the ability to detect outliers.5.1 Other methodsIt is important to note that two different type outliers are distinguished in the outlierliterature. The first type outlier, which is used in this report, is a point that lies faraway from the bulk of the data. The second type is a point that lies far away from theregression plane formed by the bulk of the data. The two types will be denoted bybulk outliers respectively regression outliers.Of course, outliers are often so according to both points of view. That is why wecompare the results of the projection method and the Kosinski method, which areboth bulk outlier methods, also with regression outlier methods. An outlier that isdeclared to be so by both methods is called a bad leverage point. In the case that apoint lies far away from the bulk of the points but close to the regression plane it iscalled a good leverage point.Rousseeuw (1987, 1990) developed the minimum volume ellipsoid (MVE) estimatorin order to robustly detect bulk outliers. The principle is to search for the ellipsoid,covering at least half the data points, for which the volume is minimal. The meanand the covariance matrix of the points inside the ellipsoid are inserted in theexpression for the Mahalanobis distance. This method is costly due to thecomplexity of the algorithm that searches the minimum volume ellipsoid.A related technique is based on the minimum covariance determinant (MCD)estimator. This technique is employed by Rocke. The aim of this technique is tosearch for the set of points, containing at least half the data, for which thedeterminant of the covariance matrix is minimal. Again, the mean and the 22
24. 24. Robust multivariate outlier detectioncovariance matrix, determined by that set of points, are inserted in the Mahalanobisdistance expression. Also this method is rather complex, although substantiallyoptimized by Rocke.Hadi (1992) developed a bulk outlier method that is very similar to the Kosinskimethod. He also starts with a set of p+1 “good” points and increases the good setone point by one. The difference lies in the choice of the first p+1 points. Hadiorders the n points using another robust measure of outlyingness. The questionarises why that other outlyingness would not be appropriate for outlier detection. Areason could be that an arbitrary robust measure of outlyingness deviates relativelystrongly from the “real” Mahalanobis distance.Atkinson combines the MVE method of Rousseeuw and the forward searchtechnique also employed by Kosinski. A few sets of p+1 randomly chosen points areused for a forward search. The set that results in the ellipsoid with minimal volumeis used for the calculation of the Mahalanobis distances.Maronna employed a projection–like method, but slightly more complicated. Theoutlyingnesses are calculated like in the projection method. Then, weights areassigned to each point, with low weights for the outlying points, i.e. the influence ofoutliers is restricted. The mean and the covariance matrix are calculated using theseweights. They form the Stahel-Donoho estimator for location and scatter. Finally,Maronna inserts this mean and this covariance matrix in the expression for theMahalanobis distance.Egan proposes resampling by the half-mean method (RHM) and the smallest half-volume method (SHV). In the RHM method several randomly selected portions ofthe data are generated. In each case the outlyingnesses are calculated. For each pointis counted how many times it has a large outlyingness. It is declared to be a trueoutlier if this happens often. In the SHV method the distance between each pair ofpoints is calculated and put in a matrix. The column with the smallest sum of thesmallest n/2 distances is selected. The corresponding n/2 points form the smallesthalf-volume. The mean and the covariance of those points are inserted in theMahalanobis distance expression.The detection of regression outliers is mainly done with the least median of squares(LMS) method. The LMS method is developed by Rousseeuw (1984, 1987, 1990).Instead of minimizing the sum of the squares of the residuals in the least squaresmethod (which should rather be called the least sum of squares method in thiscontext) the median of the squares is minimized. Outliers are simply the points withlarge residuals as calculated with the regression coefficients determined with theLMS method.Hadi (1993) uses a forward search to detect the regression outliers. The regressioncoefficients of a small good set are determined. The set is increased by subsequentlyadding the points with the smallest residuals and recalculating the regressioncoefficients until a certain stop criterion is fulfilled. A small good set has to be foundbeforehand. 23
25. 25. Robust multivariate outlier detectionAtkinson combines forward search and LMS. A few sets of p+1 randomly chosenpoints are used in a forward search. The set that results in the smallest LMS is usedfor the final determination of the regression residuals.A completely different approach is the genetic algorithm for detection of regressionoutliers by Walczak. We will not describe this approach here since it lies beyond thescope of deterministic calculation of outlyingnesses.Fung developed an adding-back algorithm for confirmation of regression outliers.Once points are declared to be outliers by any other robust method, the points areadded back to the data set in a stepwise way. The extent to which estimation ofregression coefficients are affected by the adding-back of a point is used as adiagnostic measure to decide whether that point is a real outlier. This method wasdeveloped since robust outlier methods tend to declare too many points to beoutliers.5.2 Known data setsIn this section the projection method and the Kosinski method are compared byrunning both algorithms on the twelve data sets given in Table 5.1. The main part ofthese data sets is well described in the robust outlier detection literature. Hence, weare able to compare the results of the two algorithms with known results.The outlyingnesses as calculated by the projection method and the Kosinski methodare shown in Table 5.2, Table 5.4 and Table 5.5. In both methods the cutoff valuefor • =1% is used. In the Kosinski method a proportional increment of 20% wasused. The outlyingnesses of the projection method were calculated with q=p (if p<6;if p>5 then q=5) and the lowest step size that is shown in Table 4.4.We will now discuss the data sets one by one. Data set p n Source1. Kosinski 2 100 Ref. [1]2. Brain mass 2 28 Ref. [3]3. Hertzsprung-Russel 2 47 Ref. [3]4. Hadi 3 25 Ref. [4]5. Stackloss 4 21 Ref. [3]6. Salinity 4 28 Ref. [3]7. HBK 4 75 Ref. [3]8. Factory 5 50 This work9. Bush fire 5 38 Ref. [5]10. Wood gravity 6 20 Ref. [6]11. Coleman 6 20 Ref. [3]12. Milk 8 85 Ref. [7]Table 5.1. The name, the dimension p, the number of points n, and the source of thetested data sets.5.2.1 Kosinski dataThe Kosinski data form a data set that is difficult to handle from a point of view ofrobust outlier detection. The two-dimensional data set contains 100 points. Points 1- 24
26. 26. Robust multivariate outlier detection40 are generated from a bivariate normal distribution withµ1 = 18, µ 2 = −18, σ 12 = σ 2 = 1, ρ = 0 , and are considered to be outliers. Points 241-100 are good points and are a sample from the bivariate normal distribution withµ1 = 0, µ 2 = 0, σ 12 = σ 2 = 40, ρ = 0.7 . 2The Kosinski method correctly identifies all outliers (see Table 5.2). The projectionmethod identifies none of the outliers and declares many good points to be outliers.The reason for this failure is the large contamination and the small scatter of theoutliers. Since there are so many outliers they strongly shift the median towards theoutliers. Hence, the outliers are not detected. Furthermore, since they are narrowlydistributed, they almost completely determine the median of absolute deviations inthe projection direction perpendicular to the vector pointing from the center of thegood points to the center of the outliers. Hence, many points, lying at the end pointsof the ellipsoid of good points, have a large outlyingness.It is remarked that this data set is not an arbitrarily chosen data set. It was generatedby Kosinski in order to demonstrate the superiority of his own method over othermethods.5.2.2 Brain mass dataThe Brain mass data contain three outliers according to the Kosinski method: points6, 16 and 25. Those points are also indicated to be outliers by Rousseeuw (1990) andHadi (1992). Those authors also declare point 14 to be an outlier, but with anoutlyingness slightly above the cutoff. The projection method declares points 6, 14,16, 17, 20 and 25 to be outliers.5.2.3 Hertzsprung-Russel dataThe two methods produce almost the same outlyingnesses for all points. Bothdeclare points 11, 20, 30 and 34 to be large outliers, in agreement with results byRousseeuw (1987) and Hadi (1993). However, the projection method and theKosinski method also declare points 7 and 14 as outliers and point 9 is an outlieraccording to the Kosinski method . The outlyingness of these three points isrelatively small. Visual inspection of the data (see page 28 in Rousseeuw (1987))shows that these points are indeed moderately outlying.5.2.4 Hadi dataThe Hadi data is an artificial one. The data set contains three variables x1 , x 2 and y .The two predictors were originally created as uniform (0,15) and were thentransformed to have a correlation of 0.5. The target variable was then created byy = x1 + x 2 + ε with ε ~ N (0,1) . Finally, cases 1-3 were perturbed to havepredictor values around (15,15) and to satisfy y = x1 + x 2 + 4 .The Kosinski method finds the outliers, with a relatively small outlyingness. Theprojection method finds these outliers too but declares also two good points to beoutliers. 25
27. 27. Robust multivariate outlier detection A: Kosinski Brain mass Hertzsprung-Russel Hadi B: 3,035 3,035 3,035 3,368 C: Proj Kos Proj Kos Proj Kos Proj Kos Proj Kos 1 2,59 7,45 51 4,37 1,01 1 1,79 0,75 1 0,80 1,20 1 4,75 3,47 2 2,80 7,96 52 1,53 0,98 2 1,05 1,13 2 1,39 1,46 2 4,75 3,47 3 2,46 7,14 53 2,22 1,05 3 0,37 0,16 3 1,41 1,83 3 4,76 3,46 4 2,87 8,21 54 4,69 1,32 4 0,65 0,13 4 1,39 1,46 4 2,86 1,84 5 2,78 7,97 55 3,97 1,50 5 1,99 0,92 5 1,42 1,90 5 0,96 0,70 6 2,59 7,48 56 3,47 1,44 6 8,40 6,19 6 0,80 1,04 6 3,43 1,57 7 2,84 8,09 57 4,59 2,55 7 2,08 1,27 7 5,55 6,35 7 2,21 0,91 8 2,75 7,89 58 2,27 0,37 8 0,66 0,55 8 1,44 1,38 8 0,46 0,36 9 2,51 7,22 59 2,96 0,51 9 0,94 0,91 9 2,59 3,26 9 0,99 0,35 10 2,45 7,12 60 2,22 0,54 10 1,93 0,99 10 0,61 0,93 10 1,74 1,34 11 2,69 7,71 61 4,94 1,83 11 1,23 0,51 11 11,01 12,67 11 2,50 1,65 12 2,84 8,12 62 5,07 1,29 12 0,96 0,90 12 0,91 1,21 12 1,54 1,13 13 2,77 7,95 63 4,66 1,13 13 0,64 0,60 13 0,79 0,88 13 2,81 1,25 14 2,68 7,72 64 1,68 1,17 14 3,87 2,21 14 3,04 3,51 14 0,98 0,68 15 2,37 6,95 65 3,32 1,03 15 2,22 1,44 15 1,55 1,22 15 2,65 1,37 16 2,46 7,17 66 2,25 1,03 16 7,54 5,63 16 1,23 0,99 16 0,97 0,84 17 2,64 7,59 67 2,59 1,13 17 3,18 1,83 17 2,17 1,80 17 3,31 1,64 18 2,40 6,96 68 3,89 1,04 18 0,90 0,92 18 2,17 2,04 18 3,17 1,39 19 2,46 7,11 69 1,82 0,88 19 3,00 1,43 19 1,77 1,54 19 2,78 1,49 20 2,45 7,15 70 5,96 1,59 20 3,59 1,71 20 11,26 13,01 20 2,94 1,37 21 2,70 7,71 71 2,29 0,70 21 1,54 0,66 21 1,35 1,07 21 0,90 0,66 22 2,62 7,54 72 3,91 0,86 22 0,50 0,25 22 1,62 1,28 22 1,61 1,27 23 2,82 8,11 73 2,15 1,30 23 0,66 0,74 23 1,60 1,41 23 3,89 1,39 24 2,68 7,67 74 6,76 2,00 24 2,18 1,11 24 1,21 1,10 24 2,80 1,22 25 2,37 6,88 75 6,20 2,01 25 8,97 6,75 25 0,34 0,58 25 2,04 1,12 26 2,75 7,86 76 3,37 0,77 26 2,61 1,24 26 1,04 0,78 27 2,67 7,70 77 2,67 0,49 27 2,59 1,41 27 0,88 1,07 28 2,85 8,14 78 1,83 0,50 28 1,13 1,17 28 0,36 0,33 29 2,78 7,98 79 4,19 2,45 29 1,43 1,60 30 2,78 8,00 80 2,71 0,46 30 11,61 13,48 31 2,45 7,14 81 4,49 1,12 31 1,36 1,09 32 2,91 8,29 82 2,74 0,79 32 1,59 1,48 33 2,51 7,27 83 1,62 0,31 33 0,49 0,52 34 2,33 6,80 84 2,81 0,47 34 11,87 13,88 35 2,68 7,72 85 5,94 1,57 35 1,50 1,50 36 2,82 8,08 86 3,50 1,01 36 1,57 1,70 37 2,52 7,31 87 1,38 1,93 37 1,27 1,13 38 2,65 7,66 88 2,21 1,57 38 0,49 0,52 39 2,49 7,18 89 5,47 1,73 39 1,14 1,03 40 2,61 7,52 90 3,07 1,44 40 1,17 1,52 41 1,89 0,50 91 2,94 1,54 41 0,88 0,60 42 1,84 0,41 92 6,02 1,59 42 0,46 0,30 43 7,94 2,03 93 3,65 0,80 43 0,81 0,77 44 3,04 0,61 94 3,89 0,98 44 0,61 0,80 45 2,35 0,67 95 6,68 1,64 45 1,17 1,19 46 6,42 1,76 96 2,50 0,84 46 0,58 0,37 47 5,36 1,68 97 4,59 1,32 47 1,41 1,20 48 3,74 0,77 98 5,65 1,46 49 3,92 0,92 99 2,12 1,64 50 6,53 1,78 100 2,31 0,30Table 5.2. The outlyingness of each point of the Kosinski, the Brain mass, the Hertzsprung-Russel and the Hadi data. A: Name of data set. B: Cutoff value for • =1%; outlyingnesseshigher than the cutoff are shown in bold. C: Method (Proj: projection method; Kos: Kosinskimethod). 26
28. 28. Robust multivariate outlier detectionThe projection method finds consistently larger outlyingnesses than the Kosinskimethod, roughly a factor 2 for most points. This is related to the sparsity of the dataset. Consider for instance the extreme case of three points in two dimensions. Everypoint will have an infinitely large outlyingness according to the projection method.This can be understood by noting that the mad of the projected points is zero if theprojection vector intersects two points. The remaining point has an infiniteoutlyingness. For data sets with more points the situation is less extreme. But as longas there are relatively little points the projection outlyingnesses will be relativelylarge. In such a case the cutoff values based on the χ -distribution are in fact too 2low, leading to the swamping effect.5.2.5 Stackloss dataThe Stackloss data outlyingnesses show large differences between the two methods.One of the reasons is the sensitivity of the Kosinski results to the cutoff value in thiscase, as is discussed in section 3. If a cutoff value χ 4, 0.95 = 3.080 is used instead of 2 χ 4, 0.99 = 3.644 , the Kosinski method shows outlyingnesses as in Table 5.3. 2 outl. outl. outl. 1 4.73 8 0.98 15 1.07 2 3.30 9 0.76 16 0.87 3 4.42 10 0.98 17 1.14 4 4.19 11 0.83 18 0.71 5 0.63 12 0.93 19 0.80 6 0.76 13 1.24 20 1.04 7 0.87 14 1.04 21 3.80Table 5.3. The outlyingnesses of the Stackloss data, calculated with the Kosinskimethod with cutoff value χ 4, 0.95 = 3.080 . Outlyingnesses above this value are 2shown in bold, outlyingnesses that are even higher than χ 4, 0.99 = 3.644 are shown 2in bold italic.Here 5 points have an outlyingness exceeding the cutoff value for • =5%, four ofthem (points 1, 3, 4 and 21) even above the value for • =1%. Even in this case thedifferences with the projection method are large. The projection outlyingnesses areup to 5 times larger than the Kosinski ones.For comparison, Walczak and Atkinson declared points 1, 3, 4 and 21 to be outliers,Rocke indicated also point 2 as an outlier, while points 1, 2, 3 and 21 are outliersaccording to Hadi (1992). These results are comparable with the results of theKosinski method with • =5%. Hence, considering the results in Table 5.4, theKosinski method results in too little outliers, the projection method too much. Inboth cases the origin lies in the low n/p ratio. 27
29. 29. Robust multivariate outlier detection A: Stackloss Salinity HBK Factory B: 3,644 3,644 3,644 3,884 C: Proj Kos Proj Kos Proj Kos Proj Kos Proj Kos 1 8,42 1,62 1 2,67 1,29 1 30,38 32,34 51 1,99 1,64 1 5,23 2,12 2 6,92 1,53 2 2,58 1,46 2 31,36 33,36 52 2,20 2,06 2 5,66 1,67 3 8,14 1,45 3 4,65 1,84 3 32,81 34,90 53 3,18 2,80 3 5,55 1,91 4 9,00 1,51 4 3,54 1,63 4 32,60 34,97 54 2,13 1,96 4 4,57 2,05 5 1,74 0,41 5 6,06 4,06 5 32,71 34,92 55 1,57 1,22 5 3,28 2,34 6 2,33 0,82 6 3,12 1,41 6 31,42 33,49 56 1,78 1,46 6 2,19 1,48 7 3,45 1,31 7 2,62 1,25 7 32,34 34,33 57 1,81 1,61 7 2,27 1,49 8 3,45 1,24 8 2,87 1,59 8 31,35 33,24 58 1,67 1,55 8 1,85 1,23 9 2,15 1,11 9 3,31 1,90 9 32,13 34,35 59 0,89 1,13 9 2,15 1,17 10 4,26 1,16 10 2,08 0,91 10 31,84 33,86 60 2,08 2,05 10 3,56 1,70 11 3,01 1,11 11 2,76 1,24 11 28,95 32,68 61 1,78 1,99 11 3,64 1,87 12 3,30 1,34 12 0,77 0,43 12 29,42 33,82 62 2,29 2,00 12 3,67 1,99 13 3,25 1,01 13 2,36 1,28 13 29,42 33,82 63 1,70 1,70 13 2,24 1,43 14 3,75 1,15 14 2,52 1,24 14 33,97 36,63 64 1,62 1,75 14 2,13 1,79 15 3,90 1,20 15 3,71 2,16 15 1,99 1,89 65 1,90 1,85 15 1,84 1,29 16 2,88 0,85 16 14,83 8,08 16 2,33 2,03 66 1,78 1,87 16 3,52 2,34 17 7,09 1,78 17 3,68 1,60 17 1,65 1,74 67 1,34 1,20 17 2,42 1,79 18 3,56 0,98 18 1,84 0,82 18 0,86 0,70 68 2,93 2,20 18 5,55 2,49 19 3,07 1,04 19 2,93 1,79 19 1,54 1,18 69 1,97 1,56 19 5,65 1,76 20 2,48 0,61 20 2,00 1,22 20 1,67 1,95 70 1,59 1,93 20 5,91 2,83 21 8,85 2,11 21 2,50 0,95 21 1,57 1,76 71 0,75 1,01 21 4,35 1,90 22 3,34 1,23 22 1,90 1,70 72 1,00 0,83 22 2,20 1,63 23 5,20 2,07 23 1,72 1,72 73 1,70 1,53 23 2,77 1,62 24 4,62 1,90 24 1,70 1,56 74 1,77 1,80 24 2,14 0,90 25 0,77 0,42 25 2,06 1,83 75 2,44 1,98 25 3,11 2,13 26 1,80 0,87 26 1,73 1,80 26 2,27 1,31 27 2,85 1,11 27 2,17 2,01 27 4,88 2,02 28 3,72 1,48 28 1,41 1,13 28 5,08 2,67 29 1,33 1,13 29 4,49 2,59 30 2,04 1,86 30 1,91 1,27 31 1,61 1,53 31 1,13 0,83 32 1,78 1,70 32 2,00 1,34 33 1,55 1,45 33 3,13 2,05 34 2,10 2,07 34 2,43 1,70 35 1,41 1,80 35 5,96 2,82 36 1,63 1,61 36 5,78 2,25 37 1,75 1,87 37 5,75 1,83 38 2,01 1,86 38 4,14 1,62 39 2,16 1,93 39 3,16 2,19 40 1,25 1,17 40 2,77 1,62 41 1,65 1,81 41 2,75 1,86 42 1,91 1,72 42 2,56 1,67 43 2,50 2,17 43 4,54 2,15 44 2,04 1,91 44 4,25 1,89 45 2,07 1,86 45 3,91 2,14 46 2,04 1,91 46 2,10 1,52 47 2,92 2,56 47 1,06 0,84 48 1,40 1,70 48 1,47 1,10 49 1,73 2,01 49 3,34 2,16 50 1,05 1,36 50 2,51 1,39Table 5.4. The outlyingness of each point of the Stackloss, the Salinity, the HBKand the Factory data. A, B, C: see Table 5.2. 28
30. 30. Robust multivariate outlier detection A: Bush fire Wood gravity Coleman Milk B: 3,884 4,100 4,100 4,482 C: Proj Kos Proj Kos Proj Kos Proj Kos Proj Kos 1 3,48 1,38 1 4,72 2,65 1 3,56 2,84 1 9.06 9,46 51 2.62 1,98 2 3,27 1,04 2 2,71 1,20 2 4,92 6,37 2 10.57 10,81 52 3.64 2,98 3 2,76 1,11 3 3,68 2,19 3 6,76 2,94 3 4.04 5,09 53 2.38 2,22 4 2,84 1,02 4 14,45 33,75 4 2,99 1,53 4 3.86 2,83 54 1.22 1,16 5 3,85 1,40 5 3,02 2,80 5 2,70 1,43 5 2.23 2,52 55 1.68 1,69 6 4,92 1,90 6 16,19 38,83 6 5,74 10,43 6 2.97 2,84 56 1.10 1,01 7 11,79 4,37 7 7,90 5,00 7 3,11 2,23 7 2.36 2,35 57 1.96 2,19 8 17,96 11,87 8 15,85 37,88 8 1,48 1,83 8 2.32 2,08 58 2.05 1,95 9 18,36 12,18 9 6,12 2,72 9 2,49 5,95 9 2.58 2,49 59 1.47 2,21 10 14,75 7,64 10 8,59 2,37 10 5,71 12,04 10 2.20 1,98 60 2.04 1,76 11 12,31 6,76 11 5,38 3,04 11 5,07 7,70 11 5.28 4,60 61 1.48 1,42 12 6,17 2,38 12 6,79 2,65 12 4,31 2,77 12 6.65 6,05 62 2.64 2,07 13 5,83 1,77 13 7,14 1,98 13 3,49 2,92 13 5.63 5,38 63 2.33 2,60 14 2,30 1,59 14 2,38 2,09 14 1,95 2,16 14 6.17 5,48 64 2.58 1,90 15 4,70 1,55 15 2,40 1,47 15 6,11 6,56 15 5.47 5,73 65 1.85 1,56 16 3,43 1,38 16 4,74 2,86 16 2,18 2,30 16 3.84 4,56 66 2.01 1,64 17 3,06 0,92 17 6,07 2,12 17 3,78 5,95 17 3.59 4,76 67 3.28 2,59 18 2,75 1,41 18 3,28 2,49 18 7,86 3,09 18 3.74 3,30 68 2.41 2,33 19 2,82 1,38 19 18,33 44,49 19 3,48 2,11 19 2.43 2,85 69 46.45 44,61 20 2,89 1,20 20 7,16 2,07 20 2,80 1,56 20 4.14 3,44 70 1.99 1,87 21 2,47 1,13 21 2.26 2,08 71 2.19 2,27 22 2,44 1,73 22 1.69 1,59 72 3.24 3,02 23 2,46 1,04 23 1.81 2,04 73 6.89 6,99 24 3,44 1,04 24 2.28 2,05 74 5.01 4,90 25 1,90 0,91 25 2.81 2,83 75 2.02 2,03 26 1,69 0,97 26 1.83 2,09 76 4.77 4,51 27 2,27 0,99 27 4.24 3,71 77 1.35 1,43 28 3,31 1,35 28 3.29 3,04 78 1.49 1,87 29 4,82 1,83 29 3.19 2,57 79 2.93 2,66 30 5,06 2,18 30 1.47 1,39 80 1.40 1,38 31 6,00 5,66 31 2.87 2,29 81 2.59 2,34 32 13,48 14,08 32 2.37 2,66 82 2.14 2,42 33 15,34 16,35 33 1.78 1,33 83 3.00 2,56 34 15,10 16,11 34 2.09 1,96 84 3.88 3,06 35 15,33 16,43 35 2.73 2,10 85 2.19 2,36 36 15,02 16,04 36 2.66 2,32 37 15,17 16,30 37 2.61 2,23 38 15,25 16,41 38 2.23 2,07 39 2.27 2,07 40 3.31 2,89 41 10.63 10,11 42 3.69 3,04 43 3.20 2,85 44 7.67 6,08 45 1.99 2,28 46 1.78 2,41 47 5.19 5,35 48 2.92 2,58 49 3.43 2,70 50 3.96 2,69Table 5.5. The outlyingness of each point of the Bush fire, the Wood gravity, theColeman, and the Milk data. A, B, C: see Table 5.2. 29
31. 31. Robust multivariate outlier detection5.2.6 Salinity dataThe outlyingnesses of the Salinity data are roughly two times larger for theprojection method as compared to the Kosinski method. As a consequence, the lattershows just 2 outliers (points 5 and 16), the former 8 points. Rousseeuw (1987) andWalczak agree that the points 5, 16, 23 and 24 are outliers, with points 23 and 24lying just above the cutoff. Fung finds the same points in first instance, but afterapplying his adding-back algorithm he concludes that point 16 is the only outlier.The projection method shows too much outliers, while the Kosinski method missespoints 23 and 24.5.2.7 HBK dataIn the case of the HBK data the projection method and the Kosinski method agreecompletely. Both indicate points 1-14 to be outliers. This is also in agreement withthe results of the original Kosinski method and of Egan, Hadi (1992,1993), Rocke,Rousseeuw (1987,1990), Fung and Walczak. It is remarked that some of theseauthors only find points 1-10 as outliers, but they use the “regression” definition ofan outlier. The HBK is a artificial data set, where the good points lie along aregression plane. Points 1-10 are bad leverage points, i.e. they lie far away from thecenter of the good points and from the regression plane as well. Points 11-14 aregood leverage points, i.e. although they lie far away from the bulk of the data theystill lie close to the regression plane. If one considers the distance from theregression plane, the points 11-14 are not outliers.5.2.8 Factory dataThe Factory data set is a new one1. It is given in Table 5.6.The outlyingnesses show a big discrepancy between the two methods. Theprojection outlyingnesses are much larger than the Kosinski ones, resulting in 18versus 0 outliers. The outlyingnesses are so large due to the shape of the data. Abouthalf the data set is quite narrowly concentrated around the center of the data, theother half forms a rather thick tail. Hence, in many projection directions the mad isvery small, leading to large outlyingnesses for the points in the tail. It is remarkedthat the projection outliers are well comparable to the Kosinski outliers found with acutoff for • =5% (see also section 3.3.6).1 The Factory data is a generated data set, originally used in an exercise on regressionanalysis in the CBS course “multivariate technics with SPSS”. It is interesting to note thatthe regression coefficients change radically if the points, that are indicated to be outliers bythe projection method and the Kosinski method with low cutoff, are removed from the dataset. In other words, the regression coefficients are mainly determined by the “outlying”points. 30
32. 32. Robust multivariate outlier detection x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 1 14.9 7.107 21 129 11.609 26 12.3 12.616 20 192 11.478 2 8.4 6.373 22 141 10.704 27 4.1 14.019 20 177 14.261 3 21.6 6.796 22 153 10.942 28 6.8 16.631 23 185 15.300 4 25.2 9.208 20 166 11.332 29 6.2 14.521 19 216 10.181 5 26.3 14.792 25 193 11.665 30 13.7 13.689 22 188 13.475 6 27.2 14.564 23 189 14.754 31 18 14.525 21 192 14.155 7 22.2 11.964 20 175 13.255 32 22.8 14.523 21 183 15.401 8 17.7 13.526 23 186 11.582 33 26.5 18.473 22 205 14.891 9 12.5 12.656 20 190 12.154 34 26.1 15.718 22 200 15.459 10 4.2 14.119 20 187 12.438 35 14.8 7.008 21 124 10.768 11 6.9 16.691 22 195 13.407 36 18.7 6.274 21 145 12.435 12 6.4 14.571 19 206 11.828 37 21.2 6.711 22 153 9.655 13 13.3 13.619 22 198 11.438 38 25.1 9.257 22 169 10.445 14 18.2 14.575 22 192 11.060 39 26.3 14.832 25 191 13.150 15 22.8 14.556 21 191 14.951 40 27.5 14.521 24 177 14.067 16 26.1 18.573 21 200 16.987 41 17.6 13.533 24 186 12.184 17 26.3 15.618 22 200 12.472 42 12.4 12.618 21 194 12.427 18 14.8 7.003 22 130 9.920 43 4.3 14.178 20 181 14.863 19 18.2 6.368 22 144 10.773 44 6 16.612 21 192 14.274 20 21.3 6.722 21 123 15.088 45 6.6 14.513 20 213 10.706 21 25 9.258 20 157 13.510 46 13.1 13.656 22 192 13.191 22 26.1 14.762 24 183 13.047 47 18.2 14.525 21 191 12.956 23 27.4 14.464 23 177 15.745 48 22.8 14.486 21 189 13.690 24 22.4 11.864 21 175 12.725 49 26.2 18.527 22 200 17.551 25 17.9 13.576 23 167 12.119 50 26.1 15.578 22 204 13.530Table 5.6. The Factory data (n=50, p=5). The average temperature (x1, in degreesCelsius), the production (x2, in 1000 pieces), the number of working days (x3), thenumber of employees (x4) and the water consumption (x5, in 1000 liters) at a factoryin 50 successive months.5.2.9 Bushfire dataThe outliers found by the adjusted Kosinski method (points 7-11, 31-38) agreeperfectly with those found by the original algorithm of Kosinski and with the resultsby Rocke and Maronna. The projection method shows as additional outliers points 6,12, 13, 15, 29 and 30. Due to the large contamination the projected median is shiftedstrongly, leading to relatively large outlyingnesses for the good points and,consequently, many swamped points.5.2.10 Wood gravity dataRousseeuw (1984), Hadi (1993), Atkinson, Rocke and Egan declare points 4, 6, 8and 19 to be outliers. The Kosinski method finds these outliers too, but outlier 7 isadditional. The projection method shows strange results. Fourteen points have anoutlyingness above the cutoff, which is 70% of the data set. This is of course notrealistic. The reason is again the sparsity of the data set. Hence, it is rather surprisingthat the Kosinski method and the methods by other authors perform relatively wellin this case.5.2.11 Coleman dataThe Coleman data contain 8 outliers according to the projection method, 7 accordingto the Kosinski method. However, they agree only upon 5 points (2, 6, 10, 11, 15). 31