SlideShare a Scribd company logo
1 of 62
Download to read offline
Robust Statistical Procedures
by
Tokelo Khalema (2008060978)
Supervisor: Sean van der Merwe
University of the Free State
Submitted in partial fulfilment of the requirements for the degree:
B.Sc. Hons. Mathematical Statistics
November 2013
Declaration
I hereby declare that this work, submitted for the degree B.Sc. Hons. (Mathematical Statis-
tics), at the University of the Free State, is my own original work and has not previously
been submitted, for degree purposes or otherwise, to any other institution of higher learning.
I further declare that all sources cited or quoted are indicated and acknowledged by means
of a comprehensive list of references. Copyright hereby cedes to the University of the Free
State.
Tokelo Khalema
i
Abstract
The Gaussian linear model, two Bayesian Student-t regression models, and the method
of least absolute deviations are compared in a Monte Carlo study. Their relative perfor-
mances under conditions favourable to the Gaussian linear model (or OLS regression)
are first investigated and then a few violations of assumptions that underly the OLS re-
gression model are made and the models compared again. Our object is twofold. First
we want to see how soon and how severely the least squares regression model starts
to lose optimality against three of its common alternatives mentioned above. Second,
we want to see how the Bayesian Student-t models fare relative to the method of least
absolute deviations in cases where OLS would normally not be applied. In addition to
opening with a review of influence diagnosis in least squares, we treat at some length,
the two Bayesian Student-t regression models, the method of least absolute deviations,
and finally, we briefly review M-estimation and its generalisation, i.e. M-regression.
ii
Acknowledgement
That the work reported here finally came to fruition was not achieved single-handedly.
Continous guidance, support and insightful comments provided by Sean van der Merwe are
gratefully acknowledged.
iii
Problem Statement
When Normal regression model assumptions are grossly violated, the practitioner is likely
to be faced with the problem of having to choose among a plethora of robust alternatives. It
would therefore be preferable to know which alternative method would yield more “stable”
results than the others under exactly what scenario of violated model assumptions. For
instance, say one of the models considered in this study performed best (according to some
prespecified criteria) for sample sizes less than or equal to 100 comprised of up to 80% of
outliers but worst otherwise. Then a practitioner with this knowledge would only apply this
model to very heavily contaminated data for small to moderate sample sizes. Our study
is not as exhaustive as a comprehensive investigation would otherwise be; we will base our
judgement on some of the most commonly violated model assumptions, namely, skewness,
non-Normality and the presence of outliers.
iv
Overview
This paper is divided into five sections. The first section sheds some light on the concept of
robustness and reviews some of the most common location estimators. This is followed by a
generalisation of location estimation to regression in the literature study section. Regression
methods reviewed in the literature study are, ordinary least squares, Bayesian Student-t
regression, and the method of least absolute deviations. The third section presents our
research methodology, and the penultimate section discusses results from our Monte Carlo
study. Finally we close the note with a section on conclusions drawn from our study.
v
Contents
Declaration i
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The notion of robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Alternative estimators of location and scale . . . . . . . . . . . . . . . . . . 3
1.4 The need for robust methods . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Literature Study 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Introduction to Robust Regression . . . . . . . . . . . . . . . . . . . 16
2.2.2 The Independent Student-t Regression Model . . . . . . . . . . . . . 17
2.2.3 Objective treatment of the Student-t linear model . . . . . . . . . . 24
2.2.4 Least Absolute Deviations . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.5 Methods based on M-estimators . . . . . . . . . . . . . . . . . . . . 30
2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Methodology 38
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Model performance criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Results/Applications 42
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Scenario one . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Scenarios two and three . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Scenario 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Scenario five . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.6 Scenario six . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.7 Effects on inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5 Closing remarks 46
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Summary of key results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Suggestions for further research . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
A
Table of results i
vi
B
Extra graphs ii
vii
List of Figures
1 Regression diagnostic plot for the OLS model with all observations. . . . . . 18
2 Regression diagnostic plot for the OLS model with all but the first, third,
fourth, and twenty first observations. . . . . . . . . . . . . . . . . . . . . . . 19
3 Jackknifed parameter estimates for the stackloss data. . . . . . . . . . . . . 20
4 Plot of the OLS and LAD regressions of stopping distance on velocity. . . . 29
5 Plots of intervals for explanatory variables on which the LAD fit is resistant. 31
6 Objective functions of Huber’s and Tukey’s Bisquare estimators of location
and their corresponding influence functions. . . . . . . . . . . . . . . . . . . 35
7 Effects of 5% contamination on location. . . . . . . . . . . . . . . . . . . . . 39
8 Plots of observed coverage probability against nominally stated probability
for OLS under scenario one. . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
9 Simulated OLS confidence intervals for the parameters under standard con-
ditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
10 Simulated OLS confidence intervals for β2 under the 5%-contamination sce-
nario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
viii
List of Tables
1 Michelson’s supplementary determinations of the velocity of light in air. . . 5
2 Some location estimates and their respective bootstrap variances based on 10
000 samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Operational data of a plant for the oxidation of ammonia to nitric acid. . . 15
4 Summary table for influence diagnostics from the regression on the stackloss
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 Data on stopping distance as a function of velocity. . . . . . . . . . . . . . . 28
6 Summary of the calculations of the LAD regression of stopping distance on
velocity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7 Heart catheterisation data recorded on 12 patients. . . . . . . . . . . . . . . 30
8 Admissible intervals for the values of the response variable. . . . . . . . . . 32
9 Table of grouped and reordered observations from the catheterisation data. 33
10 Table of admissible intervals of explanatory variables for non-defining obser-
vations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
11 Results for the sixth scenario from a Monte Carlo simulation study with
i = 1000 iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
12 Results for scenarios one to five from a Monte Carlo simulation study with
i = 1000 iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
ix
1 Introduction
1.1 Introduction
It is widely acknowledged that Normal theory has for hundreds of years played an unrivalled
role in all forms of inference. About fifty years after the method of least absolute deviations
was introduced, Legendre introduced the method of least squares and the notion of a linear
model in his Nouvelles m`ethodes pour la determination des orbites des comet`es (Stigler 1977;
Birkes & Dodge 1993, pp. 29, 57; Wang & Chow 1994, pp. 5-7). Later Gauss assumed that
the random errors in Legendre’s linear model followed a Normal distribution and proved
some important properties of his estimates (Wang & Chow 1994, p. 6). Gauss also derived
many important properties of the Normal distribution which then came to be known also
as the Gaussian distribution. The linear model which assumes Normality of the random
errors came to be called the Normal (or Gaussian) linear model. Also instrumental in the
development of the theory of the Gaussian linear model were Fisher and Markov (Wang &
Chow 1994, p. 6). It was not too long until the method of least absolute deviations was
overshadowed by that of least squares (Birkes & Dodge 1993, p. 57).
The linear model as we know it today, with uncorrelated homoscedastic random errors, is
sometimes referred to as the Gauss-Markov linear model (Wang & Chow 1994, p. 147), or
simply and more commonly, as the Gaussian (or Normal) linear model. It is the object of
this note to review the Normal linear model and to show the vulrenability of least squares
estimates to gross errors. The fitting process of the Normal linear model will also be shown
to be quite laborious. Then we will investigate alternative methods that have already been
proposed in the literature and compare them in a simulation study which includes a variety
of scenarios. In particular, each model will be studied under conditions that satisfy the as-
sumptions of the least squares regression model, and then under increasingly unfavourable
conditions in which the assumptions are violated in a variety of ways. Methods investigated
herein are the method of Least Absolute Deviations (or Minimum Sum of Absolute Devia-
tions), and the Bayesian Student-t regression model. Two different implementations of the
latter regression model will be considered and compared.
1.2 The notion of robustness
All models, statistical or otherwise, are based on a set of assumptions. Sometimes it is
desirable to have a procedure whose output is not heavily reliant on the validity of the
assumptions. The reader will recall that the assumptions on which the Gaussian linear
model is based are (see e.g. Rice 2007, p. 547):
1. Normality of the error distribution,
2. independence of the random errors and,
3. error variance homogeneity.
1
Deviation from the Normality assumption can come in a variety of ways — the error dis-
tribution might exhibit more skewness, fatter tails, or more outliers1
than would otherwise
be expected if the underlying distribution were Normal. Although several formal test pro-
cedures have been proposed for testing Normality, examination of the residuals is more
common. The second assumption is more troublesome to test. What analysts usually do is
they assume that only two types of dependence are possible, namely, one of blocking and
the other of serial correlation. Then the former could be dealt with by adding an extra
parameter to represent the block effect and the latter could be diagnosed using time-series
analysis tools (Ravishanker & Dey 2002, pp. 125, 291; Miller 1997, pp. 33-34 ).
Another difficulty associated with the independence assumption is that it underlies both
parametric and non-parametric statistical procedures (Rice 2007, p. 505). So should one
find that the independence assumption is suspect, they are not at liberty to use, say, the
Kruskal-Wallis test in place of the parametric one-way ANOVA. Most formal test procedures
for variance homogeneity are not robust to non-Normality (Stigler 2010). An infamous
example of these is Bartlett’s test of which the F-test of the equality of two population
variances is a special case (Sokal & Rohlf 1969, p. 375; Stigler 2010; Miller 1997, p. 264).
Other perhaps less famous tests are Hartley’s and Cochran’s tests (Rivest 1986; Miller 1997,
p. 264). Rivest (1986) proves that Bartlett’s, Cochran’s, and Hartley’s tests are non-robust
in small sample situations and concludes that they are all liberal when the underlying
distribution is long-tailed. Levene’s test (Levene 1960) is usually considered a robust test
of variance homogeneity (see e.g. Vorapongsathorn et al. 2004), but a less formal diagnostic
procedure commonly employed in practice is to create a plot of the residuals against the
fitted values.
The last paragraph left the word “robust” unexplained. What does it mean? It was Box,
G.E.P. who first used the word in the context of data analysis (Stigler 2010). Although
finer definitions of the word exist, the following will suffice for our purposes: A procedure is
called robust if inferences drawn from it are not overly dependent on the assumptions upon
which it is predicated. For instance, the t-test for equality of population means has been
shown to be robust to departures from Normality (Miller 1997, p. 5). The same applies a
fortiori to the F-test of equality of several population means (Miller 1997, p. 80); but the
F-test of equality of population variances is not robust to non-Normality (Huber 2009, p.
297; Rice 2007, p. 464). By and large, F-tests on the equality of location parameters are
quite robust, while those on the equality of scale parameters are not (Miller 1997, p. 265).
The word “resistance” is also often encountered in robust statistics. Although we will refrain
from distinguishing between resistance and robustness, it is well to give a more widely
accepted definition of the word “resistance”. A statistic or procedure is resistant if it is not
unduly sensitive to ill-behaved data (Goodall 1983, p. 349). Contrast this with the notion
of robustness. For example, the sample median is a resistant measure of location, while the
sample mean is non-resistant. However, if one were to estimate the location parameter µ
for a N(µ, σ2
) population2
, then the sample mean would be a more efficient estimator than
the sample median — for a Normal parent distribution, the asymptotic efficiency of the
sample mean relative to the median is π/2 (see e.g. Birkes & Dodge 1993, p. 192). Thus for
infinitely large Gaussian samples, the sample median is only about 64% as efficient as the
1Loosely speaking, an outlier is a gross error.
2Recall that the mean coincides with the median in a Normal population.
2
sample mean. We see then that in using a robust procedure, one should be prepared to lose
some efficiency if the underlying distribution turns out to be Normal.
1.3 Alternative estimators of location and scale
The time-honoured sample mean is a non-robust estimator of location because it weighs
all observations equally. On the other hand, since the sample median depends only on
the centremost values, it is robust to tail observations. A measure of scale related to and
derived from the sample median is the median absolute deviation (MAD). Consider a sample,
x1, . . . , xn, from some distribution F. If we denote the median as,
˜x = median{x1, . . . , xn},
then the MAD is defined thus,
MAD ∝ median{|x1 − ˜x|, . . . , |xn − ˜x|}.
Unlike the conventional sample standard deviation that exhibits quadratic weighting, it is
very resistant to outliers (Rice 2007, p. 402).
Another robust estimator of location related to the median is the trimmed mean. A δ-
trimmed mean is calculated by ordering a sample Y1, . . . , Yn and discarding the nδ leftmost
and rightmost observations (see e.g. Rice 2007, p. 397; Rosenberger & Gasko 1983, pp. 307-
308). Hence, for an ordered sample, Y(1) ≤ · · · ≤ Y(n), the δ-trimmed mean may be written
as (Miller 1997, p. 29),
¯YT =
Y(nδ+1) + Y(nδ+2) + · · · + Y(n−nδ)
(1 − 2δ)n
.
The δ-trimmed mean is often simply denoted ¯Yδ. The sample median is approximately
a 50%-trimmed mean, and of course a 0%-trimmed mean is the ordinary sample mean
(Rosenberger & Gasko 1983, p. 308).
Somewhat related to trimming is the process of Winsorising (see e.g. Miller 1997, p. 29).
In fact, the Winsorised variance is an appropriate scale estimator used with the trimmed
mean (Miller 1997, p. 29). Winsorising consists in replacing lower tail observations with
larger order statistics and upper tail observations with smaller order statistics (Miller 1997,
pp. 29-30). Say we observe a sample Y1, . . . , Yn from some parent distribution. Then we
can obtain a winsorised sample by defining3
,
yW (i) =



y(nδ+1), i = 1, . . . , nδ
y(i), i = nδ + 1, . . . , n − nδ,
y(n−nδ), i = n − nδ + 1, . . . , n,
(1.1)
3Note that we use lower-case letters to denote observed values and upper-case letters to denote random
quatities.
3
where the fraction δ is the same as for the trimmed mean (Miller 1997, p. 30). The δ-
Winsorized mean is then defined as (Miller 1997, p. 30),
¯YW =
1
n
n
i=1
YW (i) (1.2)
=
1
n
nδY(nδ+1) +
n−nδ
i=nδ+1
Y(i) + nδY(n−nδ) (1.3)
=
nδY(nδ+1) + Y(nδ+1) + Y(nδ+2) + · · · + Y(n−nδ) + nδY(n−nδ)
n
, (1.4)
from which the δ-Winsorised variance is defined as (Miller 1997, p. 30),
s2
W =
1
(1 − 2δ)2(n − 1)
n
i=1
(YW (i) − ¯YW )2
(1.5)
=
1
(1 − 2δ)2(n − 1)
nδ Y(nδ+1) − ¯YW
2
(1.6)
+
n−nδ
i=nδ+1
Y(i) − ¯YW
2
+ nδ Y(n−nδ) − ¯YW
2
. (1.7)
A location estimator proposed by Huber (1964) that has good robustness and efficiency
properties has been related to Winsorising by the originator. This estimator will be discussed
again in later sections. If the above Winsorising process was performed “asymmetrically”
so that the g leftmost and h rightmost observations were Winsorised, where g and h need
not be equal, then define (see Huber 1964),
T =
gu + Y(g+1) + Y(g+2) + . . . + Y(n−h) + hv
n
, (1.8)
where the numbers u, v satisfy Y(g) ≤ u ≤ Y(g+1) and Y(n−h) ≤ v ≤ Y(n−h+1), respectively.
Huber (1964) posits that the estimator T is asymptotically equivalent to Winsorising.
A close competitor to estimators of the type defined by Equation 1.8 is that proposed by
Hodges and Lehmann (1963) (see Huber 1964). It may be taken as defined by (Huber 1964),
T = median {(Yi + Yj) /2|i < j} .
The Hodges-Lehmann estimator, as it is called, is associated with the signed-rank statistic
(Miller 1997, p. 24).
Other location estimators proposed for their robustness, although rarely met with in practice
are, the outmean, and the Edgeworth. The former can be written (Stigler 1977),
¯Y c
.25 = 2 ¯Y − ¯Y.25,
where ¯Y.25 denotes the 25%-trimmed mean. The latter estimator is a weighted average of
the lower quartile, the median, and the upper quartile, with weights in proportions 5 : 6 : 5
4
(Stigler 1977). The outmean is known to perform poorly for long-tailed distributions (Stigler
1977).
Consider the data given in Table 1 taken from Stigler (1977). The data reported here are
Michelson’s supplementary determinations of the velocity of light in air. We calculate the
sample mean of these data to be 756.2, and the sample median to be 774.0. Respective
bootstrap variances of these estimates are 475.4 and 402.4. Hence, one sees that the median
is a more stable location estimator than is the mean for these data. Values of other location
estimators are give in the fourth row of Table 2. The Hogdes-Lehmann estimator has the
highest variance of 497.7. The fifth, sixth, seventh, and eighth rows of Table 2 give values of
location estimators and their respective bootstrap variances for other data sets not presented
in this note. The last row shows that the Hodges-Lehmann estimator for the last data set
has a value far more stable than the sample mean. The sample median still performs best
in this case. However, for the third data set, the sample median performs worst. Overall,
the trimmed means never seem to be to unstable.
Measures of velocity of light in air
883 711 578 696 851
816 611 796 573 809
778 599 774 748 723
796 1051 820 748
682 781 772 797
Table 1: Michelson’s supplementary determinations of the velocity of light in air.
Several other robust estimators of location and scale have been proposed.
1.4 The need for robust methods
Sometimes the hope that Normality is satisfied is too fond to be entertained. It might be that
the data are too heavily contaminated, as would be the case if the underlying distribution
were the Student-t distribution or any of the other heavy-tailed distributions. Although
issues such as skewness and variance heterogeneity can sometimes be remedied by the use of
some transformation, the problem of fat tails is not as easy to get around. One will usually
have to adopt a different model altogether. For instance, the errors could be assumed to
follow a Student-t distribution and parameter estimates could be calculated. As will be
shown later, this is not an easy task.
When classical methods are applied to contaminated data problems, often the analyst will
first clean the data by making use of some outlier rejection method. Then he would continue
to apply the method to the remaining scores as if they constituted the whole sample. To
see the defect of such a procedure, consider a sample, {X1, . . . , Xn}, from some parent
distribution, say Normal. One method of outlier rejection would be to form the statistic
(Xi − ¯X)/s, where s2
=
n
i=1(Xi − ¯X)2
/(n − 1) (Hawkins 1980, pp. 11-12). Now if c is
chosen such that the test has some prespecified experimentwise significance level α, then
any Xi satisfying |Xi − ¯X|/s > c would be identified as an outlier and thus rejected. The
5
Estimator
Hodges-Lehmannmean10%-trimmedmean20%-trimmedmeanmedian
EstimateVarianceEstimateVarianceEstimateVarianceEstimateVarianceEstimateVariance
760.0497.7756.2475.4753.1459.4761.8412.1774.0402.4
5.4550.00165.4479.00165.4560.00175.4574.00175.4600.0033
28.50001.509628.55001.230628.56251.628728.25001.804528.00001.9407
8.5700.03038.6317.02738.6063.02868.5817.02728.5000.0282
8.4275.01918.3776.04118.4527.03628.4282.01718.3600.0097
Table2:Somelocationestimatesandtheirrespectivebootstrapvariancesbasedon10000samples.
6
presence of multiple outliers in the sample would increase the sample variance s2
so that
the statistic (Xi − ¯X)/s takes on very low values and hence, the test will fail to reject
some otherwise significant observations (Hawkins 1980, p. 12). This effect is referred to as
masking (Hawkins 1980, p. 12).
The outlier rejection procedure outlined above is quite primitive. More sophisticated meth-
ods have been proposed (see e.g. Hawkins 1980). Huber (2009), however, argues that the
best outlier rejection procedures do not reach the performance of the best robust proce-
dures, and that classical outlier rejection rules are unable to cope with multiple outliers.
Also, the most commonly used outlier rejection rule (namely, the maximum studentized
rule) will have trouble detecting one distant outlier out of 10 (Hampel 2001; Hampel 1985).
Although they remain in common use, the legitimacy of outlier rejection methods is brought
to question.
One common method of applying the classical Gaussian linear model to contaminated data
is that due to Hoaglin and Welsch (1978). They first fit an OLS to the complete sample to
obtain a preliminary estimate, then they trim observations that seem to be outlying and look
out for significant changes in the estimates. Any observation whose exclusion significantly
impacts the fit is rejected in the final fit. Accordingly, some authors have sometimes called
this method “resistant OLS”. We discuss it in full in the opening of Section 2 and illustrate
it on Brownlee’s stackloss data. This method, however, has shortcomings. Even if the
percent trimming is small, it can be very inefficient for Normally distributed data (Ruppert
& Carroll 1980).
In referring to the method of a preliminary estimate discussed above, Hampel (2001) wrote:
With “good” rejection rules which are able to find a sufficiently high fraction
of distant gross errors (which have a sufficiently high “breakdown point”, cf.
below), this is a viable possibility; but it typically loses at least 10-20 percent
efficiency compared with better robust methods for high-quality data (Hampel
1985). It is interesting to note that also subjective rejection has been investigated
empirically by means of a small Monte Carlo study with 5 subjectively rejecting
statisticians (Relles & Rogers 1977); the avoidable efficiency losses are again
about 10-20 percent (Hampel 1985). This seems ok for fairly high, but not for
highest standards.
Although it remains a bit fuzzy, another important concept in the study of robust methods
is that of efficiency. Efficiency in the context of point estimation is characterised by low
variance (or Mean Squared Error for biased estimators), and in interval estimation, shorter
confidence intervals can be considered as more efficient than broader ones (Hoaglin et al.
1983, p. 284). Huber (2009, p. 5) agues that a robust procedure should have reasonably good
efficiency at the assumed model. It is also important to have a procedure that is efficient
for alternative distributions. A study by Andrews et al. (1972) showed that the variance of
the 10% or 20% trimmed mean is never much larger than that of the sample mean even in
the case of the Normal distribution for which the mean is optimal and can be quite a lot
smaller when the underlying distribution is more heavy-tailed than the Normal distribution
(Rice 2007, p. 398).
7
Efficiency in testing hypotheses designates “good” power while the significance level remains
fixed (Hoaglin et al. 1983, p. 284). For instance, Miller (1980, p. 9) argues that although the
t-test is somewhat “robust for validity”, it is not “robust for efficiency”. This implies that as
much as the t-test will maintain the nominally stated significance level for small arbitrary
departures from model assumptions, there might well exist some specially designed tests
more powerful than the t-test when the underlying distribution is not Normal.
1.5 Conclusion
To sum up, the theory of robust estimation is not an idle one. The use of a robust technique
safeguards the investigator against being led astray in the case of unsatisfied assumptions.
Unfortunately, robustness usually comes at a cost, namely, that of compromised efficiency.
For example, the median will not be as badly affected by gross errors as the mean would,
although if the distribution turned out to be Normal and outlier-free, the use of the median
would lose efficiency against the mean at least asymptotically (see e.g. Birkes & Dodge
1993, p. 192). M-estimators introduced by Huber (1964), although not as easy to compute
as least squares estimates, say, counter this unfavourable trade-off between robustness to
outliers and efficiency at the assumed model, usually Normal (Rosenberger & Gasko 1983, p.
298). They are discussed last in the next section which opens with a review of Hoaglin and
Welch’s (1978) method of fitting an OLS in the presence of outliers. Section 3 discusses our
research methodology, Section 4 presents the results from a simulation study in which four
regression models are compared in a variety of scenarios. The penultimate section discusses
the results from the simulation study.
8
2 Literature Study
2.1 Introduction
The ordinary least squares (OLS) regression model has always appealed, inter alia, for the
ease with which its parameters can be estimated, the ease with which standard errors of such
estimators can be estimated, and certain optimality properties that least squares estimators
possess when distributional assumptions are not grossly violated (e.g. Faraway 2002, p. 19).
For any error distribution, it has been shown that least squares estimators are best linear
unbiased estimators under the assumption of zero mean and constant variance of the error
distribution (Faraway 2002, p. 20; Wang & Chow 1994, p. 285). Literature on least squares
estimation is rich and well understood.
As in any modelling exercise, in fitting an OLS regression model, validity of assumptions
should be assessed before any inferences are drawn. Diagnostic procedures are available to
identify any discrepancies. To assess the quality of fit, checks will be done on the residuals
with the hope of spotting anything untoward about their structure (see e.g. Goodall 1983);
and transformations can be employed to remedy some flawed assumptions. For example, a
single transformation might be found that repairs skewness and non-constant variance in a
data set (Kerns 2002, p. 258). Sometimes such a transformation will not be found, but least
squares estimates are somewhat robust to non-constant error variance and distributional
discrepancies, especially if the data are not saliently skewed (Miller 1997, pp. 6-7, 199, 208).
Unfortunately, the same cannot be said about outliers; a single outlying response can have
detrimental effects on least squares estimates, especially the slope (Miller 1997, p. 199).
Worse yet, usually no transformation will be found that repairs outliers, and the commonly
used Box-Cox class of transformations has been shown to be sensitive to outliers (Miller
1997, pp. 18, 201; Andrews 1971).
One way around this is trimming influential observations (see e.g. Hoaglin & Welsch 1978)
— after an initial ordinary least squares analysis is carried out, influential observations
will be identified using such criteria as Cook’s distance (see e.g. Chatterjee & Hadi 1988).
Then any identified influential observations can be rejected and inference based only on the
remaining scores. This approach, however, has been shown to have drawbacks. It has been
proved to be very ineffecient if the error distribution is Gaussian (or close to Gaussian), or
unduly contaminated (Miller 1997; Ruppert & Carroll 1980). What is more, the process
of model fitting can be an involved exercise — it might take more than a handful of steps
before a satisfactory model is found. The labour of carrying out diagnostic checks, building
and rebuilding models, has motivated the development of robust regression models.
The area of robust regression is a new arrival — it was first introduced in the 1970’s as a
generalisation of robust estimation of a location parameter (Li 1985, p. 281). As already
pointed out in the introduction, we will call a procedure robust if inferences drawn from it do
not change substantially when the underlying assumptions are compromised. Specifially, we
will consider distributional robustness and robustness to outliers and influential observations.
In fact, the bulk of robust inference, at least in frequentist analysis, has been with regard to
outliers (Gelman et al. 2004). In what follows, we will use the words “robust” and “resistant”
interchangebly. But we will distinguish between an outlier and an influential observation.
9
The latter refers to an observation whose inclusion or exclusion has marked influence on
the fitted regression model (Kerns 2002, p. 259). An outlier will not always be influential
and vice versa (Kerns 2002, p. 259). In the ensuing subsections, we demonstrate the non-
robustness of ordinary least squares regression, in particular, to influential and discordant
or outlying observations.
2.1.1 Least squares estimation
We present here an ordinary least squares (multiple) regression model to recall some of its
properties. In particular, we attempt to reveal its sensitivity to outlying response variables
and high-leverage predictor variables. Also, we adopt a slightly different approach to the
parameter estimation process in an attempt to bridge the gap between OLS regression and
its robust counterparts. The multiple regression model can be written compactly in matrix
form as follows,
y = Xβ + e, (2.9)
where the matrix X : n × p is of full rank4
and is defined as,
X =
1 x11 x12 . . . x1,p−1
1 x21 x22 . . . x2,p−1
...
...
...
...
...
1 xn1 xn2 . . . xn,p−1
. (2.10)
Throughout, we will assume that the design matrix X is non-stochastic5
. The vector of
random errors e is given by e = (ε1, ε2, . . . , εn) , and the vector of parameters β by β =
(β0, β1, . . . , βp−1) . We will refer to the entries of the design matrix X as carriers and the
p-dimensional space Xp
in which the row vectors of X lie the carrier space. Outliers in the
carrier space are said to give rise to high leverage— a notion we are yet to formalise. The
objective in least squares is to minimise the residual sum of squares
Q(β) = ||y − Xβ||2
=
n
i=1
ρ(εi), (2.11)
with respect to β, where || · || denotes the Euclidean norm and ρ(εi) = ε2
i . Stated mathe-
matically, the OLS solution will have to satisfy,
ˆβ = arg min
β∈Rp
||y − Xβ||2
.
4That the matrix X be of full rank guarantees that XT X 0, or equivalently, that (XT X)−1 exists.
5The design matrix will be stochastic if, like the dependent variables, the independent variables are
measured with error.
10
The formulation of the minimisation problem in Equation 2.11 will be important when
we discuss robust alternatives to ordinary least squares estimation where the function ρ(·),
called an objective function, will be defined differently. We call the derivative of the objective
function, ψ(εi) = ρ (εi) an influence function. On differentiating Equation 2.11 with respect
to β we obtain,
∂Q(β)/∂β =
n
i=1
ψ(εi)xT
i =
n
i=1
(yi − xiβ)xT
i = 0, (2.12)
where xi is the ith row vector of the design matrix X.
Equation 2.12 above is a disguised form of the p simultaneous equations which yield the
solution,
ˆβ = (XT
X)−1
XT
y.
It can be shown that the least sqaures estimate of β coincides with the corresponding MLE
(or maximum likelihood estimate) if the error distribution is assumed to be Gaussian with
mean 0 and variance σ2
(Gentle 2013, p. 484). To this end, we write down the likelihood of
β given the error variance,
L(β|σ2
, y, X) = (2πσ2
)−n/2
exp{−(y − Xβ)T
(y − Xβ)/2σ2
}. (2.13)
From the monotocity of the log function, maximising Equation 2.13 is equivalent to max-
imising the log-likelihood, which is given by,
lL = −
n
2
log(2πσ2
) −
1
2σ2
(y − Xβ)T
(y − Xβ). (2.14)
It then becomes immediately apparent that maximising the expression in Equation 2.14
with respect to β is equivalent to minimising the expression (Gentle 2013, p. 484),
Q(β) = (y − Xβ)T
(y − Xβ), (2.15)
which is the same expression we minimised in least squares estimation (see Equation 2.11).
In simple linear regression (i.e. the case of p = 2), the detection of outliers can simply
be done by visual inspection since the space in which an outlier can be located is at most
2-dimensional. For p > 2 however, this technique does not work. This is due in part to
the sparseness of data in p-dimensional space, so that if p > q, then an outlier in Rp
is
not necessarily as such in Rq
. This has motivated the development of more sophisticated
methods of outlier detection. The discrepance of a data point can be a result of an outlying
explanatory variable, an outlying dependent variable, or both. Therefore a satisfactory
influence diagnosis should examine both the carriers and the yi.
The matrix,
H = X(XT
X)−1
XT
, (2.16)
11
is known in the literature as the hat matrix6
because “it puts a hat on the matrix y” as in
the following equation (Kerns 2010, p. 272).
ˆy = Hy (2.17)
The equation above expresses each fitted value ˆyi as a linear combination of the observed
y values. If we denote the ijth entry of the design matrix by hij, then Equation 2.17 can
be viewed as a compact formulation of the following set of equations (see Hoaglin & Welsch
1978),
ˆyi =
n
j=1
hijyj = hiiyi +
j=i
hijyj, for i = 1, 2, · · · , n. (2.18)
From Equation 2.18 it appears that hii and hii alone, represents the amount of leverage
the observed value yi applies on the fitted value ˆyi. Since the hat matrix is only dependent
on the explanatory variables and not on the dependent variable, this amount of leverage is
independent of the observed value yi. We also see that hij quantifies the amount of leverage
the yj (for j = i) exert on ˆyi. In fact, one can readily obtain,
∂ˆyi
∂yi
= hii,
and
∂ˆyi
∂yj
= hij,
from Equation 2.18 (see Ravishanker & Dey 2002). It has been left to the reader to show
that the matrix H is both idempotent (i.e. H = H2
) and symmetric. As a result, we can
express the diagonal entries of the hat matrix as,
hii =
n
j=1
h2
ij = h2
ii +
j=i
h2
ij, (2.19)
(see Hoaglin & Welsch 1978) from which we readily see that 0 ≤ hii ≤ 1, so that a value
of hii close to 1 should be flagged as high; that close to zero on the other hand, should not
raise concern to the analyst. We also conclude from Equation 2.19 that, whenever hii = 0
or hii = 1, then hij = 0 for all j = i (see Hoaglin & Welsch 1978). It remains to explain
how large hii needs to be in order to be called “large”. There has not been much consensus
around this. Hoaglin and Welsch (1978) based their judgment on the average size of hii over
the data points in the regression, which can be shown to be p/n. Then from their experience
they suggested that any value of hii in excess of 2p/n should be indicative of high leverage
(Li 1985).
We adopt the same criterion because it turns out to be neither too liberal nor too conser-
vative; Huber (1981) suggested a rather liberal criterion, namely, that points with hii > 0.2
should be regarded as high-leverage points (Ravishanker & Dey 2002). In trying to investi-
gate the actual influence of, say the ith case (xi, yi), to the parameter estimates, it is well to
consider the fit without that particular case and see by how much the estimated parameters
6This term was coined by John W. Tukey who originated the idea of using the hat matrix as a diagnostic
tool in regression problems.
12
change. In deference to common practice, vectors and matrices with the ith case deleted
will be subscripted with a paranthesised “i” so that, for example, the vector of parameter
estimators with the ith observation deleted will be denoted by ˆβ(i).
Below we give a heuristic derivation of the expression for the difference ˆβ − ˆβ(i). As a
prelude to such a derivation, we state without proof, a useful matrix identity known as the
Sherman-Morrison formula. For a non-singular matrix A and vectors u and v, we have
(Miller 1974),
(A + uvT
)−1
= A−1
−
(A−1
u)(vT
A−1
)
1 + vT
A−1
u
.
After a few substitutions we obtain,
(XT
X − xT
i xi)−1
= (XT
X)−1
+
(XT
X)−1
xT
i xi(XT
X)−1
1 − xi(XT
X)−1
xT
i
.
Then by noting that,
ˆβ(i) = (XT
X − xT
i xi)−1
(XT
y − xT
i yi),
we get after some algebra,
ˆβ − ˆβ(i) =
(XT
X)−1
xT
i (yi − xi
ˆβ)
1 − xi(XT
X)−1
xT
i
=
(XT
X)−1
xT
i
1 − hii
ri. (2.20)
From Equation 2.20 above we see that an observation (xi, yi) will be influential if its leverage
hii is large, if its residual ri is large, or if both the leverage and the residual are large. We
have already made clear when a leverage value can be judged to be large — at least accord-
ing to Hoaglin and Welsch (1978). Now we need to discuss the issue of designating large
ri. Hoaglin and Welsch (1978) consider the so-called jackknifed or externally studentised
residuals, {r∗
i |i = 1, . . . , n}, instead of the ordinary residuals {ri|i = 1, . . . , n}.
The ith externally Studentised residual (Ravishanker & Dey 2002) is defined as
r∗
i =
ri
s(i)
√
1 − hii
i = 1, 2, . . . , n,
where
s2
(i) =
||y(i) − ˆy(i)||2
n − p − 1
, i = 1, 2, ..., n,
can be shown to be an unbiased estimate of σ2
based on a sample of n − 1 observations if
the ordinary residuals ri are uncorrelated.
An observation with an externally studentised residual that is significant at a level of 10%
suggests that the observation be reckoned with (Hoaglin & Welsch 1978). So the investigator
13
will test for leverages in excess of 2p/n and any significantly large externally studentised
residuals. Hoaglin and Welsch (1978) suggest that neither criterion be used in isolation7
.
Then Equation 2.20 will be applied to tell whether the observation did indeed turn out to
be influential. Any observation with undue influence will be excluded from the final model.
The example below illustrates this method.
An illustrative Example Perhaps the most frequently used data set in the literature
of robust regression so far has been Brownlee’s stack loss data set. The data represents 21
days of operation of a plant that oxidises ammonia to nitric acid. The explanatory variables
x1, x2, and x3 and the response variable y (or the “stack loss”) are as follows (Li 1985):
x1 = airflow (which reflects the rate of operation of the plant),
x2 = temperature of the cooling water in the coils of the absorbing
tower for the nitric oxides,
x3 = concentration of nitric acid in the absorbing liquid [coded by
x3 = 10× (concentration in percent −50)],
and
y = the percent of the ingoing ammonia that is lost by escaping in the
unabsorbed nitric oxides (×10)
From the third column of Table 4 we see that only the 17th case has a leverage value
greater that 2p/n = .381. But before we discard it, we investigate the actual influence
it has on the estimated coefficients. That is, we calculate the difference, ˆβ − ˆβ(17) =
(−5.6075, 0.0027, −0.0238, 0.0675)T
. The corresponding change in standard error units,
(0.4714, 0.0202, 0.0647, 0.4316)T
, is not at all appreciable. We conclude therefore that this
observation is not influential and hence should be retained. On the basis of studentised
residuals, the 4th and 21st observations are significant at a 10% significance level. The
respective parameter changes in standard error units are (0.1117, 0.3805, 0.5676, 0.0249)T
and (0.3181, 1.2859, 1.3007, 0.2878)T
. The fourth observation does not appear to be so dis-
cordant as to warrant its elimination. However, the changes in ˆβ1 and ˆβ2 that result from
omitting the 21st case should warn us against including it. Overall, we conclude that, the
only truly discrepant case is the 21st. Nevertheless, it turns out that observations 1, 3,
and 4 were discarded as transient states (see Li 1985, p. 317). The fitted model without
observations 1, 3, 4, and 21 is,
ˆy = −37.652 + 0.798X1 + 0.577X2 − 0.067X3,
where X1 denotes air flow, X2 water temperature, and X3 acid concentration. In Figure 1
observations 1, 4, and 21 are flagged as influential. The latter observation falls well within
7This should make intuitive sense since an observation can be well-behaved in the carrier space while its
response variable assumes an unequivocally discrepant value and vice versa.
14
Cooling Water Acid Stack
Observation Air Flow, Inlet Temperature, Concentration, Loss,
Number x1 x2 x3 y
1 80 27 89 42
2 80 27 88 37
3 75 25 90 37
4 62 24 87 28
5 62 22 87 18
6 62 23 87 18
7 62 24 93 19
8 62 24 93 20
9 58 23 87 15
10 58 18 80 14
11 58 18 89 14
12 58 17 88 13
13 58 18 82 11
14 58 19 93 12
15 50 18 89 8
16 50 18 86 7
17 50 19 72 8
18 50 19 79 8
19 50 20 80 9
20 56 20 82 15
21 70 20 91 15
Table 3: Operational data of a plant for the oxidation of ammonia to nitric acid.
the Cook’s distance contour. Remember that this was the only observation we concluded to
be truly influential. Figure 2 shows influence diagnostics of the model without observations
1, 3, 4, and 21. Counter to what one might have anticipated, the plot indicates that
even after the removal of the initially influential observations, some other previously non-
discrepant observations turn out to be outlying. Note the high leverage of observation 2
(h22 > 2 × 4/20 = 0.4).
We hasten to point out that influence is by no chance the only nuisance to be diagnosed in
a regression problem; all assumptions on which the regression model is predicated should be
checked. Since our main focus in this paper is on robustness to outliers, checks other than
those that diagnose influence are better treated elsewhere.
It is not impossible that one of the two measures of influence we have considered here, (i.e.
hii and r∗
i ), suggests that an observation be deleted, while the other suggests otherwise. In
such a case, it would be helpful to have an overall measure of influence that simultaneously
takes both measures into account. Such a procedure has already been developed and used in
regression type problems. Cook’s distance is one of the most commonly met overall criteria
15
(see Cook 1977). It is defined as (Ravishanker & Dey 2002, pp. 330; Miller 1997, p. 201),
Ci =
(ˆβ − ˆβ(−i))T
(XT
X)(ˆβ − ˆβ(−i))
pˆσ2
(2.21)
=
hii
p(1 − hii)
r2
i (2.22)
where ˆσ2
= yT
(I − H)y/(n − p). From the above equation it is clear that the size of Ci is
affected by the magnitudes of both the residual ri and and of hii/(1 − hii) (Miller 1997, p.
201).
Another powerful (yet very simple to implement) method of identifying influential obser-
vations is the Jackknife (see e.g. Crawley 2013, pp. 481-483). To use the jackknife, each
observation i is left out in turn, and a statistic of interest, say θ(−i), is calculated. The
collection of all n psuedo-values {θ(−1), . . . , θ(−n)} can then be plotted on a histogram and
any influential subsets of the sample will be immediately visible. Figure 3 displays four
such histograms that result from applying the jackknife to the stackloss data. For example,
the top-right panel of Figure 3 shows that there is one observation without which the esti-
mate ˆβ1 goes beyond 0.5. This can easily be identified, although not from the histogram,
as the 21-st observation. The bottom-left panel also shows the presence of two potentially
“harmful” observations. These are observations 4 and 21.
2.2 Robust Regression
2.2.1 Introduction to Robust Regression
Like ordinary least squares regression is a generalisation of least squares estimation, so is
robust regression a generalisation of robust estimation. Several robust estimation procedures
have been proposed in the literature. L-, R-, and M-estimation are far and away the most
important classes of robust estimation methods and most common estimators fall in at least
one of these categories (see e.g. Miller 1997, p. 28). Only the latter category of robust
estimators is discussed in this note. For a discussion on the other two classes, the reader is
referred to Bickel (1973).
We saw earlier that the method of least squares regression is supplemented by several di-
agnostic tools that identify outliers. After influential observations have been identified, an
OLS regression would be fitted only to a “clean” subset of the original data set. For ease
of reference at a later stage, let us call the OLS fit thus obtained a “resistant” OLS fit.
In contrast resistant OLS, the idea behind the use of robust regression procedures, besides
perhaps the method of least absolute deviations which is more vulnerable to high-leverage
points than OLS (Bellio & Ventura 2005), is to accommodate the possibility of any aber-
rations in the data. This is effected by giving disproportionately less weight to discrepant
observations (Chen & Box 1990). Procedures that automatically adapt themselves to the
16
Obs. ri hii
√
1 − hii s(i) r∗
i
1 3.235 .302 .836 3.200 1.209
2 −1.917 .318 .826 3.292 −.705
3 4.556 .175 .909 3.099 1.618
4 5.698 .129 .934 2.975 2.052
5 −1.712 .052 .974 3.314 −.531
6 −3.007 .077 .960 3.250 −.963
7 −2.389 .219 .884 3.274 −.826
8 −1.389 .219 .884 3.320 −.474
9 −3.144 .140 .927 3.234 −1.049
10 1.267 .200 .894 3.324 .426
11 2.636 .155 .919 3.265 .878
12 2.779 .217 .885 3.250 .967
13 −1.429 .158 .918 3.320 −.469
14 −.050 .206 .891 3.343 −.017
15 2.361 .190 .900 3.278 .801
16 .905 .131 .932 3.334 .291
17 −1.520 .412 .767 3.306 −.600
18 −.455 .161 .916 3.341 −.149
19 −.598 .175 .909 3.339 −.197
20 1.412 .080 .959 3.323 .443
21 −7.238 .285 .846 2.569 −3.330
Table 4: Summary table for influence diagnostics from the regression on the stackloss data.
underlying distribution are called adaptive (Huber 2009). It is the purpose of the remainder
of this section to review some common alternatives to least squares estimation.
2.2.2 The Independent Student-t Regression Model
Since one of the ways in which an outlier can occur in a data set is if the underlying dis-
tribution is heavy-tailed, it would seem reasonable to, for starters, assume that the random
errors come from a fat-tailed distribution like the double exponential distribution, or the
Cauchy distribution, in lieu of the less kurtic Normal distribution. Then we could form a
likelihood function and estimate the values of the parameters. In the next two paragraphs
we assume that the disturbances arise from the Student-t distribution, of which the Cauchy
distribution is a special case. The degrees-of-freedom parameter, which we will denote by
ν, will reflect the excess “mass” under the tails of the distribution from which the random
errors arise (Gelman et al. 2004).
Usually in practice, it will not be an easy task to estimate the degrees of freedom parame-
ter; estimates from a heavy-tailed likelihood are generally intractable and computationally
demanding (Fonseca et al. 2008; Gelman et al. 2004). So, in assuming a kurtic error distri-
bution, we should be ready to lose mathematical convenience in exchange for more stable
17
0.0 0.1 0.2 0.3 0.4
−3−2−1012
Leverage
Standardizedresiduals
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
Cook's distance
1
0.5
0.5
1
Residuals vs Leverage
21
1
4
Figure 1: Regression diagnostic plot for the OLS model with all observations.
parameter estimates.
The density of the Student-t distribution with location and scale parameters, ξ and σ2
, and
ν degrees of freedom is given by
p(y|ξ, σ2
, ν) =
Γ{(ν + 1)/2}
Γ(ν/2)
√
πνσ2
{1 + (y − ξ)2
/νσ2
}−(ν+1)/2
, 0 ≤ y < ∞. (2.23)
If a random variable y has the density above, it is customary to write y ∼ tν(ξ, σ2
). Consider
the linear model
yi = β0 +
p
j=1
Xijβj + εi, i = 1, . . . , n, (2.24)
where the density of the (independent) random errors is given in Equation 2.23 with location
parameter ξ = 0, scale parameter σ, and degrees of freedom parameter ν. We assume the
following nonimformative priors for the parameters σ and β (Geweke 1993),
p1(β) ∝ constant,
and
p2(σ) ∝ σ−1
.
18
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
−2−1012
Leverage
Standardizedresiduals
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Cook's distance
1
0.5
0.5
1
Residuals vs Leverage
2
13
14
Figure 2: Regression diagnostic plot for the OLS model with all but the first, third, fourth,
and twenty first observations.
Additionally, we assume that σ and β are independent a priori: that is p1(β, σ) ∝ σ−1
. The
model in Equation 2.24 is referred to by Geweke (1993) as the independent Student-t linear
model. Geweke (1993) shows that the independent Student-t linear model is equivalent to
an appropriate scale mixture of Normals. Without attempting to prove the equivalence, this
Gaussian mixture is given next8
. For i = 1, . . . , n, assume
yi = β0 +
p
j=1
Xijβj + εi, (2.25)
where the random errors are independent and εi ∼ N(0, σ2
ωi). Alternatively in vector
notation, let
y = Xβ + ε,
where the covariance matrix of the random errors var(ε) = σ2
Ω and Ω = diag(ω1, . . . , ωn).
That is for i = 1, . . . , n,
p(yi|β, σ, ω) =
1
√
2πσ2ωi
exp −
1
2ωi
yi − xiβ
σ
2
, −∞ < yi < ∞ (2.26)
8Derivations herein are modified versions of Geweke’s (1993): slightly different mathematical tools are
used.
19
β
^
0
Frequency
−44 −42 −40 −38 −36 −34
0246810
β
^
1
Frequency
0.65 0.75 0.85
0246812
β
^
2
Frequency
0.8 1.0 1.2 1.4
0246810
β
^
3
Frequency
−0.22 −0.18 −0.14 −0.10
0246810
Figure 3: Jackknifed parameter estimates for the stackloss data.
Since the random errors εi are independent we get the following likelihood function,
L(β, σ, ω|y, X) =
n
i=1
1
√
2πσ2ωi
exp −
1
2ωi
yi − xiβ
σ
2
Given proportionally,
L(β, σ, ω|y, X) ∝ σ−1
n
i=1
ω
−1/2
i exp −
1
2ωi
yi − xiβ
σ
2
.
Assume that a priori the parameters β, σ and ω = (ω1, . . . , ωn)T
are mutually independent
and that as before, p1(β)p2(σ) ∝ σ−1
. As originally suggested by suggested by Lindley
(1971), assume that a priori, the ωi are independent and that,
ν/ωi ∼ χ2
ν.
To ontain explicitly the density of the ωi, the reader will recall that the density of an inverted
χ2
variate y with ν degrees of freedom is give by,
pY (y) =
1
Γ(ν/2)2ν/2
1
y
ν/2+1
e−1/2y
, y > 0. (2.27)
20
From the transformation y = g−1
(ωi) = ωi/ν we obtain,
p(ωi) =pY (g−1
(ωi))
d
dωi
g−1
(ωi)
=
1
Γ(ν/2)2ν/2
1
ν
ν
ωi
ν/2+1
e−ν/2ωi
Then from the independence of the ωi the prior of the random vector ω is given by,
p3(ω) =
n
i=1
1
Γ(ν/2)2ν/2
1
ν
ν
ωi
ν/2+1
e−ν/2ωi
=(ν/2)nν/2
{Γ(ν/2)}−n
n
i=1
ω
(ν+2)/2
i e−ν/2ωi
∝
n
i=1
ω
(ν+2)/2
i e−ν/2ωi
On multiplying the likelihood and the priors, the joint posterior density of β, σ, and ω is
given by,
p(β, σ, ω|y, X) = p1(β)p2(σ)p3(ω) × L(β, σ, ω|y, X)
∝ σ−(n+1)
n
i=1
ω
−(ν+3)/2
i exp −
1
2ωi
ν +
yi − xiβ
σ
2
. (2.28)
We will sample from the posterior distribution above using the Gibbs Sampler as outlined in
Geweke (1993). To this end, we will need to have an expression for the conditional posterior
distribution of each of the parameters. It will soon appear that these are quite tractable.
But the assumption of unknown degrees of freedom will occasion a slight complication. The
reader is urged to note that we obtained the posterior in Equation 2.28 by assuming, at least
tacitly, that the degrees of freedom parameter ν was known. Fonseca et al. (2008) state
that the robustness of the analysis using the Student-t distribution is directly related to the
number of degrees of freedom ν. They further emphasise the difficulty of approximating
the parameter ν. We will bring the assumption of unknown degrees of freedom back in at
a later step. When we finally do so, the expressions of the conditional distributions of the
other parameters will remain unchanged. We proceed to find the posterior conditionals.
If we rewrite Equation 2.28 as,
p(β, σ, ω|y, X) ∝
n
i=1
ω
−(ν+3)/2
i e−ν/2ωi
×
exp −
n
i=1
1
2ωi
yi − xiβ
σ
2
, (2.29)
21
and note that,
exp −
n
i=1
1
2ωi
yi − xiβ
σ
2
= exp −
1
2σ2
(y − Xβ)T
Ω−1
(y − Xβ)
∝ exp −
1
2σ2
(β − Λ)T
(XT
Ω−1
X)−1
(β − Λ) .
Hence the posterior distribution of β conditional on the rest of the parameters σ, and ω is
(Geweke 1993),
β|σ, ω ∼ Np{Λ, σ2
(XT
Ω−1
X)−1
}
where the mean vector is given by Λ = (XT
Ω−1
X)−1
XT
Ω−1
y.
In order to put the conditional posterior distribution of the variance σ2
in a form that
corresponds to a standard distribution, let ui = yi − xiβ and note that
p(σ|β, ω, y, X) ∝ σ−(n+1)
exp −
1
2σ2
n
i=1
u2
i /ωi . (2.30)
By making use of a transformation, the variance-parametrisation of the density in equa-
tion 2.30 easily works out to,
p(σ2
|β, ω, y, X) ∝
1
σ2
(n+4)/2
exp −
1
2σ2
n
i=1
u2
i /ωi .
Now let us make the following astute transformation,
φ =
1
σ2
n
i=1
u2
i /ωi,
the inverse of which is,
g−1
(φ) = φ−1
n
i=1
u2
i /ωi.
Then the absolute value of the Jacobian of the inverse transformation will take the following
form,
d
dφ
g−1
(φ) = −φ−2
n
i=1
u2
i /ωi
∝ φ−2
.
Finally, putting all components together yields the conditional posterior density of φ,
p(φ|β, ω, y, X) = p(g−1
(φ)|β, ω, y, X)
d
dφ
g−1
(φ)
∝ φn/2
e−φ/2
. (2.31)
22
Expression 2.31 above is a kernel of a χ2
n+2 random variable. We have thus proved that,
1
σ2
n
i=1
(u2
i /ωi)|β, ω ∼ χ2
n+2.
Next let us consider the conditional posterior distribution of ω. Conditional on β and σ
(Geweke 1993), the ωi are independent with posterior density,
p(ωi|β, σ, y, X) ∝ ω
−(ν+3)/2
i exp −
1
2ωi
ν +
yi − xiβ
σ
2
(2.32)
= ω
−(ν+3)/2
i e−(ν+u2
i /σ2
)/2ωi
(2.33)
Now let (see Geweke 1993),
ψ = (σ−2
u−1
i + ν)/ωi,
and consider the inverse transformation,
g−1
(ψ) = (σ−2
u−1
i + ν)/ψ.
It follows then that the conditional posterior density of ψ is,
p(ψ|β, σ, y, X) = p(g−1
(φ)|β, σ, y, X)
d
dψ
g−1
(ψ)
∝ {ψ/(σ−2
u−1
i + ν)}(ν+3)/2
ψ−2
e−ψ/2
∝ ψν/2−1
e−ψ/2
.
We recognise the last expression as a kernel of a χ2
ν+1. Hence we have shown that a posteriori,
ψ|β, σ ∼ χ2
ν+1.
Let us assume an exponential prior for ν. That is,
p(ν) = λ e−λν
.
From Equation 2.28 it follows that,
p(ν|β, σ, y, X) ∝ (ν/2)(nν/2)
{Γ(ν/2)}−n
n
j=1
ω
−(ν+3)/2
j exp −
n
i=1
νω−1
i /2
∝ (ν/2)nν/2
{Γ(ν/2)}−n
e−ην
,
where
η =
1
2
n
i=1
(log ωi + ω−1
i ) + λ.
For computational details and proves of convergence of the Gibbs sampler, the reader is
referred to Geweke (1993).
23
2.2.3 Objective treatment of the Student-t linear model
The last few paragraphs treated the independent Student-t linear model by assuming a
proper prior for the degrees of freedom. The appropriateness of the selection of that par-
ticular prior was not emphasised; it was merely chosen to obtain a proper posterior. It
turns out that it has short-comings attached — Fonseca et al. (2008) show that it is too
informative and has undue influence on the posterior inference. They further show that the
analysis of the Student-t linear model based on the exponential prior, g(ν) = λ e−λν
, of the
degrees of freedom is strongly dependent on the value of λ which Geweke (1993) suggested
should be chosen based on the prior information about the problem at hand.
To counter this undesirable subjectivity, in lieu of assuming an exponential prior for the
degrees of freedom, the next few paragraphs present an objective treatment of the Student-
t linear model using the Jeffreys-rule and the independence Jeffreys priors as originally
proposed by Fonseca et al. (2008). We assume the same model given in Equation 2.24
with similar distributional assumptions on the random errors. That is the random errors
are independent and identically distributed according to the Student-t distribution with
location parameter zero, scale parameter σ and ν degrees of freedom. Then for the parameter
θ = (β, σ, ν) ∈ Rp
× (0, ∞)2
, we form the following likelihood function,
L(β, σ, ν|y, X) =
n
i=1
Γ{(ν + 1)/2}
Γ(ν/2)
√
πνσ2
{1 + (yi − xiβ)2
/νσ2
}−(ν+1)/2
=
Γ ν+1
2
n
Γ ν
2 (πνσ2)n/2
n
i=1
{1 + (yi − xiβ)2
/νσ2
}−(ν+1)/2
(2.34)
=
Γ ν+1
2
n
νnν/2
Γ ν
2 πn/2σn
n
i=1
ν +
yi − xiβ
σ
2 −(ν+1)/2
.
For the vector of parameters θ = (θ1, θ2, θ3) = (β, σ, ν), the entry in the ith row and jth
column of the Fisher information matrix is given by
{F(θ)}ij = E −
∂2
∂θi∂θj
log L(θ|y, x) ,
where the expectation is to be taken with respect to the distribution of y. Derivations of
the independence Jeffreys and the Jeffreys prior for θ = (β, σ, ν) are given in Fonseca et al.
(2008). The authors show that both priors belong to a class of improper prior distributions
given by
π(θ) ∝
π(ν)
σa
, (2.35)
where a ∈ R is a hyperparameter and π(ν) is the marginal prior of ν. They also prove that
the independence Jeffreys prior and the Jeffreys prior for θ, which they denote by πI
(β, σ, ν)
and πJ
(β, σ, ν), are of the form 2.35 with, respectively,
24
a = 1, πI
(ν) ∝
ν
ν + 3
1/2
ψ
ν
2
− ψ
ν + 1
2
−
2(ν + 3)
ν(ν + 1)2
1/2
and
a = p + 1, πJ
(ν) ∝ πI
(ν)
ν + 1
ν + 3
p/2
,
where ψ(·) and ψ (·) are the digamma and trigamma functions, respectively.
2.2.4 Least Absolute Deviations
If the Normality of the error distribution is suspect (e.g. if the error distribution is more
kurtic than the Normal distribution), then a common robust alternative to ordinary least
squares regression is the Least Absolute Deviations (LAD) regression (Narula & Wellington
1990, p. 130). The method of least absolute deviations is conceptually simple but, its com-
putational aspect is not mathematically neat. Unlike in least squares estimation where the
location parameter is estimated by the sample mean, in least absolute deviations estima-
tion, the sample median plays such a role. This will become clear shortly. Using the sample
median instead of the sample mean for location parameter estimation has the advantage of
robustness to outliers — in any sample, only one or two of the central values are used to
obtain the location parameter estimate (Rosenberger & Gasko 1983, p. 302). Furthermore,
the value of the sample median remains unchanged if the magnitude of a datum is changed
in such a way that it remains on the same side of the sample median (Narula & Wellington
1990, p. 130).
Narula and Wellington (1985) show that this desirable property is inherited by LAD re-
gression. Specifically, they show that as long as the response value of an observation lies
on the same side of the fitted LAD line (in simple regression), then the model will remain
unchanged. LAD regression is also known, among other names, as Minimum Sum of Ab-
solute Errors (MSAE), Least Absolute Values (LAV), or L1-regression. The latter name
implies that LAD regression is a special case of Lp-regression with p = 1, or that the LAD
regression criterion is to minimise the L1-norm (or the sum of the absolute values) of the
residuals,
Q(β) =
n
i=1
|yi − xiβ| =
n
i=1
ρ(εi), (2.36)
where ρ(·) = | · |. Differentiating the expression in Equation 2.36 and setting the derivative
equal to zero yields an equivalent formulation,
∂Q(β)/∂β =
n
i=1
ψ(yi − xiβ) =
n
i=1
sgn(yi − xiβ) = 0, (2.37)
where,
sgn(x) =



+1, if x > 0
0, if x = 0
−1, if x < 0
25
A value of β that satisfies Equation 2.37 is called the least absolute value estimate. We will
simply denote it ˆβ. Note that since the absolute value function |x| is not differentiable at
x = 0, Equation 2.37 is not completely correct by setting the derivative of |x| to zero at
x = 0, but such an inconsistency has been accepted as reasonable in the literature as it puts
the LAD ψ-function in agreement with other ψ-functions (Goodall 1983, p. 343).
One common method for solving the minimisation problem presented in Equation 2.37 is
to transform the problem into one of linear programming (Pynn¨onen 1994). Another is to
use generalised iterative least squares estimation. This problem will always have a solution
but, it will not always be unique. In robustness studies, one can distinguish between two
broad categories of distributions (Green 1976, cited in Hawkins 1980, p. 1), namely, those
that have fat tails and those with thin tails. The former are referred to as outlier-prone
and include the double exponential (Laplace) distribution; the latter are referred to as
outlier-resistant. The Normal distribution, which has kurtosis γ2 = 3, is an example of
an outlier-resistant distribution. The double exponential distribution on the other hand
has γ2 = 6. It is interesting to see that if the disturbances are independent and identically
distributed according to the double exponential distribution, then the LAD estimator is also
the maximum likelihood estimator. In order to formulate the linear programming model for
the LAD estimation problem, define the following variables,
d+
i =
εi if εi > 0
0 if εi ≤ 0
and
d−
i =
0 if εi > 0
−εi if εi ≤ 0
Then the linear programming model to find the parameter estimates in the linear model
minimises the function
n
i=1
d+
i +
n
i=1
d−
i
subject to,
β0 +
p
j=1
Xijβj + d+
i − d−
i = y
d+
i ≥ 0, d+
i ≥ 0,
while the βi are unrestricted in sign. An LAD regression will fit at least p observations
exactly (i.e. at least p residuals will be zero) where p is the number of parameters including
the intercept (Ravishanker & Dey 2002, p. 340). Hence in a simple regression problem one
need only determine two observations through which the LAD line passes in order to fit the
entire regression. In the theory of least absolute deviations, it is customary to refer to those
observations that have been fitted exactly as defining observations (Narula & Wellington
2002). An observation that is not defining is simply called non-defining.
The diagnosis of influence in LAD regression is usually carried out differently from the
way it is performed in least squares (see Narula & Wellington 2002). Instead of deleting
26
an observation to determine its influence on the fit, as is done in OLS regression, Narula
and Wellington (2002) find an interval on each value of the predictor variable that leaves
the fitted LAD regression unchanged. The interval for the value of the response variable
that will leave the fit unaltered is given by [ˆyi, ∞) or (−∞, ˆyi] according as the residual
corresponding to the ith fitted value ˆyi is positive or negative (Narula & Wellington 2002).
Solving the LAD/MSAE regression problem via linear programming methods is just one of
many methods present in the literature. Birkes and Dodge (1993) present and justify an
alternative algorithm to carry out (simple) least absolute deviations regression. The next
two paragraphs give an outline of the algorithm.
The reader will recall that an MSAE regression will fit at least p observations exactly. Hence,
in the special case of simple regression (i.e. when p = 2) the regression line will pass through
at least two observations. First choose an initial point, (x1, y1) say. Then, among all lines
that pass through it, we seek the best, according to some criterion yet to be defined. This
line will have to pass through another data point (x2, y2), say. Now we seek among all lines
that pass through (x2, y2), the best line, which in turn passess through another point, say
(x3, y3). The process is continued until two consecutive iterations yield the same line L. At
this point the algorithm has reached convergence and line L is the LAD/MSAE regression
line.
To find the best line that passes through a data point (x0, y0), we calculate the slope
(yi − y0)/(xi − x0) of the line passing through (x0, y0) and (xi, yi). Points for which xi = x0
can be ignored. Then rearrange the observations such that (y1 − y0)/(x1 − x0) ≤ (y2 −
y0)/(x2 − x0) ≤ · · · ≤ (yn − y0)/(xn − x0). Find the index k for which
k
i=1 |xi − x0| first
exceeds 1
2 T =
k
i=1 |xi − x0|/2. Then the best line passing (x0, y0) has slope and intercept
estimates,
ˆβ1 =
yk − y0
xk − x0
and
ˆβ0 = y0 − β1x0
respectively.
Before we continue to give a multiple regression extension of the algorithm outline above, for
illustrative purposes, we apply it to the data presented in Table 5. Originally from Browlee
(1960), Table 5 reports a study on the stopping distance of an automobile as a function of
velocity on a certain road (see also Rice 2007, p. 599).
To start the algorithm, we choose a data point (x0, y0), say (15.4, 20.5). Note that this
choice is totally arbitrary. Then form the slopes (yi − 15.4)/(xi − 20.5) and arrange them
in increasing order. These are given in the second column of Table 6. Next we calculate
T = i=1 |xi − 20.5| = 95.6 and look out for the observation for which the cumulative
sum |xi − 20.5| first exceeds T/2 = 47.8. From Table 6 this is the fourth observation.
We do the same with this observation as we did with (15.4, 20.5). That is, we compute
the slopes (yi − 73.1)/(xi − 40.5), rearrange them in increasing order, calculate the sum
T = i=4 |xi − 40.5| = 75.6 and look out for the observation for which the cumulative sum
27
Obs. Velocity (mi/h) Stopping Distance (ft)
1 20.5 15.4
2 20.5 13.3
3 30.5 33.9
4 40.5 73.1
5 48.8 113.0
6 57.8 142.6
Table 5: Data on stopping distance as a function of velocity.
Cumulative sum
Obs. (yi − 15.4)/(xi − 20.5) |xi − 20.5| of |xi − 20.5|
6 1.331 37.3 37.3
2 0.0 37.3
3 1.850 10.0 47.3
4 2.885 20.0 67.3
5 3.449 28.3 95.6
Table 6: Summary of the calculations of the LAD regression of stopping distance on velocity.
|xi − 40.5| first exceeds T/2 = 37.8. This points to the second observation. For each
iteration of the algorithm, one will have to form a table similar to Table 6.
The next two iterations point to the sixth and the second observations, respectively. But
since the second observation was reached just a step before and just a step after the sixth
observation was reached, the algorithm converges. Hence the defining observations are the
sixth and the second observations. The estimated parameters are, ˆβ1 = (142.6−13.3)/(57.8−
20.5) = 3.47 and ˆβ0 = 13.3 − 3.47 × 20.5 = −57.76. Figure 4 depicts fits of both the OLS
regression and the LAD regression. We hasten to point out that this example was included
only for illustrative purposes; it should not be presumed that the LAD model we just fitted
is anyhow superior to the OLS model.
Birkes and Dodge (1993) present a modified version of Barrodale and Roberts’ (1974) al-
gorithm for fitting a multiple least absolute deviations regression based on the simplex
algorithm. For reasons of space, we do not discuss it here but we will apply it. Table 7
reports results from a small study in which the relationship between catheter length and
2 other variables, namely, height and weight was investigated (Rice 2007, pp. 581). In the
table catheter length is represented by distance to pulmonary artery. See the reference cited
for more details. Consider fitting a linear multiple regression model,
yi = β0 + β1x1i + β2x2i + εi, i = 1, . . . , 12,
to the data using the least absolute deviations method. The resulting fit is, ˆyi = 31.591 −
.178x1i + .326x2i, i = 1, . . . , 12. But how resistant is this fit? We consider this sensitivity
question next.
Where there is measurement, there is certain to be some error one way or another. Suppose
28
q
q
q
q
q
q
20 30 40 50
20406080100120140
Velocity (miles per hour)
StoppingDistance(ft) OLS line
LAD line
Figure 4: Plot of the OLS and LAD regressions of stopping distance on velocity.
some observation in the catheterisation data was recorded with error. Let us assume that
the actual value of x41 was 35.9, say, but was incorrectly recorded as 39.5. Would this bring
our fit into question? Or stated more generally, what range of values can x41 take (in the
vicinity of 39.5) without affecting the parameter estimates? It can be shown that as long as
x41 assumes any value in the closed interval [31.72, 39.77], the LAD fit remains unaffected.
This is quite a desirable property the method of least absolute deviations inherits from the
median. Intervals of values of explanatory variables for non-defining observations on which
the LAD or MSAE fit on the catheterisation data we obtained earlier does not change are
depicted in Figure 5. The next few paragraphs give only a vague idea of how they were
calculated, with the hope that the interested reader will refer to Narula and Wellington
(2002).
In treating this example we will closely mimic the style of presentation of the originators
of the procedure, namely, Narula and Wellington (2002), and the reader interested in the
full details of the method should see the reference just cited since not all calculations will
be presented in this note. We would like to find intervals about predictor variables (and
also the response variable) for non-defining observations on which our fit is resistant. The
case of the response variable is trivial — the LAD fit will remain unchanged, it will be
recalled, as long as the new value of the response variable is greater than or less than the
fitted value according as the corresponding observation has a positive or negative residual.
Table 8 presents these intervals.
29
Distance to
Obs. number Height Weight Pulmonary Artery
i (in.) (lb) (cm)
1 42.8 40.0 37.0
2 63.5 93.5 49.5
3 37.5 35.5 34.4
4 39.5 30.0 36.0
5 45.5 52.0 43.0
6 38.5 17.0 28.0
7 43.0 38.5 37.0
8 22.5 8.5 20.0
9 37.0 33.0 33.5
10 23.5 9.5 30.5
11 33.0 21.0 38.5
12 58.0 79.0 47.0
Table 7: Heart catheterisation data recorded on 12 patients.
The intervals for the values of the predictor variables that leave the parameter estimates
unchanged, however, are not as easy to derive. We consider them next. First we reorder our
observations according to the type of residual each has. Observations with zero residuals are
first grouped together and given new indices in the order of their original appearance. Those
with positive residuals come next and similarly rearranged and last come the observations
with negative residuals also reordered in a similar manner. Hence for example, observation 1
retains its index because it has the smallest index of all defining observations and observation
9 gets a 12 as its new index because it has the greatest index of all observations with negative
residuals. The complete enumeration of the grouped and reordered observations is reported
in Table 9.
According to Narula and Wellington (2002), one then forms the design matrix with its
entries rearranged as in Table 9, and after some involved calculations obtains the intervals
presented in Table 10.
2.2.5 Methods based on M-estimators
M-estimators provide a delicate compromise between robustness and effeciency. The class
of M-estimators was introduced in Huber (1964) and generalised to regression problems
in Huber (1973). Although other classes of robust estimators exist in the literature, M-
estimators are by far the most flexible and they give the best performance (see e.g. Li 1985).
Additionally, M-estimation generalises to regression models more readily than L- and R-
estimation (Li 1985; Huber 1972). Here we review very briefly the theory of M-estimation
and show how it embraces the methods of ordinary least squares and least absolute deviations
estimation as special cases.
Consider a sample, x1, . . . , xn, from a Normal distribution with mean µ and variance σ2
. A
30
2 4 6 8 10
20304050607080
Observation Number
x1(height)
2 4 6 8 10
020406080
Observation Number
x2(weight)
Figure 5: Plots of intervals for explanatory variables on which the LAD fit is resistant.
maximum likelihood estimator of the location parameter µ minimises,
n
i=1
Xi − µ
σ
2
. (2.38)
It is known that the above expression is minimised by the conventional sample mean, ¯X =
n−1 n
i=1 Xi. Suppose now that the sample, x1, . . . , xn, comes from the Laplace distribution
with mean µ and variance 2σ2
. That is,
f(xi|µ) = (2σ)−1
exp −
|xi − µ|
σ
, for i = 1, . . . , n.
In this case it is evident that a maximum likelihood estimator of the location parameter µ
minimises the expression,
n
i=1
Xi − µ
σ
. (2.39)
It can be shown that the sample median is the corresponding estimator. Equations 2.38
and 2.39 above have more in common than might be obvious at first sight. Both can be
rewritten,
n
i=1
ρ
Xi − µ
σ
,
31
Obs. No. Lower bound y Upper bound
1 Defining observation
2 −∞ 49.5 50.74
3 −∞ 34.4 36.48
4 34.33 36.0 ∞
5 40.43 43.0 ∞
6 −∞ 28.0 30.27
7 36.48 37.0 ∞
8 −∞ 20.0 30.35
9 −∞ 33.5 35.75
10 Defining observation
11 32.55 38.5 ∞
12 Defining observation
Table 8: Admissible intervals for the values of the response variable.
where ρ(t) is a continuous (usually convex) real valued function. It is obvious that the
objective function ρ(t) = t2
for Equation 2.38 and ρ(t) = |t| for Equation 2.39.
Huber (1964) defines an M-estimate (or a maximum likelihood type estimate) Tn of a loca-
tion parameter as any estimate that minimises an expression of the form
n
i=1
ρ(xi; Tn),
or equivalently, that satisfies
n
i=1
ψ(xi; Tn) = 0,
where ρ(·) is an arbitrary function and ψ(x; θ) = (∂/∂θ)ρ(x; θ) (also see Huber 2009).
From this definition we already see that the sample mean and median are examples of M-
estimators. More refined estimators than the mean and the median fall under this broad
category of estimators. It has already been pointed out that the sample median is resistant
to outliers, but has the drawback of inefficiency at the Normal distribution. Also, the sample
mean that is of course more efficient than the sample median at the Normal distribution has
the disadvantage of gross sensitivity to outliers. The estimator corresponding to Huber’s
objective function,
ρ(t) =



1
2 t2
, for |t| ≤ k
k|t| − 1
2 k2
, otherwise
(2.40)
was designed to inherit the resistance of the sample median and the efficiency of the sample
mean. This can be seen by noting that for small values of t, it behaves like the objective
function of least squares estimation and otherwise like that of least absolute deviations
estimation. The location parameter estimator given explicitly by Equation 2.40 is referred
32
Obs. number i xi1 xi2 Intercept yi ri Remarks
1 42.8 40.0 1 37.0 .000 Defining observation
10 23.5 9.5 1 30.5 .000 Defining observation
12 58.0 79.0 1 47.0 .000 Defining observation
4 39.5 30.0 1 36.0 1.671 Positive residual
5 45.5 52.0 1 43.0 2.571 Positive residual
7 43.0 38.5 1 37.0 .524 Positive residual
11 33.0 21.0 1 38.5 5.945 Positive residual
2 63.5 93.5 1 49.5 −1.245 Negative residual
3 37.5 35.5 1 34.5 −1.978 Negative residual
6 38.5 17.0 1 28.0 −2.272 Negative residual
8 22.5 8.5 1 20.0 −10.352 Negative residual
9 37.0 33.0 1 33.5 −2.252 Negative residual
Table 9: Table of grouped and reordered observations from the catheterisation data.
to by some authors as a Huber (e.g. Goodall 1983, pp. 369-371). The Huber will resist
outliers in the response variable (i.e. in the y-direction) but will perform very poorly in the
face of leverage points (Bellio & Ventura 2005). The tuning constant k > 0 is chosen to
strike a balance between efficiency and resistance. One will select a small or large value of
the tuning constant depending on whether the distribution has a large or small proportion
of outliers (Birkes & Dodge 1993, pp. 99-100).
To give another favourable property possesed by Huber’s ρ function, we make the following
definition:
A distribution of the form,
F = (1 − )Φ + H,
where Φ(x) =
x
−∞
exp(−1
2 t2
)dt is the Standard Normal cumulative distribution function
(CDF) and H the CDF of a contaminating distribution is called an -contaminated Normal
distribution (Goodall 1983, p. 372; Rosenberger & Gasko 1983, p. 317; Miller 1997, p.
10). For arbitrary choices of the -contamination, Huber’s estimators are the most efficient
(Goodall 1983, p. 372). However, it appears that the density corresponding to the above
contaminated Normal is not heavy-tailed enough to account for outliers sometimes encoun-
tered (Goodall 1983, p. 374). This poses a threat to the robustness of Huber’s M-estimator
to unduly discrepant values.
A class of M-estimators called redescending M-estimators counters this weakness (Bellio
& Ventura 2005). Two examples of redescending M-estimators — so called because their
influence functions return to zero at large absolute values of their arguments — are Tukey’s
biweight and Andrews’ estimator. Their respective objective functions are given by (see e.g.
Goodall 1983, pp. 348-349),
33
Obs. No. L. B. x1 U. B. L. B. x2 U. B.
1
2 63.226 63.5 70.488 89.679 93.5 94.204
3 37.226 37.5 45.278 29.430 35.5 36.204
4 31.722 39.5 39.774 29.296 30.0 35.127
5 37.722 45.5 45.774 51.296 52.0 59.890
6 38.226 38.5 46.278 10.028 17.0 17.704
7 40.056 43.0 43.274 37.796 38.5 40.109
8 22.226 22.5 80.614 −7.170 8.5 9.204
9 36.726 37.0 44.778 26.088 33.0 33.704
10
11 25.222 33.0 33.274 20.296 21.0 36.670
12
Table 10: Table of admissible intervals of explanatory variables for non-defining observations.
ρ(u) =



1
6
[1 − (1 − u2
)3
], for |u| ≤ 1
1
6
, otherwise
and
ρ(u) =



1
π
(1 − cos πu), for |u| ≤ 1
2
π
, otherwise.
Figure 6 presents plots of, first the objective function, and then the influence function of
a Huber estimator and Tukey’s biweight/bisquare estimator. The tuning constant k for
Huber’s estimator is set to 1.5. From the graph in panel (a), Huber’s influence function can
be seen to be quadratic in the region between the red lines and linear elsewhere. From the
top-right panel we see that Huber’s Ψ-function is monotone. What is more, for observations
beyond a certain point, the influence curve becomes constant. The bottom-right panel shows
that Tukey’s bisquare M-estimator is of a redescending type.
To generalise the robust estimation of the location parameter to regression, we need first
a way of estimating the error scale parameter σ. In the MSAE and OLS cases there is no
need for estimating scale because Equations 2.38 and 2.39 can be equivalently written as,
n
i=1
(Xi − µ)2
= min,
34
−10 −5 0 5 10
0246812
(a) Huber's ρ function
u
Ψ(u)
−10 −5 0 5 10
−1.5−0.50.51.5
(b) Huber's Ψ function
u
Ψ(u)
−1.5 −0.5 0.5 1.5
0.000.050.100.15
(c) Bisquare ρ function
u
Ψ(u)
−1.5 −0.5 0.5 1.5
−0.3−0.10.10.3
(d) Bisquare Ψ function
u
Ψ(u)
Figure 6: Objective functions of Huber’s and Tukey’s Bisquare estimators of location and
their corresponding influence functions.
and
n
i=1
|Xi − µ| = min,
respectively. However, for other M-estimators the story is not the same; an estimate ˆσ of
scale is required because if scale is not taken into account, then the estimate ˆβ would not
respond correctly to a change in the units of y or to a change in the scale of the errors (Li
1985, p. 302). Two strategies for taking scale into account in regression are (see Li 1985,
pp. 302-303),
1. Estimate σ beforehand,
2. Estimate β and σ simultaneously.
In the first-mentioned method, an initial scale estimate ˆσ, commonly the median absolute
deviation, is calculated before each iterative step. Then considering scale as known, a
solution ˆβM to
35
n
i=1
ψ
yi − xiβ
ˆσ
xT
i = 0 (2.41)
is determined. Assuming the MAD is used, the scale estimate ˆσMAD is calculated as,
ˆσMAD =
1
0.6745
median
i∈{1,...,n}
yi − xi
ˆβ
(0)
− median
j∈{1,...,n}
yj − xj
ˆβ
(0)
,
where ˆβ
(0)
is a preliminary estimate of β and the factor 1/0.6745 ensures that ˆσMAD
estimates σ when the distribution is Normal (Jacoby 2005). The MSAE estimate will
usually furnish such a preliminary estimate (Li 1985, p. 302). The method of estimating
scale and the parameters simultaneously is accomplished by solving simultaneously,
n
i=1
ψ
yi − xiβ
σ
xT
i = 0 (2.42)
and
n
i=1
χ
yi − xiβ
σ
= na, (2.43)
where χ(·) is a suitable bounded function, and a is a suitable positive constant often chosen
as [(n − p)/n]E{χ(Z)} where Z denotes a standard Normal random variable (Bellio &
Ventura 2005; Li 1985, p. 303).
Algorithms for computing the parameter estimates in M-regression are presented in Li
(1985). Birkes and Dodge (1993) give a light-hearted exposition of M-regression by giving
practical examples and verifiication if the algorithms they use. Most but not all M-estimators
occur as maximum likelihood estimators of some parameter for some error distribution. An
example of an M-estimator that does not occur as maximum likelihood estimator is Tukey’s
biweight (Birkes & Dodge 1993). Huber’s M-estimators are maximum likelihood estima-
tors for the least favourable -contaminated Normal distributions whose densities for a given
value of k, are given by the expression (Goodall 1983, p. 373),
f(x) =



1 −
√
2π
e−x2
/2, for |x| ≤ k,
1 −
√
2π
ek2
/2 − k|x|, otherwise.
(2.44)
If the tuning constant k is set to k = 1.345, then Huber’s estimate will be 95% as efficient as
the sample mean at the Normal distribution and will give substantial resistance at alternative
distributions (Jacoby 2005). Although Huber’s estimates (sometimes called Huber-type
estiamtes) are only robust to outliers in the y-direction but sensitive to outliers in the
carriers, it appears that in some practical situations such as bioassy experiments, only
errors in the y-direction need to be considered — outliers in regressors can be ignored
36
(Bellio & Ventura 2005). Methods that counter this problem (e.g. Tukey’s biweight) are
computationally more complicated because of multiple roots. In such cases it is important
to choose a good starting point and iterate carefully because iterative M-estiamtors are
sensitive to the starting value when the ψ-function redescends (Bellio & Ventura 2005; Li
1985, p. 309).
2.3 Conclusion
We reviewed in the last chapter, (i) ordinary least squares regression, (ii) Bayesian Student-t
regression, (iii) least absolute deviations regression, and (iv) M-regression. Although least
squares regression has the disadvantage of being too sensitive to outliers,it is still widely used
in practice. Several diagnostic procedures have been proposed and successfully used with
OLS regression. Additionally, most packaged programs, e.g. R, have routines to perform
many such diagnosis tools as the hat matrix, residuals, residual plots, Cook’s distances, etc.
The analyst then sets his rule for rejecting influential observations. For instance, we saw
earlier, the “rejection” method proposed by Hoaglin and Welsch (1978).
Fitting a resistant OLS is then seen to be too laborious. One other disadvantage is that,
though some authors call it resistant, it actually is not; it is only resistant to the eliminated
observations, which are merely sample values. The population remains unknown. Methods
(ii), (iii), and (iv) remedy this lack of resistance or robustness to ill-behaved observations.
The use of robust methods in general makes the model-fitting process more automatic.
These methods however, save perhaps for method (iv), are not without problems of their
own. Say we applied the method of least absolute deviations to well-behaved Normally
distributed data, then a great deal of efficiency would be lost. That is, we would have
estimates with variances larger than those from an OLS regression. How methods (i), (ii),
and (iii) actually compare in practice is the subject of the remainder of this note.
37
3 Methodology
3.1 Introduction
The previous section had its main focus on the method of least squares, the method of
least absolute deviations, and two implementations of the Bayesian Student-t linear model.
No attempt was made to show their relative performances. The remainder of this note is
devoted to comparing the four models on the basis of their respective abilities to handle
compromised model assumptions. To do this, first we fit all four of the models under
conditions favourable to the least squares regression model to see how robust alternatives
to OLS considered herein perform relative to OLS. Then we will break a few assumptions of
the Gaussian linear model one at a time and compare the models again. In order to make
our comparisons we will use standard model performance criteria discussed below.
3.2 Research Design
Each model will be fitted to a thousand simulated samples of sizes, 100, and 400. Addition-
ally, the proportion of contamination will take on values 0%, 5%, and 25%. We will work
with the case of 2 explanatory variables. The idea is to fix the parameters βj and see how
close each model comes under different conditions often met with in practice. For instance,
since in practice deviation from symmetry usually comes in the form of positive skewness
(e.g. Miller 1997, p. 16), we will not consider negatively skewed data. The explanatory
variables Xij will be generated as follows independent standard Normal random variables.
The random errors εi will be generated from the standard Normal distribution with unit
variance in the standard-assumptions scenario.
In order to contaminate the response variable y, we will use a variant of Kianifard and
Swallow’s (n.d.) method. That is, (1) we randomly select a proportion of the y’s, (2) to
each of the 100n chosen y-values, we add an appropriate δ in place of a random error ε.
For instance, for a sample size of 100 with a proportion of contamination of 25%, we will
randomly select 25 y’s and in place of ε, we will add δ’s to each of the 25 selected y’s. It
will be recalled from the literature review that outliers in the explanatory variables effect
high leverage points. Hence, to introduce high-leverage points, (1) we randomly select a
proportion of the observations, (2) for each of the 100n chosen observations, we substitute
an approriate pair (δ∗
1, δ∗
2) for the selected observation’s pair of explanatory variables, say,
(Xk1, Xk2).
Possible scenarios for a study of this kind are enumerable but our scope will not be too
broad. The scenarios we will investigate are as follows:
1. all assumptions valid,
2. a sample with 5% contamination,
3. a sample with 25% contamination,
38
mean
Frequency
1.6 2.0 2.4 2.8
0100200300
median
Frequency
−0.4 −0.2 0.0 0.2 0.4
0100200300
T1
Frequency
−0.4 −0.2 0.0 0.2 0.4
050100150
T2
Frequency
−0.4 −0.2 0.0 0.2 0.4
050100150
Figure 7: Effects of 5% contamination on location.
4. a sample consisting of 25% of positively skewed observations,
5. a sample consisting of 100% of positively skewed observations, and,
6. a sample with 5% contamination and 25% positively skewed observations.
Exactly how to create the first scenario is straight-forward. To simulate the second scenario,
we will add 30 to 5% of the randomly selected ei’s, substitute 10 for 5% of the X1’s, and 200
for 5% of the X2’s. The third scenario will be simulated similarly but with 25% in place of
the 5% in the second scenario. This same method of contamination will be used for scenario
six. To introduce skewness we will sample the random errors from a Gamma distribution
with α = 2 and λ = 0.5. The Gamma distribution will always be positively skewed. It’s
coefficient of skewness can be shown to be 2/
√
α (see e.g. Randle & Wolfe 1979, p. 415).
Let us consider the simpler problem of location under one of the contamination scenarios.
One realisation of the 5%-contamination scenario is depicted in Figure 7. Note how the
distribution of the mean losses symmetry.
It is worth noting that, each scenario under which contamination is involved, not only the
response variable but also the explanatory variables will be contaminated. Contaminated
39
values in the response variable will be referred to as outliers. Those in the explanatory
variables will be said to give rise to high leverage (see Birkes & Dodge 1993, p. 206).
3.3 Research Objectives
The objective of this study is to see how soon and how badly least squares regression loses
optimality to the other methods treated herein as its assumptions are violated in a myriad
of ways and to different extents.
Specifically, our main research objectives are:
1. to see how better OLS fares than alternative methods when its assumptions are fully
met,
2. to see roughly at what point OLS starts to perform poorly relative to alternative
methods as its assumptions are violated,
3. to find out which one of the models under study performs best under what conditions,
4. to see the role played by sample size n.
3.4 Model performance criteria
For each of the four models, we will calculate the Root Mean Square Error (RMSE) and
its components, i.e. the bias and variance of the parameter estimates. Bias will give an
average measure of distance between the true parameter vector (β0, β1, β2) and its estimates
(ˆβ0, ˆβ1, ˆβ2). The RMSE can be viewed as a measure of accuracy (see e.g. Lohr 2010, p. 32).
It is defined as the square-root of the Mean Squared Error (MSE). Recall that the MSE of
an estimate ˆβj of βj is given by,
MSE(ˆβj) = Var(ˆβj) + Bias(ˆβj)2
. (3.45)
In order to see how the coverage probability is affected, we will try and see whether 95%
confidence intervals contain the true parameter 95% of the time. The effect on the true
coverage probabilty will have very important implications on the robustness for validity of
the associated t-tests (see e.g. Huber 2009; Miller 1997, p. 9). We elaborate more on this
later.
3.5 Conclusion
Although only the values of the RMSE and bias will be tabulated, it will be important to
also consider the variance. This will easily be obtained from rearranging Equation 3.45.
40
Also, in addition to the element-wise bias terms, it will be worthwhile to have an overall
measure of distance between the true parameter and its estimate. To this end, we will use
the Euclidean distance between the vector of true parameters and the vector of esimated
parameters. The reader is reminded that the Euclidean distance between vectors u and
v is defined as d(u, v) = ||u − v|| (see e.g. Anton & Rorres 2005). Sample sizes 100 and
400 were chosen with the hope of obtaining approximate asymptotic relative efficiencies of
the models. A look at the disparity between the true and the nominally stated coverage
probabilities will lead us to draw some important conclusions about the disparity between
the true significance level α∗
and the nominally stated one α.
41
4 Results/Applications
4.1 Introduction
This part of our note reports results from a Monte Carlo study carried out in ways discussed
in the preceeding section. For ease of reference, we denote the heteroscedastic Bayesian
Student-t model as T1 and the homoscedastic one as T2. Table 12 (printed in the appendix
) is partitioned into five sections each of which summarises results from each of the first five
scenarios, and Table 11 summarises results from the final scenario. The bias and RMSE of
the parameter estimates are calculated and given element-wise to afford a closer look at the
performances of the models.
4.2 Scenario one
Bias under all models diminishes for the larger sample size of n = 400 where OLS estimates
seem to have been the most biased. To show that we need not be overly concerned about the
bias of our OLS estimates, note that for n = 100, the vector (3, 10, −5) was estimated under
OLS as (3.0053, 10.0023, −5.0007) on average. We conclude that, for the first scenario, all
models performed well as far as bias is concerned.
A measure of accuracy, the root mean square error (RMSE) can be considered next, as an
estimator can have sufficiently low bias but be severely unstable. Note that, for both sample
sizes, OLS estimates are the most accurate, with accuracy improving for the larger sample
size.
The variance of parameter estimates can easily be calculated as,
Var( ˆβj) = RMSE( ˆβj)2
− Bias( ˆβj)2
. (4.46)
Let ˆβj denote the least squares estimate of βj and ˜βj the corresponding least absolute
deviations estimate. If S2
ˆβj
and S2
˜βj
denote the respective sample variances, then we observe
that for n = 400,
S2
ˆβ0
S2
˜β0
=
.04882
− .00022
.06172 − .00082
≈ .63, (4.47)
S2
ˆβ1
S2
˜β1
=
.04932
− .00192
.06212 − .00072
≈ .63, (4.48)
and
S2
ˆβ2
S2
˜β2
=
.05102
− .00272
.06302 − .002882
≈ .65. (4.49)
42
These ratios average out to .64 — the relative asymptotic effiiciency of the sample median
to the sample mean for the Normal distribution. Similarly, the asymptotic effiiciencies of
the location estimators corresponding to the two Bayesian models relative to the sample
mean were calculated as approximately .81, and .84, for the Normal model. We thus see
little loss in efficiency in using Student-t regression instead of OLS when the assumptions of
the Normal linear model are fully satisfied. Thus both Student-t models out-perform LAD
under conditions favourable to the Normal linear model based on efficiency.
4.3 Scenarios two and three
The second section of Table 12 summarises results from the second scenario under which
5% of the observations are outlying. Let us consider only the first two parameter estimates
since all models poorly estimated β2. We see that OLS estimates were the most stable in
the 5%-contamination scenario. The method of least absolute deviations did worse than
the other three models on the basis of stability — LAD estimates also seem to have been
the most inaccurate among the four estimates. To quantify the relative performances of the
models, consider the case of n = 100. Then one can show that the efficiencies of LAD, T1,
and T2 are approximately, 66%, 85%, and 86%, respectively, relative to OLS. The seemingly
counter-intuitive result that LAD under-performed OLS in this scenario is due to the greater
susceptibility that LAD has to high leverage than OLS (see e.g. Birkes & Dodge 1993, p.
191; Jacoby 2005).
Under the third scenario, relative performances remained more or less the same as in the
second scenario. If once again we consider only the first two parameter estimates, it will
be apparent that under this scenario OLS had the least variances still. LAD also had the
greatest variances under this scenario. Efficiencies of LAD, T1, and T2 relative to OLS can
be calculated to approximately, 65%, 82%, and 84%, respectively. We conclude then that
OLS has done best under the second and third scenarios. LAD has done poorest, and the
Bayesian models did not perform too poorly.
4.4 Scenario 4
For a quarter of each sample, random errors were taken as variates from a Gamma distri-
bution, centred at the mean 0, and the rest from a standard Normal distribution under this
scenario. We see relatively low bias under OLS, particularly in the intercept. OLS esti-
mates however, have shown the greatest variances. LAD had slightly lower variances than
did OLS. The homoscedastic Student-t model had the least variance. The reader might
find it instructive to calculate the relative changes in the variances of parameter estimates
from the first scenario to the fourth. For n = 100, the variances of the OLS estimates,
LAD estimates, T1 estimates, and T2 estimates are about 2.6 times, 1.5 times, 1.5 times
and 1.45 times as large as in the first scenario. Hence we conclude that under this scenario,
OLS performs worst in terms of the variances of parameter estimates; the homoscedastic
Student-t model has done best.
43
HonsTokelo
HonsTokelo
HonsTokelo
HonsTokelo
HonsTokelo
HonsTokelo
HonsTokelo
HonsTokelo
HonsTokelo

More Related Content

What's hot

The analysis of doubly censored survival data
The analysis of doubly censored survival dataThe analysis of doubly censored survival data
The analysis of doubly censored survival dataLonghow Lam
 
Clustering Financial Time Series and Evidences of Memory E
Clustering Financial Time Series and Evidences of Memory EClustering Financial Time Series and Evidences of Memory E
Clustering Financial Time Series and Evidences of Memory EGabriele Pompa, PhD
 
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)Denis Zuev
 
Nelson_Rei_Bernardino_PhD_Thesis_2008
Nelson_Rei_Bernardino_PhD_Thesis_2008Nelson_Rei_Bernardino_PhD_Thesis_2008
Nelson_Rei_Bernardino_PhD_Thesis_2008Nelson Rei Bernardino
 
Mathematical formula handbook
Mathematical formula handbookMathematical formula handbook
Mathematical formula handbooktilya123
 

What's hot (13)

The analysis of doubly censored survival data
The analysis of doubly censored survival dataThe analysis of doubly censored survival data
The analysis of doubly censored survival data
 
thesis
thesisthesis
thesis
 
phd-thesis
phd-thesisphd-thesis
phd-thesis
 
Clustering Financial Time Series and Evidences of Memory E
Clustering Financial Time Series and Evidences of Memory EClustering Financial Time Series and Evidences of Memory E
Clustering Financial Time Series and Evidences of Memory E
 
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
 
probabilidades.pdf
probabilidades.pdfprobabilidades.pdf
probabilidades.pdf
 
Thesispdf
ThesispdfThesispdf
Thesispdf
 
Barret templates
Barret templatesBarret templates
Barret templates
 
Nelson_Rei_Bernardino_PhD_Thesis_2008
Nelson_Rei_Bernardino_PhD_Thesis_2008Nelson_Rei_Bernardino_PhD_Thesis_2008
Nelson_Rei_Bernardino_PhD_Thesis_2008
 
Thesis
ThesisThesis
Thesis
 
Mathematical formula handbook
Mathematical formula handbookMathematical formula handbook
Mathematical formula handbook
 
Notes on probability 2
Notes on probability 2Notes on probability 2
Notes on probability 2
 
jmaruski_1
jmaruski_1jmaruski_1
jmaruski_1
 

Similar to HonsTokelo

Lecturenotesstatistics
LecturenotesstatisticsLecturenotesstatistics
LecturenotesstatisticsRekha Goel
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalGustavo Pabon
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalGustavo Pabon
 
Opinion Formation about Childhood Immunization and Disease Spread on Networks
Opinion Formation about Childhood Immunization and Disease Spread on NetworksOpinion Formation about Childhood Immunization and Disease Spread on Networks
Opinion Formation about Childhood Immunization and Disease Spread on NetworksZhao Shanshan
 
MSc Thesis - Jaguar Land Rover
MSc Thesis - Jaguar Land RoverMSc Thesis - Jaguar Land Rover
MSc Thesis - Jaguar Land RoverAkshat Srivastava
 
Nweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italyNweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italyAimonJamali
 
The value at risk
The value at risk The value at risk
The value at risk Jibin Lin
 
Ric walter (auth.) numerical methods and optimization a consumer guide-sprin...
Ric walter (auth.) numerical methods and optimization  a consumer guide-sprin...Ric walter (auth.) numerical methods and optimization  a consumer guide-sprin...
Ric walter (auth.) numerical methods and optimization a consumer guide-sprin...valentincivil
 
Classification System for Impedance Spectra
Classification System for Impedance SpectraClassification System for Impedance Spectra
Classification System for Impedance SpectraCarl Sapp
 
Stochastic Processes and Simulations – A Machine Learning Perspective
Stochastic Processes and Simulations – A Machine Learning PerspectiveStochastic Processes and Simulations – A Machine Learning Perspective
Stochastic Processes and Simulations – A Machine Learning Perspectivee2wi67sy4816pahn
 
Navarro & Foxcroft (2018). Learning statistics with jamovi (1).pdf
Navarro & Foxcroft (2018). Learning statistics with jamovi (1).pdfNavarro & Foxcroft (2018). Learning statistics with jamovi (1).pdf
Navarro & Foxcroft (2018). Learning statistics with jamovi (1).pdfTerimSura
 
Thesis Fabian Brull
Thesis Fabian BrullThesis Fabian Brull
Thesis Fabian BrullFabian Brull
 

Similar to HonsTokelo (20)

Lecturenotesstatistics
LecturenotesstatisticsLecturenotesstatistics
Lecturenotesstatistics
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_final
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_final
 
thesis
thesisthesis
thesis
 
Kretz dis
Kretz disKretz dis
Kretz dis
 
Non omniscience
Non omniscienceNon omniscience
Non omniscience
 
Opinion Formation about Childhood Immunization and Disease Spread on Networks
Opinion Formation about Childhood Immunization and Disease Spread on NetworksOpinion Formation about Childhood Immunization and Disease Spread on Networks
Opinion Formation about Childhood Immunization and Disease Spread on Networks
 
phd_unimi_R08725
phd_unimi_R08725phd_unimi_R08725
phd_unimi_R08725
 
MSci Report
MSci ReportMSci Report
MSci Report
 
thesis
thesisthesis
thesis
 
MSc Thesis - Jaguar Land Rover
MSc Thesis - Jaguar Land RoverMSc Thesis - Jaguar Land Rover
MSc Thesis - Jaguar Land Rover
 
Nweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italyNweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italy
 
The value at risk
The value at risk The value at risk
The value at risk
 
Ric walter (auth.) numerical methods and optimization a consumer guide-sprin...
Ric walter (auth.) numerical methods and optimization  a consumer guide-sprin...Ric walter (auth.) numerical methods and optimization  a consumer guide-sprin...
Ric walter (auth.) numerical methods and optimization a consumer guide-sprin...
 
Classification System for Impedance Spectra
Classification System for Impedance SpectraClassification System for Impedance Spectra
Classification System for Impedance Spectra
 
Sunidhi_MSc_F2015
Sunidhi_MSc_F2015Sunidhi_MSc_F2015
Sunidhi_MSc_F2015
 
MScThesis1
MScThesis1MScThesis1
MScThesis1
 
Stochastic Processes and Simulations – A Machine Learning Perspective
Stochastic Processes and Simulations – A Machine Learning PerspectiveStochastic Processes and Simulations – A Machine Learning Perspective
Stochastic Processes and Simulations – A Machine Learning Perspective
 
Navarro & Foxcroft (2018). Learning statistics with jamovi (1).pdf
Navarro & Foxcroft (2018). Learning statistics with jamovi (1).pdfNavarro & Foxcroft (2018). Learning statistics with jamovi (1).pdf
Navarro & Foxcroft (2018). Learning statistics with jamovi (1).pdf
 
Thesis Fabian Brull
Thesis Fabian BrullThesis Fabian Brull
Thesis Fabian Brull
 

HonsTokelo

  • 1. Robust Statistical Procedures by Tokelo Khalema (2008060978) Supervisor: Sean van der Merwe University of the Free State Submitted in partial fulfilment of the requirements for the degree: B.Sc. Hons. Mathematical Statistics November 2013
  • 2. Declaration I hereby declare that this work, submitted for the degree B.Sc. Hons. (Mathematical Statis- tics), at the University of the Free State, is my own original work and has not previously been submitted, for degree purposes or otherwise, to any other institution of higher learning. I further declare that all sources cited or quoted are indicated and acknowledged by means of a comprehensive list of references. Copyright hereby cedes to the University of the Free State. Tokelo Khalema i
  • 3. Abstract The Gaussian linear model, two Bayesian Student-t regression models, and the method of least absolute deviations are compared in a Monte Carlo study. Their relative perfor- mances under conditions favourable to the Gaussian linear model (or OLS regression) are first investigated and then a few violations of assumptions that underly the OLS re- gression model are made and the models compared again. Our object is twofold. First we want to see how soon and how severely the least squares regression model starts to lose optimality against three of its common alternatives mentioned above. Second, we want to see how the Bayesian Student-t models fare relative to the method of least absolute deviations in cases where OLS would normally not be applied. In addition to opening with a review of influence diagnosis in least squares, we treat at some length, the two Bayesian Student-t regression models, the method of least absolute deviations, and finally, we briefly review M-estimation and its generalisation, i.e. M-regression. ii
  • 4. Acknowledgement That the work reported here finally came to fruition was not achieved single-handedly. Continous guidance, support and insightful comments provided by Sean van der Merwe are gratefully acknowledged. iii
  • 5. Problem Statement When Normal regression model assumptions are grossly violated, the practitioner is likely to be faced with the problem of having to choose among a plethora of robust alternatives. It would therefore be preferable to know which alternative method would yield more “stable” results than the others under exactly what scenario of violated model assumptions. For instance, say one of the models considered in this study performed best (according to some prespecified criteria) for sample sizes less than or equal to 100 comprised of up to 80% of outliers but worst otherwise. Then a practitioner with this knowledge would only apply this model to very heavily contaminated data for small to moderate sample sizes. Our study is not as exhaustive as a comprehensive investigation would otherwise be; we will base our judgement on some of the most commonly violated model assumptions, namely, skewness, non-Normality and the presence of outliers. iv
  • 6. Overview This paper is divided into five sections. The first section sheds some light on the concept of robustness and reviews some of the most common location estimators. This is followed by a generalisation of location estimation to regression in the literature study section. Regression methods reviewed in the literature study are, ordinary least squares, Bayesian Student-t regression, and the method of least absolute deviations. The third section presents our research methodology, and the penultimate section discusses results from our Monte Carlo study. Finally we close the note with a section on conclusions drawn from our study. v
  • 7. Contents Declaration i 1 Introduction 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 The notion of robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Alternative estimators of location and scale . . . . . . . . . . . . . . . . . . 3 1.4 The need for robust methods . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Literature Study 9 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Least squares estimation . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Introduction to Robust Regression . . . . . . . . . . . . . . . . . . . 16 2.2.2 The Independent Student-t Regression Model . . . . . . . . . . . . . 17 2.2.3 Objective treatment of the Student-t linear model . . . . . . . . . . 24 2.2.4 Least Absolute Deviations . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.5 Methods based on M-estimators . . . . . . . . . . . . . . . . . . . . 30 2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3 Methodology 38 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4 Model performance criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4 Results/Applications 42 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Scenario one . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Scenarios two and three . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4 Scenario 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5 Scenario five . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.6 Scenario six . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.7 Effects on inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5 Closing remarks 46 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2 Summary of key results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3 Suggestions for further research . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 A Table of results i vi
  • 9. List of Figures 1 Regression diagnostic plot for the OLS model with all observations. . . . . . 18 2 Regression diagnostic plot for the OLS model with all but the first, third, fourth, and twenty first observations. . . . . . . . . . . . . . . . . . . . . . . 19 3 Jackknifed parameter estimates for the stackloss data. . . . . . . . . . . . . 20 4 Plot of the OLS and LAD regressions of stopping distance on velocity. . . . 29 5 Plots of intervals for explanatory variables on which the LAD fit is resistant. 31 6 Objective functions of Huber’s and Tukey’s Bisquare estimators of location and their corresponding influence functions. . . . . . . . . . . . . . . . . . . 35 7 Effects of 5% contamination on location. . . . . . . . . . . . . . . . . . . . . 39 8 Plots of observed coverage probability against nominally stated probability for OLS under scenario one. . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 9 Simulated OLS confidence intervals for the parameters under standard con- ditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 10 Simulated OLS confidence intervals for β2 under the 5%-contamination sce- nario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii viii
  • 10. List of Tables 1 Michelson’s supplementary determinations of the velocity of light in air. . . 5 2 Some location estimates and their respective bootstrap variances based on 10 000 samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 Operational data of a plant for the oxidation of ammonia to nitric acid. . . 15 4 Summary table for influence diagnostics from the regression on the stackloss data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5 Data on stopping distance as a function of velocity. . . . . . . . . . . . . . . 28 6 Summary of the calculations of the LAD regression of stopping distance on velocity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 7 Heart catheterisation data recorded on 12 patients. . . . . . . . . . . . . . . 30 8 Admissible intervals for the values of the response variable. . . . . . . . . . 32 9 Table of grouped and reordered observations from the catheterisation data. 33 10 Table of admissible intervals of explanatory variables for non-defining obser- vations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 11 Results for the sixth scenario from a Monte Carlo simulation study with i = 1000 iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 12 Results for scenarios one to five from a Monte Carlo simulation study with i = 1000 iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i ix
  • 11. 1 Introduction 1.1 Introduction It is widely acknowledged that Normal theory has for hundreds of years played an unrivalled role in all forms of inference. About fifty years after the method of least absolute deviations was introduced, Legendre introduced the method of least squares and the notion of a linear model in his Nouvelles m`ethodes pour la determination des orbites des comet`es (Stigler 1977; Birkes & Dodge 1993, pp. 29, 57; Wang & Chow 1994, pp. 5-7). Later Gauss assumed that the random errors in Legendre’s linear model followed a Normal distribution and proved some important properties of his estimates (Wang & Chow 1994, p. 6). Gauss also derived many important properties of the Normal distribution which then came to be known also as the Gaussian distribution. The linear model which assumes Normality of the random errors came to be called the Normal (or Gaussian) linear model. Also instrumental in the development of the theory of the Gaussian linear model were Fisher and Markov (Wang & Chow 1994, p. 6). It was not too long until the method of least absolute deviations was overshadowed by that of least squares (Birkes & Dodge 1993, p. 57). The linear model as we know it today, with uncorrelated homoscedastic random errors, is sometimes referred to as the Gauss-Markov linear model (Wang & Chow 1994, p. 147), or simply and more commonly, as the Gaussian (or Normal) linear model. It is the object of this note to review the Normal linear model and to show the vulrenability of least squares estimates to gross errors. The fitting process of the Normal linear model will also be shown to be quite laborious. Then we will investigate alternative methods that have already been proposed in the literature and compare them in a simulation study which includes a variety of scenarios. In particular, each model will be studied under conditions that satisfy the as- sumptions of the least squares regression model, and then under increasingly unfavourable conditions in which the assumptions are violated in a variety of ways. Methods investigated herein are the method of Least Absolute Deviations (or Minimum Sum of Absolute Devia- tions), and the Bayesian Student-t regression model. Two different implementations of the latter regression model will be considered and compared. 1.2 The notion of robustness All models, statistical or otherwise, are based on a set of assumptions. Sometimes it is desirable to have a procedure whose output is not heavily reliant on the validity of the assumptions. The reader will recall that the assumptions on which the Gaussian linear model is based are (see e.g. Rice 2007, p. 547): 1. Normality of the error distribution, 2. independence of the random errors and, 3. error variance homogeneity. 1
  • 12. Deviation from the Normality assumption can come in a variety of ways — the error dis- tribution might exhibit more skewness, fatter tails, or more outliers1 than would otherwise be expected if the underlying distribution were Normal. Although several formal test pro- cedures have been proposed for testing Normality, examination of the residuals is more common. The second assumption is more troublesome to test. What analysts usually do is they assume that only two types of dependence are possible, namely, one of blocking and the other of serial correlation. Then the former could be dealt with by adding an extra parameter to represent the block effect and the latter could be diagnosed using time-series analysis tools (Ravishanker & Dey 2002, pp. 125, 291; Miller 1997, pp. 33-34 ). Another difficulty associated with the independence assumption is that it underlies both parametric and non-parametric statistical procedures (Rice 2007, p. 505). So should one find that the independence assumption is suspect, they are not at liberty to use, say, the Kruskal-Wallis test in place of the parametric one-way ANOVA. Most formal test procedures for variance homogeneity are not robust to non-Normality (Stigler 2010). An infamous example of these is Bartlett’s test of which the F-test of the equality of two population variances is a special case (Sokal & Rohlf 1969, p. 375; Stigler 2010; Miller 1997, p. 264). Other perhaps less famous tests are Hartley’s and Cochran’s tests (Rivest 1986; Miller 1997, p. 264). Rivest (1986) proves that Bartlett’s, Cochran’s, and Hartley’s tests are non-robust in small sample situations and concludes that they are all liberal when the underlying distribution is long-tailed. Levene’s test (Levene 1960) is usually considered a robust test of variance homogeneity (see e.g. Vorapongsathorn et al. 2004), but a less formal diagnostic procedure commonly employed in practice is to create a plot of the residuals against the fitted values. The last paragraph left the word “robust” unexplained. What does it mean? It was Box, G.E.P. who first used the word in the context of data analysis (Stigler 2010). Although finer definitions of the word exist, the following will suffice for our purposes: A procedure is called robust if inferences drawn from it are not overly dependent on the assumptions upon which it is predicated. For instance, the t-test for equality of population means has been shown to be robust to departures from Normality (Miller 1997, p. 5). The same applies a fortiori to the F-test of equality of several population means (Miller 1997, p. 80); but the F-test of equality of population variances is not robust to non-Normality (Huber 2009, p. 297; Rice 2007, p. 464). By and large, F-tests on the equality of location parameters are quite robust, while those on the equality of scale parameters are not (Miller 1997, p. 265). The word “resistance” is also often encountered in robust statistics. Although we will refrain from distinguishing between resistance and robustness, it is well to give a more widely accepted definition of the word “resistance”. A statistic or procedure is resistant if it is not unduly sensitive to ill-behaved data (Goodall 1983, p. 349). Contrast this with the notion of robustness. For example, the sample median is a resistant measure of location, while the sample mean is non-resistant. However, if one were to estimate the location parameter µ for a N(µ, σ2 ) population2 , then the sample mean would be a more efficient estimator than the sample median — for a Normal parent distribution, the asymptotic efficiency of the sample mean relative to the median is π/2 (see e.g. Birkes & Dodge 1993, p. 192). Thus for infinitely large Gaussian samples, the sample median is only about 64% as efficient as the 1Loosely speaking, an outlier is a gross error. 2Recall that the mean coincides with the median in a Normal population. 2
  • 13. sample mean. We see then that in using a robust procedure, one should be prepared to lose some efficiency if the underlying distribution turns out to be Normal. 1.3 Alternative estimators of location and scale The time-honoured sample mean is a non-robust estimator of location because it weighs all observations equally. On the other hand, since the sample median depends only on the centremost values, it is robust to tail observations. A measure of scale related to and derived from the sample median is the median absolute deviation (MAD). Consider a sample, x1, . . . , xn, from some distribution F. If we denote the median as, ˜x = median{x1, . . . , xn}, then the MAD is defined thus, MAD ∝ median{|x1 − ˜x|, . . . , |xn − ˜x|}. Unlike the conventional sample standard deviation that exhibits quadratic weighting, it is very resistant to outliers (Rice 2007, p. 402). Another robust estimator of location related to the median is the trimmed mean. A δ- trimmed mean is calculated by ordering a sample Y1, . . . , Yn and discarding the nδ leftmost and rightmost observations (see e.g. Rice 2007, p. 397; Rosenberger & Gasko 1983, pp. 307- 308). Hence, for an ordered sample, Y(1) ≤ · · · ≤ Y(n), the δ-trimmed mean may be written as (Miller 1997, p. 29), ¯YT = Y(nδ+1) + Y(nδ+2) + · · · + Y(n−nδ) (1 − 2δ)n . The δ-trimmed mean is often simply denoted ¯Yδ. The sample median is approximately a 50%-trimmed mean, and of course a 0%-trimmed mean is the ordinary sample mean (Rosenberger & Gasko 1983, p. 308). Somewhat related to trimming is the process of Winsorising (see e.g. Miller 1997, p. 29). In fact, the Winsorised variance is an appropriate scale estimator used with the trimmed mean (Miller 1997, p. 29). Winsorising consists in replacing lower tail observations with larger order statistics and upper tail observations with smaller order statistics (Miller 1997, pp. 29-30). Say we observe a sample Y1, . . . , Yn from some parent distribution. Then we can obtain a winsorised sample by defining3 , yW (i) =    y(nδ+1), i = 1, . . . , nδ y(i), i = nδ + 1, . . . , n − nδ, y(n−nδ), i = n − nδ + 1, . . . , n, (1.1) 3Note that we use lower-case letters to denote observed values and upper-case letters to denote random quatities. 3
  • 14. where the fraction δ is the same as for the trimmed mean (Miller 1997, p. 30). The δ- Winsorized mean is then defined as (Miller 1997, p. 30), ¯YW = 1 n n i=1 YW (i) (1.2) = 1 n nδY(nδ+1) + n−nδ i=nδ+1 Y(i) + nδY(n−nδ) (1.3) = nδY(nδ+1) + Y(nδ+1) + Y(nδ+2) + · · · + Y(n−nδ) + nδY(n−nδ) n , (1.4) from which the δ-Winsorised variance is defined as (Miller 1997, p. 30), s2 W = 1 (1 − 2δ)2(n − 1) n i=1 (YW (i) − ¯YW )2 (1.5) = 1 (1 − 2δ)2(n − 1) nδ Y(nδ+1) − ¯YW 2 (1.6) + n−nδ i=nδ+1 Y(i) − ¯YW 2 + nδ Y(n−nδ) − ¯YW 2 . (1.7) A location estimator proposed by Huber (1964) that has good robustness and efficiency properties has been related to Winsorising by the originator. This estimator will be discussed again in later sections. If the above Winsorising process was performed “asymmetrically” so that the g leftmost and h rightmost observations were Winsorised, where g and h need not be equal, then define (see Huber 1964), T = gu + Y(g+1) + Y(g+2) + . . . + Y(n−h) + hv n , (1.8) where the numbers u, v satisfy Y(g) ≤ u ≤ Y(g+1) and Y(n−h) ≤ v ≤ Y(n−h+1), respectively. Huber (1964) posits that the estimator T is asymptotically equivalent to Winsorising. A close competitor to estimators of the type defined by Equation 1.8 is that proposed by Hodges and Lehmann (1963) (see Huber 1964). It may be taken as defined by (Huber 1964), T = median {(Yi + Yj) /2|i < j} . The Hodges-Lehmann estimator, as it is called, is associated with the signed-rank statistic (Miller 1997, p. 24). Other location estimators proposed for their robustness, although rarely met with in practice are, the outmean, and the Edgeworth. The former can be written (Stigler 1977), ¯Y c .25 = 2 ¯Y − ¯Y.25, where ¯Y.25 denotes the 25%-trimmed mean. The latter estimator is a weighted average of the lower quartile, the median, and the upper quartile, with weights in proportions 5 : 6 : 5 4
  • 15. (Stigler 1977). The outmean is known to perform poorly for long-tailed distributions (Stigler 1977). Consider the data given in Table 1 taken from Stigler (1977). The data reported here are Michelson’s supplementary determinations of the velocity of light in air. We calculate the sample mean of these data to be 756.2, and the sample median to be 774.0. Respective bootstrap variances of these estimates are 475.4 and 402.4. Hence, one sees that the median is a more stable location estimator than is the mean for these data. Values of other location estimators are give in the fourth row of Table 2. The Hogdes-Lehmann estimator has the highest variance of 497.7. The fifth, sixth, seventh, and eighth rows of Table 2 give values of location estimators and their respective bootstrap variances for other data sets not presented in this note. The last row shows that the Hodges-Lehmann estimator for the last data set has a value far more stable than the sample mean. The sample median still performs best in this case. However, for the third data set, the sample median performs worst. Overall, the trimmed means never seem to be to unstable. Measures of velocity of light in air 883 711 578 696 851 816 611 796 573 809 778 599 774 748 723 796 1051 820 748 682 781 772 797 Table 1: Michelson’s supplementary determinations of the velocity of light in air. Several other robust estimators of location and scale have been proposed. 1.4 The need for robust methods Sometimes the hope that Normality is satisfied is too fond to be entertained. It might be that the data are too heavily contaminated, as would be the case if the underlying distribution were the Student-t distribution or any of the other heavy-tailed distributions. Although issues such as skewness and variance heterogeneity can sometimes be remedied by the use of some transformation, the problem of fat tails is not as easy to get around. One will usually have to adopt a different model altogether. For instance, the errors could be assumed to follow a Student-t distribution and parameter estimates could be calculated. As will be shown later, this is not an easy task. When classical methods are applied to contaminated data problems, often the analyst will first clean the data by making use of some outlier rejection method. Then he would continue to apply the method to the remaining scores as if they constituted the whole sample. To see the defect of such a procedure, consider a sample, {X1, . . . , Xn}, from some parent distribution, say Normal. One method of outlier rejection would be to form the statistic (Xi − ¯X)/s, where s2 = n i=1(Xi − ¯X)2 /(n − 1) (Hawkins 1980, pp. 11-12). Now if c is chosen such that the test has some prespecified experimentwise significance level α, then any Xi satisfying |Xi − ¯X|/s > c would be identified as an outlier and thus rejected. The 5
  • 17. presence of multiple outliers in the sample would increase the sample variance s2 so that the statistic (Xi − ¯X)/s takes on very low values and hence, the test will fail to reject some otherwise significant observations (Hawkins 1980, p. 12). This effect is referred to as masking (Hawkins 1980, p. 12). The outlier rejection procedure outlined above is quite primitive. More sophisticated meth- ods have been proposed (see e.g. Hawkins 1980). Huber (2009), however, argues that the best outlier rejection procedures do not reach the performance of the best robust proce- dures, and that classical outlier rejection rules are unable to cope with multiple outliers. Also, the most commonly used outlier rejection rule (namely, the maximum studentized rule) will have trouble detecting one distant outlier out of 10 (Hampel 2001; Hampel 1985). Although they remain in common use, the legitimacy of outlier rejection methods is brought to question. One common method of applying the classical Gaussian linear model to contaminated data is that due to Hoaglin and Welsch (1978). They first fit an OLS to the complete sample to obtain a preliminary estimate, then they trim observations that seem to be outlying and look out for significant changes in the estimates. Any observation whose exclusion significantly impacts the fit is rejected in the final fit. Accordingly, some authors have sometimes called this method “resistant OLS”. We discuss it in full in the opening of Section 2 and illustrate it on Brownlee’s stackloss data. This method, however, has shortcomings. Even if the percent trimming is small, it can be very inefficient for Normally distributed data (Ruppert & Carroll 1980). In referring to the method of a preliminary estimate discussed above, Hampel (2001) wrote: With “good” rejection rules which are able to find a sufficiently high fraction of distant gross errors (which have a sufficiently high “breakdown point”, cf. below), this is a viable possibility; but it typically loses at least 10-20 percent efficiency compared with better robust methods for high-quality data (Hampel 1985). It is interesting to note that also subjective rejection has been investigated empirically by means of a small Monte Carlo study with 5 subjectively rejecting statisticians (Relles & Rogers 1977); the avoidable efficiency losses are again about 10-20 percent (Hampel 1985). This seems ok for fairly high, but not for highest standards. Although it remains a bit fuzzy, another important concept in the study of robust methods is that of efficiency. Efficiency in the context of point estimation is characterised by low variance (or Mean Squared Error for biased estimators), and in interval estimation, shorter confidence intervals can be considered as more efficient than broader ones (Hoaglin et al. 1983, p. 284). Huber (2009, p. 5) agues that a robust procedure should have reasonably good efficiency at the assumed model. It is also important to have a procedure that is efficient for alternative distributions. A study by Andrews et al. (1972) showed that the variance of the 10% or 20% trimmed mean is never much larger than that of the sample mean even in the case of the Normal distribution for which the mean is optimal and can be quite a lot smaller when the underlying distribution is more heavy-tailed than the Normal distribution (Rice 2007, p. 398). 7
  • 18. Efficiency in testing hypotheses designates “good” power while the significance level remains fixed (Hoaglin et al. 1983, p. 284). For instance, Miller (1980, p. 9) argues that although the t-test is somewhat “robust for validity”, it is not “robust for efficiency”. This implies that as much as the t-test will maintain the nominally stated significance level for small arbitrary departures from model assumptions, there might well exist some specially designed tests more powerful than the t-test when the underlying distribution is not Normal. 1.5 Conclusion To sum up, the theory of robust estimation is not an idle one. The use of a robust technique safeguards the investigator against being led astray in the case of unsatisfied assumptions. Unfortunately, robustness usually comes at a cost, namely, that of compromised efficiency. For example, the median will not be as badly affected by gross errors as the mean would, although if the distribution turned out to be Normal and outlier-free, the use of the median would lose efficiency against the mean at least asymptotically (see e.g. Birkes & Dodge 1993, p. 192). M-estimators introduced by Huber (1964), although not as easy to compute as least squares estimates, say, counter this unfavourable trade-off between robustness to outliers and efficiency at the assumed model, usually Normal (Rosenberger & Gasko 1983, p. 298). They are discussed last in the next section which opens with a review of Hoaglin and Welch’s (1978) method of fitting an OLS in the presence of outliers. Section 3 discusses our research methodology, Section 4 presents the results from a simulation study in which four regression models are compared in a variety of scenarios. The penultimate section discusses the results from the simulation study. 8
  • 19. 2 Literature Study 2.1 Introduction The ordinary least squares (OLS) regression model has always appealed, inter alia, for the ease with which its parameters can be estimated, the ease with which standard errors of such estimators can be estimated, and certain optimality properties that least squares estimators possess when distributional assumptions are not grossly violated (e.g. Faraway 2002, p. 19). For any error distribution, it has been shown that least squares estimators are best linear unbiased estimators under the assumption of zero mean and constant variance of the error distribution (Faraway 2002, p. 20; Wang & Chow 1994, p. 285). Literature on least squares estimation is rich and well understood. As in any modelling exercise, in fitting an OLS regression model, validity of assumptions should be assessed before any inferences are drawn. Diagnostic procedures are available to identify any discrepancies. To assess the quality of fit, checks will be done on the residuals with the hope of spotting anything untoward about their structure (see e.g. Goodall 1983); and transformations can be employed to remedy some flawed assumptions. For example, a single transformation might be found that repairs skewness and non-constant variance in a data set (Kerns 2002, p. 258). Sometimes such a transformation will not be found, but least squares estimates are somewhat robust to non-constant error variance and distributional discrepancies, especially if the data are not saliently skewed (Miller 1997, pp. 6-7, 199, 208). Unfortunately, the same cannot be said about outliers; a single outlying response can have detrimental effects on least squares estimates, especially the slope (Miller 1997, p. 199). Worse yet, usually no transformation will be found that repairs outliers, and the commonly used Box-Cox class of transformations has been shown to be sensitive to outliers (Miller 1997, pp. 18, 201; Andrews 1971). One way around this is trimming influential observations (see e.g. Hoaglin & Welsch 1978) — after an initial ordinary least squares analysis is carried out, influential observations will be identified using such criteria as Cook’s distance (see e.g. Chatterjee & Hadi 1988). Then any identified influential observations can be rejected and inference based only on the remaining scores. This approach, however, has been shown to have drawbacks. It has been proved to be very ineffecient if the error distribution is Gaussian (or close to Gaussian), or unduly contaminated (Miller 1997; Ruppert & Carroll 1980). What is more, the process of model fitting can be an involved exercise — it might take more than a handful of steps before a satisfactory model is found. The labour of carrying out diagnostic checks, building and rebuilding models, has motivated the development of robust regression models. The area of robust regression is a new arrival — it was first introduced in the 1970’s as a generalisation of robust estimation of a location parameter (Li 1985, p. 281). As already pointed out in the introduction, we will call a procedure robust if inferences drawn from it do not change substantially when the underlying assumptions are compromised. Specifially, we will consider distributional robustness and robustness to outliers and influential observations. In fact, the bulk of robust inference, at least in frequentist analysis, has been with regard to outliers (Gelman et al. 2004). In what follows, we will use the words “robust” and “resistant” interchangebly. But we will distinguish between an outlier and an influential observation. 9
  • 20. The latter refers to an observation whose inclusion or exclusion has marked influence on the fitted regression model (Kerns 2002, p. 259). An outlier will not always be influential and vice versa (Kerns 2002, p. 259). In the ensuing subsections, we demonstrate the non- robustness of ordinary least squares regression, in particular, to influential and discordant or outlying observations. 2.1.1 Least squares estimation We present here an ordinary least squares (multiple) regression model to recall some of its properties. In particular, we attempt to reveal its sensitivity to outlying response variables and high-leverage predictor variables. Also, we adopt a slightly different approach to the parameter estimation process in an attempt to bridge the gap between OLS regression and its robust counterparts. The multiple regression model can be written compactly in matrix form as follows, y = Xβ + e, (2.9) where the matrix X : n × p is of full rank4 and is defined as, X = 1 x11 x12 . . . x1,p−1 1 x21 x22 . . . x2,p−1 ... ... ... ... ... 1 xn1 xn2 . . . xn,p−1 . (2.10) Throughout, we will assume that the design matrix X is non-stochastic5 . The vector of random errors e is given by e = (ε1, ε2, . . . , εn) , and the vector of parameters β by β = (β0, β1, . . . , βp−1) . We will refer to the entries of the design matrix X as carriers and the p-dimensional space Xp in which the row vectors of X lie the carrier space. Outliers in the carrier space are said to give rise to high leverage— a notion we are yet to formalise. The objective in least squares is to minimise the residual sum of squares Q(β) = ||y − Xβ||2 = n i=1 ρ(εi), (2.11) with respect to β, where || · || denotes the Euclidean norm and ρ(εi) = ε2 i . Stated mathe- matically, the OLS solution will have to satisfy, ˆβ = arg min β∈Rp ||y − Xβ||2 . 4That the matrix X be of full rank guarantees that XT X 0, or equivalently, that (XT X)−1 exists. 5The design matrix will be stochastic if, like the dependent variables, the independent variables are measured with error. 10
  • 21. The formulation of the minimisation problem in Equation 2.11 will be important when we discuss robust alternatives to ordinary least squares estimation where the function ρ(·), called an objective function, will be defined differently. We call the derivative of the objective function, ψ(εi) = ρ (εi) an influence function. On differentiating Equation 2.11 with respect to β we obtain, ∂Q(β)/∂β = n i=1 ψ(εi)xT i = n i=1 (yi − xiβ)xT i = 0, (2.12) where xi is the ith row vector of the design matrix X. Equation 2.12 above is a disguised form of the p simultaneous equations which yield the solution, ˆβ = (XT X)−1 XT y. It can be shown that the least sqaures estimate of β coincides with the corresponding MLE (or maximum likelihood estimate) if the error distribution is assumed to be Gaussian with mean 0 and variance σ2 (Gentle 2013, p. 484). To this end, we write down the likelihood of β given the error variance, L(β|σ2 , y, X) = (2πσ2 )−n/2 exp{−(y − Xβ)T (y − Xβ)/2σ2 }. (2.13) From the monotocity of the log function, maximising Equation 2.13 is equivalent to max- imising the log-likelihood, which is given by, lL = − n 2 log(2πσ2 ) − 1 2σ2 (y − Xβ)T (y − Xβ). (2.14) It then becomes immediately apparent that maximising the expression in Equation 2.14 with respect to β is equivalent to minimising the expression (Gentle 2013, p. 484), Q(β) = (y − Xβ)T (y − Xβ), (2.15) which is the same expression we minimised in least squares estimation (see Equation 2.11). In simple linear regression (i.e. the case of p = 2), the detection of outliers can simply be done by visual inspection since the space in which an outlier can be located is at most 2-dimensional. For p > 2 however, this technique does not work. This is due in part to the sparseness of data in p-dimensional space, so that if p > q, then an outlier in Rp is not necessarily as such in Rq . This has motivated the development of more sophisticated methods of outlier detection. The discrepance of a data point can be a result of an outlying explanatory variable, an outlying dependent variable, or both. Therefore a satisfactory influence diagnosis should examine both the carriers and the yi. The matrix, H = X(XT X)−1 XT , (2.16) 11
  • 22. is known in the literature as the hat matrix6 because “it puts a hat on the matrix y” as in the following equation (Kerns 2010, p. 272). ˆy = Hy (2.17) The equation above expresses each fitted value ˆyi as a linear combination of the observed y values. If we denote the ijth entry of the design matrix by hij, then Equation 2.17 can be viewed as a compact formulation of the following set of equations (see Hoaglin & Welsch 1978), ˆyi = n j=1 hijyj = hiiyi + j=i hijyj, for i = 1, 2, · · · , n. (2.18) From Equation 2.18 it appears that hii and hii alone, represents the amount of leverage the observed value yi applies on the fitted value ˆyi. Since the hat matrix is only dependent on the explanatory variables and not on the dependent variable, this amount of leverage is independent of the observed value yi. We also see that hij quantifies the amount of leverage the yj (for j = i) exert on ˆyi. In fact, one can readily obtain, ∂ˆyi ∂yi = hii, and ∂ˆyi ∂yj = hij, from Equation 2.18 (see Ravishanker & Dey 2002). It has been left to the reader to show that the matrix H is both idempotent (i.e. H = H2 ) and symmetric. As a result, we can express the diagonal entries of the hat matrix as, hii = n j=1 h2 ij = h2 ii + j=i h2 ij, (2.19) (see Hoaglin & Welsch 1978) from which we readily see that 0 ≤ hii ≤ 1, so that a value of hii close to 1 should be flagged as high; that close to zero on the other hand, should not raise concern to the analyst. We also conclude from Equation 2.19 that, whenever hii = 0 or hii = 1, then hij = 0 for all j = i (see Hoaglin & Welsch 1978). It remains to explain how large hii needs to be in order to be called “large”. There has not been much consensus around this. Hoaglin and Welsch (1978) based their judgment on the average size of hii over the data points in the regression, which can be shown to be p/n. Then from their experience they suggested that any value of hii in excess of 2p/n should be indicative of high leverage (Li 1985). We adopt the same criterion because it turns out to be neither too liberal nor too conser- vative; Huber (1981) suggested a rather liberal criterion, namely, that points with hii > 0.2 should be regarded as high-leverage points (Ravishanker & Dey 2002). In trying to investi- gate the actual influence of, say the ith case (xi, yi), to the parameter estimates, it is well to consider the fit without that particular case and see by how much the estimated parameters 6This term was coined by John W. Tukey who originated the idea of using the hat matrix as a diagnostic tool in regression problems. 12
  • 23. change. In deference to common practice, vectors and matrices with the ith case deleted will be subscripted with a paranthesised “i” so that, for example, the vector of parameter estimators with the ith observation deleted will be denoted by ˆβ(i). Below we give a heuristic derivation of the expression for the difference ˆβ − ˆβ(i). As a prelude to such a derivation, we state without proof, a useful matrix identity known as the Sherman-Morrison formula. For a non-singular matrix A and vectors u and v, we have (Miller 1974), (A + uvT )−1 = A−1 − (A−1 u)(vT A−1 ) 1 + vT A−1 u . After a few substitutions we obtain, (XT X − xT i xi)−1 = (XT X)−1 + (XT X)−1 xT i xi(XT X)−1 1 − xi(XT X)−1 xT i . Then by noting that, ˆβ(i) = (XT X − xT i xi)−1 (XT y − xT i yi), we get after some algebra, ˆβ − ˆβ(i) = (XT X)−1 xT i (yi − xi ˆβ) 1 − xi(XT X)−1 xT i = (XT X)−1 xT i 1 − hii ri. (2.20) From Equation 2.20 above we see that an observation (xi, yi) will be influential if its leverage hii is large, if its residual ri is large, or if both the leverage and the residual are large. We have already made clear when a leverage value can be judged to be large — at least accord- ing to Hoaglin and Welsch (1978). Now we need to discuss the issue of designating large ri. Hoaglin and Welsch (1978) consider the so-called jackknifed or externally studentised residuals, {r∗ i |i = 1, . . . , n}, instead of the ordinary residuals {ri|i = 1, . . . , n}. The ith externally Studentised residual (Ravishanker & Dey 2002) is defined as r∗ i = ri s(i) √ 1 − hii i = 1, 2, . . . , n, where s2 (i) = ||y(i) − ˆy(i)||2 n − p − 1 , i = 1, 2, ..., n, can be shown to be an unbiased estimate of σ2 based on a sample of n − 1 observations if the ordinary residuals ri are uncorrelated. An observation with an externally studentised residual that is significant at a level of 10% suggests that the observation be reckoned with (Hoaglin & Welsch 1978). So the investigator 13
  • 24. will test for leverages in excess of 2p/n and any significantly large externally studentised residuals. Hoaglin and Welsch (1978) suggest that neither criterion be used in isolation7 . Then Equation 2.20 will be applied to tell whether the observation did indeed turn out to be influential. Any observation with undue influence will be excluded from the final model. The example below illustrates this method. An illustrative Example Perhaps the most frequently used data set in the literature of robust regression so far has been Brownlee’s stack loss data set. The data represents 21 days of operation of a plant that oxidises ammonia to nitric acid. The explanatory variables x1, x2, and x3 and the response variable y (or the “stack loss”) are as follows (Li 1985): x1 = airflow (which reflects the rate of operation of the plant), x2 = temperature of the cooling water in the coils of the absorbing tower for the nitric oxides, x3 = concentration of nitric acid in the absorbing liquid [coded by x3 = 10× (concentration in percent −50)], and y = the percent of the ingoing ammonia that is lost by escaping in the unabsorbed nitric oxides (×10) From the third column of Table 4 we see that only the 17th case has a leverage value greater that 2p/n = .381. But before we discard it, we investigate the actual influence it has on the estimated coefficients. That is, we calculate the difference, ˆβ − ˆβ(17) = (−5.6075, 0.0027, −0.0238, 0.0675)T . The corresponding change in standard error units, (0.4714, 0.0202, 0.0647, 0.4316)T , is not at all appreciable. We conclude therefore that this observation is not influential and hence should be retained. On the basis of studentised residuals, the 4th and 21st observations are significant at a 10% significance level. The respective parameter changes in standard error units are (0.1117, 0.3805, 0.5676, 0.0249)T and (0.3181, 1.2859, 1.3007, 0.2878)T . The fourth observation does not appear to be so dis- cordant as to warrant its elimination. However, the changes in ˆβ1 and ˆβ2 that result from omitting the 21st case should warn us against including it. Overall, we conclude that, the only truly discrepant case is the 21st. Nevertheless, it turns out that observations 1, 3, and 4 were discarded as transient states (see Li 1985, p. 317). The fitted model without observations 1, 3, 4, and 21 is, ˆy = −37.652 + 0.798X1 + 0.577X2 − 0.067X3, where X1 denotes air flow, X2 water temperature, and X3 acid concentration. In Figure 1 observations 1, 4, and 21 are flagged as influential. The latter observation falls well within 7This should make intuitive sense since an observation can be well-behaved in the carrier space while its response variable assumes an unequivocally discrepant value and vice versa. 14
  • 25. Cooling Water Acid Stack Observation Air Flow, Inlet Temperature, Concentration, Loss, Number x1 x2 x3 y 1 80 27 89 42 2 80 27 88 37 3 75 25 90 37 4 62 24 87 28 5 62 22 87 18 6 62 23 87 18 7 62 24 93 19 8 62 24 93 20 9 58 23 87 15 10 58 18 80 14 11 58 18 89 14 12 58 17 88 13 13 58 18 82 11 14 58 19 93 12 15 50 18 89 8 16 50 18 86 7 17 50 19 72 8 18 50 19 79 8 19 50 20 80 9 20 56 20 82 15 21 70 20 91 15 Table 3: Operational data of a plant for the oxidation of ammonia to nitric acid. the Cook’s distance contour. Remember that this was the only observation we concluded to be truly influential. Figure 2 shows influence diagnostics of the model without observations 1, 3, 4, and 21. Counter to what one might have anticipated, the plot indicates that even after the removal of the initially influential observations, some other previously non- discrepant observations turn out to be outlying. Note the high leverage of observation 2 (h22 > 2 × 4/20 = 0.4). We hasten to point out that influence is by no chance the only nuisance to be diagnosed in a regression problem; all assumptions on which the regression model is predicated should be checked. Since our main focus in this paper is on robustness to outliers, checks other than those that diagnose influence are better treated elsewhere. It is not impossible that one of the two measures of influence we have considered here, (i.e. hii and r∗ i ), suggests that an observation be deleted, while the other suggests otherwise. In such a case, it would be helpful to have an overall measure of influence that simultaneously takes both measures into account. Such a procedure has already been developed and used in regression type problems. Cook’s distance is one of the most commonly met overall criteria 15
  • 26. (see Cook 1977). It is defined as (Ravishanker & Dey 2002, pp. 330; Miller 1997, p. 201), Ci = (ˆβ − ˆβ(−i))T (XT X)(ˆβ − ˆβ(−i)) pˆσ2 (2.21) = hii p(1 − hii) r2 i (2.22) where ˆσ2 = yT (I − H)y/(n − p). From the above equation it is clear that the size of Ci is affected by the magnitudes of both the residual ri and and of hii/(1 − hii) (Miller 1997, p. 201). Another powerful (yet very simple to implement) method of identifying influential obser- vations is the Jackknife (see e.g. Crawley 2013, pp. 481-483). To use the jackknife, each observation i is left out in turn, and a statistic of interest, say θ(−i), is calculated. The collection of all n psuedo-values {θ(−1), . . . , θ(−n)} can then be plotted on a histogram and any influential subsets of the sample will be immediately visible. Figure 3 displays four such histograms that result from applying the jackknife to the stackloss data. For example, the top-right panel of Figure 3 shows that there is one observation without which the esti- mate ˆβ1 goes beyond 0.5. This can easily be identified, although not from the histogram, as the 21-st observation. The bottom-left panel also shows the presence of two potentially “harmful” observations. These are observations 4 and 21. 2.2 Robust Regression 2.2.1 Introduction to Robust Regression Like ordinary least squares regression is a generalisation of least squares estimation, so is robust regression a generalisation of robust estimation. Several robust estimation procedures have been proposed in the literature. L-, R-, and M-estimation are far and away the most important classes of robust estimation methods and most common estimators fall in at least one of these categories (see e.g. Miller 1997, p. 28). Only the latter category of robust estimators is discussed in this note. For a discussion on the other two classes, the reader is referred to Bickel (1973). We saw earlier that the method of least squares regression is supplemented by several di- agnostic tools that identify outliers. After influential observations have been identified, an OLS regression would be fitted only to a “clean” subset of the original data set. For ease of reference at a later stage, let us call the OLS fit thus obtained a “resistant” OLS fit. In contrast resistant OLS, the idea behind the use of robust regression procedures, besides perhaps the method of least absolute deviations which is more vulnerable to high-leverage points than OLS (Bellio & Ventura 2005), is to accommodate the possibility of any aber- rations in the data. This is effected by giving disproportionately less weight to discrepant observations (Chen & Box 1990). Procedures that automatically adapt themselves to the 16
  • 27. Obs. ri hii √ 1 − hii s(i) r∗ i 1 3.235 .302 .836 3.200 1.209 2 −1.917 .318 .826 3.292 −.705 3 4.556 .175 .909 3.099 1.618 4 5.698 .129 .934 2.975 2.052 5 −1.712 .052 .974 3.314 −.531 6 −3.007 .077 .960 3.250 −.963 7 −2.389 .219 .884 3.274 −.826 8 −1.389 .219 .884 3.320 −.474 9 −3.144 .140 .927 3.234 −1.049 10 1.267 .200 .894 3.324 .426 11 2.636 .155 .919 3.265 .878 12 2.779 .217 .885 3.250 .967 13 −1.429 .158 .918 3.320 −.469 14 −.050 .206 .891 3.343 −.017 15 2.361 .190 .900 3.278 .801 16 .905 .131 .932 3.334 .291 17 −1.520 .412 .767 3.306 −.600 18 −.455 .161 .916 3.341 −.149 19 −.598 .175 .909 3.339 −.197 20 1.412 .080 .959 3.323 .443 21 −7.238 .285 .846 2.569 −3.330 Table 4: Summary table for influence diagnostics from the regression on the stackloss data. underlying distribution are called adaptive (Huber 2009). It is the purpose of the remainder of this section to review some common alternatives to least squares estimation. 2.2.2 The Independent Student-t Regression Model Since one of the ways in which an outlier can occur in a data set is if the underlying dis- tribution is heavy-tailed, it would seem reasonable to, for starters, assume that the random errors come from a fat-tailed distribution like the double exponential distribution, or the Cauchy distribution, in lieu of the less kurtic Normal distribution. Then we could form a likelihood function and estimate the values of the parameters. In the next two paragraphs we assume that the disturbances arise from the Student-t distribution, of which the Cauchy distribution is a special case. The degrees-of-freedom parameter, which we will denote by ν, will reflect the excess “mass” under the tails of the distribution from which the random errors arise (Gelman et al. 2004). Usually in practice, it will not be an easy task to estimate the degrees of freedom parame- ter; estimates from a heavy-tailed likelihood are generally intractable and computationally demanding (Fonseca et al. 2008; Gelman et al. 2004). So, in assuming a kurtic error distri- bution, we should be ready to lose mathematical convenience in exchange for more stable 17
  • 28. 0.0 0.1 0.2 0.3 0.4 −3−2−1012 Leverage Standardizedresiduals q q q q q q q q q q q q q q q q q q q q q Cook's distance 1 0.5 0.5 1 Residuals vs Leverage 21 1 4 Figure 1: Regression diagnostic plot for the OLS model with all observations. parameter estimates. The density of the Student-t distribution with location and scale parameters, ξ and σ2 , and ν degrees of freedom is given by p(y|ξ, σ2 , ν) = Γ{(ν + 1)/2} Γ(ν/2) √ πνσ2 {1 + (y − ξ)2 /νσ2 }−(ν+1)/2 , 0 ≤ y < ∞. (2.23) If a random variable y has the density above, it is customary to write y ∼ tν(ξ, σ2 ). Consider the linear model yi = β0 + p j=1 Xijβj + εi, i = 1, . . . , n, (2.24) where the density of the (independent) random errors is given in Equation 2.23 with location parameter ξ = 0, scale parameter σ, and degrees of freedom parameter ν. We assume the following nonimformative priors for the parameters σ and β (Geweke 1993), p1(β) ∝ constant, and p2(σ) ∝ σ−1 . 18
  • 29. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 −2−1012 Leverage Standardizedresiduals q q q q q q q q q q q q q q q q q Cook's distance 1 0.5 0.5 1 Residuals vs Leverage 2 13 14 Figure 2: Regression diagnostic plot for the OLS model with all but the first, third, fourth, and twenty first observations. Additionally, we assume that σ and β are independent a priori: that is p1(β, σ) ∝ σ−1 . The model in Equation 2.24 is referred to by Geweke (1993) as the independent Student-t linear model. Geweke (1993) shows that the independent Student-t linear model is equivalent to an appropriate scale mixture of Normals. Without attempting to prove the equivalence, this Gaussian mixture is given next8 . For i = 1, . . . , n, assume yi = β0 + p j=1 Xijβj + εi, (2.25) where the random errors are independent and εi ∼ N(0, σ2 ωi). Alternatively in vector notation, let y = Xβ + ε, where the covariance matrix of the random errors var(ε) = σ2 Ω and Ω = diag(ω1, . . . , ωn). That is for i = 1, . . . , n, p(yi|β, σ, ω) = 1 √ 2πσ2ωi exp − 1 2ωi yi − xiβ σ 2 , −∞ < yi < ∞ (2.26) 8Derivations herein are modified versions of Geweke’s (1993): slightly different mathematical tools are used. 19
  • 30. β ^ 0 Frequency −44 −42 −40 −38 −36 −34 0246810 β ^ 1 Frequency 0.65 0.75 0.85 0246812 β ^ 2 Frequency 0.8 1.0 1.2 1.4 0246810 β ^ 3 Frequency −0.22 −0.18 −0.14 −0.10 0246810 Figure 3: Jackknifed parameter estimates for the stackloss data. Since the random errors εi are independent we get the following likelihood function, L(β, σ, ω|y, X) = n i=1 1 √ 2πσ2ωi exp − 1 2ωi yi − xiβ σ 2 Given proportionally, L(β, σ, ω|y, X) ∝ σ−1 n i=1 ω −1/2 i exp − 1 2ωi yi − xiβ σ 2 . Assume that a priori the parameters β, σ and ω = (ω1, . . . , ωn)T are mutually independent and that as before, p1(β)p2(σ) ∝ σ−1 . As originally suggested by suggested by Lindley (1971), assume that a priori, the ωi are independent and that, ν/ωi ∼ χ2 ν. To ontain explicitly the density of the ωi, the reader will recall that the density of an inverted χ2 variate y with ν degrees of freedom is give by, pY (y) = 1 Γ(ν/2)2ν/2 1 y ν/2+1 e−1/2y , y > 0. (2.27) 20
  • 31. From the transformation y = g−1 (ωi) = ωi/ν we obtain, p(ωi) =pY (g−1 (ωi)) d dωi g−1 (ωi) = 1 Γ(ν/2)2ν/2 1 ν ν ωi ν/2+1 e−ν/2ωi Then from the independence of the ωi the prior of the random vector ω is given by, p3(ω) = n i=1 1 Γ(ν/2)2ν/2 1 ν ν ωi ν/2+1 e−ν/2ωi =(ν/2)nν/2 {Γ(ν/2)}−n n i=1 ω (ν+2)/2 i e−ν/2ωi ∝ n i=1 ω (ν+2)/2 i e−ν/2ωi On multiplying the likelihood and the priors, the joint posterior density of β, σ, and ω is given by, p(β, σ, ω|y, X) = p1(β)p2(σ)p3(ω) × L(β, σ, ω|y, X) ∝ σ−(n+1) n i=1 ω −(ν+3)/2 i exp − 1 2ωi ν + yi − xiβ σ 2 . (2.28) We will sample from the posterior distribution above using the Gibbs Sampler as outlined in Geweke (1993). To this end, we will need to have an expression for the conditional posterior distribution of each of the parameters. It will soon appear that these are quite tractable. But the assumption of unknown degrees of freedom will occasion a slight complication. The reader is urged to note that we obtained the posterior in Equation 2.28 by assuming, at least tacitly, that the degrees of freedom parameter ν was known. Fonseca et al. (2008) state that the robustness of the analysis using the Student-t distribution is directly related to the number of degrees of freedom ν. They further emphasise the difficulty of approximating the parameter ν. We will bring the assumption of unknown degrees of freedom back in at a later step. When we finally do so, the expressions of the conditional distributions of the other parameters will remain unchanged. We proceed to find the posterior conditionals. If we rewrite Equation 2.28 as, p(β, σ, ω|y, X) ∝ n i=1 ω −(ν+3)/2 i e−ν/2ωi × exp − n i=1 1 2ωi yi − xiβ σ 2 , (2.29) 21
  • 32. and note that, exp − n i=1 1 2ωi yi − xiβ σ 2 = exp − 1 2σ2 (y − Xβ)T Ω−1 (y − Xβ) ∝ exp − 1 2σ2 (β − Λ)T (XT Ω−1 X)−1 (β − Λ) . Hence the posterior distribution of β conditional on the rest of the parameters σ, and ω is (Geweke 1993), β|σ, ω ∼ Np{Λ, σ2 (XT Ω−1 X)−1 } where the mean vector is given by Λ = (XT Ω−1 X)−1 XT Ω−1 y. In order to put the conditional posterior distribution of the variance σ2 in a form that corresponds to a standard distribution, let ui = yi − xiβ and note that p(σ|β, ω, y, X) ∝ σ−(n+1) exp − 1 2σ2 n i=1 u2 i /ωi . (2.30) By making use of a transformation, the variance-parametrisation of the density in equa- tion 2.30 easily works out to, p(σ2 |β, ω, y, X) ∝ 1 σ2 (n+4)/2 exp − 1 2σ2 n i=1 u2 i /ωi . Now let us make the following astute transformation, φ = 1 σ2 n i=1 u2 i /ωi, the inverse of which is, g−1 (φ) = φ−1 n i=1 u2 i /ωi. Then the absolute value of the Jacobian of the inverse transformation will take the following form, d dφ g−1 (φ) = −φ−2 n i=1 u2 i /ωi ∝ φ−2 . Finally, putting all components together yields the conditional posterior density of φ, p(φ|β, ω, y, X) = p(g−1 (φ)|β, ω, y, X) d dφ g−1 (φ) ∝ φn/2 e−φ/2 . (2.31) 22
  • 33. Expression 2.31 above is a kernel of a χ2 n+2 random variable. We have thus proved that, 1 σ2 n i=1 (u2 i /ωi)|β, ω ∼ χ2 n+2. Next let us consider the conditional posterior distribution of ω. Conditional on β and σ (Geweke 1993), the ωi are independent with posterior density, p(ωi|β, σ, y, X) ∝ ω −(ν+3)/2 i exp − 1 2ωi ν + yi − xiβ σ 2 (2.32) = ω −(ν+3)/2 i e−(ν+u2 i /σ2 )/2ωi (2.33) Now let (see Geweke 1993), ψ = (σ−2 u−1 i + ν)/ωi, and consider the inverse transformation, g−1 (ψ) = (σ−2 u−1 i + ν)/ψ. It follows then that the conditional posterior density of ψ is, p(ψ|β, σ, y, X) = p(g−1 (φ)|β, σ, y, X) d dψ g−1 (ψ) ∝ {ψ/(σ−2 u−1 i + ν)}(ν+3)/2 ψ−2 e−ψ/2 ∝ ψν/2−1 e−ψ/2 . We recognise the last expression as a kernel of a χ2 ν+1. Hence we have shown that a posteriori, ψ|β, σ ∼ χ2 ν+1. Let us assume an exponential prior for ν. That is, p(ν) = λ e−λν . From Equation 2.28 it follows that, p(ν|β, σ, y, X) ∝ (ν/2)(nν/2) {Γ(ν/2)}−n n j=1 ω −(ν+3)/2 j exp − n i=1 νω−1 i /2 ∝ (ν/2)nν/2 {Γ(ν/2)}−n e−ην , where η = 1 2 n i=1 (log ωi + ω−1 i ) + λ. For computational details and proves of convergence of the Gibbs sampler, the reader is referred to Geweke (1993). 23
  • 34. 2.2.3 Objective treatment of the Student-t linear model The last few paragraphs treated the independent Student-t linear model by assuming a proper prior for the degrees of freedom. The appropriateness of the selection of that par- ticular prior was not emphasised; it was merely chosen to obtain a proper posterior. It turns out that it has short-comings attached — Fonseca et al. (2008) show that it is too informative and has undue influence on the posterior inference. They further show that the analysis of the Student-t linear model based on the exponential prior, g(ν) = λ e−λν , of the degrees of freedom is strongly dependent on the value of λ which Geweke (1993) suggested should be chosen based on the prior information about the problem at hand. To counter this undesirable subjectivity, in lieu of assuming an exponential prior for the degrees of freedom, the next few paragraphs present an objective treatment of the Student- t linear model using the Jeffreys-rule and the independence Jeffreys priors as originally proposed by Fonseca et al. (2008). We assume the same model given in Equation 2.24 with similar distributional assumptions on the random errors. That is the random errors are independent and identically distributed according to the Student-t distribution with location parameter zero, scale parameter σ and ν degrees of freedom. Then for the parameter θ = (β, σ, ν) ∈ Rp × (0, ∞)2 , we form the following likelihood function, L(β, σ, ν|y, X) = n i=1 Γ{(ν + 1)/2} Γ(ν/2) √ πνσ2 {1 + (yi − xiβ)2 /νσ2 }−(ν+1)/2 = Γ ν+1 2 n Γ ν 2 (πνσ2)n/2 n i=1 {1 + (yi − xiβ)2 /νσ2 }−(ν+1)/2 (2.34) = Γ ν+1 2 n νnν/2 Γ ν 2 πn/2σn n i=1 ν + yi − xiβ σ 2 −(ν+1)/2 . For the vector of parameters θ = (θ1, θ2, θ3) = (β, σ, ν), the entry in the ith row and jth column of the Fisher information matrix is given by {F(θ)}ij = E − ∂2 ∂θi∂θj log L(θ|y, x) , where the expectation is to be taken with respect to the distribution of y. Derivations of the independence Jeffreys and the Jeffreys prior for θ = (β, σ, ν) are given in Fonseca et al. (2008). The authors show that both priors belong to a class of improper prior distributions given by π(θ) ∝ π(ν) σa , (2.35) where a ∈ R is a hyperparameter and π(ν) is the marginal prior of ν. They also prove that the independence Jeffreys prior and the Jeffreys prior for θ, which they denote by πI (β, σ, ν) and πJ (β, σ, ν), are of the form 2.35 with, respectively, 24
  • 35. a = 1, πI (ν) ∝ ν ν + 3 1/2 ψ ν 2 − ψ ν + 1 2 − 2(ν + 3) ν(ν + 1)2 1/2 and a = p + 1, πJ (ν) ∝ πI (ν) ν + 1 ν + 3 p/2 , where ψ(·) and ψ (·) are the digamma and trigamma functions, respectively. 2.2.4 Least Absolute Deviations If the Normality of the error distribution is suspect (e.g. if the error distribution is more kurtic than the Normal distribution), then a common robust alternative to ordinary least squares regression is the Least Absolute Deviations (LAD) regression (Narula & Wellington 1990, p. 130). The method of least absolute deviations is conceptually simple but, its com- putational aspect is not mathematically neat. Unlike in least squares estimation where the location parameter is estimated by the sample mean, in least absolute deviations estima- tion, the sample median plays such a role. This will become clear shortly. Using the sample median instead of the sample mean for location parameter estimation has the advantage of robustness to outliers — in any sample, only one or two of the central values are used to obtain the location parameter estimate (Rosenberger & Gasko 1983, p. 302). Furthermore, the value of the sample median remains unchanged if the magnitude of a datum is changed in such a way that it remains on the same side of the sample median (Narula & Wellington 1990, p. 130). Narula and Wellington (1985) show that this desirable property is inherited by LAD re- gression. Specifically, they show that as long as the response value of an observation lies on the same side of the fitted LAD line (in simple regression), then the model will remain unchanged. LAD regression is also known, among other names, as Minimum Sum of Ab- solute Errors (MSAE), Least Absolute Values (LAV), or L1-regression. The latter name implies that LAD regression is a special case of Lp-regression with p = 1, or that the LAD regression criterion is to minimise the L1-norm (or the sum of the absolute values) of the residuals, Q(β) = n i=1 |yi − xiβ| = n i=1 ρ(εi), (2.36) where ρ(·) = | · |. Differentiating the expression in Equation 2.36 and setting the derivative equal to zero yields an equivalent formulation, ∂Q(β)/∂β = n i=1 ψ(yi − xiβ) = n i=1 sgn(yi − xiβ) = 0, (2.37) where, sgn(x) =    +1, if x > 0 0, if x = 0 −1, if x < 0 25
  • 36. A value of β that satisfies Equation 2.37 is called the least absolute value estimate. We will simply denote it ˆβ. Note that since the absolute value function |x| is not differentiable at x = 0, Equation 2.37 is not completely correct by setting the derivative of |x| to zero at x = 0, but such an inconsistency has been accepted as reasonable in the literature as it puts the LAD ψ-function in agreement with other ψ-functions (Goodall 1983, p. 343). One common method for solving the minimisation problem presented in Equation 2.37 is to transform the problem into one of linear programming (Pynn¨onen 1994). Another is to use generalised iterative least squares estimation. This problem will always have a solution but, it will not always be unique. In robustness studies, one can distinguish between two broad categories of distributions (Green 1976, cited in Hawkins 1980, p. 1), namely, those that have fat tails and those with thin tails. The former are referred to as outlier-prone and include the double exponential (Laplace) distribution; the latter are referred to as outlier-resistant. The Normal distribution, which has kurtosis γ2 = 3, is an example of an outlier-resistant distribution. The double exponential distribution on the other hand has γ2 = 6. It is interesting to see that if the disturbances are independent and identically distributed according to the double exponential distribution, then the LAD estimator is also the maximum likelihood estimator. In order to formulate the linear programming model for the LAD estimation problem, define the following variables, d+ i = εi if εi > 0 0 if εi ≤ 0 and d− i = 0 if εi > 0 −εi if εi ≤ 0 Then the linear programming model to find the parameter estimates in the linear model minimises the function n i=1 d+ i + n i=1 d− i subject to, β0 + p j=1 Xijβj + d+ i − d− i = y d+ i ≥ 0, d+ i ≥ 0, while the βi are unrestricted in sign. An LAD regression will fit at least p observations exactly (i.e. at least p residuals will be zero) where p is the number of parameters including the intercept (Ravishanker & Dey 2002, p. 340). Hence in a simple regression problem one need only determine two observations through which the LAD line passes in order to fit the entire regression. In the theory of least absolute deviations, it is customary to refer to those observations that have been fitted exactly as defining observations (Narula & Wellington 2002). An observation that is not defining is simply called non-defining. The diagnosis of influence in LAD regression is usually carried out differently from the way it is performed in least squares (see Narula & Wellington 2002). Instead of deleting 26
  • 37. an observation to determine its influence on the fit, as is done in OLS regression, Narula and Wellington (2002) find an interval on each value of the predictor variable that leaves the fitted LAD regression unchanged. The interval for the value of the response variable that will leave the fit unaltered is given by [ˆyi, ∞) or (−∞, ˆyi] according as the residual corresponding to the ith fitted value ˆyi is positive or negative (Narula & Wellington 2002). Solving the LAD/MSAE regression problem via linear programming methods is just one of many methods present in the literature. Birkes and Dodge (1993) present and justify an alternative algorithm to carry out (simple) least absolute deviations regression. The next two paragraphs give an outline of the algorithm. The reader will recall that an MSAE regression will fit at least p observations exactly. Hence, in the special case of simple regression (i.e. when p = 2) the regression line will pass through at least two observations. First choose an initial point, (x1, y1) say. Then, among all lines that pass through it, we seek the best, according to some criterion yet to be defined. This line will have to pass through another data point (x2, y2), say. Now we seek among all lines that pass through (x2, y2), the best line, which in turn passess through another point, say (x3, y3). The process is continued until two consecutive iterations yield the same line L. At this point the algorithm has reached convergence and line L is the LAD/MSAE regression line. To find the best line that passes through a data point (x0, y0), we calculate the slope (yi − y0)/(xi − x0) of the line passing through (x0, y0) and (xi, yi). Points for which xi = x0 can be ignored. Then rearrange the observations such that (y1 − y0)/(x1 − x0) ≤ (y2 − y0)/(x2 − x0) ≤ · · · ≤ (yn − y0)/(xn − x0). Find the index k for which k i=1 |xi − x0| first exceeds 1 2 T = k i=1 |xi − x0|/2. Then the best line passing (x0, y0) has slope and intercept estimates, ˆβ1 = yk − y0 xk − x0 and ˆβ0 = y0 − β1x0 respectively. Before we continue to give a multiple regression extension of the algorithm outline above, for illustrative purposes, we apply it to the data presented in Table 5. Originally from Browlee (1960), Table 5 reports a study on the stopping distance of an automobile as a function of velocity on a certain road (see also Rice 2007, p. 599). To start the algorithm, we choose a data point (x0, y0), say (15.4, 20.5). Note that this choice is totally arbitrary. Then form the slopes (yi − 15.4)/(xi − 20.5) and arrange them in increasing order. These are given in the second column of Table 6. Next we calculate T = i=1 |xi − 20.5| = 95.6 and look out for the observation for which the cumulative sum |xi − 20.5| first exceeds T/2 = 47.8. From Table 6 this is the fourth observation. We do the same with this observation as we did with (15.4, 20.5). That is, we compute the slopes (yi − 73.1)/(xi − 40.5), rearrange them in increasing order, calculate the sum T = i=4 |xi − 40.5| = 75.6 and look out for the observation for which the cumulative sum 27
  • 38. Obs. Velocity (mi/h) Stopping Distance (ft) 1 20.5 15.4 2 20.5 13.3 3 30.5 33.9 4 40.5 73.1 5 48.8 113.0 6 57.8 142.6 Table 5: Data on stopping distance as a function of velocity. Cumulative sum Obs. (yi − 15.4)/(xi − 20.5) |xi − 20.5| of |xi − 20.5| 6 1.331 37.3 37.3 2 0.0 37.3 3 1.850 10.0 47.3 4 2.885 20.0 67.3 5 3.449 28.3 95.6 Table 6: Summary of the calculations of the LAD regression of stopping distance on velocity. |xi − 40.5| first exceeds T/2 = 37.8. This points to the second observation. For each iteration of the algorithm, one will have to form a table similar to Table 6. The next two iterations point to the sixth and the second observations, respectively. But since the second observation was reached just a step before and just a step after the sixth observation was reached, the algorithm converges. Hence the defining observations are the sixth and the second observations. The estimated parameters are, ˆβ1 = (142.6−13.3)/(57.8− 20.5) = 3.47 and ˆβ0 = 13.3 − 3.47 × 20.5 = −57.76. Figure 4 depicts fits of both the OLS regression and the LAD regression. We hasten to point out that this example was included only for illustrative purposes; it should not be presumed that the LAD model we just fitted is anyhow superior to the OLS model. Birkes and Dodge (1993) present a modified version of Barrodale and Roberts’ (1974) al- gorithm for fitting a multiple least absolute deviations regression based on the simplex algorithm. For reasons of space, we do not discuss it here but we will apply it. Table 7 reports results from a small study in which the relationship between catheter length and 2 other variables, namely, height and weight was investigated (Rice 2007, pp. 581). In the table catheter length is represented by distance to pulmonary artery. See the reference cited for more details. Consider fitting a linear multiple regression model, yi = β0 + β1x1i + β2x2i + εi, i = 1, . . . , 12, to the data using the least absolute deviations method. The resulting fit is, ˆyi = 31.591 − .178x1i + .326x2i, i = 1, . . . , 12. But how resistant is this fit? We consider this sensitivity question next. Where there is measurement, there is certain to be some error one way or another. Suppose 28
  • 39. q q q q q q 20 30 40 50 20406080100120140 Velocity (miles per hour) StoppingDistance(ft) OLS line LAD line Figure 4: Plot of the OLS and LAD regressions of stopping distance on velocity. some observation in the catheterisation data was recorded with error. Let us assume that the actual value of x41 was 35.9, say, but was incorrectly recorded as 39.5. Would this bring our fit into question? Or stated more generally, what range of values can x41 take (in the vicinity of 39.5) without affecting the parameter estimates? It can be shown that as long as x41 assumes any value in the closed interval [31.72, 39.77], the LAD fit remains unaffected. This is quite a desirable property the method of least absolute deviations inherits from the median. Intervals of values of explanatory variables for non-defining observations on which the LAD or MSAE fit on the catheterisation data we obtained earlier does not change are depicted in Figure 5. The next few paragraphs give only a vague idea of how they were calculated, with the hope that the interested reader will refer to Narula and Wellington (2002). In treating this example we will closely mimic the style of presentation of the originators of the procedure, namely, Narula and Wellington (2002), and the reader interested in the full details of the method should see the reference just cited since not all calculations will be presented in this note. We would like to find intervals about predictor variables (and also the response variable) for non-defining observations on which our fit is resistant. The case of the response variable is trivial — the LAD fit will remain unchanged, it will be recalled, as long as the new value of the response variable is greater than or less than the fitted value according as the corresponding observation has a positive or negative residual. Table 8 presents these intervals. 29
  • 40. Distance to Obs. number Height Weight Pulmonary Artery i (in.) (lb) (cm) 1 42.8 40.0 37.0 2 63.5 93.5 49.5 3 37.5 35.5 34.4 4 39.5 30.0 36.0 5 45.5 52.0 43.0 6 38.5 17.0 28.0 7 43.0 38.5 37.0 8 22.5 8.5 20.0 9 37.0 33.0 33.5 10 23.5 9.5 30.5 11 33.0 21.0 38.5 12 58.0 79.0 47.0 Table 7: Heart catheterisation data recorded on 12 patients. The intervals for the values of the predictor variables that leave the parameter estimates unchanged, however, are not as easy to derive. We consider them next. First we reorder our observations according to the type of residual each has. Observations with zero residuals are first grouped together and given new indices in the order of their original appearance. Those with positive residuals come next and similarly rearranged and last come the observations with negative residuals also reordered in a similar manner. Hence for example, observation 1 retains its index because it has the smallest index of all defining observations and observation 9 gets a 12 as its new index because it has the greatest index of all observations with negative residuals. The complete enumeration of the grouped and reordered observations is reported in Table 9. According to Narula and Wellington (2002), one then forms the design matrix with its entries rearranged as in Table 9, and after some involved calculations obtains the intervals presented in Table 10. 2.2.5 Methods based on M-estimators M-estimators provide a delicate compromise between robustness and effeciency. The class of M-estimators was introduced in Huber (1964) and generalised to regression problems in Huber (1973). Although other classes of robust estimators exist in the literature, M- estimators are by far the most flexible and they give the best performance (see e.g. Li 1985). Additionally, M-estimation generalises to regression models more readily than L- and R- estimation (Li 1985; Huber 1972). Here we review very briefly the theory of M-estimation and show how it embraces the methods of ordinary least squares and least absolute deviations estimation as special cases. Consider a sample, x1, . . . , xn, from a Normal distribution with mean µ and variance σ2 . A 30
  • 41. 2 4 6 8 10 20304050607080 Observation Number x1(height) 2 4 6 8 10 020406080 Observation Number x2(weight) Figure 5: Plots of intervals for explanatory variables on which the LAD fit is resistant. maximum likelihood estimator of the location parameter µ minimises, n i=1 Xi − µ σ 2 . (2.38) It is known that the above expression is minimised by the conventional sample mean, ¯X = n−1 n i=1 Xi. Suppose now that the sample, x1, . . . , xn, comes from the Laplace distribution with mean µ and variance 2σ2 . That is, f(xi|µ) = (2σ)−1 exp − |xi − µ| σ , for i = 1, . . . , n. In this case it is evident that a maximum likelihood estimator of the location parameter µ minimises the expression, n i=1 Xi − µ σ . (2.39) It can be shown that the sample median is the corresponding estimator. Equations 2.38 and 2.39 above have more in common than might be obvious at first sight. Both can be rewritten, n i=1 ρ Xi − µ σ , 31
  • 42. Obs. No. Lower bound y Upper bound 1 Defining observation 2 −∞ 49.5 50.74 3 −∞ 34.4 36.48 4 34.33 36.0 ∞ 5 40.43 43.0 ∞ 6 −∞ 28.0 30.27 7 36.48 37.0 ∞ 8 −∞ 20.0 30.35 9 −∞ 33.5 35.75 10 Defining observation 11 32.55 38.5 ∞ 12 Defining observation Table 8: Admissible intervals for the values of the response variable. where ρ(t) is a continuous (usually convex) real valued function. It is obvious that the objective function ρ(t) = t2 for Equation 2.38 and ρ(t) = |t| for Equation 2.39. Huber (1964) defines an M-estimate (or a maximum likelihood type estimate) Tn of a loca- tion parameter as any estimate that minimises an expression of the form n i=1 ρ(xi; Tn), or equivalently, that satisfies n i=1 ψ(xi; Tn) = 0, where ρ(·) is an arbitrary function and ψ(x; θ) = (∂/∂θ)ρ(x; θ) (also see Huber 2009). From this definition we already see that the sample mean and median are examples of M- estimators. More refined estimators than the mean and the median fall under this broad category of estimators. It has already been pointed out that the sample median is resistant to outliers, but has the drawback of inefficiency at the Normal distribution. Also, the sample mean that is of course more efficient than the sample median at the Normal distribution has the disadvantage of gross sensitivity to outliers. The estimator corresponding to Huber’s objective function, ρ(t) =    1 2 t2 , for |t| ≤ k k|t| − 1 2 k2 , otherwise (2.40) was designed to inherit the resistance of the sample median and the efficiency of the sample mean. This can be seen by noting that for small values of t, it behaves like the objective function of least squares estimation and otherwise like that of least absolute deviations estimation. The location parameter estimator given explicitly by Equation 2.40 is referred 32
  • 43. Obs. number i xi1 xi2 Intercept yi ri Remarks 1 42.8 40.0 1 37.0 .000 Defining observation 10 23.5 9.5 1 30.5 .000 Defining observation 12 58.0 79.0 1 47.0 .000 Defining observation 4 39.5 30.0 1 36.0 1.671 Positive residual 5 45.5 52.0 1 43.0 2.571 Positive residual 7 43.0 38.5 1 37.0 .524 Positive residual 11 33.0 21.0 1 38.5 5.945 Positive residual 2 63.5 93.5 1 49.5 −1.245 Negative residual 3 37.5 35.5 1 34.5 −1.978 Negative residual 6 38.5 17.0 1 28.0 −2.272 Negative residual 8 22.5 8.5 1 20.0 −10.352 Negative residual 9 37.0 33.0 1 33.5 −2.252 Negative residual Table 9: Table of grouped and reordered observations from the catheterisation data. to by some authors as a Huber (e.g. Goodall 1983, pp. 369-371). The Huber will resist outliers in the response variable (i.e. in the y-direction) but will perform very poorly in the face of leverage points (Bellio & Ventura 2005). The tuning constant k > 0 is chosen to strike a balance between efficiency and resistance. One will select a small or large value of the tuning constant depending on whether the distribution has a large or small proportion of outliers (Birkes & Dodge 1993, pp. 99-100). To give another favourable property possesed by Huber’s ρ function, we make the following definition: A distribution of the form, F = (1 − )Φ + H, where Φ(x) = x −∞ exp(−1 2 t2 )dt is the Standard Normal cumulative distribution function (CDF) and H the CDF of a contaminating distribution is called an -contaminated Normal distribution (Goodall 1983, p. 372; Rosenberger & Gasko 1983, p. 317; Miller 1997, p. 10). For arbitrary choices of the -contamination, Huber’s estimators are the most efficient (Goodall 1983, p. 372). However, it appears that the density corresponding to the above contaminated Normal is not heavy-tailed enough to account for outliers sometimes encoun- tered (Goodall 1983, p. 374). This poses a threat to the robustness of Huber’s M-estimator to unduly discrepant values. A class of M-estimators called redescending M-estimators counters this weakness (Bellio & Ventura 2005). Two examples of redescending M-estimators — so called because their influence functions return to zero at large absolute values of their arguments — are Tukey’s biweight and Andrews’ estimator. Their respective objective functions are given by (see e.g. Goodall 1983, pp. 348-349), 33
  • 44. Obs. No. L. B. x1 U. B. L. B. x2 U. B. 1 2 63.226 63.5 70.488 89.679 93.5 94.204 3 37.226 37.5 45.278 29.430 35.5 36.204 4 31.722 39.5 39.774 29.296 30.0 35.127 5 37.722 45.5 45.774 51.296 52.0 59.890 6 38.226 38.5 46.278 10.028 17.0 17.704 7 40.056 43.0 43.274 37.796 38.5 40.109 8 22.226 22.5 80.614 −7.170 8.5 9.204 9 36.726 37.0 44.778 26.088 33.0 33.704 10 11 25.222 33.0 33.274 20.296 21.0 36.670 12 Table 10: Table of admissible intervals of explanatory variables for non-defining observations. ρ(u) =    1 6 [1 − (1 − u2 )3 ], for |u| ≤ 1 1 6 , otherwise and ρ(u) =    1 π (1 − cos πu), for |u| ≤ 1 2 π , otherwise. Figure 6 presents plots of, first the objective function, and then the influence function of a Huber estimator and Tukey’s biweight/bisquare estimator. The tuning constant k for Huber’s estimator is set to 1.5. From the graph in panel (a), Huber’s influence function can be seen to be quadratic in the region between the red lines and linear elsewhere. From the top-right panel we see that Huber’s Ψ-function is monotone. What is more, for observations beyond a certain point, the influence curve becomes constant. The bottom-right panel shows that Tukey’s bisquare M-estimator is of a redescending type. To generalise the robust estimation of the location parameter to regression, we need first a way of estimating the error scale parameter σ. In the MSAE and OLS cases there is no need for estimating scale because Equations 2.38 and 2.39 can be equivalently written as, n i=1 (Xi − µ)2 = min, 34
  • 45. −10 −5 0 5 10 0246812 (a) Huber's ρ function u Ψ(u) −10 −5 0 5 10 −1.5−0.50.51.5 (b) Huber's Ψ function u Ψ(u) −1.5 −0.5 0.5 1.5 0.000.050.100.15 (c) Bisquare ρ function u Ψ(u) −1.5 −0.5 0.5 1.5 −0.3−0.10.10.3 (d) Bisquare Ψ function u Ψ(u) Figure 6: Objective functions of Huber’s and Tukey’s Bisquare estimators of location and their corresponding influence functions. and n i=1 |Xi − µ| = min, respectively. However, for other M-estimators the story is not the same; an estimate ˆσ of scale is required because if scale is not taken into account, then the estimate ˆβ would not respond correctly to a change in the units of y or to a change in the scale of the errors (Li 1985, p. 302). Two strategies for taking scale into account in regression are (see Li 1985, pp. 302-303), 1. Estimate σ beforehand, 2. Estimate β and σ simultaneously. In the first-mentioned method, an initial scale estimate ˆσ, commonly the median absolute deviation, is calculated before each iterative step. Then considering scale as known, a solution ˆβM to 35
  • 46. n i=1 ψ yi − xiβ ˆσ xT i = 0 (2.41) is determined. Assuming the MAD is used, the scale estimate ˆσMAD is calculated as, ˆσMAD = 1 0.6745 median i∈{1,...,n} yi − xi ˆβ (0) − median j∈{1,...,n} yj − xj ˆβ (0) , where ˆβ (0) is a preliminary estimate of β and the factor 1/0.6745 ensures that ˆσMAD estimates σ when the distribution is Normal (Jacoby 2005). The MSAE estimate will usually furnish such a preliminary estimate (Li 1985, p. 302). The method of estimating scale and the parameters simultaneously is accomplished by solving simultaneously, n i=1 ψ yi − xiβ σ xT i = 0 (2.42) and n i=1 χ yi − xiβ σ = na, (2.43) where χ(·) is a suitable bounded function, and a is a suitable positive constant often chosen as [(n − p)/n]E{χ(Z)} where Z denotes a standard Normal random variable (Bellio & Ventura 2005; Li 1985, p. 303). Algorithms for computing the parameter estimates in M-regression are presented in Li (1985). Birkes and Dodge (1993) give a light-hearted exposition of M-regression by giving practical examples and verifiication if the algorithms they use. Most but not all M-estimators occur as maximum likelihood estimators of some parameter for some error distribution. An example of an M-estimator that does not occur as maximum likelihood estimator is Tukey’s biweight (Birkes & Dodge 1993). Huber’s M-estimators are maximum likelihood estima- tors for the least favourable -contaminated Normal distributions whose densities for a given value of k, are given by the expression (Goodall 1983, p. 373), f(x) =    1 − √ 2π e−x2 /2, for |x| ≤ k, 1 − √ 2π ek2 /2 − k|x|, otherwise. (2.44) If the tuning constant k is set to k = 1.345, then Huber’s estimate will be 95% as efficient as the sample mean at the Normal distribution and will give substantial resistance at alternative distributions (Jacoby 2005). Although Huber’s estimates (sometimes called Huber-type estiamtes) are only robust to outliers in the y-direction but sensitive to outliers in the carriers, it appears that in some practical situations such as bioassy experiments, only errors in the y-direction need to be considered — outliers in regressors can be ignored 36
  • 47. (Bellio & Ventura 2005). Methods that counter this problem (e.g. Tukey’s biweight) are computationally more complicated because of multiple roots. In such cases it is important to choose a good starting point and iterate carefully because iterative M-estiamtors are sensitive to the starting value when the ψ-function redescends (Bellio & Ventura 2005; Li 1985, p. 309). 2.3 Conclusion We reviewed in the last chapter, (i) ordinary least squares regression, (ii) Bayesian Student-t regression, (iii) least absolute deviations regression, and (iv) M-regression. Although least squares regression has the disadvantage of being too sensitive to outliers,it is still widely used in practice. Several diagnostic procedures have been proposed and successfully used with OLS regression. Additionally, most packaged programs, e.g. R, have routines to perform many such diagnosis tools as the hat matrix, residuals, residual plots, Cook’s distances, etc. The analyst then sets his rule for rejecting influential observations. For instance, we saw earlier, the “rejection” method proposed by Hoaglin and Welsch (1978). Fitting a resistant OLS is then seen to be too laborious. One other disadvantage is that, though some authors call it resistant, it actually is not; it is only resistant to the eliminated observations, which are merely sample values. The population remains unknown. Methods (ii), (iii), and (iv) remedy this lack of resistance or robustness to ill-behaved observations. The use of robust methods in general makes the model-fitting process more automatic. These methods however, save perhaps for method (iv), are not without problems of their own. Say we applied the method of least absolute deviations to well-behaved Normally distributed data, then a great deal of efficiency would be lost. That is, we would have estimates with variances larger than those from an OLS regression. How methods (i), (ii), and (iii) actually compare in practice is the subject of the remainder of this note. 37
  • 48. 3 Methodology 3.1 Introduction The previous section had its main focus on the method of least squares, the method of least absolute deviations, and two implementations of the Bayesian Student-t linear model. No attempt was made to show their relative performances. The remainder of this note is devoted to comparing the four models on the basis of their respective abilities to handle compromised model assumptions. To do this, first we fit all four of the models under conditions favourable to the least squares regression model to see how robust alternatives to OLS considered herein perform relative to OLS. Then we will break a few assumptions of the Gaussian linear model one at a time and compare the models again. In order to make our comparisons we will use standard model performance criteria discussed below. 3.2 Research Design Each model will be fitted to a thousand simulated samples of sizes, 100, and 400. Addition- ally, the proportion of contamination will take on values 0%, 5%, and 25%. We will work with the case of 2 explanatory variables. The idea is to fix the parameters βj and see how close each model comes under different conditions often met with in practice. For instance, since in practice deviation from symmetry usually comes in the form of positive skewness (e.g. Miller 1997, p. 16), we will not consider negatively skewed data. The explanatory variables Xij will be generated as follows independent standard Normal random variables. The random errors εi will be generated from the standard Normal distribution with unit variance in the standard-assumptions scenario. In order to contaminate the response variable y, we will use a variant of Kianifard and Swallow’s (n.d.) method. That is, (1) we randomly select a proportion of the y’s, (2) to each of the 100n chosen y-values, we add an appropriate δ in place of a random error ε. For instance, for a sample size of 100 with a proportion of contamination of 25%, we will randomly select 25 y’s and in place of ε, we will add δ’s to each of the 25 selected y’s. It will be recalled from the literature review that outliers in the explanatory variables effect high leverage points. Hence, to introduce high-leverage points, (1) we randomly select a proportion of the observations, (2) for each of the 100n chosen observations, we substitute an approriate pair (δ∗ 1, δ∗ 2) for the selected observation’s pair of explanatory variables, say, (Xk1, Xk2). Possible scenarios for a study of this kind are enumerable but our scope will not be too broad. The scenarios we will investigate are as follows: 1. all assumptions valid, 2. a sample with 5% contamination, 3. a sample with 25% contamination, 38
  • 49. mean Frequency 1.6 2.0 2.4 2.8 0100200300 median Frequency −0.4 −0.2 0.0 0.2 0.4 0100200300 T1 Frequency −0.4 −0.2 0.0 0.2 0.4 050100150 T2 Frequency −0.4 −0.2 0.0 0.2 0.4 050100150 Figure 7: Effects of 5% contamination on location. 4. a sample consisting of 25% of positively skewed observations, 5. a sample consisting of 100% of positively skewed observations, and, 6. a sample with 5% contamination and 25% positively skewed observations. Exactly how to create the first scenario is straight-forward. To simulate the second scenario, we will add 30 to 5% of the randomly selected ei’s, substitute 10 for 5% of the X1’s, and 200 for 5% of the X2’s. The third scenario will be simulated similarly but with 25% in place of the 5% in the second scenario. This same method of contamination will be used for scenario six. To introduce skewness we will sample the random errors from a Gamma distribution with α = 2 and λ = 0.5. The Gamma distribution will always be positively skewed. It’s coefficient of skewness can be shown to be 2/ √ α (see e.g. Randle & Wolfe 1979, p. 415). Let us consider the simpler problem of location under one of the contamination scenarios. One realisation of the 5%-contamination scenario is depicted in Figure 7. Note how the distribution of the mean losses symmetry. It is worth noting that, each scenario under which contamination is involved, not only the response variable but also the explanatory variables will be contaminated. Contaminated 39
  • 50. values in the response variable will be referred to as outliers. Those in the explanatory variables will be said to give rise to high leverage (see Birkes & Dodge 1993, p. 206). 3.3 Research Objectives The objective of this study is to see how soon and how badly least squares regression loses optimality to the other methods treated herein as its assumptions are violated in a myriad of ways and to different extents. Specifically, our main research objectives are: 1. to see how better OLS fares than alternative methods when its assumptions are fully met, 2. to see roughly at what point OLS starts to perform poorly relative to alternative methods as its assumptions are violated, 3. to find out which one of the models under study performs best under what conditions, 4. to see the role played by sample size n. 3.4 Model performance criteria For each of the four models, we will calculate the Root Mean Square Error (RMSE) and its components, i.e. the bias and variance of the parameter estimates. Bias will give an average measure of distance between the true parameter vector (β0, β1, β2) and its estimates (ˆβ0, ˆβ1, ˆβ2). The RMSE can be viewed as a measure of accuracy (see e.g. Lohr 2010, p. 32). It is defined as the square-root of the Mean Squared Error (MSE). Recall that the MSE of an estimate ˆβj of βj is given by, MSE(ˆβj) = Var(ˆβj) + Bias(ˆβj)2 . (3.45) In order to see how the coverage probability is affected, we will try and see whether 95% confidence intervals contain the true parameter 95% of the time. The effect on the true coverage probabilty will have very important implications on the robustness for validity of the associated t-tests (see e.g. Huber 2009; Miller 1997, p. 9). We elaborate more on this later. 3.5 Conclusion Although only the values of the RMSE and bias will be tabulated, it will be important to also consider the variance. This will easily be obtained from rearranging Equation 3.45. 40
  • 51. Also, in addition to the element-wise bias terms, it will be worthwhile to have an overall measure of distance between the true parameter and its estimate. To this end, we will use the Euclidean distance between the vector of true parameters and the vector of esimated parameters. The reader is reminded that the Euclidean distance between vectors u and v is defined as d(u, v) = ||u − v|| (see e.g. Anton & Rorres 2005). Sample sizes 100 and 400 were chosen with the hope of obtaining approximate asymptotic relative efficiencies of the models. A look at the disparity between the true and the nominally stated coverage probabilities will lead us to draw some important conclusions about the disparity between the true significance level α∗ and the nominally stated one α. 41
  • 52. 4 Results/Applications 4.1 Introduction This part of our note reports results from a Monte Carlo study carried out in ways discussed in the preceeding section. For ease of reference, we denote the heteroscedastic Bayesian Student-t model as T1 and the homoscedastic one as T2. Table 12 (printed in the appendix ) is partitioned into five sections each of which summarises results from each of the first five scenarios, and Table 11 summarises results from the final scenario. The bias and RMSE of the parameter estimates are calculated and given element-wise to afford a closer look at the performances of the models. 4.2 Scenario one Bias under all models diminishes for the larger sample size of n = 400 where OLS estimates seem to have been the most biased. To show that we need not be overly concerned about the bias of our OLS estimates, note that for n = 100, the vector (3, 10, −5) was estimated under OLS as (3.0053, 10.0023, −5.0007) on average. We conclude that, for the first scenario, all models performed well as far as bias is concerned. A measure of accuracy, the root mean square error (RMSE) can be considered next, as an estimator can have sufficiently low bias but be severely unstable. Note that, for both sample sizes, OLS estimates are the most accurate, with accuracy improving for the larger sample size. The variance of parameter estimates can easily be calculated as, Var( ˆβj) = RMSE( ˆβj)2 − Bias( ˆβj)2 . (4.46) Let ˆβj denote the least squares estimate of βj and ˜βj the corresponding least absolute deviations estimate. If S2 ˆβj and S2 ˜βj denote the respective sample variances, then we observe that for n = 400, S2 ˆβ0 S2 ˜β0 = .04882 − .00022 .06172 − .00082 ≈ .63, (4.47) S2 ˆβ1 S2 ˜β1 = .04932 − .00192 .06212 − .00072 ≈ .63, (4.48) and S2 ˆβ2 S2 ˜β2 = .05102 − .00272 .06302 − .002882 ≈ .65. (4.49) 42
  • 53. These ratios average out to .64 — the relative asymptotic effiiciency of the sample median to the sample mean for the Normal distribution. Similarly, the asymptotic effiiciencies of the location estimators corresponding to the two Bayesian models relative to the sample mean were calculated as approximately .81, and .84, for the Normal model. We thus see little loss in efficiency in using Student-t regression instead of OLS when the assumptions of the Normal linear model are fully satisfied. Thus both Student-t models out-perform LAD under conditions favourable to the Normal linear model based on efficiency. 4.3 Scenarios two and three The second section of Table 12 summarises results from the second scenario under which 5% of the observations are outlying. Let us consider only the first two parameter estimates since all models poorly estimated β2. We see that OLS estimates were the most stable in the 5%-contamination scenario. The method of least absolute deviations did worse than the other three models on the basis of stability — LAD estimates also seem to have been the most inaccurate among the four estimates. To quantify the relative performances of the models, consider the case of n = 100. Then one can show that the efficiencies of LAD, T1, and T2 are approximately, 66%, 85%, and 86%, respectively, relative to OLS. The seemingly counter-intuitive result that LAD under-performed OLS in this scenario is due to the greater susceptibility that LAD has to high leverage than OLS (see e.g. Birkes & Dodge 1993, p. 191; Jacoby 2005). Under the third scenario, relative performances remained more or less the same as in the second scenario. If once again we consider only the first two parameter estimates, it will be apparent that under this scenario OLS had the least variances still. LAD also had the greatest variances under this scenario. Efficiencies of LAD, T1, and T2 relative to OLS can be calculated to approximately, 65%, 82%, and 84%, respectively. We conclude then that OLS has done best under the second and third scenarios. LAD has done poorest, and the Bayesian models did not perform too poorly. 4.4 Scenario 4 For a quarter of each sample, random errors were taken as variates from a Gamma distri- bution, centred at the mean 0, and the rest from a standard Normal distribution under this scenario. We see relatively low bias under OLS, particularly in the intercept. OLS esti- mates however, have shown the greatest variances. LAD had slightly lower variances than did OLS. The homoscedastic Student-t model had the least variance. The reader might find it instructive to calculate the relative changes in the variances of parameter estimates from the first scenario to the fourth. For n = 100, the variances of the OLS estimates, LAD estimates, T1 estimates, and T2 estimates are about 2.6 times, 1.5 times, 1.5 times and 1.45 times as large as in the first scenario. Hence we conclude that under this scenario, OLS performs worst in terms of the variances of parameter estimates; the homoscedastic Student-t model has done best. 43