Seminar on Robust Regression Methods

ROBUST REGRESSION METHOD
By,
SUMON JOSE
A Seminar Presentation
Under the Guidence of Dr. Jessy John
February 24, 2015
SUMON JOSE (NIT CALICUT) ROBUST REGRESSION METHOD February 24, 2015 1 / 69

CONTENTS
1 INTRODUCTION
2 REVIEW
3 ROBUSTNESS & RESISTANCE
4 APPROACH
5 STRENGTHS & WEAKNESSES
6 M- ESTIMATORS
7 DELIVERY TIME PROBLEM
8 ANALYSIS
9 PROPERTIES
10 SURVEY OF OTHER ROBUST REGRESSION
ESTIMATORS
11 REFERENCE

INTRODUCTION
Perfomance Evaluation- Geethu Anna Jose

REVIEW
The classical linear regression model relates the
dependednt or response variables yi to independent
explanatory variables xi1, xi2, ..., xip for i = 1, .., n, such
that
yi = xT
i β + i, (1)
for i=1,...,n
where xT
i = (xi1, xi2, ..., xip), i denote the error terms and
β = (β1, β2, ..., βp)T

REVIEW
The expected value of yi called the ﬁtted value is
ˆyi = xT
i β (2)
and one can use this to calculate the residual for the ith
case,
ri = yi − ˆyi (3)
In the case of simple linear regression model we may
calculate the value of β0 and β1 using the following
formulae:

REVIEW
β1 =
n
i=1 yi xi −
n
i=1 yi
n
i=1 xi
n
n
i=1 x2
i −
( n
i=1 xi )2
n
(4)
β0 = y − ˆβ1x (5)
The vector of ﬁtted values ˆyi curresponding to the
observed value yi may be expressed as follows:
ˆy = X ˆβ (6)

REVIEW
Limitations of Least Square Estimator
Extremly sensitive to deviations from the model
assumptions (as normal distribution is assumed for the
errors).
Drastically changed by the eﬀect of outliers.

REVIEW
What About Deleting Outliers Before Analysis
All the Outliers need not be erroneous data, they
could be exceptional occurances
Some of such Outliers could be the result of some
factors not considered in the current study
So in general, unusual observations are not always bad
observations. Moreover in large data it is often very
diﬃcult to spot out the outlying data.

ROBUSTNESS AND RESISTANCE
Resistant Regression Estimators
Deﬁnition
The Resistant regression estimators are primarily
concerned with robustness of validity: meaning that their
main concern is to prevent unsual observations from
aﬀecting the estimates produced.

Robust Regression Estimators
Definition
They are concerned with both robustness of efficiency and
robustness of validity, meaning that they should also
maintain a small sampling variance, even when the data
does not fit the assumed distribution.

⇒ In general Robust regression estimators aim to ﬁt
a model that describes the majority of a sample.
⇒ Their robustness is achieved by giving the data
diﬀerent weights
⇒ Whereas in Least Square Approximation all data
are treated equally.

APPROACH
Robust Estimation methods are powerful tools in
detection of outliers in complicated data sets.
But unless the data is very well behaved, different
estimators would give different estimates.
On their own, they do not provide a final model.
A healthy approach would be to employ both robust
regression methods as well as least square method to
compare the results.

STRENGTHS & WEAKNESSES
Finite Sample Breakdown Point
Deﬁnition
Breakdown Point (BDP) is the measure of the resistance
of an estimator. The BDP of a regression estimator is the
smallest fraction of contamination that can cause the
estimator to ’breakdown’ and no longer represent the
trend of the data.

When an estimator breaks down, the estimate it produces
from the contaminated data can become arbitrarily far
from the estimate than it would give when the data was
uncontaminated.

In order to describe the BDP mathematically, define T as
a regression estimator, Z as a sample of n data points and
T(Z) = ˆβ. Let Z be the corrupted sample where m of
the original data points are replaced with arbitrary values.
The maximum effect that could be caused by such
contamination is
effect(m; T, Z) = supz |T(Z ) − T(Z)| (7)

STRENGTHS & WEAKNESS
When (7) is infinite, an outlier can have an arbitrarily
large effect on T. The BDP of T at the sample Z is
therefore defined as:
BDP(T, Z) = min{
m
n
: effect(M; T, Z)is infinite} (8)

STRENGTH & WEAKNESSES
The Least Square Method estimator for example has a
breakdown point of 1
n because just one leverage point can
cause it to breakdown. As the number of data increases,
the breakdown point tends to 0 and so it is said to that
the least squares estimator has BDP 0%.

STRENGTH & WEAKNESS
Remark
The highest breakdown point one can hope for is 50% as
if more than half the data is contaminated that one
cannot diﬀerentiate between ’good’ and ’bad’ data.

Relative Efficiency of an Estimator
Definition
The efficiency of an estimator for a particular parameter is
defined as the ratio of its minimum possible variance to
its actual variance. Strictly, an estimator is considered
’efficient’ when this ratio is one.

High eﬃciency is crucial for an estimator if the intention
is to use an estimate from sample data to make inference
about the larger population from which the same was
drawn.

Relative Efficiency
Relative efficiency compares the efficiency of an
estimator to that of a well known method.
In the context of regression, estimators are compared
to the least squares estimator which is the most
efficient estimator known.

Given two estimators T1 and T2 for a population
parameter β, where T1 is the most efficient estimator
possible and T2 is less efficient, the relative efficiency of
T2 is calculated as the ratio of its mean squared error to
the mean squared error of T1
Efficiency(T1, T2) =
E[(T1 − β)2
]
E[(T2 − β)2]
(9)

M-ESTIMATORS
Introduction
1 Were ﬁrst proposed by Huber(1973)
2 But the early ones had the weakness in terms of one
or more of the desired properties
3 From them developed the modern means

M-ESTIMATORS
Maximum Likelihood Type Estimators
M-estimation is based on the idea that while we still want
a maximum likelihood estimator, the errrors might be
better represented by a diﬀerent, heavier tailed
distribution.

M-ESTIMATORS
If the probability distribution function of the error of f ( i),
then the maximum likelihood estimator for β is that
which maximizes the likelihood function
n
i=1
f ( i) =
n
i=1
f (yi − xT
i β) (10)

M-ESTIMATORS
This means, it also maximizes the log-likelihood function
n
i=1
ln f ( i) =
n
i=1
ln f (yi − xT
i β) (11)
When the errrors are normally distributed it has been
shown that this leads to minimising the sum of squared
residuals, which is the ordinary least square method.

M-ESTIMATORS
Assuming the the errors are diﬀerently distributed, leads to
the maximum likelihood esimator, minimising a diﬀerent
function. Using this idea, an M-estimator ˆβ minimizes
n
i=1
ρ( i) =
n
i=1
ρ(yi − xT
i β) (12)
where ρ(u) is a continuous, symmetric function called the
objectve function with a unique minimum at 0.

M-ESTIMATORS
1 Knowing the appropriate ρ(u) to use requires
knowledge of how the errors are really distributed.
2 Functions are usually chosen through consideration of
how the resulting estimator down-weights the larger
residuals
3 A Robust M-estimator achieves this by minimizing the
sum of a less rapidly increasing objective function than
the ρ(u) = u2
of the least squares

M-ESTIMATORS
Constructing a Scale Equivariant Estimator
The M-estimator is not necessarily scale invariant i.e. if
the errors yi − xT
i β were multiplied by a constant, the
new solution to the above equation might not be the
same as the scaled version of the old one.

M-ESTIMATORS
To obtain a scale invariant version of this estimator we
usually solve,
n
i=1
ρ(
i
s
) =
n
i=1
ρ(
yi − xT
i β
s
) (13)
where s is a robust estimate of scale.

M-ESTIMATORS
A popular choice for s is the re-scaled median absolute
deivation
s = 1.4826XMAD (14)
where MAD is the Median Absolute Deviation
MAD = Median|yi − xT
i
ˆβ| = Median| i| (15)

M-ESTIMATORS
’s’ is highly resistant to outlying observations, with BDP
50%, as it is based on the median rather than the mean.
The estimator rescales MAD by the factor 1.4826 so that
when the sample is large and i really distributed as
N(0, σ2
)), s estimates the standard deviation.

M-ESTIMATORS
With a large sample and i ∼ N(0, σ2
):
P(| i| < MAD) ≈ 0.5
⇒ P(| i −0
σ | < MAD
σ ) ≈ 0.5
⇒ P(|Z| < MAD
σ ) ≈ 0.5
⇒ MAD
σ ≈ Φ−1
(0.75)

M-ESTIMATORS
⇒ MAD
Φ−1 ≈ σ
1.4826 X MAD ≈ σ
Thus the tuning constant 1.4826 makes s an
approximately unbiased estimator of σ if n is large and the
error distribution is normal.

M-ESTIMATORS
Finding an M-Estimator
To obtain an M-estimate we solve,
Minimizeβ
n
i=1
ρ(
i
s
) = Minimizeβ
n
i=1
ρ(
yi − xi β
s
) (16)
For that we equate the ﬁrst partial derivatives of ρ with
respect to βj (j=0,1,2,3,...,k) to zero, yielding a necessary
condition for a minimum.

M-ESTIMATORS
This gives a system of p = k + 1 equations
n
i=1
Xijψ(
yi − xi β
s
) = 0, j = 0, 1, 2, ..., k (17)
where ψ = ρ and Xij is the ith
observation on the jth
regressor and xi0 = 1. In general ψ is a non-linear
function and so equation (17) must be solved iteratively.
The most widely used method to ﬁnd this is the method
of iteratively reweighted least squares.

M-ESTIMATORS
To use iteratively reweighted least squares suppose that an
initial estimate of ˆβ0 is available and that s is an estimate
of the scale. Then we write the p = k + 1 equations as:
n
i=1
Xij ψ(
yi − xi β
s
) =
n
i=1
xij {ψ[(yi − xi β)/s]/(yi − xi β)/s}(yi − xi β)
s
= 0
(18)

M-ESTIMATORS
as
n
i=1
XijW 0
i (yi − xiβ) = 0, j = 0, 1, 2, ..., k (19)
where
W 0
i =



ψ[
(yi −xi β)
s ]
(yi −x
i
β)
s
if yi = xi
ˆβ0
1 if yi = xi
ˆβ0
(20)

M-ESTIMATORS
We may write the above equation in matrix form as
follows:
X W 0
Xβ = X W 0
y (21)
where W0 is an n X n diagonal matrix of weights with
diagonal elements given by the expression
W 0
i =



ψ[
(yi −xi β)
s ]
(yi −x
i
β)
s
if yi = xi
ˆβ0
1 if yi = xi
ˆβ0
(22)

M-ESTIMATORS
From the matrix form we realize that the expression is
same as that of the usual weighted least squares normal
equation. Consequently the one step estimator is
ˆβ1 = (X W 0
X)−1
X W 0
y (23)
At the next step we recompute the weights from the
equation for W but using ˆβ1 and not ˆβ0

M-ESTIMATORS
NOTE:
Usually only a few iterations are required to obtain
convergence
It could be easily be implemented by a computer
programme.

M-ESTIMATORS
Re-Descending Estimators
Re- descending M estimators are those which have
inﬂuence functions that are non decreasing near the origin
but decreasing towards zero far from the origin.
Their ψ can be chosen to redescend smoothly to zero, so
that they usually satisfy ψ(x) = 0 for all |x| > r where r
is referred to as the minimum rejection point.

M-ESTIMATORS

M-ESTIMATORS
Robust Criterion Functions
Citerion ρ ψ(z) w(x) range
Least
Squares z2
2 z 1.0 |z| < ∞
Huber’s
t-function z2
2 z 1.0 |z| < t
t = 2 |z|t − t2
2 tsign(z) t
|z| |x| > t
Andrew’s
Wave function a(1 − cos(z
a)) sin(z
a)
sin(z
a )
z
a
|z| ≤ aπ

DELIVERY TIME PROBLEM
Problem
A Softdrink bottler is analyzing the vending machine service routes in his
distriution system. He is interested in predicting the amount of time
required by the route driver to service the vending machines in an outlet.
This service activity includes stocking the machine with beverage products
and minor maintenance or housekeeping. The industrial engineer
responsible for the study has suggested that the two most important
variables aﬀecting the delivery time (y) are the numer of cases of product
stocked (x1) and the distance walked by the route driver (x2). The
engineer has collected 25 observations on delivery time, which are shown
in the following table. Fit a regression model into it.

Table of Data
Observation Delivery time Number of cases Distance in Feets
i (in minutes) y x1 x2
1 16.8 7 560
2 11.50 3 320
3 12.03 3 340
4 14.88 4 80
5 13.75 6 150
6 18.11 7 330
7 8 2 110
8 17.83 7 210
9 79.24 30 1460
10 21.50 5 605
11 40.33 16 688
12 21 10 215
13 13.50 4 255

Observation Delivery time Number of cases Distance in Feets
(in minutes) y x1 x2
14 19.75 6 462
15 24.00 9 448
16 29.00 10 776
17 15.35 6 200
18 19.00 7 132
19 9.50 3 36
20 35.10 17 770
21 17.90 10 140
22 52.32 26 810
23 18.75 9 450
24 19.83 8 635
25 10.75 4 150

Least Square Fit of the Delivery Time Data
Obs. yi ˆyi ei Weight
1 .166800E+02 .217081E+02 -.502808E+01 .100000E+01
2 0115000E+02 .103536E+02 .114639E+01 .100000E+01
3 .120300E+02 .120798E+02 -.497937E-01 .100000E+01
4 .148800E+02 .995565E+01 .492435E+01 .100000E+01
5 .137500E+02 .141944E+02 -.444398E+00 .100000E+01
6 .181100E+02 .183996E+02 -.289574E+00 .100000E+01
7 .800000E+01 .715538E+01 .844624E+00 .100000E+01
8 .178300E+02 .166734E+02 .115660E+02 .100000E+01
9 .792400E+02 .718203E+02 .741971E+01 .100000E+01
10 .215000E+02 .191236E+02 .237641E+01 .100000E+01
11 .403300E+02 .380925E+02 .223749E+01 .100000E+01
12 .2100000E+02 .215930E+02 -.593041E+00 .100000E+01
13 .135000E+02 .124730E+02 .102701E+01 .100000E+01

14 .197500E+02 .186825E+02 .106754E+01 .100000E+01
15 .240000E+02 .233288E+02 .671202E+00 .100000E+01
16 .290000E+02 .296629E+02 -.662928E+00 .100000E+01
17 .153500E+02 .149136E+02 .436360E+00 .100000E+01
18 .190000E+02 .155514E+02 .344862E+01 .100000E+01
19 .950000E+01 .770681E+01 .179319E+01 .100000E+01
20 .351000E+02 .408880E+02 -.578797E+01 .100000E+01
21 .179000E+02 .205142E+02 -.261418E+01 .100000E+01
22 .523200E+02 .560065E+02 -.368653E+01 .100000E+01
23 .187500E+02 .233576E+02 -.460757E+01 .100000E+01
24 .198300E+02 .244029E+02 -.457285E+01 .100000E+01
25 .107500E+02 .109626E+02 -.212584E+00 .100000E+01

Accordingly we have the following values for the
parameters:
ˆβ0 = 2.3412
ˆβ1 = 1.6159
ˆβ2 = 0.014385 Thus we have the regression line as
follows:
yi = 2.3412 + 1.6159x1 + 0.014385x2 (24)

Huber’s t-Function, t=2
1 .166800E+02 .217651E+02 -.508511E+01 .639744E+00
2 .115000E+02 .109809E+02 .519115E+00 .100000E+01
3 .120300E+02 .126296E+02 -.599594E+00 .100000E+01
4 .148800E+02 .105856E+02 .429439E+01 .757165E+00
5 .137500E+02 .146038E+02 -.853800E+00 .100000E+01
6 .181100E+02 .186051E+02 -.495085E+00 .100000E+01
7 .800000E+01 .794135E+01 .586521E-01 .100000E+01
8 .178300E+02 .169564E+02 .873625E+00 .100000E+01
9 .792400E+02 .692795E+02 .996050E+01 .327017E+00
10 .215000E+02 .193269E+02 .217307E+01 .100000E+01
11 .403300E+02 .372777E+02 .305228E+01 .100000E+01
12 .210000E+02 .216097E+02 -.609734E+00 .100000E+01
13 .135000E+02 .129900E+02 .510021E+00 .100000E+01

i
14 .197500E+02 .188904E+02 .859556E+00 .100000E+01
15 .240000E+02 .232828E+02 .717244E+00 .100000E+01
16 .290000E+02 .293174E+02 -.317449E+00 .100000E+01
17 .153500E+02 .152908E+02 .592377E-01 .100000E+01
18 .190000E+02 .158847E+02 .311529E+01 .100000E+01
19 .950000E+01 .845286E+01 .104714E+01 .100000E+01
20 .351000E+02 .399326E+02 -.483256E+01 .672828E+00
21 .179000E+02 .205793E+02 -.267929E+01 .100000E+01
22 .523200E+02 .542361E+02 -.191611E+01 .100000E+01
23 .187500E+02 .233102E+02 -.456023E+01 .713481E+00
24 .198300E+02 .243238E+02 .449377E+01 .723794E+00
25 .107500E+02 .115474E+02 -.797359E+00 .100000E+01

Accordingly we get the values of the parameters as
follows: ˆβ0 = 3.3736
ˆβ1 = 1.5282
ˆβ2 = 0.013739
Thus we get the regression line as follows:
yi = 3.3736 + 1.5282x1 + 0.013739x2 (25)

Andrew’s Wave Function with a = 1.48
i
1 .166800E+02 .216430E+02 -.496300E+01 .427594E+00
2 .115000E+02 .116923E+02 -.192338E+00 .998944E+00
3 .120300E+02 .131457E+02 .-.111570E+01 .964551E+00
4 .148800E+02 .114549E+02 .342506E+01 .694894E+00
5 .137500E+02 .152191E+02 -.146914E+01 .939284E+00
6 .181100E+01 .188574E+02 -.747381E+00 .984039E+00
7 .800000E+01 .890189E+01 .901888E+00 .976864E+00
8 .178300E+02 ..174040E+02 ..425984E+00 .994747E+00
9 .792400E+02 .660818E+02 .131582E+02 .0
10 .215000E+02 .192716E+02 .222839E+01 .863633E+00
11 .403300E+02 .363170E+02 .401296E+01 .597491E+00
12 .210000E+02 .218392E+02 -.839167E+00 .980003E+00
13 .135000E02 .135744E+02 -.744338E+01 .999843E+00

i
14 .197500E+02 .198979E+02 .752115E+00 .983877E+00
15 .240000E+02 .232029E+02 .797080E+00 .981854E+00
16 ..290000E+02 .286336E+02 .366350E+00 .996228E+00
17 .153500E+02 .158247E+02 -.474704E+00 .993580E+00
18 .190000E+02 .164593E+02 .254067E+01 .824146E+00
19 .950000E+01 .946384E+01 .361558E-01 .999936E+00
20 .351000E+02 .387684E+02 -.366837E+01 .655336E+00
21 .179000E+02 .209308E+02 -.303081E+01 .756603E+00
22 .523200E+02 .523766E+02 -.566063E-01 .999908E+00
23 .187500E+02 .232271E+02 .-.447714E+01 .515506E+00
24 .198300E+02 .240095E+02 -.417955E+01 .567792E+00
25 .107500E+02 .123027E+02 -1.55274E+01 .932266E+00

Thus we have the estimates as follows:
ˆβ0 = 4.6532
ˆβ1 = 1.4582
ˆβ2 = 0.012111
Thus we get the regression line as follows:
yi = 4.6532 + 1.4582x1 + 0.012111x2 (26)

ANALYSIS
Computing M-Estimators
Robust regression methods are not an option in most
statistical software today.
SAS, PROC, NLIN etc can be used to implement
iteratively reweighted least squares procedure.
There are also Robust procedures available in S-Pluz.

ANALYSIS
Robust Regression Methods...
Robust regression methods have much to offer a data
analyst.
They will be extremly helpful in locating outliers and
hightly influential observations.
Whenever a least squares analysis is perfomed it would
be useful to perform a robust fit also.

ANALYSIS
If the results of both the fit are in substantial
agreement, the use of Least Square Procedure offers a
good estimation of the parameters.
If the results of both the procedures are not in
agreement, the reason for the difference should be
identified and corrected.
Special attention need to be given to observations
that are down weighted in the robust fit.

PROPERTIES
Breakdown Point The ﬁnite sample breakdown point is
the smallest fraction of anomalous data that can cause the
estimator to be useless. The smallest possible breakdown
poit is 1
n, i.e. s single observation can distort the estimator
so badly that it is of no practical use to the regression
model builder. The breakdown point of OLS is 1
n.

PROPERTIES
M-estimators can be aﬀected by x-space outliers in an
identical manner to OLS.
Consequently, the breakdown point of the class of m
estimators is 1
n as well.
We would generally want the breakdown point of an
estimator to exceed 10%.
This has led to the development of High Breakdown
point estimators.

PROPERTIES
Eﬃciency
The M estimators have a higher eﬃciency than the least
squares, i.e. they behave well even as the size of the
sample increases to ∞.

SURVEY OF OTHER ROBUST
REGRESSION ESTIMATORS
High Break Down Point Estimators Because both the
OLS and M-estimator suffer from a low breakdown point
1
n, considerable effort has been devoted to finding
estimators that perform better with respect to this
property. Often a break down point of 50% is desirable.

SURVEY OF OTHER ROBUST
REGRESSION ESTIMATORS
There are various other estimation procedures like
Least Median of Squares
Least Trimmed Sum of Squres
S Estimators
R and L Estimators
Robust Ridge regression
MM Estimation etc.

ABSTRACT & CONCLUSION
Review ⇒ Robustness and Resistance ⇒
Our Approach ⇒ Strengths and Weaknesses
⇒ M-Estimators ⇒ Delivery time
problem ⇒ Analysis ⇒ Properties ⇒
Survey of other Robust Regression Estimators

REFERENCE
1 Draper, R Norman. & Smith, Harry. “Applied Regression
Analysis”, 3rd edn., John Wiley and Sons, New York, 1998.
2 Montgomery, C Douglas. Peck, A Elizabeth. & Vining, Geoﬀrey
G. “Introduction to Linear Regression Analysis”, 3rd edn., Wiley
India, 2003.
3 Brook J, Richard. “Applied Regression Analysis and
Experimental Design”, Chapman & Hall, London, 1985.
4 Rawlings O, John. “Applied Regression Analysis: A Research
Tool”, Springer, New York, 1989.
5 Pedhazar, Elazar J. “Multiple Regression in Behavioural Research:
Explanation and Prediction”, Wadsworth, Australia, 1997

THANK YOU

Seminar on Robust Regression Methods

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Seminar on Robust Regression Methods

Similar to Seminar on Robust Regression Methods (20)

Seminar on Robust Regression Methods