Robust parametric classification and variable selection with minimum distance estimation

Robust parametric classiﬁcation and variable
selection with minimum distance estimation

Eric Chia,b,1 with David W. Scotta,2
a Department of Statistics,
Rice University
b Baylor College of Medicine

June 17, 2010

1
DOE DE-FG02-97ER25308
2
NSF DMS-09-07491

Outline

The binary regression problem

The L2 E Method

Estimation

Variable Selection

Simulations

Conclusion

Logistic Regression

Suppose we wish to predict y ∈ {0, 1}n using X ∈ Rn×p .
The number of features p could be very large.

Univariate Logistic Regression: MLE

1.0 qq q q q q
q qq q q qq
q q q qq qq qq qqqq q qq q
qq q q qq
qq q qq
q

0.8

0.6
Pr(Y=1)

0.4

0.2

0.0 qq q q q q q qq q q q q q q
q q q q q q q qq
qq q q q qq
q q
q q q q q qq
q q q
q q

−6 −4 −2 0 2 4 6
X

MLE is sensitive to outliers

1.0 q qq q q qq q q q q
q qq q q qq
qq q q qq
qq q qq
q

0.8

0.6
Pr(Y=1)

0.4

0.2

q q q q q q q qq
qq q q q qq
q q
q q q q q qq
q q q
q q

−6 −4 −2 0 2 4 6
X


Likelihood based choice
Outlier or not, MLE puts mass wherever data lies.
Cost: MLE puts mass over regions where there is no data.

1.0 q qq q q qq q q q q
q qq q q qq
qq q q qq
qq q qq
q

0.8

0.6
Pr(Y=1)

0.4

0.2

q q q q q q q qq
qq q q q qq
q q
q q q q q qq
q q q
q q

−6 −4 −2 0 2 4 6
X

There are no ’ones’ between -4 and -2.

But P(Y = 1|X ∈ (−4, −2)) ↑.

There are no ’zeros’ between 4 and 6.

But P(Y = 0|X ∈ (4, 6)) ↑.

The L2 distance as an alternative to the deviance loss.

g : unknown true density.
fθ : putative parametric density.
Find θ that minimizes the ISE

ˆ
θ = argmin (fθ (x) − g (x))2 dx.
θ

The L2 E Method

The equivalent empirical criterion:
n
ˆ 2
θ = argmin fθ (x)2 dx − fθ (Xi ) ,
θ n
i=1

where Xi ∈ Rp is the covariate vector of the i th observation.
The L2 Estimator or L2 E [Scott, 2001].
Familar quantity: Smoothing parameter selection in
non-parametric density estimation.

Density-power divergence

The L2 E and MLE are empirical minimizers of two different points
in a spectrum divergence measures [Basu et al, 1998].

1 1 1+γ
dγ (g , fθ ) = fθ1+γ (z) − 1 + g (z)fθγ (z) + g (z) dz,
γ γ

γ > 0 trades off efficiency for robustness.
γ = 1 =⇒ L2 loss.
γ → 0 =⇒ Kullback - Leibler divergence.

Robustness of the L2 distance

ˆ
θ = argmin (fθ (x) − g (x))2 dx.
θ

The L2 distance is zero-forcing:

g (x) = 0 forces fθ (x) = 0.

Puts premium on avoiding “false positives”.
L2 E balances:

mass where data is
v.s.
no mass where data is absent.

Partial Densities: An extra degree of freedom

Expand the search space [Scott, 2001]:

(wfθ (x) − g (x))2 dx.

Fit a parametric model to only a fraction, w , of the data
(Hopefully the fraction described well by the parametric
model!)

n
ˆ ˆ 2 2 2w
(θ, w ) = argmin w fθ (x) dx − fθ (Xi ) .
θ,w n
i=1

Logistic L2 E loss

Let F (u) = 1/(1 + exp(−u)), logistic function, then

n
ˆ ˆ w2
(β, w ) = argmin F (xiT β)2 + (1 − F (xiT β))2
β,w ∈[0,1] n i=1
n
w
−2 yi F (xiT β) + (1 − yi )(1 − F (xiT β)) .
n
i=1

Two dimensional example

4

2

0

X2
−2

−4

−5 0 5 10
X1

n = 300 and p = 2.
Three clusters each of size 100
Two are labelled 0
One is labelled 1

5 5

0 0
X2

X2
−5 −5

0 5 10 0 5 10
X1 X1

(a) MLE (b) L2 E A :w = 1.026
ˆ

5 5

0 0
X2

X2
−5 −5

0 5 10 0 5 10
X1 X1

(c) L2 E B: w = 0.666
ˆ (d) L2 E C: w = 0.668
ˆ

The optimization problem

Challenges
L2 E loss is not convex.
Hessian of the L2 E loss is non-deﬁnite.
Standard Newton-Raphson fails.
Scalability and stability as p increases?

Solution
Majorization-Minimization


Strategy
Minimize a surrogate function, majorization.
Choose surrogate such that
↓ surrogate =⇒ ↓ objective.
surrogate is easier to minimize than objective.


Deﬁnition
Given f and g , real-valued functions on Rp , g majorizes f at x if
1. g (x) = f (x) and
2. g (u) ≥ f (u) for all u.

More
Lack of fit

Less

very bad optimal less bad

The spectrum of logistic models

Quadratic majorization of the logistic L2 E loss

The loss has bounded curvature with respect to β. Fix w .
Majorize the exact second order Taylor expansion.
1 T −1 T (m)
β (m+1) = β (m) − (X X ) X Z ,
K
where
1 3 4 w
K≥ max wz − z 3 − 2wz 2 + z + .
4 z∈[−1,1] 2 2

K controls the step size. Its lower bound is related to the
maximum curvature of the loss.
Z (m) is a working response that depends on Y and X β (m) .

Continuous variable selection with the LASSO

Minimize
p
“L2 E loss ”+λ |βi |
i=1

Penalized majorization of loss majorizes the penalized loss.
Minimize
p
“majorization of L2 E loss ”+λ |βi |
i=1

Coordinate Descent

Suppose X is standardized, then

(m+1) (m) 1 T (m)
βk = S βk − X Z ,λ ,
K (k)

where S is the soft threshold function

S(x, λ) = sign(x) max(|x| − λ, 0).

Extension to elastic net is straightforward.

Heuristic Model Selection

Regularization Path
Calculate penalized regression coefficients for range of λ values.

Information Criterion
For each λ, calculate deviance loss using L2 E coefficients and
add correction term (AIC and BIC).
Select model with lowest AIC/BIC value.
Use number of non-zero penalized regression coefficients for
degrees of freedom [Zou et al, 2007].

Heuristic Model Selection

151 127 111 97 69 38 7 4 4 4 4 3 2 0

800
1.5

700
1.0

600
0.5

L2E BIC
βj

500
0.0

400
-0.5

300
-1.0

200
-3.0 -2.5 -2.0 -1.5 -3.0 -2.5 -2.0 -1.5

log10(λ) log10(λ)

Simulations: Estimation

n = 200, p = 4
Xi | Group 1 ∼ i.i.d. N(µ, σ)
Xi | Group 2 ∼ i.i.d. N(−µ, σ)
β = (1, 1/2, 1, 2)
Yi |Xi ∼ i.d. Bern(F (XiT β))
1,000 replicates.

Case 1

Vary position of 1 outlier.

Distributions of ﬁtted coeﬃcients

1 2

6

4 q q q q q q
q q
q q
q q q q q q q q q q q q q
q q
q q
q q
q q
q q
q q
q q
q q
q q
q q
q q
q q
q q q
q
q q
q q q q q
q q
q q q
q q q
q q
q q q q q
q q q q
q q q
q q
q q
q q
q q q q
q q
q q
q q
q q
q q
q q q
q q
q q
q q q
q q
q q
q q q q q
q q q
q q q q q q
q q q
q
2 q
q q
q q q
q
q
q q q
q
q
q
q
q
q q
q
q
q
q q q
q q
q q q
q q q
q q
q q q q
q q
q q q
q
q q q
q
q
q
0 q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q q
q
q
q q
q
q
q
q q
q q
q
q
q
q
q
q
q q q q
q
q q
q
q
q q q q q q q q q q q q q q q
Method
Fitted Value

q
q q
q q q q q q q q q q q q q q
q q q q q q
q
q q
MLE
3 4
L2E: w = 1
q q q
q q q q
q
q
q L2E: w = wopt
6 q
q
q q q q q q
q q
q q
4 q q q q q q q q q q
q q
q
q q q q q q q q q q q
q

2

0

−0.25 1.5 3 6 12 24 −0.25 1.5 3 6 12 24
Outlier position

Estimation

MLE regression coeﬃcients driven to zero (implosion breakdown)

Case 2

Vary number of outliers at a ﬁxed position.

Distributions of ﬁtted coeﬃcients

1 2
q q
q
q q
q q
4 q q
4 q q q q
q q q
q q q
q q q
q q q
q q q
q q
3 q q q
q q
q q q q
q q q
q
q
q q
q
q
q
q q 3 q q
q
q q q
q q q q q q q q
q q
q q
q
q q q q
q q q q
q q
q q
q
q
q
q q q
q
q q q
q q q q q q q q q
q q q q q q
q q
q q
q q
q q
q q q q q
q q q q
q q q
q q q q
q q q
q q q q
q q q
q q q q q q
q q q q q q q q q q q q
2 q q q
q
q q q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q 2 q
q q q q q
q q q q q
q
q
q
q
q
q q
q q
q q
q 1 q
q q q
1 q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q
q q
q q q q q q q
0
0 q q q
q q q
q
q q q q q q
q q q
q q q q q
q q q q
q q q
q
q
q
q q
q q
q q q
q q q q
q
q q q q q q
q q q
q q
q q q
q q
q
q q q q q q q q
q q q q −1 q q q q q q q q q q q
q q q q q q q Method
Fitted Value

q q q q q q q q q
q q q q q q
q q q q
−1 q q
q
q
q
q
q
q
q q
q
q
q
q q
q q
q q q
MLE
3 4
L2E: w = 1
q q
q q
q 8 q L2E: w = w.opt
q q
6 q
q
q q q
q q q q q q q q
q
q q q q q q 6 q q q
q q q q q q
4 q q q q q q q
q q q q
q
q q q q q q q q q q
4
2
2
0
0

0 1 5 10 15 20 0 1 5 10 15 20
Number of outliers

Simulations: Variable Selection

n = 200, p = 1000
Xi | Group 1 ∼ i.i.d. N(µ, σ)
Xi | Group 2 ∼ i.i.d. N(−µ, σ)
β = (1, 1, 1, 1, 0, . . . , 0)
Yi |Xi ∼ i.d. Bern(F (XiT β))
1,000 replicates.

Single Outlier
Moved along ray starting at centroid of one group and moving
away along (1, 1, 1, 1, 0, . . . , 0).

Average number of correct variables selected

AIC BIC

4

3

method
Expectation

MLE
2
L2E: w = 1
L2E: w = wopt

1

0

0 1.06 2.11 3.17 4.22 5.28 6.33 7.39 8.44 9.5 0 1.06 2.11 3.17 4.22 5.28 6.33 7.39 8.44 9.5
Outlier Relative Position

Average number of incorrect variables selected

AIC BIC

140

120

100

method
Expectation

80
MLE
L2E: w = 1
60 L2E: w = wopt

40

20

0

0 1.06 2.11 3.17 4.22 5.28 6.33 7.39 8.44 9.5 0 1.06 2.11 3.17 4.22 5.28 6.33 7.39 8.44 9.5
Outlier Relative Position

Variable Selection

Implosion breakdown =⇒ reduced SNR =⇒ missed detections

Summary

MLE logistic regression is sensitive to implosion breakdown.
Estimation and variable selection are aﬀected: contaminants
reduce SNR.
L2 E is robust because it is zero forcing.
Majorization-Minimization + Coordinate Descent facilitate
fast and stable optimization.

Future work

Is w worth optimizing over?
What is the correct AIC or BIC formulation?
What are the degrees of freedom in the L2 E loss model?

References

D.W. Scott.
Parametric statistical modeling by minimum integrated square
error.
Technometrics, 43(3):274–285, 2001.
A. Basu et al.
Robust and eﬃcient estimation by minimising a density power
divergence.
Biometrika, 85(3):549–559, 1998
H. Zou et al.
On the “degrees of freedom” of the lasso.
Annals of Statistics, 35(5):2173–2192, 2007

Robust parametric classification and variable selection with minimum distance estimation

Recommended

Recommended

More Related Content

Similar to Robust parametric classification and variable selection with minimum distance estimation

Similar to Robust parametric classification and variable selection with minimum distance estimation (11)

Recently uploaded

Recently uploaded (20)

Robust parametric classification and variable selection with minimum distance estimation