We present a robust solution to the classification and variable selection problem when the dimension of the data, or number of predictor variables, may greatly exceed the number of observations. When faced with the problem of classifying objects given many measured attributes of the objects, the goal is to build a model that makes the most accurate predictions using only the most meaningful subset of the available measurements. The introduction of L1 regularized model fitting has inspired many approaches that simultaneously do model fitting and variable selection. If parametric models are employed, the standard approach is some form of regularized maximum likelihood estimation. This is an asymptotically efficient procedure under very general conditions - provided that the model is specified correctly. Correctly specifying a model, however, is not trivial. Even a few outliers among data drawn from an otherwise pure sample of data can result in a very poor model. In contrast, minimizing the integrated square error, while less efficient, proves to be robust to a fair amount of contamination. We propose to fit logistic models using this alternative criterion to address the possibility of model misspecification. The resulting method may be considered a robust variant of regularized maximum likelihood methods for high dimensional data.
Exploring the Future Potential of AI-Enabled Smartphone Processors
Robust parametric classification and variable selection with minimum distance estimation
1. Robust parametric classification and variable
selection with minimum distance estimation
Eric Chia,b,1 with David W. Scotta,2
a Department of Statistics,
Rice University
b Baylor College of Medicine
June 17, 2010
1
DOE DE-FG02-97ER25308
2
NSF DMS-09-07491
2. Outline
The binary regression problem
The L2 E Method
Estimation
Variable Selection
Simulations
Conclusion
3. Outline
The binary regression problem
The L2 E Method
Estimation
Variable Selection
Simulations
Conclusion
4. Logistic Regression
Suppose we wish to predict y ∈ {0, 1}n using X ∈ Rn×p .
The number of features p could be very large.
7. MLE is sensitive to outliers
Likelihood based choice
Outlier or not, MLE puts mass wherever data lies.
Cost: MLE puts mass over regions where there is no data.
8. MLE is sensitive to outliers
1.0 q qq q q qq q q q q
q qq q q qq
q q q qq qq qq qqqq q qq q
qq q q qq
qq q qq
q
0.8
0.6
Pr(Y=1)
0.4
0.2
0.0 qq q q q q q qq q q q q q q
q q q q q q q qq
qq q q q qq
q q
q q q q q qq
q q q
q q
−6 −4 −2 0 2 4 6
X
There are no ’ones’ between -4 and -2.
But P(Y = 1|X ∈ (−4, −2)) ↑.
There are no ’zeros’ between 4 and 6.
But P(Y = 0|X ∈ (4, 6)) ↑.
9. Outline
The binary regression problem
The L2 E Method
Estimation
Variable Selection
Simulations
Conclusion
10. The L2 distance as an alternative to the deviance loss.
g : unknown true density.
fθ : putative parametric density.
Find θ that minimizes the ISE
ˆ
θ = argmin (fθ (x) − g (x))2 dx.
θ
11. The L2 E Method
The equivalent empirical criterion:
n
ˆ 2
θ = argmin fθ (x)2 dx − fθ (Xi ) ,
θ n
i=1
where Xi ∈ Rp is the covariate vector of the i th observation.
The L2 Estimator or L2 E [Scott, 2001].
Familar quantity: Smoothing parameter selection in
non-parametric density estimation.
12. Density-power divergence
The L2 E and MLE are empirical minimizers of two different points
in a spectrum divergence measures [Basu et al, 1998].
1 1 1+γ
dγ (g , fθ ) = fθ1+γ (z) − 1 + g (z)fθγ (z) + g (z) dz,
γ γ
γ > 0 trades off efficiency for robustness.
γ = 1 =⇒ L2 loss.
γ → 0 =⇒ Kullback - Leibler divergence.
13. Robustness of the L2 distance
ˆ
θ = argmin (fθ (x) − g (x))2 dx.
θ
The L2 distance is zero-forcing:
g (x) = 0 forces fθ (x) = 0.
Puts premium on avoiding “false positives”.
L2 E balances:
mass where data is
v.s.
no mass where data is absent.
14. Partial Densities: An extra degree of freedom
Expand the search space [Scott, 2001]:
(wfθ (x) − g (x))2 dx.
Fit a parametric model to only a fraction, w , of the data
(Hopefully the fraction described well by the parametric
model!)
n
ˆ ˆ 2 2 2w
(θ, w ) = argmin w fθ (x) dx − fθ (Xi ) .
θ,w n
i=1
15. Logistic L2 E loss
Let F (u) = 1/(1 + exp(−u)), logistic function, then
n
ˆ ˆ w2
(β, w ) = argmin F (xiT β)2 + (1 − F (xiT β))2
β,w ∈[0,1] n i=1
n
w
−2 yi F (xiT β) + (1 − yi )(1 − F (xiT β)) .
n
i=1
16. Two dimensional example
4
2
0
X2
−2
−4
−5 0 5 10
X1
n = 300 and p = 2.
Three clusters each of size 100
Two are labelled 0
One is labelled 1
18. 5 5
0 0
X2
X2
−5 −5
0 5 10 0 5 10
X1 X1
(c) L2 E B: w = 0.666
ˆ (d) L2 E C: w = 0.668
ˆ
19. Outline
The binary regression problem
The L2 E Method
Estimation
Variable Selection
Simulations
Conclusion
20. The optimization problem
Challenges
L2 E loss is not convex.
Hessian of the L2 E loss is non-definite.
Standard Newton-Raphson fails.
Scalability and stability as p increases?
Solution
Majorization-Minimization
21. Majorization-Minimization
Strategy
Minimize a surrogate function, majorization.
Choose surrogate such that
↓ surrogate =⇒ ↓ objective.
surrogate is easier to minimize than objective.
22. Majorization-Minimization
Definition
Given f and g , real-valued functions on Rp , g majorizes f at x if
1. g (x) = f (x) and
2. g (u) ≥ f (u) for all u.
23. More
Lack of fit
Less
very bad optimal less bad
The spectrum of logistic models
24. More
Lack of fit
Less
very bad optimal less bad
The spectrum of logistic models
25. More
Lack of fit
Less
very bad optimal less bad
The spectrum of logistic models
26. More
Lack of fit
Less
very bad optimal less bad
The spectrum of logistic models
27. More
Lack of fit
Less
very bad optimal less bad
The spectrum of logistic models
28. More
Lack of fit
Less
very bad optimal less bad
The spectrum of logistic models
29. More
Lack of fit
Less
very bad optimal less bad
The spectrum of logistic models
30. More
Lack of fit
Less
very bad optimal less bad
The spectrum of logistic models
31. More
Lack of fit
Less
very bad optimal less bad
The spectrum of logistic models
32. More
Lack of fit
Less
very bad optimal less bad
The spectrum of logistic models
33. Quadratic majorization of the logistic L2 E loss
The loss has bounded curvature with respect to β. Fix w .
Majorize the exact second order Taylor expansion.
1 T −1 T (m)
β (m+1) = β (m) − (X X ) X Z ,
K
where
1 3 4 w
K≥ max wz − z 3 − 2wz 2 + z + .
4 z∈[−1,1] 2 2
K controls the step size. Its lower bound is related to the
maximum curvature of the loss.
Z (m) is a working response that depends on Y and X β (m) .
34. Outline
The binary regression problem
The L2 E Method
Estimation
Variable Selection
Simulations
Conclusion
35. Continuous variable selection with the LASSO
Minimize
p
“L2 E loss ”+λ |βi |
i=1
Penalized majorization of loss majorizes the penalized loss.
Minimize
p
“majorization of L2 E loss ”+λ |βi |
i=1
36. Coordinate Descent
Suppose X is standardized, then
(m+1) (m) 1 T (m)
βk = S βk − X Z ,λ ,
K (k)
where S is the soft threshold function
S(x, λ) = sign(x) max(|x| − λ, 0).
Extension to elastic net is straightforward.
37. Heuristic Model Selection
Regularization Path
Calculate penalized regression coefficients for range of λ values.
Information Criterion
For each λ, calculate deviance loss using L2 E coefficients and
add correction term (AIC and BIC).
Select model with lowest AIC/BIC value.
Use number of non-zero penalized regression coefficients for
degrees of freedom [Zou et al, 2007].
46. Simulations: Variable Selection
n = 200, p = 1000
Xi | Group 1 ∼ i.i.d. N(µ, σ)
Xi | Group 2 ∼ i.i.d. N(−µ, σ)
β = (1, 1, 1, 1, 0, . . . , 0)
Yi |Xi ∼ i.d. Bern(F (XiT β))
1,000 replicates.
Single Outlier
Moved along ray starting at centroid of one group and moving
away along (1, 1, 1, 1, 0, . . . , 0).
47. Average number of correct variables selected
AIC BIC
4
3
method
Expectation
MLE
2
L2E: w = 1
L2E: w = wopt
1
0
0 1.06 2.11 3.17 4.22 5.28 6.33 7.39 8.44 9.5 0 1.06 2.11 3.17 4.22 5.28 6.33 7.39 8.44 9.5
Outlier Relative Position
48. Average number of incorrect variables selected
AIC BIC
140
120
100
method
Expectation
80
MLE
L2E: w = 1
60 L2E: w = wopt
40
20
0
0 1.06 2.11 3.17 4.22 5.28 6.33 7.39 8.44 9.5 0 1.06 2.11 3.17 4.22 5.28 6.33 7.39 8.44 9.5
Outlier Relative Position
50. Outline
The binary regression problem
The L2 E Method
Estimation
Variable Selection
Simulations
Conclusion
51. Summary
MLE logistic regression is sensitive to implosion breakdown.
Estimation and variable selection are affected: contaminants
reduce SNR.
L2 E is robust because it is zero forcing.
Majorization-Minimization + Coordinate Descent facilitate
fast and stable optimization.
52. Future work
Is w worth optimizing over?
What is the correct AIC or BIC formulation?
What are the degrees of freedom in the L2 E loss model?
53. References
D.W. Scott.
Parametric statistical modeling by minimum integrated square
error.
Technometrics, 43(3):274–285, 2001.
A. Basu et al.
Robust and efficient estimation by minimising a density power
divergence.
Biometrika, 85(3):549–559, 1998
H. Zou et al.
On the “degrees of freedom” of the lasso.
Annals of Statistics, 35(5):2173–2192, 2007