Understanding SparseNet: Theory and Practice
Nicholas Dronen
July 2, 2013
Overview
• First half technical, then practical
• Theory is necessary for intuition of good practice
• Practice - examples...
What is SparseNet?
• sparsenet: an R package (2012) based on MC+
algorithm
• Related packages: lars (2003), glmnet (2008)
...
A linear model
x
y
b
m = y/x
Linear models with many predictor variables
Y = Xβ + ǫ
• Y is a (n × 1) vector of response variables
• X is a (n × p) matr...
Task 1: Parameter estimation: which β is best?
A common measure of quality of a particular β is residual sum of
squares (R...
Residuals
1.0 1.5 2.0 2.5 3.0
1.01.52.02.53.0
x
y
yi
yi
^
}|yi − yi
^|
Task 1: Parameter estimation: minimizing RSS
ˆβ = arg minβ RSS(β)
Is ˆβ computed by exhaustive search of βs? No
Task 1: Parameter estimation (cont.)
After some matrix calculus, estimating the parameters becomes a
linear algebra proble...
Task 2: Subset selection
• Find ≤ p predictor variables to include in model
• Reasons for subset selection
• Principle of ...
Task 2: Subset selection; linear models
• Different methods, different eras
• Stepwise regression (circa Common Era)
• Leaps...
Task 2: Subset selection; leaps and bounds
• Intelligently enumerate all possible subsets -
computationally expensive
• ∼ ...
Shrinkage methods
When computing ˆβ add a penalty term J and parameter λ
ˆβ(λ) = arg minβ [RSS(β) + λJ(β)]
Equivalently
ˆβ...
Norms
Most penalty terms based on some norm of β:
• ℓ2: ||β||2 = i β2
i (Euclidean norm)
• ℓ1: ||β||1 = i |βi | (Manhattan...
Norms (concretely)
Let x be the vector [2, 1]
• ||x||2: 22
+ 12
= 5
• ||x||1: |2| + |1| = 3
• ||x||0: I(2 = 0) + I(1 = 0) ...
Visualizing norms
(0,0)
(2,1)
Visualizing norms: ℓ2
(0,0)
(2,1)
||x||2 = 22
+ 12
Visualizing norms: ℓ1
(0,0)
(2,1)
2
1
||x||1 = |2| + |1|
Visualizing norms: ℓ0
(0,0)
(2,1)
||x||0 = I(2 ≠ 0)+ I(1 ≠ 0)
Penalty terms
• Ridge: J(β) = ||β||2
• Lasso: J(β) = ||β||1
• Geometry of a penalty defines the region of possible
solutions
Ridge regression: ℓ2 penalty
ˆβ(λ) = arg minβ [RSS(β) + λ p
i=1 β2
i ]
Equivalently
ˆβ(t) = arg minβ RSS(β)
subject to p
i...
Region of an ℓ2 ball (λ = 1)
β1 = 1
β2 = 0
β1 = 0.67
β2 = 0.74
β1 = 0.33
β2 = 0.94
β1 = 0
β2 = 1
Lasso: ℓ1 penalty
ˆβ(λ) = arg minβ [RSS(β) + λ p
i=1 |βi|]
Equivalently
ˆβ(t) = arg minβ RSS(β)
subject to p
i=1 |βi| ≤ t
Region of an ℓ1 ball (λ = 1)
β1 = 1
β2 = 0
β1 = 0.67
β2 = 0.33
β1 = 0.33
β2 = 0.67
β1 = 0
β2 = 1
Sparsity and shrinkage
Unlike the ℓ2 penalty (right green region), the ℓ1 penalty (left
green region) produces sparse mode...
Why (some) shrinkage methods are great
Lasso and sparsenet perform two tasks simultaneously:
• Estimate β
• Find ≤ p predi...
Rest of talk
• Lasso (with examples)
• Model selection concepts
• SparseNet
Solving the lasso: original approach
• Estimating ˆβ for the lasso is a convex optimization
problem (specifically, quadrati...
Solving the lasso: newer approaches
• Fastest algorithms for obtaining lasso solution are
implemented as R packages:
• lar...
Lasso examples
LARS
> l i b r a r y ( l a r s )
> o b j e c t <− l a r s (X, y )
> coef ( o b j e c t )
V1 V2 V3 V4
[ 1 , ] 0.000000 0.00...
Coefficient profile (LARS)
* * *
*
*
0.0 0.2 0.4 0.6 0.8 1.0
−50510152025
|beta|/max|beta|
StandardizedCoefficients
*
*
* *
*...
GLMNET
> l i b r a r y ( glmnet )
> o b j e c t <− glmnet (X, y , nlambda=5)
> coef ( o b j e c t )
s0 s1 s2 s3 s4
( I n t...
Coefficient profile (GLMNET)
0 5 10 15 20
−10−505
L1 Norm
Coefficients
0 2 4 4 4
Theory of the lasso
coherence
adaptive
restricted regression
minimal adaptive
restricted eigenvalue
restricted
eigenvaluec...
Some limitations of the lasso
• “The lasso penalty is somewhat indifferent to the choice
among a set of strong but correlat...
Model selection concepts
Cross validation
• LOO cross validation is unbiased but high variance
• Bias: accurate estimate of performance on held-out...
Model selection
You are given a set of models and their K-fold CV performance
estimates. How do you choose the one that wi...
Model selection: choose best
• Choose the model with the best CV performance estimate
• If you trust CV estimates, why not?
Model selection: choose using 1 standard-error rule
Choose the model that
• is the most parsimonious
• has CV performance ...
Model selection: best vs. 1 SE rule
●
●
●
●
●
●
●
●
●
● ●
● ● ●
●
● ●
● ● ● ●
●
●
●
●
●
●
● ●
● ● ●
● ● ●
●
●
● ●
●
● ● ● ...
Model selection example with GLMNET
> l i b r a r y ( glmnet )
> o b j e c t <− cv . glmnet (X, y , nlambda=5)
> object$la...
SparseNet
SparseNet
• Based on Cun-Hui Zhang’s (Rutgers statistics) MC+
algorithm
• “Nearly unbiased variable selection under minima...
SparseNet
> l i b r a r y ( s p a r s e n e t )
> o b j e c t <− s p a r s e n e t (X, y , nlambda=5, ngamma=4)
> coef ( o...
Coefficient profile (SparseNet)
0 5 10 15 20
−10−505
L1 Norm
Coefficients
0 2 4 4 4
Lasso
0 5 10 15 20
−10−505
L1 Norm
Coeffi...
Model selection example with SparseNet
> l i b r a r y ( s p a r s e n e t )
> o b j e c t <− cv . s p a r s e n e t (X, y...
Model selection example with SparseNet (cont.)
> l i b r a r y ( s p a r s e n e t )
> o b j e c t <− cv . s p a r s e n e...
Summary
• Refinement of the lasso (1996, Tibshirani)
• Strong theoretical guarantees, good empirical performance
• Best-in-...
Upcoming SlideShare
Loading in...5
×

Sparsenet

371

Published on

An overview of linear modeling with l1 regularization and the MCP penalty using lars, glmnet, and sparsenet.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
371
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Sparsenet

  1. 1. Understanding SparseNet: Theory and Practice Nicholas Dronen July 2, 2013
  2. 2. Overview • First half technical, then practical • Theory is necessary for intuition of good practice • Practice - examples using R
  3. 3. What is SparseNet? • sparsenet: an R package (2012) based on MC+ algorithm • Related packages: lars (2003), glmnet (2008) • More geneology later • SparseNet fits linear models
  4. 4. A linear model x y b m = y/x
  5. 5. Linear models with many predictor variables Y = Xβ + ǫ • Y is a (n × 1) vector of response variables • X is a (n × p) matrix of predictor variables • β is a (p × 1) vector of coefficients • ǫ is noise Two tasks: • Parameter estimation: estimate β • Subset selection: find ≤ p variables to include in model
  6. 6. Task 1: Parameter estimation: which β is best? A common measure of quality of a particular β is residual sum of squares (RSS). Let ˆyi = p j=1 xijβj be the prediction for essay i. Then RSS(β) = n i=1(yi − ˆyi)2
  7. 7. Residuals 1.0 1.5 2.0 2.5 3.0 1.01.52.02.53.0 x y yi yi ^ }|yi − yi ^|
  8. 8. Task 1: Parameter estimation: minimizing RSS ˆβ = arg minβ RSS(β) Is ˆβ computed by exhaustive search of βs? No
  9. 9. Task 1: Parameter estimation (cont.) After some matrix calculus, estimating the parameters becomes a linear algebra problem (an inversion and a few multiplications): ˆβ = (XT X)−1 XT y This solution sometimes called “ordinary least squares”
  10. 10. Task 2: Subset selection • Find ≤ p predictor variables to include in model • Reasons for subset selection • Principle of parsimony, Occam’s razor • Interpretation: each βi should be meaningful • Prediction: p too large, model overfits train data
  11. 11. Task 2: Subset selection; linear models • Different methods, different eras • Stepwise regression (circa Common Era) • Leaps and bounds (Furnival, 1974) • Ridge regression: constrain β (Hoerl and Kennard, 1970) • Lasso, sparsenet: constrain β, shrink irrelevant βi s to 0 (Tibshirani, 1996; Mazumder et al, 2012) • Ridge, lasso, sparsenet are shrinkage methods
  12. 12. Task 2: Subset selection; leaps and bounds • Intelligently enumerate all possible subsets - computationally expensive • ∼ O(2p ) time complexity – very high latency • Prevents modeler from iterating quickly • Not feasible for wide matrices (p > 50) – job runs a day or more
  13. 13. Shrinkage methods When computing ˆβ add a penalty term J and parameter λ ˆβ(λ) = arg minβ [RSS(β) + λJ(β)] Equivalently ˆβ(t) = arg minβ RSS(β) subject to J(β) ≤ t Correspondence between λ and t is one to one When λ is 0, ˆβ(λ) is just the ordinary least squares solution
  14. 14. Norms Most penalty terms based on some norm of β: • ℓ2: ||β||2 = i β2 i (Euclidean norm) • ℓ1: ||β||1 = i |βi | (Manhattan/taxicab norm) • ℓ0: ||β||0 = i I(βi = 0) (Number of non-zero entries)
  15. 15. Norms (concretely) Let x be the vector [2, 1] • ||x||2: 22 + 12 = 5 • ||x||1: |2| + |1| = 3 • ||x||0: I(2 = 0) + I(1 = 0) = 2
  16. 16. Visualizing norms (0,0) (2,1)
  17. 17. Visualizing norms: ℓ2 (0,0) (2,1) ||x||2 = 22 + 12
  18. 18. Visualizing norms: ℓ1 (0,0) (2,1) 2 1 ||x||1 = |2| + |1|
  19. 19. Visualizing norms: ℓ0 (0,0) (2,1) ||x||0 = I(2 ≠ 0)+ I(1 ≠ 0)
  20. 20. Penalty terms • Ridge: J(β) = ||β||2 • Lasso: J(β) = ||β||1 • Geometry of a penalty defines the region of possible solutions
  21. 21. Ridge regression: ℓ2 penalty ˆβ(λ) = arg minβ [RSS(β) + λ p i=1 β2 i ] Equivalently ˆβ(t) = arg minβ RSS(β) subject to p i=1 β2 i ≤ t
  22. 22. Region of an ℓ2 ball (λ = 1) β1 = 1 β2 = 0 β1 = 0.67 β2 = 0.74 β1 = 0.33 β2 = 0.94 β1 = 0 β2 = 1
  23. 23. Lasso: ℓ1 penalty ˆβ(λ) = arg minβ [RSS(β) + λ p i=1 |βi|] Equivalently ˆβ(t) = arg minβ RSS(β) subject to p i=1 |βi| ≤ t
  24. 24. Region of an ℓ1 ball (λ = 1) β1 = 1 β2 = 0 β1 = 0.67 β2 = 0.33 β1 = 0.33 β2 = 0.67 β1 = 0 β2 = 1
  25. 25. Sparsity and shrinkage Unlike the ℓ2 penalty (right green region), the ℓ1 penalty (left green region) produces sparse models ✦ ❫ ✦ ❫✷✦ ✶ ✦✷ ✦✶ ✦ Hastie et al., The Elements of Statistical Learning, 2009
  26. 26. Why (some) shrinkage methods are great Lasso and sparsenet perform two tasks simultaneously: • Estimate β • Find ≤ p predictor variables to include in model
  27. 27. Rest of talk • Lasso (with examples) • Model selection concepts • SparseNet
  28. 28. Solving the lasso: original approach • Estimating ˆβ for the lasso is a convex optimization problem (specifically, quadratic programming [QP]) • Original approach - not terribly fast • Find λ0 such that β is 0 • Run QP solver to estimate ˆβ(λi ) for each λi in some subset of the range [0, λ0] • Use cross validation to determine best λi
  29. 29. Solving the lasso: newer approaches • Fastest algorithms for obtaining lasso solution are implemented as R packages: • lars (Hastie and Efron, 2003) • Least angle regression • Lasso solution is piecewise linear; a series of clever projections • glmnet (Friedman et al, 2008) • Very fast coordinate descent • Compute ˆβ(λ) by iteratively computing each ˆβi (λ) until convergence; akin to solving univariate regressions • These methods are as fast as a single ordinary least squares solution.
  30. 30. Lasso examples
  31. 31. LARS > l i b r a r y ( l a r s ) > o b j e c t <− l a r s (X, y ) > coef ( o b j e c t ) V1 V2 V3 V4 [ 1 , ] 0.000000 0.000000 0.0000000 0.00000 [ 2 , ] 0.000000 4.059090 0.0000000 0.00000 [ 3 , ] 0.000000 5.597966 0.1454205 0.00000 [ 4 , ] 1.081084 5.712019 0.1590585 0.00000 [ 5 , ] 2.723949 6.789427 0.2128308 −10.53639
  32. 32. Coefficient profile (LARS) * * * * * 0.0 0.2 0.4 0.6 0.8 1.0 −50510152025 |beta|/max|beta| StandardizedCoefficients * * * * * * * * * * * * * * * LASSO 4132 0 1 2 4
  33. 33. GLMNET > l i b r a r y ( glmnet ) > o b j e c t <− glmnet (X, y , nlambda=5) > coef ( o b j e c t ) s0 s1 s2 s3 s4 ( I n t e r c e p t ) 4.07 −2.193 3.068 3.593 3.644 V1 . 1.362 2.586 2.709 2.723 V2 . 5.895 6.700 6.781 6.789 V3 . 0.168 0.208 0.212 0.213 V4 . −1.792 −9.664 −10.450 −10.528
  34. 34. Coefficient profile (GLMNET) 0 5 10 15 20 −10−505 L1 Norm Coefficients 0 2 4 4 4
  35. 35. Theory of the lasso coherence adaptive restricted regression minimal adaptive restricted eigenvalue restricted eigenvaluecompatibilityirrepresentable Theorem 7.3 Corollary 6.13 Theorem 7.2  ✁✂✄☎✆ ✝✞✟✠✡☛ ☞✞✁ ✌✁✆✡✍✄✎✍✞✠ ✂✠✡ ✏✑✒✆✁✁✞✁ ✓✔✕✖✗✘ ✙✚✛✜✢✣ ✤✚✔ ✏✥✦✘✔✔✚✔ Theorems 6.2 and 6.4 Lemma 6.11 Buhlmann and van de Geer, Statistics for High-Dimensional Data: Methods, Theory, and Applications, Springer-Verlag, 2011
  36. 36. Some limitations of the lasso • “The lasso penalty is somewhat indifferent to the choice among a set of strong but correlated variables” – Tibshirani (link) • Variable selection consistency only under the irrepresentable condition (“On Model Selection Consistency of Lasso”, Zhao and Yu, 2006) • Some theoretical conditions apply only in the extreme p >> n • For our purposes: does the learner yield models with high predictive accuracy?
  37. 37. Model selection concepts
  38. 38. Cross validation • LOO cross validation is unbiased but high variance • Bias: accurate estimate of performance on held-out data • Variance: sensitive to variations in training data • K-fold cross validation is preferred (K = 5 . . . 20)
  39. 39. Model selection You are given a set of models and their K-fold CV performance estimates. How do you choose the one that will perform well on unseen data?
  40. 40. Model selection: choose best • Choose the model with the best CV performance estimate • If you trust CV estimates, why not?
  41. 41. Model selection: choose using 1 standard-error rule Choose the model that • is the most parsimonious • has CV performance close enough to the best one Breiman’s one standard-error rule (Classification and Regression Trees, 1984, Breiman et al, p. 80) defines “close enough” as having an error no more than one standard error above the error of the best model
  42. 42. Model selection: best vs. 1 SE rule ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.00 1.25 1.50 1.75 2.00 0 10 20 30 40 NVars MSE type ●● ●● ●● ●● Best One SE Test Train Rule Best One SE
  43. 43. Model selection example with GLMNET > l i b r a r y ( glmnet ) > o b j e c t <− cv . glmnet (X, y , nlambda=5) > object$lambda .1 se ; object$lambda . min [ 1 ] 0.182 [ 1 ] 0.00182 > p r e d i c t ( object , newX , s=object$lambda .1 se ) [ 1 , ] 1 5.21 > p r e d i c t ( object , newX , s=object$lambda . min ) [ 1 , ] 1 5.36
  44. 44. SparseNet
  45. 45. SparseNet • Based on Cun-Hui Zhang’s (Rutgers statistics) MC+ algorithm • “Nearly unbiased variable selection under minimax concave penalty”, Zhang, 2010 • Uses same coordinate descent approach as glmnet • MC+ has two components • Minimax concave penalty (MCP) (more complex than l1 penalty) • Penalized linear unbiased selection (PLUS) algorithm • Variable selection consistent under more general conditions than the lasso • Zhang has implementation on CRAN: plus
  46. 46. SparseNet > l i b r a r y ( s p a r s e n e t ) > o b j e c t <− s p a r s e n e t (X, y , nlambda=5, ngamma=4) > coef ( o b j e c t ) $g1 l 0 l 1 l 2 l 3 l 4 ( I n t e r c e p t ) 4.1 e+00 −2.22 3.03 3.58 3.61 V1 . 1.39 2.62 2.72 2.75 V2 2.1 e−15 5.89 6.69 6.78 6.78 V3 . 0.17 0.21 0.21 0.21 V4 . −1.77 −9.64 −10.44 −10.51 $g2 l 0 l 1 l 2 l 3 l 4 ( I n t e r c e p t ) 4.1 e+00 −2.05 3.30 3.67 3.66 V1 . 1.30 2.49 2.70 2.72 V2 4.2 e−15 5.95 6.77 6.79 6.79 . . . .
  47. 47. Coefficient profile (SparseNet) 0 5 10 15 20 −10−505 L1 Norm Coefficients 0 2 4 4 4 Lasso 0 5 10 15 20 −10−505 L1 Norm Coefficients 0 2 4 4 4 Gamma = 150 0 5 10 15 20 −10−505 L1 Norm Coefficients 0 1 2 3 3 Gamma = 12.2 0 5 10 15 20 −10−505 L1 Norm Coefficients 0 1 2 2 3 Subset
  48. 48. Model selection example with SparseNet > l i b r a r y ( s p a r s e n e t ) > o b j e c t <− cv . s p a r s e n e t (X, y , nlambda=5) > object$parms .1 se ; object$parms . min gamma lambda 8.563 0.081 gamma lambda 9.9 e+35 8.1 e−04
  49. 49. Model selection example with SparseNet (cont.) > l i b r a r y ( s p a r s e n e t ) > o b j e c t <− cv . s p a r s e n e t (X, y , nlambda=5) > p r e d i c t ( object , newX , which=”parms . min ” ) [ 1 , ] 1 5.4 > p r e d i c t ( object , newX , which=”parms .1 se ” ) [ 1 , ] 1 5.2
  50. 50. Summary • Refinement of the lasso (1996, Tibshirani) • Strong theoretical guarantees, good empirical performance • Best-in-class linear modeling algorithm

×