Sparsity by worst-case quadratic penalties
Upcoming SlideShare
Loading in...5
×
 

Sparsity by worst-case quadratic penalties

on

  • 459 views

 

Statistics

Views

Total Views
459
Views on SlideShare
458
Embed Views
1

Actions

Likes
0
Downloads
3
Comments
0

1 Embed 1

https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Sparsity by worst-case quadratic penalties Sparsity by worst-case quadratic penalties Presentation Transcript

    • Sparsity by Worst-case Quadratic Penalties joint work with Yves Grandvalet and Christophe Ambroise e ´ Statistique et G´nome, CNRS & Universit´ d’Evry Val d’Essonne e ´ SSB group, Evry – October the 16th, 2012arXiv preprint with Y. Grandvalet and C. Ambroise.http://arxiv.org/abs/1210.2077R-package quadrupen, on CRAN 1
    • Variable selection in high-dimensional problemsQuestionable solutions 1. Treat univariate problems and select effects via multiple testing Genomic data are often highly correlated. . . 2. Combine multivariate analysis and model selection techniques arg min − (β; y, X) + λ β 0 β∈Rp Unfeasible for p > 30 in general (NP-hard)Popular idea (questionable too!)Use 1 as a convex relaxation of this problem, keeping sparse-inducingeffect: arg min − (β; y, X) + λ · pen 1 (β) β∈Rp 2
    • Another algorithm for the Lasso? (- -)zzzWell, not really. . . 1. We suggest an unifying approach that might be useful since it helps having a global view on the Lasso-zoology reading every daily arXiv papers dealing with 1 is out-of-reach insights are still needed to understand high dimensional problem 2. The associated algorithm is efficient and accurate up to medium scale (1000s) problems Promising tools for pre-discarding irrelevant variables are emerging Thus solving this class of problems may be enough Bootstrapping, cross-validation and subsampling are highly needed. . . for which our method is well adapted. 3
    • OutlineGeometrical insights on SparsityRobustness ViewpointNumerical experimentsR package features 4
    • OutlineGeometrical insights on SparsityRobustness ViewpointNumerical experimentsR package features 5
    • A geometric view of sparsity minimize − (β1 , β2 ) + λΩ(β1 , β2 )(β1 , β2 ) β1 ,β2 maximize (β1 , β2 ) β1 ,β2 s.t. Ω(β1 , β2 ) ≤ c β2 β1 6
    • A geometric view of sparsity minimize − (β1 , β2 ) + λΩ(β1 , β2 ) β1 ,β2 maximize (β1 , β2 )β2 β1 ,β2 s.t. Ω(β1 , β2 ) ≤ c β1 6
    • Singularities induce sparsity Lasso ( 1 ) Ridge ( 2 ) β2 β2 ˆ β ˆ β ˆ NB ( β) B ˆ NB ( β) B β1 β1ˆβ ∈ B is optimal if and only if ˆ (β; X, y) defines a supporting ˆhyperplane at β. Equivalently, − ˆ ˆ (β; X, y) ∈ NB (β). 7
    • OutlineGeometrical insights on SparsityRobustness ViewpointNumerical experimentsR package features 8
    • Robust optimization frameworkWorst-case analysis ˆWe wish to solve a regression problem where β minimize ˆ β = arg min max ||Xβ − y||2 + λ ||β − γ||2 , β∈Rp γ∈Dγ Dγ describes an uncertainty set for the parameters, γ acts as a spurious adversary over the true β . maximizing over Dγ leads to the worst-case formulation. 9
    • Sparsity by work-case formulationFirst mathematical breakthroughGreat, (a − b)2 = a 2 + b 2 − 2ab, so minimize p max ||Xβ − y||2 + λ ||β − γ||2 β∈R γ∈Dγ = minimize ||Xβ − y||2 + λ ||β||2 + λ max {γ β + } . p β∈R γ∈DγLook for sparsity-inducing norm Chose Dγ so as to recover your favourite 1 penalizer via γ β, 2 and forget about ||γ|| , which does not change the minimization can be done ’systematically’ by imposing regularity on β and considering the dual adversarial assumption on γ. 10
    • Sparsity by work-case formulationFirst mathematical breakthroughGreat, (a − b)2 = a 2 + b 2 − 2ab, so minimize p max ||Xβ − y||2 + λ ||β − γ||2 β∈R γ∈Dγ = minimize ||Xβ − y||2 + λ ||β||2 + λ max γ β + ||γ||2 p . β∈R γ∈DγLook for sparsity-inducing norm Chose Dγ so as to recover your favourite 1 penalizer via γ β, 2 and forget about ||γ|| , which does not change the minimization can be done ’systematically’ by imposing regularity on β and considering the dual adversarial assumption on γ. 10
    • Sparsity by work-case formulationFirst mathematical breakthroughGreat, (a − b)2 = a 2 + b 2 − 2ab, so minimize p max ||Xβ − y||2 + λ ||β − γ||2 β∈R γ∈Dγ = minimize ||Xβ − y||2 + λ ||β||2 + λ max γ β + ||γ||2 p . β∈R γ∈DγLook for sparsity-inducing norm Chose Dγ so as to recover your favourite 1 penalizer via γ β, 2 and forget about ||γ|| , which does not change the minimization can be done ’systematically’ by imposing regularity on β and considering the dual adversarial assumption on γ. 10
    • Sparsity by work-case formulationFirst mathematical breakthroughGreat, (a − b)2 = a 2 + b 2 − 2ab, so minimize p max ||Xβ − y||2 + λ ||β − γ||2 β∈R γ∈Dγ = minimize ||Xβ − y||2 + λ ||β||2 + λ max γ β + ||γ||2 p . β∈R γ∈DγLook for sparsity-inducing norm Chose Dγ so as to recover your favourite 1 penalizer via γ β, 2 and forget about ||γ|| , which does not change the minimization can be done ’systematically’ by imposing regularity on β and considering the dual adversarial assumption on γ. 10
    • Example: robust formulation of the elastic net’Lasso’ regularity set for βThe 1 -norm must be controlled: Lasso Hβ = {β ∈ Rp : ||β||1 ≤ ηβ } .Dual assumption on the adversaryThe ∞ -norm of γ should be controlled, say: Lasso Dγ = γ ∈ Rp : max γ β ≤ 1 Lasso β∈Hβ = {γ ∈ Rp : ||γ||∞ ≤ ηγ } = conv {−ηγ , ηγ }p ,where ηγ = 1/ηβ . 11
    • Example: robust formulation of the elastic net (continued)Now, the other way round: minimize ||Xβ − y||2 + λ ||β||2 + λ max γ β + ||γ||2 p β∈R γ∈{−ηγ ,ηγ }p = minimize ||Xβ − y||2 + λ ||β||2 + ληγ β p 1 +c β∈R 1 λ2 ⇔ minimize ||Xβ − y||2 + λ1 β 1 + ||β||2 , β∈Rp 2 2for known λ1 , λ2 . We recognize the ’official’ Elastic-net regularizer. 12
    • Example: robust formulation of the elastic net (continued)Geometrical argument with constrained formulation 2   minimize ||Xβ − y||  β∈Rp   s. t.    β 2+η β 1 ≤s  2 13
    • Example: robust formulation of the elastic net (continued)Geometrical argument with constrained formulation minimize ||Xβ − y||2  2   minimize ||Xβ − y||   p β∈R  β∈Rp      s. t. ⇔ s. t.   ||β − γ||2 ≤ s + η 2   max   β 2+η β 1 ≤s    2 γ∈{−η,η}p 13
    • Example: robust formulation of the elastic net (continued)Geometrical argument with constrained formulation minimize ||Xβ − y||2  2   minimize ||Xβ − y||   p β∈R  β∈Rp      s. t. ⇔ s. t.   ||β − γ||2 ≤ s + η 2   max   β 2+η β 1 ≤s    2 γ∈{−η,η}p 13
    • Example: robust formulation of 1/ ∞ group-LassoArgument with one group, which generalizes by decomposability of thegroup-norm.The regularity set for β with ∞ controlled Hβ = {β ∈ Rp : ||β||∞ ≤ ηβ } . maxDual assumption on the adversary max Dγ = γ ∈ Rp : sup γ β ≤ 1 max β∈Hβ∗ = {γ ∈ Rp : ||γ||1 ≤ ηγ } = conv ηγ ep , . . . , ηγ ep , −ηγ ep , . . . , −ηγ ep 1 p 1 p ,where ηγ = 1/ηβ and ep is the j th element of the canonical basis of Rp , jthat is ejj = 1 if j = j and ejj = 0 otherwise. 14
    • Example: robust formulation of 1/ ∞ group-LassoGeometrical argument with constrained formulation 2   minimize ||Xβ − y||  β∈Rp   s. t.    β 2+η β ∞ ≤s  2 15
    • Example: robust formulation of 1/ ∞ group-LassoGeometrical argument with constrained formulation minimize ||Xβ − y||2  2   minimize ||Xβ − y||   p β∈R  β∈Rp      s. t. ⇔ s. t.    max ||β − γ||2 ≤ s + η 2     β 2+η β ∞ ≤s   2 max γ∈Dγ 15
    • Example: robust formulation of 1/ ∞ group-LassoGeometrical argument with constrained formulation minimize ||Xβ − y||2  2   minimize ||Xβ − y||   p β∈R  β∈Rp      s. t. ⇔ s. t.    max ||β − γ||2 ≤ s + η 2     β 2+η β ∞ ≤s   2 max γ∈Dγ 15
    • Generalize this principle to your favourite problem elastic-net ( 1 + 2) ∞ + 2 structured e.-n. fused-lasso + 2 OSCAR + 2 16
    • Worst-Case Quadratic Penalty Active Set AlgorithmS0 Initialization β ← β0 ; // Start with a feasible β A ← {j : βj = 0} ; // Determine the active set γ = arg max ||β − g||2 ; // Pick a worst admissible γ g∈DγS1 Update active variables βA −1 βA ← X A X A + λI|A| X A y + λγA ; // Subproblem resolutionS2 Verify coherence of γA with the updated βA if ||βA − γA ||2 < max ||βA − gA ||2 then // if γA is not worst-case g∈Dγ old old βA ← βA + ρ(βA − βA ) ; // Last γA -coherent solutionS3 Update active set A gj ← min xj (X A βA − y) + λ(βj − γj ) j = 1, . . . , p // worst-case gradient γ∈Dγ if ∃ j ∈ A : βj = 0 and gj = 0 then A ← A{j } ; // Downgrade j else if maxj ∈Ac gj = 0 then j ← arg max gj , A ← A ∪ {j } ; // Upgrade j j ∈Ac else Stop and return β, which is optimal 17
    • Worst-Case Quadratic Penalty Active Set AlgorithmS0 Initialization β ← β0 ; // Start with a feasible β A ← {j : βj = 0} ; // Determine the active set γ = arg max ||β − g||2 ; // Pick a worst admissible γ g∈DγS1 Update active variables βA −1 βA ← X A X A + λI|A| X A y + λγA ; // Subproblem resolutionS2 Verify coherence of γA with the updated βA if ||βA − γA ||2 < max ||βA − gA ||2 then // if γA is not worst-case g∈Dγ old old βA ← βA + ρ(βA − βA ) ; // Last γA -coherent solutionS3 Update active set A gj ← min xj (X A βA − y) + λ(βj − γj ) j = 1, . . . , p // worst-case gradient γ∈Dγ if ∃ j ∈ A : βj = 0 and gj = 0 then A ← A{j } ; // Downgrade j else if maxj ∈Ac gj = 0 then j ← arg max gj , A ← A ∪ {j } ; // Upgrade j j ∈Ac else Stop and return β, which is optimal 17
    • Worst-Case Quadratic Penalty Active Set AlgorithmS0 Initialization β ← β0 ; // Start with a feasible β A ← {j : βj = 0} ; // Determine the active set γ = arg max ||β − g||2 ; // Pick a worst admissible γ g∈DγS1 Update active variables βA −1 βA ← X A X A + λI|A| X A y + λγA ; // Subproblem resolutionS2 Verify coherence of γA with the updated βA if ||βA − γA ||2 < max ||βA − gA ||2 then // if γA is not worst-case g∈Dγ old old βA ← βA + ρ(βA − βA ) ; // Last γA -coherent solutionS3 Update active set A gj ← min xj (X A βA − y) + λ(βj − γj ) j = 1, . . . , p // worst-case gradient γ∈Dγ if ∃ j ∈ A : βj = 0 and gj = 0 then A ← A{j } ; // Downgrade j else if maxj ∈Ac gj = 0 then j ← arg max gj , A ← A ∪ {j } ; // Upgrade j j ∈Ac else Stop and return β, which is optimal 17
    • Worst-Case Quadratic Penalty Active Set AlgorithmS0 Initialization β ← β0 ; // Start with a feasible β A ← {j : βj = 0} ; // Determine the active set γ = arg max ||β − g||2 ; // Pick a worst admissible γ g∈DγS1 Update active variables βA −1 βA ← X A X A + λI|A| X A y + λγA ; // Subproblem resolutionS2 Verify coherence of γA with the updated βA if ||βA − γA ||2 < max ||βA − gA ||2 then // if γA is not worst-case g∈Dγ old old βA ← βA + ρ(βA − βA ) ; // Last γA -coherent solutionS3 Update active set A gj ← min xj (X A βA − y) + λ(βj − γj ) j = 1, . . . , p // worst-case gradient γ∈Dγ if ∃ j ∈ A : βj = 0 and gj = 0 then A ← A{j } ; // Downgrade j else if maxj ∈Ac gj = 0 then j ← arg max gj , A ← A ∪ {j } ; // Upgrade j j ∈Ac else Stop and return β, which is optimal 17
    • Algorithm complexitysee, e.g., Bach et al 2011Suppose that the algorithm stops at λmin with k activated variables, no downgrade has been observed (we thus have k iterations/steps).Complexity in favorable cases 1. compute X X A + λI|A| : O(npk ), 2. maintaining xj (X A βA − y) along the path: O(pn + pk 2 ) 3. cholesky update of (X A X A + λI|A| )−1 : O(k 3 ). a total of O(npk + pk 2 + k 3 ). Hence k , defined by λmin , matters. . . 18
    • A bound to assess distance to optimum during optimizationPropositionFor any ηγ > 0, and for all vectorial norm ||·||∗ , when Dγ is defined asDγ = {γ ∈ Rp : ||γ||∗ ≤ ηγ }, then, ∀γ ∈ Rp : ||γ||∗ ≥ ηγ , we have: ηγ ληγ (||γ||∗ − ηγ ) min max Jλ (β, γ ) ≥ Jλ (β (γ) , γ) − 2 ||γ||2 , p β∈R γ ∈Dγ ||γ||∗ ||γ||∗whereJλ (β, γ) = ||Xβ − y||2 + λ ||β − γ||2 and β (γ) = arg min Jλ (β, γ) . β∈RpThis proposition can be used to compute an optimality gap by picking aγ-value such that the current worst-case gradient is null (the currentβ-value then being the optimal β (γ)). 19
    • Bound: illustration on an Elastic-Net problem n = 50, p = 200 4 2 Optimality gap 0 −2 −4 −6 −8 0 50 100 150 200 # of iterationsFigure: Monitoring convergence: true optimality gap (solid black) versus ourpessimistic bound (dashed blue) and Fenchel’s duality gap (dotted red)computed at each iteration of the algorithm. 20
    • Full interpretation in a robust optimization frameworkPropositionThe robust regression problem min max ||(X − ∆X )β + Xγ + − y||2 , β∈Rp (∆X , ,γ)∈DX ×D ×Dγfor a given form of the global uncertainty set DX × D × Dγ on(∆X , , γ), is equivalent to the robust regularized regression problem: min max ||Xβ − y||2 + ηX ||β − γ||2 . β∈Rp γ∈DγThese assumptions entail the following relationship between X and y: y = (X − ∆X )β + Xγ + .The observed responses are formed by summing the contributions of theunobserved clean inputs, the adversarial noise that maps the observedinputs to the responses, and the neutral noise. 21
    • OutlineGeometrical insights on SparsityRobustness ViewpointNumerical experimentsR package features 22
    • General objectivesAssessing efficiency of an algorithm accuracy is the difference between the optimum of the objective function and its value at the solution returned by the algorithm; speed is the computing time required for returning this solution. timing has to be compared at similar precision requirementsMimicking post-genomic data attributesOptimization difficulties results from ill-conditioning, due to either high correlation between predictors or underdetermination (high-dimensional or “large p small n”setup)Remarks 1. With active set strategies, bad conditioning is somehow alleviated. 2. Sparsity of the true parameter heavily impacts the running times. 23
    • General objectivesAssessing efficiency of an algorithm accuracy is the difference between the optimum of the objective function and its value at the solution returned by the algorithm; speed is the computing time required for returning this solution. timing has to be compared at similar precision requirementsMimicking post-genomic data attributesOptimization difficulties results from ill-conditioning, due to either high correlation between predictors or underdetermination (high-dimensional or “large p small n”setup)Remarks 1. With active set strategies, bad conditioning is somehow alleviated. 2. Sparsity of the true parameter heavily impacts the running times. 23
    • Data generationExploring those characteristics with linear regressionWe generate samples of size n from the model y = Xβ + ε, ε ∼ N (0, σ 2 I), σ chosen so as to reach R 2 ≈ 0.8, X ∼ N (0, Σ), with Σij = 1{i=j } + ρ1{i=j } , sgn (β ) = 1, . . . , 1, −1, . . . , −1, 0, . . . , 0 . s/2 s/2 p−sControlling the difficulty ρ ∈ {0.1, 0.4, 0.8} rules the conditioning, s ∈ {10%, 30%, 60%} controls the sparsity, the ratio n/p ∈ {2, 1, 0.5} quantifies the well/ill-posedness. 24
    • Comparing optimization strategiesWe used our own code to avoid implementation biases for 1. accelerated proximal methods – proximal, 2. coordinate descent – coordinate, 3. our quadratic solver – quadratic,wrapped in the same active-set + warm-start routine.Timings averaged over 100 runs to minimize 1 λ2 enet Jλ1 ,λ2 (β) = ||Xβ − y||2 + λ1 ||β||1 + ||β||2 . 2 2with halting condition max ˆ ˆ xj y − Xβ + λ2 β < λ1 + τ, (1) j {∈1...p}where the threshold τ = 10−2 on a 50 × 50 grid of λ1 × λ2. 25
    • log-ratio between timing of competitor and quadratic, p = 2n = 100, s = 30 large corr. (0.8) medium corr. (0.4) small corr. (0.1) 1.5 1.0 coordinate descent 0.5 0.0 −0.5 # times faster −1.0 1log10 λ2 3 10 30 100 1.5 300 1.0 proximal (fista) 0.5 0.0 −0.5 −1.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 log10 (λ1 ) 26
    • Comparing stand-alone implementationsWe compare our method on a Lasso problem to popular R-packages 1. accelerated proximal methods – SPAMs-FISTA (Mairal, Bach et al.), 2. coordinate descent – glmnet (Friedman, Hastie, Tibshirani), 3. homotopy/LARS algorithm– lars (Efron, Hastie) and SPAMs-LARS, 4. quadratic solver – quadrupen. lasso enetThe distance D to the optimum is evaluated on Jλ (β) = Jλ,0 (β) by 1/2 1 lasso ˆlars − J lasso β method ˆ 2 D(method) = Jλ βλ λ λ , |Λ| λ∈Λwhere Λ is given by the first min(n, p) steps of lars. Vary {ρ, (p, n)}, fix s = 0.25 min(n, p) and average over 50 runs. 27
    • n = 100, p = 40 q q q low correlation (0.1) 0 q q q med correlation (0.4) q q q high correlation (0.8) q −2 qD(method) (log10 ) q q q −4 q q −6 q q −8 q glmnet (CD, active set) −10 SPAMs (FISTA, no active set) SPAMs (homotopy/LARS) −12 quadrupen (this paper) q q lars (homotopy/LARS) −2.5 −2.0 −1.5 −1.0 CPU time (in seconds, log10 ) 28
    • n = 200, p = 1000 q low correlation (0.1) 2 q q q q med correlation (0.4) q q q high correlation (0.8) 0 q qD(method) (log10 ) q q −2 q q q −4 q q −6 q q glmnet (CD, active set) −8 SPAMs (FISTA, no active set) SPAMs (homotopy/LARS) −10 quadrupen (this paper) q q lars (homotopy/LARS) −12 −1.0 −0.5 0.0 0.5 1.0 1.5 CPU time (in seconds, log10 ) 28
    • 2 n = 400, p = 10000 q q q q low correlation (0.1) qq q q med correlation (0.4) q q high correlation (0.8) 0 q qD(method) (log10 ) q −2 q q −4 q −6 q glmnet (CD, active set) SPAMs (FISTA, no active set) −8 SPAMs (homotopy/LARS) quadrupen (this paper) −10 q q lars (homotopy/LARS) 0.5 1.0 1.5 2.0 2.5 3.0 CPU time (in seconds, log10 ) 28
    • Link between accuracy and prediction performancesLasso requires the rather restrictive ’irrepresentable condition’ (or itsavatars) on the design for sign consistency. . .Any unforeseen consequences of lack of accuracy?Early stops of the algorithm is likely to prevent either the removal of all irrelevant coefficients, or the insertion of relevant ones.Illustration on mean square error and support recovery Generate 100 training data sets for linear regression with ρ = 0.8, R 2 = 0.8, p = 100, s = 30%, varying n. Generate for each a large test set (say, 10n) for evaluation. methods quadrupen glmnet (low) glmnet (med) glmnet (high) timing (ms) 8 7 8 64 accuracy (dist. to opt.) 5.9 × 10−14 7.2 × 100 6.04 × 100 1.47 × 10−2 29
    • Link between accuracy and prediction performancesLasso requires the rather restrictive ’irrepresentable condition’ (or itsavatars) on the design for sign consistency. . .Any unforeseen consequences of lack of accuracy?Early stops of the algorithm is likely to prevent either the removal of all irrelevant coefficients, or the insertion of relevant ones.Illustration on mean square error and support recovery Generate 100 training data sets for linear regression with ρ = 0.8, R 2 = 0.8, p = 100, s = 30%, varying n. Generate for each a large test set (say, 10n) for evaluation. methods quadrupen glmnet (low) glmnet (med) glmnet (high) timing (ms) 8 7 8 64 accuracy (dist. to opt.) 5.9 × 10−14 7.2 × 100 6.04 × 100 1.47 × 10−2 29
    • Link between accuracy and prediction performancesLasso requires the rather restrictive ’irrepresentable condition’ (or itsavatars) on the design for sign consistency. . .Any unforeseen consequences of lack of accuracy?Early stops of the algorithm is likely to prevent either the removal of all irrelevant coefficients, or the insertion of relevant ones.Illustration on mean square error and support recovery Generate 100 training data sets for linear regression with ρ = 0.8, R 2 = 0.8, p = 100, s = 30%, varying n. Generate for each a large test set (say, 10n) for evaluation. methods quadrupen glmnet (low) glmnet (med) glmnet (high) timing (ms) 8 7 8 64 accuracy (dist. to opt.) 5.9 × 10−14 7.2 × 100 6.04 × 100 1.47 × 10−2 29
    • n/p = 0.5 n/p = 1 n/p = 2 n/p = 0.5 n/p = 1 n/p = 2 2.5 MSE 2.5 2.0 2.0 1.5 1.5 0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0 n/p = 0.5 n/p = 1 n/p = 2 method 50 n/p = 0.5 n/p = 1 n/p = 2 quadrupen method 50Sign error glmnet low 40 quadrupen glmnet med glmnet low 40 glmnet high 30 glmnet med glmnet high 30 20 20 10 10 0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 8.0 0.5 2.0 log10 (λ1 ) 8.0 0.5 2.0 8.0 30
    • OutlineGeometrical insights on SparsityRobustness ViewpointNumerical experimentsR package features 31
    • Learning featuresProblem solved ˆ 1 β = arg min (y − Xβ) W(y − Xβ) + λ1 ω ◦ β 1 + λ2 β Sβ, β 2where W = diag(w), such as (wi )n ≥ 0 some observation weights, i=1 ω such as (ωj )p=1 > 0 j 1- penalty weights, S a p × p positive definite matrix ’structuring’ the 2 -penalty.Some corresponding estimators (Adaptive)-Lasso, (Structured) Elastic-net, Fused-Lasso signal approximator (inefficient), ... 32
    • Learning featuresProblem solved ˆ 1 β = arg min (y − Xβ) W(y − Xβ) + λ1 ω ◦ β 1 + λ2 β Sβ, β 2where W = diag(w), such as (wi )n ≥ 0 some observation weights, i=1 ω such as (ωj )p=1 > 0 j 1- penalty weights, S a p × p positive definite matrix ’structuring’ the 2 -penalty.Some corresponding estimators (Adaptive)-Lasso, (Structured) Elastic-net, Fused-Lasso signal approximator (inefficient), ... 32
    • Technical featuresWritten in R (with S4 classes) and C++Dependencies armadillo + RcppArmadillo, its accompanying interface to R Armadillo is a C++ linear algebra library (matrix maths) aiming towards a good balance between speed and ease of use. The syntax is deliberately similar to Matlab. Matrix, to handle with sparse matrices ggplot2, a plotting system based on the grammar of graphics. parallel – not available for Windows :-) a built in R package since 2.14 which allows parallel computation over available CPU cores or clusters (very useful for cross-validation for instance). Suited to solve small to medium scale problems. 33
    • Load the package and show dependenciesR> l i b r a r y ( ”q u a d r u p e n ”) [ 1 ] ”q u a d r u p e n ” ”M a t r i x ” ”lattice ” ”g g p l o t 2 ” ”s t a t s ” ” graphics ” [ 7 ] ”g r D e v i c e s ” ” u t i l s ” ”d a t a s e t s ” ”methods ” ”b a s e ”Generate a size-100 vector of parameters with labelsR> # VECTOR OF TRUE PARAMETERS #R> # sparse , b l o c k w i s e shaped #R> b e t a <− r e p ( c ( 0 , − 2 , 2 ) , c ( 8 0 , 1 0 , 1 0 ) )R> # l a b e l s for true nonzeros #R> l a b e l s <− r e p ( ” i r r e l e v a n t ” , l e n g t h ( b e t a ) )R> l a b e l s [ b e t a != 0 ] <− c ( ” r e l e v a n t ”)R> l a b e l s <− f a c t o r ( l a b e l s , o r d e r e d=TRUE,+ l e v e l s =c ( ” r e l e v a n t ” , ” i r r e l e v a n t ”) ) 34
    • R> # COVARIANCE STRUCTURE OF THE PREDICTORS #R> # T o e p l i t z c o r r e l a t i o n b e t w ee n i r r e l e v a n t v a r i a b l e s #R> c o r <− 0 . 8R> S11 <− t o e p l i t z ( c o r ˆ ( 0 : ( 8 0 − 1 ) ) )R> # b l o c c o r r e l a t i o n between r e l e v a n t v a r i a b l e s #R> S22 <− m a t r i x ( c o r , 1 0 , 1 0 )R> d i a g ( S22 ) <− 1R> # c o r r e l a t i o n b e t w e e n r e l e v a n t and i r r e l e v a n t v a r i a b l e s #R> e p s <− 0 . 2 5R> Sigma <− b d i a g ( S11 , S22 , S22 ) + e p s 20 40 Row 60 80 20 40 60 80 35
    • Generate n = 100 observation with a high level of noise σ = 10R> mu <− 3R> s i g m a <− 10R> n <− 100R> x <− a s . m a t r i x ( m a t r i x ( rnorm ( 1 0 0 * n ) , n , 1 0 0 ) %*% c h o l ( Sigma ) )R> y <− mu + x %*% b e t a + rnorm ( n , 0 , s i g m a )Give a try to raw Lasso and Elastic-net fits. . .R> s t a r t <− p r o c . t i m e ( )R> l a s s o <− e l a s t i c . n e t ( x , y , lambda2 =0)R> e . n e t <− e l a s t i c . n e t ( x , y , lambda2 =1)R> p r i n t ( p r o c . t i m e ()− s t a r t ) user system e l a p s e d 0.080 0.000 0.078 36
    • A print/show method is defined:R> p r i n t ( e . n e t )Linear r e g r e s s i o n with e l a s t i c net p e n a l i z e r , c o e f f i c i e n t s r e s c a l e d by (1+ lambda2 ) .− number o f c o e f f i c i e n t s : 100 + i n t e r c e p t− p e n a l t y p a r a m e t e r lambda1 : 100 p o i n t s from 171 t o 0 . 8 5 6− p e n a l t y p a r a m e t e r lambda2 : 1Also consider residuals, deviance, predict, and fitted methods. . .R> head ( d e v i a n c e ( e . n e t ) )171.217 162.295 153.837 145.821 138.222 131.019 82366 76185 68799 60415 52456 45139 37
    • R> p l o t ( l a s s o , main=”L a s s o ” , x v a r=” f r a c t i o n ” , l a b e l s=l a b e l s ) Lasso 40 standardized coefficients 0 variables relevant irrelevant −40 −80 0.00 0.25 0.50 0.75 1.00 |βλ1|1 maxλ1|βλ1|1 38
    • R> p l o t ( l a s s o , main=”L a s s o ” , x v a r=”lambda ” ,+ l o g . s c a l e =FALSE , r e v e r s e=TRUE, l a b e l s=l a b e l s ) Lasso 40 standardized coefficients 0 variables relevant irrelevant −40 −80 200 150 100 50 0 λ1 39
    • R> p l o t ( l a s s o , main=”L a s s o ” , l a b e l s=l a b e l s ) Lasso 40 standardized coefficients 0 variables relevant irrelevant −40 −80 0.0 0.5 1.0 1.5 2.0 log10(λ1) 40
    • R> p l o t ( e . n e t , main=” E l a s t i c −n e t ” , l a b e l s=l a b e l s ) Elastic−net 25 standardized coefficients 0 variables relevant irrelevant −25 −50 0.0 0.5 1.0 1.5 2.0 log10(λ1) 41
    • R> s y s t e m . t i m e (+ c v . d o u b l e <− c r o s s v a l ( x , y , lambda2 =10ˆ s e q (1 , −1 . 5 , l e n =50))+ )DOUBLE CROSS−VALIDATION10− f o l d CV on t h e lambda1 g r i d f o r e a c h lambda210 8.892 7.906 7.03 6.2515.558 4.942 4.394 3.907 3.4743.089 2.746 2.442 2.171 1.9311.717 1.526 1.357 1.207 1.0730.954 0.848 0.754 0.671 0.5960.53 0.471 0.419 0.373 0.3310.295 0.262 0.233 0.207 0.1840.164 0.146 0.129 0.115 0.1020.091 0.081 0.072 0.064 0.0570.051 0.045 0.04 0.036 0.032 user system e l a p s e d 10.888 1.492 6.636R> p l o t ( c v . d o u b l e ) 42
    • Cross−validation error 10.0 mean 40000log10(λ2) 30000 20000 10000 0.1 1 10 100 log10(λ1) 43
    • R> lambda2 <− s l o t ( c v . d o u b l e , ”l a m b d a 2 . m i n ”)[ 1 ] 0.8483R> s y s t e m . t i m e (+ c v . s i m p l e <− c r o s s v a l ( x , y , lambda2=lambda2 )+ )SIMPLE CROSS−VALIDATION10− f o l d CV on t h e lambda1 g r i d , lambda2 i s fixed . user system e l a p s e d 0.312 0.052 0.266R> sum ( s i g n ( s l o t ( c v . s i m p l e , ”b e t a . m i n ”) ) != s i g n ( b e t a ) )[1] 0R> p l o t ( c v . c i m p l e ) 44
    • Cross−validation error 1000Mean square error lambda.choice 1−se rule min. MSE 500 0.0 0.5 1.0 1.5 2.0 log10(λ1) 45
    • R> marks <− l o g 1 0 ( c ( s l o t ( c v . s i m p l e , ”l a m b d a 1 . m i n ”) ,+ s l o t ( c v . s i m p l e , ”l a m b d a 1 . 1 s e ” ) ) )R> g r a p h <− p l o t ( e l a s t i c . n e t ( x , y , lambda2=lambda2 ) ,+ l a b e l s=l a b e l s , p l o t=FALSE)R> g r a p h + geom v l i n e ( x i n t e r c e p t=marks ) elastic net path 25 standardized coefficients 0 variables relevant irrelevant −25 −50 0.0 0.5 1.0 1.5 2.0 log10(λ1) 46
    • Stability SelectionLet I be a random subsample of size n/2 ˆ ˆ S λ (I ) = {j : βj (I )λ = 0} the estimated support at λ, ˆ Πj ˆ λ = P(j ⊆ S λ (I )) the estimated selection probabilities, qΛ = E(|S ˆ Λ (I )|), the average number of selected variables where Λ = [λmax , λmin ].DefinitionThe set of stable variables on Λ with respect to a cutoff πthr is ˆ ˆj S stable = {j : max Πλ ≥ πthr } λ∈ΛPropositionIf the distribution of 1k ∈S λ is exchangeable for any λ ∈ Λ then, ˆ 2 1 qΛ FWER ≤ PFER = E (V ) ≤ · 2πthr − 1 p 47
    • Stability SelectionLet I be a random subsample of size n/2 ˆ ˆ S λ (I ) = {j : βj (I )λ = 0} the estimated support at λ, ˆ Πj ˆ λ = P(j ⊆ S λ (I )) the estimated selection probabilities, qΛ = E(|S ˆ Λ (I )|), the average number of selected variables where Λ = [λmax , λmin ].DefinitionThe set of stable variables on Λ with respect to a cutoff πthr is ˆ ˆj S stable = {j : max Πλ ≥ πthr } λ∈ΛPropositionIf the distribution of 1k ∈S λ is exchangeable for any λ ∈ Λ then, ˆ 2 1 qΛ FWER ≤ PFER = E (V ) ≤ · 2πthr − 1 p 47
    • R> s y s t e m . t i m e (+ s t a b <− s t a b i l i t y ( x , y , s u b s a m p l e s =400 ,+ r a n d o m i z e=TRUE, w e a k n e s s=0 . 5 )+ )STABILITY SELECTION w i t h r a n d o m i z a t i o n ( w e a k n e s s = 0 . 5 )F i t t i n g p r o c e d u r e : e l a s t i c . n e t w i t h lambda2 = 0 . 0 1 and an 100− d i m e n s i o n a l g r i d o f lambda1 .Runni ng 2 j o b s p a r a l l e l y ( 1 p e r c o r e )Approx . 200 s u b s a m p l i n g s f o r e a c h j o b f o r a t o t a l o f 400 user system e l a p s e d 5.273 0.144 5.440R> p r i n t ( s t a b )S t a b i l i t y p a t h f o r e l a s t i c . n e t p e n a l i z e r , c o e f f i c i e n t s r e s c a l e d by (1+ lambda2 ) .− p e n a l t y p a r a m e t e r lambda1 : 100 p o i n t s from 171 t o 0 . 8 5 6− p e n a l t y p a r a m e t e r lambda2 : 0 . 0 1 48
    • R> p l o t ( s t a b , l a b e l s=l a b e l s , c u t o f f =0. 7 5 , PFER=1) Stability path of an elastic.net regularizer 1.00 PFER ≤ 1 πthr 0.75 selection probabilities selection selected unselected 0.50 variables relevant irrelevant 0.25 0.00 ^ q = 7.29 0 10 20 30 average number of selected variables 49
    • R> p l o t ( s t a b , l a b e l s=l a b e l s , c u t o f f =0. 7 5 , PFER=2) Stability path of an elastic.net regularizer 1.00 PFER ≤ 2 πthr 0.75 selection probabilities selection selected unselected 0.50 variables relevant irrelevant 0.25 0.00 ^ q = 10.03 0 10 20 30 average number of selected variables 50
    • Concluding remarksWhat has been done Unifying view of sparsity through quadratic formulation, Robust regression interpretation, Competitive algorithm for small to medium scale problems, Accompanying R-package, Insights for links between accuracy and prediction performances.What will be done (almost surely, and soon) 1. Multivariate problems (St´phane, Guillem/Pierre) e tr(Y − XB) Ω(Y − XB) + penλ1 ,λ2 (B). 2. A real sparse handling of the design matrix (St´phane). e 3. Group 1/ ∞ penalty (prototyped in R) (St´phane, Camille, Eric). e 51
    • Concluding remarksWhat has been done Unifying view of sparsity through quadratic formulation, Robust regression interpretation, Competitive algorithm for small to medium scale problems, Accompanying R-package, Insights for links between accuracy and prediction performances.What will be done (almost surely, and soon) 1. Multivariate problems (St´phane, Guillem/Pierre) e tr(Y − XB) Ω(Y − XB) + penλ1 ,λ2 (B). 2. A real sparse handling of the design matrix (St´phane). e 3. Group 1/ ∞ penalty (prototyped in R) (St´phane, Camille, Eric). e 51
    • More perspectivesHopefully (help needed) 1. Screening/early discarding of irrelevant features. 2. Efficient implementation of Iterative Reweighted Least-Squares for logistic regression, Cox model, etc. (Sarah? Marius?) 3. Consider implementation of the OSCAR/group-OSCAR (for segmentation purpose). (Alia? Morgane? Pierre?) 4. Control the precision when solving the sub problems (tortured M2 student) a dirty (yet controlled) resolution could even speed up the procedure. can be done via (preconditioned) conjugate gradient/iterative methods. Some promising results during Aurore’s M2. 5. More tools for robust statistics. . . 6. Integration with SIMoNe (If I don’t know what to do/at loose ends). 52
    • References F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1):1–106, 2012. H. Xu, C. Caramanis, and S. Mannor. Robust regression and lasso. IEEE Transactions on Information Theory, 56(7):3561–3574, 2010. A. Beck and M. Teboulle. Fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2:183–202, 2009. H. D. Bondell and B. J. Reich. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics, 64(1):115–123, 2008. W. J. Fu. Penalized regressions: The bridge versus the lasso. Journal of Computational and Graphical Statistics, 7(3):397–416, 1998. L. El Ghaoui and H. Lebret. Robust solutions to least-squares problems with uncertain data. SIAM Journal on Matrix Analysis and Applications, 18(4):1035–1064, 1997. 53