Upcoming SlideShare
×

# Sparsity with sign-coherent groups of variables via the cooperative-Lasso

1,235 views

Published on

1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
1,235
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
10
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Sparsity with sign-coherent groups of variables via the cooperative-Lasso

1. 1. Sparsity with sign-coherent groups of variables via the cooperative-Lasso Julien Chiquet1 , Yves Grandvalet2 , Camille Charbonnier1 1 e ´ Statistique et G´nome, CNRS & Universit´ d’Evry Val d’Essonne e 2 Heudiasyc, CNRS & Universit´ de Technologie de Compi`gne e e SSB – 29 mars 2011 arXiv preprint. http://arxiv.org/abs/1103.2697 R-package scoop. http://stat.genopole.cnrs.fr/logiciels/scoopcooperative-Lasso 1
2. 2. Notations Let Y be the output random variable, X = (X 1 , . . . , X p ) be the input random variables, where X j is the jth predictor. The data Given a sample {(yi , xi ), i = 1, . . . , n} of i.id. realizations of (Y, X), denote y = (y1 , . . . , yn ) the response vector, xj = (xj , . . . , xj ) the vector of data for the jth predictor, 1 n X the n × p design matrix of data whose jth column is xj , D = {i : (yi , xi ) ∈ training set}, T = {i : (yi , xi ) ∈ test set}.cooperative-Lasso 2
3. 3. Generalized linear models Suppose Y depends linearly on X through a function g: E(Y ) = g(Xβ ). ˆ We predict a response yi by yi = g(xi β) for any i ∈ T by solving ˆ ˆ β = arg max D (β) = arg min Lg (yi , xi β), β β i∈D where Lg is a loss function depending on the function g. Typically, if Y is Gaussian and g = Id (OLS), Lg (y, xβ) = (y − xβ)2 if Y is binary and g : t → g(t) = (1 + e−t )−1 (logistic regression) Lg (y, xβ) = − y · xβ − log 1 + exβ or any negative log-likelihood of an exponential family distribution.cooperative-Lasso 3
4. 4. Generalized linear models Suppose Y depends linearly on X through a function g: E(Y ) = g(Xβ ). ˆ We predict a response yi by yi = g(xi β) for any i ∈ T by solving ˆ ˆ β = arg max D (β) = arg min Lg (yi , xi β), β β i∈D where Lg is a loss function depending on the function g. Typically, if Y is Gaussian and g = Id (OLS), Lg (y, xβ) = (y − xβ)2 if Y is binary and g : t → g(t) = (1 + e−t )−1 (logistic regression) Lg (y, xβ) = − y · xβ − log 1 + exβ or any negative log-likelihood of an exponential family distribution.cooperative-Lasso 3
5. 5. Estimation and selection at the group level 1. Structure: the set I = {1, . . . , p} splits into a known partition. K I= Gk , with Gk ∩ G = ∅, k = . k=1 2. Sparsity: the support S of β has few entries. S = {i : βi = 0}, such as |S| p. The group-Lasso estimator Grandvalet and Canu ’98, Bakin ’99, Yuan and Lin ’06 K ˆgroup = arg min − β D (β) +λ wk β Gk . β∈Rp k=1 λ ≥ 0 controls the overall amount of penalty, wk > 0 adapts the penalty between groups (dropped hereafter).cooperative-Lasso 4
6. 6. Estimation and selection at the group level 1. Structure: the set I = {1, . . . , p} splits into a known partition. K I= Gk , with Gk ∩ G = ∅, k = . k=1 2. Sparsity: the support S of β has few entries. S = {i : βi = 0}, such as |S| p. The group-Lasso estimator Grandvalet and Canu ’98, Bakin ’99, Yuan and Lin ’06 K ˆgroup = arg min − β D (β) +λ wk β Gk . β∈Rp k=1 λ ≥ 0 controls the overall amount of penalty, wk > 0 adapts the penalty between groups (dropped hereafter).cooperative-Lasso 4
7. 7. Toy example: the prostate dataset Examines the correlation between the prostate speciﬁc antigen and 8 clinical measures for 97 patients. svi lweight lcavol lcavol log(cancer volume) lweight log(prostate weight) age agecoeﬃcients lbph log(benign prostatic hyperplasia amount) svi seminal vesicle invasion lcp log(capsular penetration) lbph gleason gleason Gleason score pgg45 age pgg45 percentage Gleason scores 4 or 5 lcp -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 lambda (log scale) Figure: Lasso cooperative-Lasso 5
8. 8. Toy example: the prostate dataset Examines the correlation between the prostate speciﬁc antigen and 8 clinical measures for 97 patients. 600 age 500 lcavol log(cancer volume) 400 lweight log(prostate weight) age age 300Height lbph log(benign prostatic pgg45 200 hyperplasia amount) svi seminal vesicle invasion 100 lcp log(capsular penetration) 0 gleason Gleason score lweight gleason pgg45 percentage Gleason scores 4 lbph lcavol svi lcp or 5 Figure: hierarchical clustering cooperative-Lasso 5
9. 9. Toy example: the prostate dataset Examines the correlation between the prostate speciﬁc antigen and 8 clinical measures for 97 patients. svi lweight lcavol lcavol log(cancer volume) lweight log(prostate weight) age agecoeﬃcients lbph log(benign prostatic hyperplasia amount) svi seminal vesicle invasion lcp log(capsular penetration) lbph gleason gleason Gleason score pgg45 age pgg45 percentage Gleason scores 4 or 5 lcp -3 -2 -1 0 lambda (log scale) Figure: group-Lasso cooperative-Lasso 5
10. 10. Toy example: the prostate dataset Examines the correlation between the prostate speciﬁc antigen and 8 clinical measures for 97 patients. svi lweight lcavol lcavol log(cancer volume) lweight log(prostate weight) age agecoeﬃcients lbph log(benign prostatic hyperplasia amount) svi seminal vesicle invasion lcp log(capsular penetration) lbph gleason gleason Gleason score pgg45 age pgg45 percentage Gleason scores 4 or 5 lcp -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 lambda (log scale) Figure: Lasso cooperative-Lasso 5
11. 11. Application to splice site detection Predict splice site status (0/1) by a sequence of 7 bases and their interactions. 2 1.5 order 0: 7 factors with 4 levels,Information content order 1: C7 factors with 42 levels, 2 1 order 2: C7 factors with 43 levels, 3 using dummy coding for factor, 0.5 we form groups. 0 1 2 3 4 5 6 7 8 9 Position L. Meier, S. van de Geer, P. B¨hlmann, 2008. u The group-Lasso for logistic regression, JRSS series B.cooperative-Lasso 6
12. 12. Application to splice site detection Predict splice site status (0/1) by a sequence of 7 bases and their interactions. order 0 g49 g45 g61 order 1 order 2 order 0: 7 factors with 4 levels, g44 g54 g42 order 1: C7 factors with 42 levels, 2 order 2: C7 factors with 43 levels, 3 using dummy coding for factor, g4 we form groups. g18 g5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 L. Meier, S. van de Geer, P. B¨hlmann, 2008. u The group-Lasso for logistic regression, JRSS series B.cooperative-Lasso 6
13. 13. Group-Lasso limitations 1. Not a single zero should belong to a group with non-zeros Strong group sparsity (Huang and Zhang, ’10 arXiv) establish the conditions where the group-Lasso outperforms the Lasso, and conversely. 2. No sign-coherence within group Required if groups gather consonant variables e.g., groups deﬁned by clusters of positively correlated variables. The cooperative-Lasso A penalty which assumes a sign-coherent group structure, that is to say, groups which gather either non-positive, non-negative, or null parameters.cooperative-Lasso 7
14. 14. Group-Lasso limitations 1. Not a single zero should belong to a group with non-zeros Strong group sparsity (Huang and Zhang, ’10 arXiv) establish the conditions where the group-Lasso outperforms the Lasso, and conversely. 2. No sign-coherence within group Required if groups gather consonant variables e.g., groups deﬁned by clusters of positively correlated variables. The cooperative-Lasso A penalty which assumes a sign-coherent group structure, that is to say, groups which gather either non-positive, non-negative, or null parameters.cooperative-Lasso 7
15. 15. Motivation: multiple network inference experiment 1 experiment 2 experiment 3 inference inference inference A group is a set of corresponding edges across tasks (e.g., red or blue ones): sign-coherence matters! J. Chiquet, Y. Grandvalet, C. Ambroise, 2010. Inferring multiple graphical structures, Statistics and Computing.cooperative-Lasso 8
16. 16. Motivation: joint segmentation of aCGH proﬁles 2  minimize β − y  ,   β∈Rp p  s.t   |βi − βi−1 | < s, i=1 1 wherelog-ratio (CNVs) y a vector in Rp , β a vector in Rp , 0 βi a size-n vector with ith probes for the n proﬁles. a group gathers every position i -1 across proﬁles. Sign-coherence may avoid inconsistent variations across proﬁles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH proﬁles using fast group LARS, NIPS. cooperative-Lasso 9
17. 17. Motivation: joint segmentation of aCGH proﬁles 2  minimize β − Y  ,  β∈Rn×p  p  s.t   βi − βi−1 < s, i=1 1 wherelog-ratio (CNVs) Y a n × p matrix with n proﬁles with size p. 0 βi a size-n vector with ith probes for the n proﬁles. a group gathers every position i -1 across proﬁles. Sign-coherence may avoid inconsistent variations across proﬁles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH proﬁles using fast group LARS, NIPS. cooperative-Lasso 9
18. 18. Motivation: joint segmentation of aCGH proﬁles 2  minimize β − Y  ,  β∈Rn×p  p  s.t   βi − βi−1 < s, i=1 1 wherelog-ratio (CNVs) Y a n × p matrix with n proﬁles with size p. 0 βi a size-n vector with ith probes for the n proﬁles. a group gathers every position i -1 across proﬁles. Sign-coherence may avoid inconsistent variations across proﬁles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH proﬁles using fast group LARS, NIPS. cooperative-Lasso 9
19. 19. Motivation: joint segmentation of aCGH proﬁles 2  minimize β − Y  ,  β∈Rn×p  p  s.t   βi − βi−1 < s, i=1 1 wherelog-ratio (CNVs) Y a n × p matrix with n proﬁles with size p. 0 βi a size-n vector with ith probes for the n proﬁles. a group gathers every position i -1 across proﬁles. Sign-coherence may avoid inconsistent variations across proﬁles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH proﬁles using fast group LARS, NIPS. cooperative-Lasso 9
20. 20. Motivation: joint segmentation of aCGH proﬁles 2  minimize β − Y  ,  β∈Rn×p  p  s.t   βi − βi−1 < s, i=1 1 wherelog-ratio (CNVs) Y a n × p matrix with n proﬁles with size p. 0 βi a size-n vector with ith probes for the n proﬁles. a group gathers every position i -1 across proﬁles. Sign-coherence may avoid inconsistent variations across proﬁles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH proﬁles using fast group LARS, NIPS. cooperative-Lasso 9
21. 21. Motivation: joint segmentation of aCGH proﬁles 2  minimize β − Y  ,  β∈Rn×p  p  s.t   βi − βi−1 < s, i=1 1 wherelog-ratio (CNVs) Y a n × p matrix with n proﬁles with size p. 0 βi a size-n vector with ith probes for the n proﬁles. a group gathers every position i -1 across proﬁles. Sign-coherence may avoid inconsistent variations across proﬁles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH proﬁles using fast group LARS, NIPS. cooperative-Lasso 9
22. 22. Motivation: joint segmentation of aCGH proﬁles 2  minimize β − Y  ,  β∈Rn×p  p  s.t   βi − βi−1 < s, i=1 1 wherelog-ratio (CNVs) Y a n × p matrix with n proﬁles with size p. 0 βi a size-n vector with ith probes for the n proﬁles. a group gathers every position i -1 across proﬁles. Sign-coherence may avoid inconsistent variations across proﬁles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH proﬁles using fast group LARS, NIPS. cooperative-Lasso 9
23. 23. Outline Deﬁnition Resolution Consistency Model selection Simulation studies Sibling probe sets and gene selectioncooperative-Lasso 10
24. 24. Outline Deﬁnition Resolution Consistency Model selection Simulation studies Sibling probe sets and gene selectioncooperative-Lasso 11
25. 25. The cooperative-Lasso estimator Deﬁnition ˆcoop = arg min J(β), with J(β) = − β D (β) +λ β coop , β∈Rp where, for any v ∈ Rp , K + − v coop = v+ group + v − group = vGk + vGk , k=1 and + + v+ = (v1 , . . . , vp ), vj = max(0, vj ), + − + v− = (v1 , . . . , vp ), vj = max(0, −vj ). −cooperative-Lasso 12
26. 26. A geometric view of sparsity minimize − (β1 , β2 ) + λΩ(β1 , β2 )(β1 , β2 ) β1 ,β2 maximize (β1 , β2 ) β1 ,β2 s.t. Ω(β1 , β2 ) ≤ c β2 β1cooperative-Lasso 13
27. 27. A geometric view of sparsity minimize − (β1 , β2 ) + λΩ(β1 , β2 ) β1 ,β2 maximize (β1 , β2 )β2 β1 ,β2 s.t. Ω(β1 , β2 ) ≤ c β1cooperative-Lasso 13
28. 28. Ball crafting: group-Lasso β4 = 0 β4 = 0.3Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1Unit ball β1 β1 β group ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1cooperative-Lasso 14
29. 29. Ball crafting: group-Lasso β4 = 0 β4 = 0.3Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1Unit ball β1 β1 β group ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1cooperative-Lasso 14
30. 30. Ball crafting: group-Lasso β4 = 0 β4 = 0.3Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1Unit ball β1 β1 β group ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1cooperative-Lasso 14
31. 31. Ball crafting: group-Lasso β4 = 0 β4 = 0.3Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1Unit ball β1 β1 β group ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1cooperative-Lasso 14
32. 32. Ball crafting: cooperative-Lasso β4 = 0 β4 = 0.3Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1Unit ball β1 β1 β coop ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1cooperative-Lasso 15
33. 33. Ball crafting: cooperative-Lasso β4 = 0 β4 = 0.3Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1Unit ball β1 β1 β coop ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1cooperative-Lasso 15
34. 34. Ball crafting: cooperative-Lasso β4 = 0 β4 = 0.3Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1Unit ball β1 β1 β coop ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1cooperative-Lasso 15
35. 35. Ball crafting: cooperative-Lasso β4 = 0 β4 = 0.3Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1Unit ball β1 β1 β coop ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1cooperative-Lasso 15
36. 36. Outline Deﬁnition Resolution Consistency Model selection Simulation studies Sibling probe sets and gene selectioncooperative-Lasso 16
37. 37. Convex analysis Supporting Hyperplane An hyperplane supports a set iﬀ the set is contained in one half-space the set has at least one point on the hyperplane β2 β1cooperative-Lasso 17
38. 38. Convex analysis Supporting Hyperplane An hyperplane supports a set iﬀ the set is contained in one half-space the set has at least one point on the hyperplane β2 β2 β1 β1cooperative-Lasso 17
39. 39. Convex analysis Supporting Hyperplane An hyperplane supports a set iﬀ the set is contained in one half-space the set has at least one point on the hyperplane β2 β2 β1 β1cooperative-Lasso 17
40. 40. Convex analysis Supporting Hyperplane An hyperplane supports a set iﬀ the set is contained in one half-space the set has at least one point on the hyperplane β2 β2 β1 β1cooperative-Lasso 17
41. 41. Convex analysis Supporting Hyperplane An hyperplane supports a set iﬀ the set is contained in one half-space the set has at least one point on the hyperplane β2 β2 β2 β1 β1 β1 There are Supporting Hyperplane at all points of convex sets: Generalize tangentscooperative-Lasso 17
42. 42. Convex analysis Dual Cone and subgradient Generalizes normals β2 β2 β2 β1 β1 β1 g is a subgradient at x the vector (g, −1) is normal to the supporting hyperplane at this point The subdiﬀerential at x is the set of all subgradient at x.cooperative-Lasso 18
43. 43. Convex analysis Dual Cone and subgradient Generalizes normals β2 β2 β2 β1 β1 β1 g is a subgradient at x the vector (g, −1) is normal to the supporting hyperplane at this point The subdiﬀerential at x is the set of all subgradient at x.cooperative-Lasso 18
44. 44. Convex analysis Dual Cone and subgradient Generalizes normals β2 β2 β2 β1 β1 β1 g is a subgradient at x the vector (g, −1) is normal to the supporting hyperplane at this point The subdiﬀerential at x is the set of all subgradient at x.cooperative-Lasso 18
45. 45. Convex analysis Dual Cone and subgradient Generalizes normals β2 β2 β2 β1 β1 β1 g is a subgradient at x the vector (g, −1) is normal to the supporting hyperplane at this point The subdiﬀerential at x is the set of all subgradient at x.cooperative-Lasso 18
46. 46. Optimality conditions Theorem A necessary and suﬃcient condition for the optimality of β is that the null vector 0 belong to the subdiﬀerential of the convex function J: 0 ∂β J(β) = {v ∈ Rp : v = − β (β) + λθ}, where θ ∈ Rp belongs to the subdiﬀerential of the coop-norm. Deﬁne ϕj (v) = (sign(vj )v)+ , then θ is such as βj ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , θj = , ϕj (β Gk ) c ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , ϕj (θ Gk ) ≤ 1. We derive a subset algorithm to solve that problem (that you can enjoy in the paper and the package).cooperative-Lasso 19
47. 47. Optimality conditions Theorem A necessary and suﬃcient condition for the optimality of β is that the null vector 0 belong to the subdiﬀerential of the convex function J: 0 ∂β J(β) = {v ∈ Rp : v = − β (β) + λθ}, where θ ∈ Rp belongs to the subdiﬀerential of the coop-norm. Deﬁne ϕj (v) = (sign(vj )v)+ , then θ is such as βj ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , θj = , ϕj (β Gk ) c ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , ϕj (θ Gk ) ≤ 1. We derive a subset algorithm to solve that problem (that you can enjoy in the paper and the package).cooperative-Lasso 19
48. 48. Optimality conditions Theorem A necessary and suﬃcient condition for the optimality of β is that the null vector 0 belong to the subdiﬀerential of the convex function J: 0 ∂β J(β) = {v ∈ Rp : v = − β (β) + λθ}, where θ ∈ Rp belongs to the subdiﬀerential of the coop-norm. Deﬁne ϕj (v) = (sign(vj )v)+ , then θ is such as βj ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , θj = , ϕj (β Gk ) c ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , ϕj (θ Gk ) ≤ 1. We derive a subset algorithm to solve that problem (that you can enjoy in the paper and the package).cooperative-Lasso 19
49. 49. Linear regression with orthonormal design Consider ˆ 1 2 β = arg min y − Xβ + λΩ(β) , β 2 ˆols with X X = I. Hence, (xj ) (Xβ − y) = βj − β and ˆ 1 ˆols β = arg min β (β − β ) + λΩ(β) . β 2 We may ﬁnd a closed-form of β for, e.g., 1. Ω(β) = β lasso , 2. Ω(β) = β group , 3. Ω(β) = β coop .cooperative-Lasso 20
50. 50. Linear regression with orthonormal design Consider ˆ 1 2 β = arg min y − Xβ + λΩ(β) , β 2 ˆols with X X = I. Hence, (xj ) (Xβ − y) = βj − β and ˆ 1 ˆols β = arg min β (β − β ) + λΩ(β) . β 2 We may ﬁnd a closed-form of β for, e.g., 1. Ω(β) = β lasso , 2. Ω(β) = β group , 3. Ω(β) = β coop .cooperative-Lasso 20
51. 51. Linear regression with orthonormal design ˆlasso β1 ∀j ∈ {1, . . . , p} ,  + ˆlasso λ  ˆols βj = 1 − βj , ˆ β olsj + ˆlasso = βj ˆols βj − λ . ˆols β2 ˆols β1 Fig.: Lasso as a function of the OLS coeﬃcientscooperative-Lasso 20
52. 52. Linear regression with orthonormal design ˆgroup β1 ∀k ∈ {1, . . . , K} , ∀j ∈ Gk ,  + ˆgroup = 1 − λ  ˆols βj βj , βˆols Gk + ˆgroup = β Gk ˆols β Gk − λ . ˆols β2 ˆols β1 Fig.: Group-Lasso as a function of the OLS coeﬃcientscooperative-Lasso 20
53. 53. Linear regression with orthonormal design ˆcoop β1 ∀k ∈ {1, . . . , K} , ∀j ∈ Gk ,  + ˆcoop λ ˆols βj = 1 − ols  βj , ˆ ϕ (β ) j Gk + ˆcoop ϕj (β Gk ) = ˆols ϕj (β Gk ) − λ . ˆols β2 ˆols β1 Fig.: Coop-Lasso as a function of the OLS coeﬃcientscooperative-Lasso 20
54. 54. Outline Deﬁnition Resolution Consistency Model selection Simulation studies Sibling probe sets and gene selectioncooperative-Lasso 21
55. 55. Linear regression setup Technical assumptions(A1) X and Y have ﬁnite fourth order moments 4 E X < ∞, E|Y |4 < ∞,(A2) the covariance matrix Ψ = EXX ∈ Rp×p is invertible,(A3) for every k = 1, . . . , K, if (β )+ > 0 and (β )− > 0 then for every j ∈ Gk β j = 0. (All sign-coherent groups are either included or excluded from the true support).cooperative-Lasso 22
56. 56. Irrepresentability condition Deﬁne Sk = S ∩ Gk the support within a group and −1 [D(β)]jj = [sign(βj )β Gk ]+ . Assume there exists η > 0 such that(A4) For every group Gk including at least one null coeﬃcient: max( (ΨSk S Ψ−1 D(β S )β S )+ , (ΨSk S Ψ−1 D(β S )β S )− ) ≤ 1 − η, c SS c SS(A5) For every group Gk intersecting the support and including either positive or negative coeﬃcients, let νk be the sign of these coeﬃcients (νk = 1 if (β Gk )+ > 0 and νk = −1 if (β Gk )− > 0): νk ΨSk S Ψ−1 D(β S )β S c SS 0, where denotes componentwise inequality.cooperative-Lasso 23
57. 57. Consistency results Theorem If assumptions (A1-5) are satisﬁed and if there exists η > 0, then for every sequence λn such that λn = λ0 n−γ , γ ∈]0, 1/2[, ˆcoop −→ β β P ˆ and P(S(β coop ) = S) → 1. Asymptotically, the cooperative-Lasso is unbiased and enjoys exact support recovery (even when there are irrelevant variables within a group).cooperative-Lasso 24
58. 58. Sketch of the proof ˜ 1. Construct an artiﬁcal estimator β S restricted to the true support S and extend it with 0 coeﬃcients on S c . ˜ 2. Consider the event En on which β satisﬁes the original optimality coop ˜ conditions. On En , β = β ˆ ˆcoop and β c = 0, by uniqueness. S S S 3. We need to prove that limn→∞ P(En ) = 1. 4. Derive the asymptotic distribution of the derivative of the loss ˜ function X (y − Xβ) from TCL on second order moments, ˜ Optimality conditions on β S . Right choice of λn provides convergence in probability. 5. Assumptions (A4-5) assume that the limits in probability satisfy optimality constraints with strict inequalities. 6. As a result, optimility conditions are satisﬁed (with large inequalities) with probability tending to 1.cooperative-Lasso 25
59. 59. Sketch of the proof ˜ 1. Construct an artiﬁcal estimator β S restricted to the true support S and extend it with 0 coeﬃcients on S c . ˜ 2. Consider the event En on which β satisﬁes the original optimality coop ˜ conditions. On En , β = β ˆ ˆcoop and β c = 0, by uniqueness. S S S 3. We need to prove that limn→∞ P(En ) = 1. 4. Derive the asymptotic distribution of the derivative of the loss ˜ function X (y − Xβ) from TCL on second order moments, ˜ Optimality conditions on β S . Right choice of λn provides convergence in probability. 5. Assumptions (A4-5) assume that the limits in probability satisfy optimality constraints with strict inequalities. 6. As a result, optimility conditions are satisﬁed (with large inequalities) with probability tending to 1.cooperative-Lasso 25
60. 60. Sketch of the proof ˜ 1. Construct an artiﬁcal estimator β S restricted to the true support S and extend it with 0 coeﬃcients on S c . ˜ 2. Consider the event En on which β satisﬁes the original optimality coop ˜ conditions. On En , β = β ˆ ˆcoop and β c = 0, by uniqueness. S S S 3. We need to prove that limn→∞ P(En ) = 1. 4. Derive the asymptotic distribution of the derivative of the loss ˜ function X (y − Xβ) from TCL on second order moments, ˜ Optimality conditions on β S . Right choice of λn provides convergence in probability. 5. Assumptions (A4-5) assume that the limits in probability satisfy optimality constraints with strict inequalities. 6. As a result, optimility conditions are satisﬁed (with large inequalities) with probability tending to 1.cooperative-Lasso 25
61. 61. Sketch of the proof ˜ 1. Construct an artiﬁcal estimator β S restricted to the true support S and extend it with 0 coeﬃcients on S c . ˜ 2. Consider the event En on which β satisﬁes the original optimality coop ˜ conditions. On En , β = β ˆ ˆcoop and β c = 0, by uniqueness. S S S 3. We need to prove that limn→∞ P(En ) = 1. 4. Derive the asymptotic distribution of the derivative of the loss ˜ function X (y − Xβ) from TCL on second order moments, ˜ Optimality conditions on β S . Right choice of λn provides convergence in probability. 5. Assumptions (A4-5) assume that the limits in probability satisfy optimality constraints with strict inequalities. 6. As a result, optimility conditions are satisﬁed (with large inequalities) with probability tending to 1.cooperative-Lasso 25
62. 62. Illustration 1.0 0.5 Generate data y = Xβ + σε,coeﬃcients β = (1, 1, −1, −1, 0, 0, 0, 0) G = {{1, 2}, {3, 4}, {5, 6}, {7, 8}} 0.0 σ = 0.1, R2 ≈ 0.99, n = 20, irrepresentability conditions -0.5 holds for the coop-Lasso, holds not for the group-Lasso. average over 100 simulations. -1.0 -3 -2 -1 0 1 log10 (λ) Fig.:: 50% coverage intervals (upper / lower quartiles)cooperative-Lasso 26
63. 63. Illustration 1.0 0.5 Generate data y = Xβ + σε,coeﬃcients β = (1, 1, −1, −1, 0, 0, 0, 0) G = {{1, 2}, {3, 4}, {5, 6}, {7, 8}} 0.0 σ = 0.1, R2 ≈ 0.99, n = 20, irrepresentability conditions -0.5 holds for the coop-Lasso, holds not for the group-Lasso. average over 100 simulations. -1.0 -3 -2 -1 0 1 log10 (λ) Fig.:group-Lasso: 50% coverage intervals (upper / lower quartiles)cooperative-Lasso 26
64. 64. Illustration 1.0 0.5 Generate data y = Xβ + σε,coeﬃcients β = (1, 1, −1, −1, 0, 0, 0, 0) G = {{1, 2}, {3, 4}, {5, 6}, {7, 8}} 0.0 σ = 0.1, R2 ≈ 0.99, n = 20, irrepresentability conditions -0.5 holds for the coop-Lasso, holds not for the group-Lasso. average over 100 simulations. -1.0 -3 -2 -1 0 1 log10 (λ) Fig.:coop-Lasso: 50% coverage intervals (upper / lower quartiles)cooperative-Lasso 26
65. 65. Outline Deﬁnition Resolution Consistency Model selection Simulation studies Sibling probe sets and gene selectioncooperative-Lasso 27
66. 66. Optimism of the training error The training error: 1 ˆ err = L(yi , xi β). |D| i∈D The test error (“extra-sample” error): ˆ Errex = EX,Y [L(Y, X β)|D]. The “in-sample” error 1 ˆ Errin = EY L(Yi , xi β)|D . |D| i∈D Deﬁnition (Optimism) Errin = err + ”optimism”.cooperative-Lasso 28
67. 67. Optimism of the training error The training error: 1 ˆ err = L(yi , xi β). |D| i∈D The test error (“extra-sample” error): ˆ Errex = EX,Y [L(Y, X β)|D]. The “in-sample” error 1 ˆ Errin = EY L(Yi , xi β)|D . |D| i∈D Deﬁnition (Optimism) Errin = err + ”optimism”.cooperative-Lasso 28
68. 68. Cp statistics For squared-error loss (and some other loss), 2 Errin = err + cov(ˆi , yi ). y |D| i∈D The amount by which err underestimates the true error depends on how strongly yi aﬀects its own prediction. The harder we ﬁt the data, the greater the covariance will be thereby increasing the optimism (ESLII 5th print). Mallows’ Cp Statistic ˆ For a linear regression ﬁt yi with p inputs i∈D cov(ˆi , yi ) = pσ 2 : y df 2 Cp = err + 2 · ˆ σ , with df = p. |D|cooperative-Lasso 29
69. 69. Cp statistics For squared-error loss (and some other loss), 2 Errin = err + cov(ˆi , yi ). y |D| i∈D The amount by which err underestimates the true error depends on how strongly yi aﬀects its own prediction. The harder we ﬁt the data, the greater the covariance will be thereby increasing the optimism (ESLII 5th print). Mallows’ Cp Statistic ˆ For a linear regression ﬁt yi with p inputs i∈D cov(ˆi , yi ) = pσ 2 : y df 2 Cp = err + 2 · ˆ σ , with df = p. |D|cooperative-Lasso 29
70. 70. Generalized degrees of freedom ˆ ˆ Let y(λ) = Xβ(λ) be the predicted values for a penalized estimator. Proposition (Efron (’04)+ Stein’s Lemma (’81)) . 1 ˆ ∂ yλ df(λ) = 2 cov(ˆi (λ), yi ) = Ey tr y . σ ∂y i∈D For the Lasso, Zou et al. (’07) show that ˆlasso (λ) df lasso (λ) = β . 0 Assuming X X = I Yuan and Lin (’06) show for the group-Lasso that the trace term equals ˆgroup   K β Gk (λ) df group (λ) = ˆgroup 1 β Gk (λ) > 0 1 + (pk − 1) . k=1 β ols Gkcooperative-Lasso 30
71. 71. Generalized degrees of freedom ˆ ˆ Let y(λ) = Xβ(λ) be the predicted values for a penalized estimator. Proposition (Efron (’04)+ Stein’s Lemma (’81)) . 1 ˆ ∂ yλ df(λ) = 2 cov(ˆi (λ), yi ) = Ey tr y . σ ∂y i∈D For the Lasso, Zou et al. (’07) show that ˆlasso (λ) df lasso (λ) = β . 0 Assuming X X = I Yuan and Lin (’06) show for the group-Lasso that the trace term equals ˆgroup   K β Gk (λ) df group (λ) = ˆgroup 1 β Gk (λ) > 0 1 + (pk − 1) . k=1 β ols Gkcooperative-Lasso 30
72. 72. Generalized degrees of freedom ˆ ˆ Let y(λ) = Xβ(λ) be the predicted values for a penalized estimator. Proposition (Efron (’04)+ Stein’s Lemma (’81)) . 1 ˆ ∂ yλ df(λ) = 2 cov(ˆi (λ), yi ) = Ey tr y . σ ∂y i∈D For the Lasso, Zou et al. (’07) show that ˆlasso (λ) df lasso (λ) = β . 0 Assuming X X = I Yuan and Lin (’06) show for the group-Lasso that the trace term equals ˆgroup   K β Gk (λ) df group (λ) = ˆgroup 1 β Gk (λ) > 0 1 + (pk − 1) . k=1 β ols Gkcooperative-Lasso 30
73. 73. Approximated degrees of freedom for the coop-Lasso Proposition Assuming that data are generated according to a linear regression model and that X is orthonormal, the following expression of df coop (λ) is an unbiased estimate of df(λ) +   K ˆcoop β Gk (λ) 1 + (pk − 1)   df coop (λ) = 1 + +  ˆcoop + β G (λ) >0 ˆols   k=1 k β Gk −   ˆcoop β Gk (λ) 1 + (pk − 1)   +1 − −  , ˆcoop − β G (λ) >0 β ols   k Gk where pk and pk are respectively the number of positive and negative + − ˆols entries in β (γ). Gkcooperative-Lasso 31
74. 74. Approximated degrees of freedom for the coop-Lasso Proposition Assuming that data are generated according to a linear regression model and that X is orthonormal, the following expression of df coop (λ) is an unbiased estimate of df(λ) +   K ˆcoop β Gk (λ) k 1 + p+ − 1   df coop (λ) = 1 +  ˆcoop 1+γ + β G (λ) >0 ˆridge   k=1 k β Gk (γ) −   ˆcoop β Gk (λ) k 1 + p− − 1   +1 −  , ˆcoop 1+γ − β G (λ) >0 ˆridge   k β Gk (γ) where pk and pk are respectively the number of positive and negative + − entries in βˆridge (γ). Gkcooperative-Lasso 31
75. 75. Approximated information criteria Following Zou et al, we extend the Cp stat to an “approximated” AIC y − y(λ) ˆ ˜ AIC(λ) = + 2df(λ), σ2 and from the AIC, there is (small) step to BIC: y − y(λ) ˆ ˜ BIC(λ) = + log(n)df(λ). σ2 The K–fold cross-validation works well but is computationally intensive. It is required when we do not meet the linear regression setup. . .cooperative-Lasso 32
76. 76. Outline Deﬁnition Resolution Consistency Model selection Simulation studies Sibling probe sets and gene selectioncooperative-Lasso 33
77. 77. Revisiting Elastic-Net experiments (1) q Generate data y = Xβ + σε, 70 q q q β = q q (0, . . . , 0, 2, . . . , 2, 0, . . . , 0, 2, . . . , 2) 60 q q q q q 10 10 10 10 50 G1 = {1, . . . , 10}, G2 = {11, . . . , 20},MSE G3 = {21, . . . , 30}, 40 G4 = {31, . . . , 40}. σ = 15, corr(xi , xj ) = 0.5, 30 training/validation/test/ = 100/100/400, 20 q average over 100 simulations. 10 lasso enet group coopcooperative-Lasso 34
78. 78. Revisiting Elastic-Net experiments (2) Generate data y = Xβ + σε, β = (3, . . . , 3, 0, . . . , 0) q 250 15 25 q q σ = 15, 200 G1 = {1, . . . , 5}, G2 = {6, . . . , 10}, q G3 = {11, . . . , 15}, 150 G4 = {16, . . . , 40}.MSE xj = Z1 + ε, Z1 ∼ N (0, 1), ∀j ∈ G1 100 q q q q q xj = Z3 + ε, Z2 ∼ N (0, 1), ∀j ∈ G2 q xj = Z3 + ε, Z3 ∼ N (0, 1), ∀j ∈ G3 50 xj ∼ N (0, 1), ∀j ∈ G4 . training/validation/test/ = 50/50/400, 0 lasso enet group coop average over 100 simulations.cooperative-Lasso 35
79. 79. Breiman’s setup Simulations setting A wave-like vector of parameters β p = 90 variables partitioned into K = 10 groups of size pk = 9, 3 (partially) active groups, 6 groups of zeros, in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5. 0 20 40 60 80 Figure: β with h = 1, |Sk | = 1 non-zero coeﬃcients in each active group.cooperative-Lasso 36
80. 80. Breiman’s setup Simulations setting A wave-like vector of parameters β p = 90 variables partitioned into K = 10 groups of size pk = 9, 3 (partially) active groups, 6 groups of zeros, in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5. 0 20 40 60 80 Figure: β with h = 2, |Sk | = 3 non-zero coeﬃcients in each active group.cooperative-Lasso 36
81. 81. Breiman’s setup Simulations setting A wave-like vector of parameters β p = 90 variables partitioned into K = 10 groups of size pk = 9, 3 (partially) active groups, 6 groups of zeros, in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5. 0 20 40 60 80 Figure: β with h = 3, |Sk | = 5 non-zero coeﬃcients in each active group.cooperative-Lasso 36
82. 82. Breiman’s setup Simulations setting A wave-like vector of parameters β p = 90 variables partitioned into K = 10 groups of size pk = 9, 3 (partially) active groups, 6 groups of zeros, in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5. 0 20 40 60 80 Figure: β with h = 4, |Sk | = 7 non-zero coeﬃcients in each active group.cooperative-Lasso 36
83. 83. Breiman’s setup Simulations setting A wave-like vector of parameters β p = 90 variables partitioned into K = 10 groups of size pk = 9, 3 (partially) active groups, 6 groups of zeros, in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5. 0 20 40 60 80 Figure: β with h = 5, |Sk | = 9 non-zero coeﬃcients in each active group.cooperative-Lasso 36
84. 84. Breiman’s setup Example of path of solution and signal recovery with BIC choice The signal strength is generated so as y = Xβ + σ , with σ = 1, n = 30 to 500, X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example), magnitude in β chosen so as R2 ≈ 0.75. Remark Covariance structure is purposely disconnected from the group structure. None of the support recovery conditions are fulﬁlled.cooperative-Lasso 37
85. 85. Breiman’s setup Example of path of solution and signal recovery with BIC choice The signal strength is generated so as y = Xβ + σ , with σ = 1, n = 30 to 500, X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example), magnitude in β chosen so as R2 ≈ 0.75. One shot sample with n = 120cooperative-Lasso 37
86. 86. Breiman’s setup Example of path of solution and signal recovery with BIC choice The signal strength is generated so as y = Xβ + σ , with σ = 1, n = 30 to 500, X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example), magnitude in β chosen so as R2 ≈ 0.75. 0.6 0.5 0.4 0.4 0.3 ˆlasso ˆlasso 0.2 True signal 0.2 β β Estimated signal 0.1 0.0 0.0 -0.2 -0.1 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 log10 (λ) i Figure: Lassocooperative-Lasso 37
87. 87. Breiman’s setup Example of path of solution and signal recovery with BIC choice The signal strength is generated so as y = Xβ + σ , with σ = 1, n = 30 to 500, X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example), magnitude in β chosen so as R2 ≈ 0.75. 0.5 0.5 0.4 0.4 0.3 0.3 ˆgroup ˆgroup 0.2 True signal 0.2 β β 0.1 Estimated signal 0.1 0.0 0.0 -0.1 -0.1 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 0 20 40 60 80 log10 (λ) i Figure: Group-Lassocooperative-Lasso 37
88. 88. Breiman’s setup Example of path of solution and signal recovery with BIC choice The signal strength is generated so as y = Xβ + σ , with σ = 1, n = 30 to 500, X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example), magnitude in β chosen so as R2 ≈ 0.75. 0.5 0.5 0.4 0.4 0.3 0.3 ˆcoop ˆcoop True signal 0.2 0.2 β β Estimated signal 0.1 0.1 0.0 0.0 -0.1 -0.1 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 0 20 40 60 80 log10 (λ) i Figure: Coop-Lassocooperative-Lasso 37