Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

1,235 views

Published on

No Downloads

Total views

1,235

On SlideShare

0

From Embeds

0

Number of Embeds

2

Shares

0

Downloads

10

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Sparsity with sign-coherent groups of variables via the cooperative-Lasso Julien Chiquet1 , Yves Grandvalet2 , Camille Charbonnier1 1 e ´ Statistique et G´nome, CNRS & Universit´ d’Evry Val d’Essonne e 2 Heudiasyc, CNRS & Universit´ de Technologie de Compi`gne e e SSB – 29 mars 2011 arXiv preprint. http://arxiv.org/abs/1103.2697 R-package scoop. http://stat.genopole.cnrs.fr/logiciels/scoopcooperative-Lasso 1
- 2. Notations Let Y be the output random variable, X = (X 1 , . . . , X p ) be the input random variables, where X j is the jth predictor. The data Given a sample {(yi , xi ), i = 1, . . . , n} of i.id. realizations of (Y, X), denote y = (y1 , . . . , yn ) the response vector, xj = (xj , . . . , xj ) the vector of data for the jth predictor, 1 n X the n × p design matrix of data whose jth column is xj , D = {i : (yi , xi ) ∈ training set}, T = {i : (yi , xi ) ∈ test set}.cooperative-Lasso 2
- 3. Generalized linear models Suppose Y depends linearly on X through a function g: E(Y ) = g(Xβ ). ˆ We predict a response yi by yi = g(xi β) for any i ∈ T by solving ˆ ˆ β = arg max D (β) = arg min Lg (yi , xi β), β β i∈D where Lg is a loss function depending on the function g. Typically, if Y is Gaussian and g = Id (OLS), Lg (y, xβ) = (y − xβ)2 if Y is binary and g : t → g(t) = (1 + e−t )−1 (logistic regression) Lg (y, xβ) = − y · xβ − log 1 + exβ or any negative log-likelihood of an exponential family distribution.cooperative-Lasso 3
- 4. Generalized linear models Suppose Y depends linearly on X through a function g: E(Y ) = g(Xβ ). ˆ We predict a response yi by yi = g(xi β) for any i ∈ T by solving ˆ ˆ β = arg max D (β) = arg min Lg (yi , xi β), β β i∈D where Lg is a loss function depending on the function g. Typically, if Y is Gaussian and g = Id (OLS), Lg (y, xβ) = (y − xβ)2 if Y is binary and g : t → g(t) = (1 + e−t )−1 (logistic regression) Lg (y, xβ) = − y · xβ − log 1 + exβ or any negative log-likelihood of an exponential family distribution.cooperative-Lasso 3
- 5. Estimation and selection at the group level 1. Structure: the set I = {1, . . . , p} splits into a known partition. K I= Gk , with Gk ∩ G = ∅, k = . k=1 2. Sparsity: the support S of β has few entries. S = {i : βi = 0}, such as |S| p. The group-Lasso estimator Grandvalet and Canu ’98, Bakin ’99, Yuan and Lin ’06 K ˆgroup = arg min − β D (β) +λ wk β Gk . β∈Rp k=1 λ ≥ 0 controls the overall amount of penalty, wk > 0 adapts the penalty between groups (dropped hereafter).cooperative-Lasso 4
- 6. Estimation and selection at the group level 1. Structure: the set I = {1, . . . , p} splits into a known partition. K I= Gk , with Gk ∩ G = ∅, k = . k=1 2. Sparsity: the support S of β has few entries. S = {i : βi = 0}, such as |S| p. The group-Lasso estimator Grandvalet and Canu ’98, Bakin ’99, Yuan and Lin ’06 K ˆgroup = arg min − β D (β) +λ wk β Gk . β∈Rp k=1 λ ≥ 0 controls the overall amount of penalty, wk > 0 adapts the penalty between groups (dropped hereafter).cooperative-Lasso 4
- 7. Toy example: the prostate dataset Examines the correlation between the prostate speciﬁc antigen and 8 clinical measures for 97 patients. svi lweight lcavol lcavol log(cancer volume) lweight log(prostate weight) age agecoeﬃcients lbph log(benign prostatic hyperplasia amount) svi seminal vesicle invasion lcp log(capsular penetration) lbph gleason gleason Gleason score pgg45 age pgg45 percentage Gleason scores 4 or 5 lcp -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 lambda (log scale) Figure: Lasso cooperative-Lasso 5
- 8. Toy example: the prostate dataset Examines the correlation between the prostate speciﬁc antigen and 8 clinical measures for 97 patients. 600 age 500 lcavol log(cancer volume) 400 lweight log(prostate weight) age age 300Height lbph log(benign prostatic pgg45 200 hyperplasia amount) svi seminal vesicle invasion 100 lcp log(capsular penetration) 0 gleason Gleason score lweight gleason pgg45 percentage Gleason scores 4 lbph lcavol svi lcp or 5 Figure: hierarchical clustering cooperative-Lasso 5
- 9. Toy example: the prostate dataset Examines the correlation between the prostate speciﬁc antigen and 8 clinical measures for 97 patients. svi lweight lcavol lcavol log(cancer volume) lweight log(prostate weight) age agecoeﬃcients lbph log(benign prostatic hyperplasia amount) svi seminal vesicle invasion lcp log(capsular penetration) lbph gleason gleason Gleason score pgg45 age pgg45 percentage Gleason scores 4 or 5 lcp -3 -2 -1 0 lambda (log scale) Figure: group-Lasso cooperative-Lasso 5
- 10. Toy example: the prostate dataset Examines the correlation between the prostate speciﬁc antigen and 8 clinical measures for 97 patients. svi lweight lcavol lcavol log(cancer volume) lweight log(prostate weight) age agecoeﬃcients lbph log(benign prostatic hyperplasia amount) svi seminal vesicle invasion lcp log(capsular penetration) lbph gleason gleason Gleason score pgg45 age pgg45 percentage Gleason scores 4 or 5 lcp -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 lambda (log scale) Figure: Lasso cooperative-Lasso 5
- 11. Application to splice site detection Predict splice site status (0/1) by a sequence of 7 bases and their interactions. 2 1.5 order 0: 7 factors with 4 levels,Information content order 1: C7 factors with 42 levels, 2 1 order 2: C7 factors with 43 levels, 3 using dummy coding for factor, 0.5 we form groups. 0 1 2 3 4 5 6 7 8 9 Position L. Meier, S. van de Geer, P. B¨hlmann, 2008. u The group-Lasso for logistic regression, JRSS series B.cooperative-Lasso 6
- 12. Application to splice site detection Predict splice site status (0/1) by a sequence of 7 bases and their interactions. order 0 g49 g45 g61 order 1 order 2 order 0: 7 factors with 4 levels, g44 g54 g42 order 1: C7 factors with 42 levels, 2 order 2: C7 factors with 43 levels, 3 using dummy coding for factor, g4 we form groups. g18 g5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 L. Meier, S. van de Geer, P. B¨hlmann, 2008. u The group-Lasso for logistic regression, JRSS series B.cooperative-Lasso 6
- 13. Group-Lasso limitations 1. Not a single zero should belong to a group with non-zeros Strong group sparsity (Huang and Zhang, ’10 arXiv) establish the conditions where the group-Lasso outperforms the Lasso, and conversely. 2. No sign-coherence within group Required if groups gather consonant variables e.g., groups deﬁned by clusters of positively correlated variables. The cooperative-Lasso A penalty which assumes a sign-coherent group structure, that is to say, groups which gather either non-positive, non-negative, or null parameters.cooperative-Lasso 7
- 14. Group-Lasso limitations 1. Not a single zero should belong to a group with non-zeros Strong group sparsity (Huang and Zhang, ’10 arXiv) establish the conditions where the group-Lasso outperforms the Lasso, and conversely. 2. No sign-coherence within group Required if groups gather consonant variables e.g., groups deﬁned by clusters of positively correlated variables. The cooperative-Lasso A penalty which assumes a sign-coherent group structure, that is to say, groups which gather either non-positive, non-negative, or null parameters.cooperative-Lasso 7
- 15. Motivation: multiple network inference experiment 1 experiment 2 experiment 3 inference inference inference A group is a set of corresponding edges across tasks (e.g., red or blue ones): sign-coherence matters! J. Chiquet, Y. Grandvalet, C. Ambroise, 2010. Inferring multiple graphical structures, Statistics and Computing.cooperative-Lasso 8
- 16. Motivation: joint segmentation of aCGH proﬁles 2 minimize β − y , β∈Rp p s.t |βi − βi−1 | < s, i=1 1 wherelog-ratio (CNVs) y a vector in Rp , β a vector in Rp , 0 βi a size-n vector with ith probes for the n proﬁles. a group gathers every position i -1 across proﬁles. Sign-coherence may avoid inconsistent variations across proﬁles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH proﬁles using fast group LARS, NIPS. cooperative-Lasso 9
- 17. Motivation: joint segmentation of aCGH proﬁles 2 minimize β − Y , β∈Rn×p p s.t βi − βi−1 < s, i=1 1 wherelog-ratio (CNVs) Y a n × p matrix with n proﬁles with size p. 0 βi a size-n vector with ith probes for the n proﬁles. a group gathers every position i -1 across proﬁles. Sign-coherence may avoid inconsistent variations across proﬁles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH proﬁles using fast group LARS, NIPS. cooperative-Lasso 9
- 18. Motivation: joint segmentation of aCGH proﬁles 2 minimize β − Y , β∈Rn×p p s.t βi − βi−1 < s, i=1 1 wherelog-ratio (CNVs) Y a n × p matrix with n proﬁles with size p. 0 βi a size-n vector with ith probes for the n proﬁles. a group gathers every position i -1 across proﬁles. Sign-coherence may avoid inconsistent variations across proﬁles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH proﬁles using fast group LARS, NIPS. cooperative-Lasso 9
- 19. Motivation: joint segmentation of aCGH proﬁles 2 minimize β − Y , β∈Rn×p p s.t βi − βi−1 < s, i=1 1 wherelog-ratio (CNVs) Y a n × p matrix with n proﬁles with size p. 0 βi a size-n vector with ith probes for the n proﬁles. a group gathers every position i -1 across proﬁles. Sign-coherence may avoid inconsistent variations across proﬁles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH proﬁles using fast group LARS, NIPS. cooperative-Lasso 9
- 20. Motivation: joint segmentation of aCGH proﬁles 2 minimize β − Y , β∈Rn×p p s.t βi − βi−1 < s, i=1 1 wherelog-ratio (CNVs) Y a n × p matrix with n proﬁles with size p. 0 βi a size-n vector with ith probes for the n proﬁles. a group gathers every position i -1 across proﬁles. Sign-coherence may avoid inconsistent variations across proﬁles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH proﬁles using fast group LARS, NIPS. cooperative-Lasso 9
- 21. Motivation: joint segmentation of aCGH proﬁles 2 minimize β − Y , β∈Rn×p p s.t βi − βi−1 < s, i=1 1 wherelog-ratio (CNVs) Y a n × p matrix with n proﬁles with size p. 0 βi a size-n vector with ith probes for the n proﬁles. a group gathers every position i -1 across proﬁles. Sign-coherence may avoid inconsistent variations across proﬁles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH proﬁles using fast group LARS, NIPS. cooperative-Lasso 9
- 22. Motivation: joint segmentation of aCGH proﬁles 2 minimize β − Y , β∈Rn×p p s.t βi − βi−1 < s, i=1 1 wherelog-ratio (CNVs) Y a n × p matrix with n proﬁles with size p. 0 βi a size-n vector with ith probes for the n proﬁles. a group gathers every position i -1 across proﬁles. Sign-coherence may avoid inconsistent variations across proﬁles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH proﬁles using fast group LARS, NIPS. cooperative-Lasso 9
- 23. Outline Deﬁnition Resolution Consistency Model selection Simulation studies Sibling probe sets and gene selectioncooperative-Lasso 10
- 24. Outline Deﬁnition Resolution Consistency Model selection Simulation studies Sibling probe sets and gene selectioncooperative-Lasso 11
- 25. The cooperative-Lasso estimator Deﬁnition ˆcoop = arg min J(β), with J(β) = − β D (β) +λ β coop , β∈Rp where, for any v ∈ Rp , K + − v coop = v+ group + v − group = vGk + vGk , k=1 and + + v+ = (v1 , . . . , vp ), vj = max(0, vj ), + − + v− = (v1 , . . . , vp ), vj = max(0, −vj ). −cooperative-Lasso 12
- 26. A geometric view of sparsity minimize − (β1 , β2 ) + λΩ(β1 , β2 )(β1 , β2 ) β1 ,β2 maximize (β1 , β2 ) β1 ,β2 s.t. Ω(β1 , β2 ) ≤ c β2 β1cooperative-Lasso 13
- 27. A geometric view of sparsity minimize − (β1 , β2 ) + λΩ(β1 , β2 ) β1 ,β2 maximize (β1 , β2 )β2 β1 ,β2 s.t. Ω(β1 , β2 ) ≤ c β1cooperative-Lasso 13
- 28. Ball crafting: group-Lasso β4 = 0 β4 = 0.3Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1Unit ball β1 β1 β group ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1cooperative-Lasso 14
- 29. Ball crafting: group-Lasso β4 = 0 β4 = 0.3Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1Unit ball β1 β1 β group ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1cooperative-Lasso 14
- 30. Ball crafting: group-Lasso β4 = 0 β4 = 0.3Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1Unit ball β1 β1 β group ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1cooperative-Lasso 14
- 31. Ball crafting: group-Lasso β4 = 0 β4 = 0.3Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1Unit ball β1 β1 β group ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1cooperative-Lasso 14
- 32. Ball crafting: cooperative-Lasso β4 = 0 β4 = 0.3Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1Unit ball β1 β1 β coop ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1cooperative-Lasso 15
- 33. Ball crafting: cooperative-Lasso β4 = 0 β4 = 0.3Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1Unit ball β1 β1 β coop ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1cooperative-Lasso 15
- 34. Ball crafting: cooperative-Lasso β4 = 0 β4 = 0.3Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1Unit ball β1 β1 β coop ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1cooperative-Lasso 15
- 35. Ball crafting: cooperative-Lasso β4 = 0 β4 = 0.3Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1Unit ball β1 β1 β coop ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1cooperative-Lasso 15
- 36. Outline Deﬁnition Resolution Consistency Model selection Simulation studies Sibling probe sets and gene selectioncooperative-Lasso 16
- 37. Convex analysis Supporting Hyperplane An hyperplane supports a set iﬀ the set is contained in one half-space the set has at least one point on the hyperplane β2 β1cooperative-Lasso 17
- 38. Convex analysis Supporting Hyperplane An hyperplane supports a set iﬀ the set is contained in one half-space the set has at least one point on the hyperplane β2 β2 β1 β1cooperative-Lasso 17
- 39. Convex analysis Supporting Hyperplane An hyperplane supports a set iﬀ the set is contained in one half-space the set has at least one point on the hyperplane β2 β2 β1 β1cooperative-Lasso 17
- 40. Convex analysis Supporting Hyperplane An hyperplane supports a set iﬀ the set is contained in one half-space the set has at least one point on the hyperplane β2 β2 β1 β1cooperative-Lasso 17
- 41. Convex analysis Supporting Hyperplane An hyperplane supports a set iﬀ the set is contained in one half-space the set has at least one point on the hyperplane β2 β2 β2 β1 β1 β1 There are Supporting Hyperplane at all points of convex sets: Generalize tangentscooperative-Lasso 17
- 42. Convex analysis Dual Cone and subgradient Generalizes normals β2 β2 β2 β1 β1 β1 g is a subgradient at x the vector (g, −1) is normal to the supporting hyperplane at this point The subdiﬀerential at x is the set of all subgradient at x.cooperative-Lasso 18
- 43. Convex analysis Dual Cone and subgradient Generalizes normals β2 β2 β2 β1 β1 β1 g is a subgradient at x the vector (g, −1) is normal to the supporting hyperplane at this point The subdiﬀerential at x is the set of all subgradient at x.cooperative-Lasso 18
- 44. Convex analysis Dual Cone and subgradient Generalizes normals β2 β2 β2 β1 β1 β1 g is a subgradient at x the vector (g, −1) is normal to the supporting hyperplane at this point The subdiﬀerential at x is the set of all subgradient at x.cooperative-Lasso 18
- 45. Convex analysis Dual Cone and subgradient Generalizes normals β2 β2 β2 β1 β1 β1 g is a subgradient at x the vector (g, −1) is normal to the supporting hyperplane at this point The subdiﬀerential at x is the set of all subgradient at x.cooperative-Lasso 18
- 46. Optimality conditions Theorem A necessary and suﬃcient condition for the optimality of β is that the null vector 0 belong to the subdiﬀerential of the convex function J: 0 ∂β J(β) = {v ∈ Rp : v = − β (β) + λθ}, where θ ∈ Rp belongs to the subdiﬀerential of the coop-norm. Deﬁne ϕj (v) = (sign(vj )v)+ , then θ is such as βj ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , θj = , ϕj (β Gk ) c ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , ϕj (θ Gk ) ≤ 1. We derive a subset algorithm to solve that problem (that you can enjoy in the paper and the package).cooperative-Lasso 19
- 47. Optimality conditions Theorem A necessary and suﬃcient condition for the optimality of β is that the null vector 0 belong to the subdiﬀerential of the convex function J: 0 ∂β J(β) = {v ∈ Rp : v = − β (β) + λθ}, where θ ∈ Rp belongs to the subdiﬀerential of the coop-norm. Deﬁne ϕj (v) = (sign(vj )v)+ , then θ is such as βj ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , θj = , ϕj (β Gk ) c ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , ϕj (θ Gk ) ≤ 1. We derive a subset algorithm to solve that problem (that you can enjoy in the paper and the package).cooperative-Lasso 19
- 48. Optimality conditions Theorem A necessary and suﬃcient condition for the optimality of β is that the null vector 0 belong to the subdiﬀerential of the convex function J: 0 ∂β J(β) = {v ∈ Rp : v = − β (β) + λθ}, where θ ∈ Rp belongs to the subdiﬀerential of the coop-norm. Deﬁne ϕj (v) = (sign(vj )v)+ , then θ is such as βj ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , θj = , ϕj (β Gk ) c ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , ϕj (θ Gk ) ≤ 1. We derive a subset algorithm to solve that problem (that you can enjoy in the paper and the package).cooperative-Lasso 19
- 49. Linear regression with orthonormal design Consider ˆ 1 2 β = arg min y − Xβ + λΩ(β) , β 2 ˆols with X X = I. Hence, (xj ) (Xβ − y) = βj − β and ˆ 1 ˆols β = arg min β (β − β ) + λΩ(β) . β 2 We may ﬁnd a closed-form of β for, e.g., 1. Ω(β) = β lasso , 2. Ω(β) = β group , 3. Ω(β) = β coop .cooperative-Lasso 20
- 50. Linear regression with orthonormal design Consider ˆ 1 2 β = arg min y − Xβ + λΩ(β) , β 2 ˆols with X X = I. Hence, (xj ) (Xβ − y) = βj − β and ˆ 1 ˆols β = arg min β (β − β ) + λΩ(β) . β 2 We may ﬁnd a closed-form of β for, e.g., 1. Ω(β) = β lasso , 2. Ω(β) = β group , 3. Ω(β) = β coop .cooperative-Lasso 20
- 51. Linear regression with orthonormal design ˆlasso β1 ∀j ∈ {1, . . . , p} , + ˆlasso λ ˆols βj = 1 − βj , ˆ β olsj + ˆlasso = βj ˆols βj − λ . ˆols β2 ˆols β1 Fig.: Lasso as a function of the OLS coeﬃcientscooperative-Lasso 20
- 52. Linear regression with orthonormal design ˆgroup β1 ∀k ∈ {1, . . . , K} , ∀j ∈ Gk , + ˆgroup = 1 − λ ˆols βj βj , βˆols Gk + ˆgroup = β Gk ˆols β Gk − λ . ˆols β2 ˆols β1 Fig.: Group-Lasso as a function of the OLS coeﬃcientscooperative-Lasso 20
- 53. Linear regression with orthonormal design ˆcoop β1 ∀k ∈ {1, . . . , K} , ∀j ∈ Gk , + ˆcoop λ ˆols βj = 1 − ols βj , ˆ ϕ (β ) j Gk + ˆcoop ϕj (β Gk ) = ˆols ϕj (β Gk ) − λ . ˆols β2 ˆols β1 Fig.: Coop-Lasso as a function of the OLS coeﬃcientscooperative-Lasso 20
- 54. Outline Deﬁnition Resolution Consistency Model selection Simulation studies Sibling probe sets and gene selectioncooperative-Lasso 21
- 55. Linear regression setup Technical assumptions(A1) X and Y have ﬁnite fourth order moments 4 E X < ∞, E|Y |4 < ∞,(A2) the covariance matrix Ψ = EXX ∈ Rp×p is invertible,(A3) for every k = 1, . . . , K, if (β )+ > 0 and (β )− > 0 then for every j ∈ Gk β j = 0. (All sign-coherent groups are either included or excluded from the true support).cooperative-Lasso 22
- 56. Irrepresentability condition Deﬁne Sk = S ∩ Gk the support within a group and −1 [D(β)]jj = [sign(βj )β Gk ]+ . Assume there exists η > 0 such that(A4) For every group Gk including at least one null coeﬃcient: max( (ΨSk S Ψ−1 D(β S )β S )+ , (ΨSk S Ψ−1 D(β S )β S )− ) ≤ 1 − η, c SS c SS(A5) For every group Gk intersecting the support and including either positive or negative coeﬃcients, let νk be the sign of these coeﬃcients (νk = 1 if (β Gk )+ > 0 and νk = −1 if (β Gk )− > 0): νk ΨSk S Ψ−1 D(β S )β S c SS 0, where denotes componentwise inequality.cooperative-Lasso 23
- 57. Consistency results Theorem If assumptions (A1-5) are satisﬁed and if there exists η > 0, then for every sequence λn such that λn = λ0 n−γ , γ ∈]0, 1/2[, ˆcoop −→ β β P ˆ and P(S(β coop ) = S) → 1. Asymptotically, the cooperative-Lasso is unbiased and enjoys exact support recovery (even when there are irrelevant variables within a group).cooperative-Lasso 24
- 58. Sketch of the proof ˜ 1. Construct an artiﬁcal estimator β S restricted to the true support S and extend it with 0 coeﬃcients on S c . ˜ 2. Consider the event En on which β satisﬁes the original optimality coop ˜ conditions. On En , β = β ˆ ˆcoop and β c = 0, by uniqueness. S S S 3. We need to prove that limn→∞ P(En ) = 1. 4. Derive the asymptotic distribution of the derivative of the loss ˜ function X (y − Xβ) from TCL on second order moments, ˜ Optimality conditions on β S . Right choice of λn provides convergence in probability. 5. Assumptions (A4-5) assume that the limits in probability satisfy optimality constraints with strict inequalities. 6. As a result, optimility conditions are satisﬁed (with large inequalities) with probability tending to 1.cooperative-Lasso 25
- 59. Sketch of the proof ˜ 1. Construct an artiﬁcal estimator β S restricted to the true support S and extend it with 0 coeﬃcients on S c . ˜ 2. Consider the event En on which β satisﬁes the original optimality coop ˜ conditions. On En , β = β ˆ ˆcoop and β c = 0, by uniqueness. S S S 3. We need to prove that limn→∞ P(En ) = 1. 4. Derive the asymptotic distribution of the derivative of the loss ˜ function X (y − Xβ) from TCL on second order moments, ˜ Optimality conditions on β S . Right choice of λn provides convergence in probability. 5. Assumptions (A4-5) assume that the limits in probability satisfy optimality constraints with strict inequalities. 6. As a result, optimility conditions are satisﬁed (with large inequalities) with probability tending to 1.cooperative-Lasso 25
- 60. Sketch of the proof ˜ 1. Construct an artiﬁcal estimator β S restricted to the true support S and extend it with 0 coeﬃcients on S c . ˜ 2. Consider the event En on which β satisﬁes the original optimality coop ˜ conditions. On En , β = β ˆ ˆcoop and β c = 0, by uniqueness. S S S 3. We need to prove that limn→∞ P(En ) = 1. 4. Derive the asymptotic distribution of the derivative of the loss ˜ function X (y − Xβ) from TCL on second order moments, ˜ Optimality conditions on β S . Right choice of λn provides convergence in probability. 5. Assumptions (A4-5) assume that the limits in probability satisfy optimality constraints with strict inequalities. 6. As a result, optimility conditions are satisﬁed (with large inequalities) with probability tending to 1.cooperative-Lasso 25
- 61. Sketch of the proof ˜ 1. Construct an artiﬁcal estimator β S restricted to the true support S and extend it with 0 coeﬃcients on S c . ˜ 2. Consider the event En on which β satisﬁes the original optimality coop ˜ conditions. On En , β = β ˆ ˆcoop and β c = 0, by uniqueness. S S S 3. We need to prove that limn→∞ P(En ) = 1. 4. Derive the asymptotic distribution of the derivative of the loss ˜ function X (y − Xβ) from TCL on second order moments, ˜ Optimality conditions on β S . Right choice of λn provides convergence in probability. 5. Assumptions (A4-5) assume that the limits in probability satisfy optimality constraints with strict inequalities. 6. As a result, optimility conditions are satisﬁed (with large inequalities) with probability tending to 1.cooperative-Lasso 25
- 62. Illustration 1.0 0.5 Generate data y = Xβ + σε,coeﬃcients β = (1, 1, −1, −1, 0, 0, 0, 0) G = {{1, 2}, {3, 4}, {5, 6}, {7, 8}} 0.0 σ = 0.1, R2 ≈ 0.99, n = 20, irrepresentability conditions -0.5 holds for the coop-Lasso, holds not for the group-Lasso. average over 100 simulations. -1.0 -3 -2 -1 0 1 log10 (λ) Fig.:: 50% coverage intervals (upper / lower quartiles)cooperative-Lasso 26
- 63. Illustration 1.0 0.5 Generate data y = Xβ + σε,coeﬃcients β = (1, 1, −1, −1, 0, 0, 0, 0) G = {{1, 2}, {3, 4}, {5, 6}, {7, 8}} 0.0 σ = 0.1, R2 ≈ 0.99, n = 20, irrepresentability conditions -0.5 holds for the coop-Lasso, holds not for the group-Lasso. average over 100 simulations. -1.0 -3 -2 -1 0 1 log10 (λ) Fig.:group-Lasso: 50% coverage intervals (upper / lower quartiles)cooperative-Lasso 26
- 64. Illustration 1.0 0.5 Generate data y = Xβ + σε,coeﬃcients β = (1, 1, −1, −1, 0, 0, 0, 0) G = {{1, 2}, {3, 4}, {5, 6}, {7, 8}} 0.0 σ = 0.1, R2 ≈ 0.99, n = 20, irrepresentability conditions -0.5 holds for the coop-Lasso, holds not for the group-Lasso. average over 100 simulations. -1.0 -3 -2 -1 0 1 log10 (λ) Fig.:coop-Lasso: 50% coverage intervals (upper / lower quartiles)cooperative-Lasso 26
- 65. Outline Deﬁnition Resolution Consistency Model selection Simulation studies Sibling probe sets and gene selectioncooperative-Lasso 27
- 66. Optimism of the training error The training error: 1 ˆ err = L(yi , xi β). |D| i∈D The test error (“extra-sample” error): ˆ Errex = EX,Y [L(Y, X β)|D]. The “in-sample” error 1 ˆ Errin = EY L(Yi , xi β)|D . |D| i∈D Deﬁnition (Optimism) Errin = err + ”optimism”.cooperative-Lasso 28
- 67. Optimism of the training error The training error: 1 ˆ err = L(yi , xi β). |D| i∈D The test error (“extra-sample” error): ˆ Errex = EX,Y [L(Y, X β)|D]. The “in-sample” error 1 ˆ Errin = EY L(Yi , xi β)|D . |D| i∈D Deﬁnition (Optimism) Errin = err + ”optimism”.cooperative-Lasso 28
- 68. Cp statistics For squared-error loss (and some other loss), 2 Errin = err + cov(ˆi , yi ). y |D| i∈D The amount by which err underestimates the true error depends on how strongly yi aﬀects its own prediction. The harder we ﬁt the data, the greater the covariance will be thereby increasing the optimism (ESLII 5th print). Mallows’ Cp Statistic ˆ For a linear regression ﬁt yi with p inputs i∈D cov(ˆi , yi ) = pσ 2 : y df 2 Cp = err + 2 · ˆ σ , with df = p. |D|cooperative-Lasso 29
- 69. Cp statistics For squared-error loss (and some other loss), 2 Errin = err + cov(ˆi , yi ). y |D| i∈D The amount by which err underestimates the true error depends on how strongly yi aﬀects its own prediction. The harder we ﬁt the data, the greater the covariance will be thereby increasing the optimism (ESLII 5th print). Mallows’ Cp Statistic ˆ For a linear regression ﬁt yi with p inputs i∈D cov(ˆi , yi ) = pσ 2 : y df 2 Cp = err + 2 · ˆ σ , with df = p. |D|cooperative-Lasso 29
- 70. Generalized degrees of freedom ˆ ˆ Let y(λ) = Xβ(λ) be the predicted values for a penalized estimator. Proposition (Efron (’04)+ Stein’s Lemma (’81)) . 1 ˆ ∂ yλ df(λ) = 2 cov(ˆi (λ), yi ) = Ey tr y . σ ∂y i∈D For the Lasso, Zou et al. (’07) show that ˆlasso (λ) df lasso (λ) = β . 0 Assuming X X = I Yuan and Lin (’06) show for the group-Lasso that the trace term equals ˆgroup K β Gk (λ) df group (λ) = ˆgroup 1 β Gk (λ) > 0 1 + (pk − 1) . k=1 β ols Gkcooperative-Lasso 30
- 71. Generalized degrees of freedom ˆ ˆ Let y(λ) = Xβ(λ) be the predicted values for a penalized estimator. Proposition (Efron (’04)+ Stein’s Lemma (’81)) . 1 ˆ ∂ yλ df(λ) = 2 cov(ˆi (λ), yi ) = Ey tr y . σ ∂y i∈D For the Lasso, Zou et al. (’07) show that ˆlasso (λ) df lasso (λ) = β . 0 Assuming X X = I Yuan and Lin (’06) show for the group-Lasso that the trace term equals ˆgroup K β Gk (λ) df group (λ) = ˆgroup 1 β Gk (λ) > 0 1 + (pk − 1) . k=1 β ols Gkcooperative-Lasso 30
- 72. Generalized degrees of freedom ˆ ˆ Let y(λ) = Xβ(λ) be the predicted values for a penalized estimator. Proposition (Efron (’04)+ Stein’s Lemma (’81)) . 1 ˆ ∂ yλ df(λ) = 2 cov(ˆi (λ), yi ) = Ey tr y . σ ∂y i∈D For the Lasso, Zou et al. (’07) show that ˆlasso (λ) df lasso (λ) = β . 0 Assuming X X = I Yuan and Lin (’06) show for the group-Lasso that the trace term equals ˆgroup K β Gk (λ) df group (λ) = ˆgroup 1 β Gk (λ) > 0 1 + (pk − 1) . k=1 β ols Gkcooperative-Lasso 30
- 73. Approximated degrees of freedom for the coop-Lasso Proposition Assuming that data are generated according to a linear regression model and that X is orthonormal, the following expression of df coop (λ) is an unbiased estimate of df(λ) + K ˆcoop β Gk (λ) 1 + (pk − 1) df coop (λ) = 1 + + ˆcoop + β G (λ) >0 ˆols k=1 k β Gk − ˆcoop β Gk (λ) 1 + (pk − 1) +1 − − , ˆcoop − β G (λ) >0 β ols k Gk where pk and pk are respectively the number of positive and negative + − ˆols entries in β (γ). Gkcooperative-Lasso 31
- 74. Approximated degrees of freedom for the coop-Lasso Proposition Assuming that data are generated according to a linear regression model and that X is orthonormal, the following expression of df coop (λ) is an unbiased estimate of df(λ) + K ˆcoop β Gk (λ) k 1 + p+ − 1 df coop (λ) = 1 + ˆcoop 1+γ + β G (λ) >0 ˆridge k=1 k β Gk (γ) − ˆcoop β Gk (λ) k 1 + p− − 1 +1 − , ˆcoop 1+γ − β G (λ) >0 ˆridge k β Gk (γ) where pk and pk are respectively the number of positive and negative + − entries in βˆridge (γ). Gkcooperative-Lasso 31
- 75. Approximated information criteria Following Zou et al, we extend the Cp stat to an “approximated” AIC y − y(λ) ˆ ˜ AIC(λ) = + 2df(λ), σ2 and from the AIC, there is (small) step to BIC: y − y(λ) ˆ ˜ BIC(λ) = + log(n)df(λ). σ2 The K–fold cross-validation works well but is computationally intensive. It is required when we do not meet the linear regression setup. . .cooperative-Lasso 32
- 76. Outline Deﬁnition Resolution Consistency Model selection Simulation studies Sibling probe sets and gene selectioncooperative-Lasso 33
- 77. Revisiting Elastic-Net experiments (1) q Generate data y = Xβ + σε, 70 q q q β = q q (0, . . . , 0, 2, . . . , 2, 0, . . . , 0, 2, . . . , 2) 60 q q q q q 10 10 10 10 50 G1 = {1, . . . , 10}, G2 = {11, . . . , 20},MSE G3 = {21, . . . , 30}, 40 G4 = {31, . . . , 40}. σ = 15, corr(xi , xj ) = 0.5, 30 training/validation/test/ = 100/100/400, 20 q average over 100 simulations. 10 lasso enet group coopcooperative-Lasso 34
- 78. Revisiting Elastic-Net experiments (2) Generate data y = Xβ + σε, β = (3, . . . , 3, 0, . . . , 0) q 250 15 25 q q σ = 15, 200 G1 = {1, . . . , 5}, G2 = {6, . . . , 10}, q G3 = {11, . . . , 15}, 150 G4 = {16, . . . , 40}.MSE xj = Z1 + ε, Z1 ∼ N (0, 1), ∀j ∈ G1 100 q q q q q xj = Z3 + ε, Z2 ∼ N (0, 1), ∀j ∈ G2 q xj = Z3 + ε, Z3 ∼ N (0, 1), ∀j ∈ G3 50 xj ∼ N (0, 1), ∀j ∈ G4 . training/validation/test/ = 50/50/400, 0 lasso enet group coop average over 100 simulations.cooperative-Lasso 35
- 79. Breiman’s setup Simulations setting A wave-like vector of parameters β p = 90 variables partitioned into K = 10 groups of size pk = 9, 3 (partially) active groups, 6 groups of zeros, in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5. 0 20 40 60 80 Figure: β with h = 1, |Sk | = 1 non-zero coeﬃcients in each active group.cooperative-Lasso 36
- 80. Breiman’s setup Simulations setting A wave-like vector of parameters β p = 90 variables partitioned into K = 10 groups of size pk = 9, 3 (partially) active groups, 6 groups of zeros, in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5. 0 20 40 60 80 Figure: β with h = 2, |Sk | = 3 non-zero coeﬃcients in each active group.cooperative-Lasso 36
- 81. Breiman’s setup Simulations setting A wave-like vector of parameters β p = 90 variables partitioned into K = 10 groups of size pk = 9, 3 (partially) active groups, 6 groups of zeros, in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5. 0 20 40 60 80 Figure: β with h = 3, |Sk | = 5 non-zero coeﬃcients in each active group.cooperative-Lasso 36
- 82. Breiman’s setup Simulations setting A wave-like vector of parameters β p = 90 variables partitioned into K = 10 groups of size pk = 9, 3 (partially) active groups, 6 groups of zeros, in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5. 0 20 40 60 80 Figure: β with h = 4, |Sk | = 7 non-zero coeﬃcients in each active group.cooperative-Lasso 36
- 83. Breiman’s setup Simulations setting A wave-like vector of parameters β p = 90 variables partitioned into K = 10 groups of size pk = 9, 3 (partially) active groups, 6 groups of zeros, in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5. 0 20 40 60 80 Figure: β with h = 5, |Sk | = 9 non-zero coeﬃcients in each active group.cooperative-Lasso 36
- 84. Breiman’s setup Example of path of solution and signal recovery with BIC choice The signal strength is generated so as y = Xβ + σ , with σ = 1, n = 30 to 500, X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example), magnitude in β chosen so as R2 ≈ 0.75. Remark Covariance structure is purposely disconnected from the group structure. None of the support recovery conditions are fulﬁlled.cooperative-Lasso 37
- 85. Breiman’s setup Example of path of solution and signal recovery with BIC choice The signal strength is generated so as y = Xβ + σ , with σ = 1, n = 30 to 500, X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example), magnitude in β chosen so as R2 ≈ 0.75. One shot sample with n = 120cooperative-Lasso 37
- 86. Breiman’s setup Example of path of solution and signal recovery with BIC choice The signal strength is generated so as y = Xβ + σ , with σ = 1, n = 30 to 500, X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example), magnitude in β chosen so as R2 ≈ 0.75. 0.6 0.5 0.4 0.4 0.3 ˆlasso ˆlasso 0.2 True signal 0.2 β β Estimated signal 0.1 0.0 0.0 -0.2 -0.1 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 log10 (λ) i Figure: Lassocooperative-Lasso 37
- 87. Breiman’s setup Example of path of solution and signal recovery with BIC choice The signal strength is generated so as y = Xβ + σ , with σ = 1, n = 30 to 500, X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example), magnitude in β chosen so as R2 ≈ 0.75. 0.5 0.5 0.4 0.4 0.3 0.3 ˆgroup ˆgroup 0.2 True signal 0.2 β β 0.1 Estimated signal 0.1 0.0 0.0 -0.1 -0.1 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 0 20 40 60 80 log10 (λ) i Figure: Group-Lassocooperative-Lasso 37
- 88. Breiman’s setup Example of path of solution and signal recovery with BIC choice The signal strength is generated so as y = Xβ + σ , with σ = 1, n = 30 to 500, X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example), magnitude in β chosen so as R2 ≈ 0.75. 0.5 0.5 0.4 0.4 0.3 0.3 ˆcoop ˆcoop True signal 0.2 0.2 β β Estimated signal 0.1 0.1 0.0 0.0 -0.1 -0.1 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 0 20 40 60 80 log10 (λ) i Figure: Coop-Lassocooperative-Lasso 37

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment