Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
×

# Multitask learning for GGM

875 views

Published on

• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

• Be the first to like this

### Multitask learning for GGM

1. 1. Inferring Multiple Graph Structures Julien Chiquet1 , Yves Grandvalet2 , Christophe Ambroise1 1 ´ ´ ´ Statistique et Genome, CNRS & Universite d’Evry Val d’Essonne 2 ´ ` Heudiasyc, CNRS & Universite de Technologie de Compiegne NeMo – 21 juin 2010 Chiquet, Grandvalet, Ambroise, arXiv preprint. Inferring multiple Gaussian graphical structures. Chiquet, Grasseau, Charbonnier and Ambroise, R-package SIMoNe. http://stat.genopole.cnrs.fr/~jchiquet/fr/softwares/simoneInferring Multiple Graph Structures 1
2. 2. Problem Inference few arrays ⇔ few examples lots of genes ⇔ high dimension Which interactions? interactions ⇔ very high dimension The main trouble is the low sample size and high dimensional setting Our main hope is to beneﬁt from sparsity: few genes interactInferring Multiple Graph Structures 2
3. 3. Handling the scarcity of data Merge several experimental conditions experiment 1 experiment 2 experiment 3Inferring Multiple Graph Structures 3
4. 4. Handling the scarcity of data Inferring each graph independently does not help experiment 1 experiment 2 experiment 3 (1) (1) (2) (2) (3) (3) (X1 , . . . , Xn1 ) (X1 , . . . , Xn2 ) (X1 , . . . , Xn3 ) inference inference inferenceInferring Multiple Graph Structures 3
5. 5. Handling the scarcity of data By pooling all the available data experiment 1 experiment 2 experiment 3 (X1 , . . . , Xn ), n = n1 + n2 + n3 . inferenceInferring Multiple Graph Structures 3
6. 6. Handling the scarcity of data experiment 1 experiment 2 experiment 3 (1) (1) (2) (2) (3) (3) (X1 , . . . , Xn1 ) (X1 , . . . , Xn2 ) (X1 , . . . , Xn3 ) inference inference inferenceInferring Multiple Graph Structures 3
7. 7. Handling the scarcity of data By breaking the separability experiment 1 experiment 2 experiment 3 (1) (1) (2) (2) (3) (3) (X1 , . . . , Xn1 ) (X1 , . . . , Xn2 ) (X1 , . . . , Xn3 ) inference inference inferenceInferring Multiple Graph Structures 3
8. 8. Handling the scarcity of data By breaking the separability experiment 1 experiment 2 experiment 3 (1) (1) (2) (2) (3) (3) (X1 , . . . , Xn1 ) (X1 , . . . , Xn2 ) (X1 , . . . , Xn3 ) inference inference inferenceInferring Multiple Graph Structures 3
9. 9. Outline Statistical model Multi-task learning Algorithms and methods Model selection ExperimentsInferring Multiple Graph Structures 4
10. 10. Outline Statistical model Multi-task learning Algorithms and methods Model selection ExperimentsInferring Multiple Graph Structures 5
11. 11. Gaussian graphical modeling Let X = (X1 , . . . , Xp ) ∼ N (0p , Σ) and assume n i.i.d. copies of X, X be the n × p matrix whose kth row is Xk , Θ = (θij )i,j∈P Σ−1 be the concentration matrix. Graphical interpretation Since corij|P{i,j} = −θij / θii θjj for i = j,   θij = 0 Xi ⊥ Xj |XP{i,j} ⇔ ⊥ or edge (i, j) ∈ network. /  non zeroes in Θ describes the graph structure.Inferring Multiple Graph Structures 6
12. 12. The model likelihood Let S = n−1 X X be the empirical variance-covariance matrix: S is a sufﬁcient statistic for X ⇒ L(Θ; X) = L(Θ; S) The log-likelihood n n n L(Θ; S) = log det(Θ) − trace(SΘ) − log(2π). 2 2 2 The MLE of Θ is S−1 not deﬁned for n < p not sparse ⇒ fully connected graphInferring Multiple Graph Structures 7
13. 13. Penalized Approaches Penalized Likelihood (Banerjee et al., 2008) max L(Θ; S) − λ Θ 1 Θ∈S+ well deﬁned for n < p sparse ⇒ sensible graph SDP of size O(p2 ) (solved by Friedman et al., 2007) ¨ Neighborhood Selection (Meinshausen & Bulhman, 2006) 1 2 β = argmin Xj − Xj β 2 + λ β 1 β∈Rp−1 n where Xj is the jth column of X and Xj is X deprived of Xj not symmetric, not positive-deﬁnite p independent L ASSO problems of size (p − 1)Inferring Multiple Graph Structures 8
14. 14. Penalized Approaches Penalized Likelihood (Banerjee et al., 2008) max L(Θ; S) − λ Θ 1 Θ∈S+ well deﬁned for n < p sparse ⇒ sensible graph SDP of size O(p2 ) (solved by Friedman et al., 2007) ¨ Neighborhood Selection (Meinshausen & Bulhman, 2006) 1 2 β = argmin Xj − Xj β 2 + λ β 1 β∈Rp−1 n where Xj is the jth column of X and Xj is X deprived of Xj not symmetric, not positive-deﬁnite p independent L ASSO problems of size (p − 1)Inferring Multiple Graph Structures 8
15. 15. Neighborhood vs. Likelihood Pseudo-likelihood (Besag, 1975) p P(X1 , . . . , Xp ) P(Xj |{Xk }k=j ) j=1 n n n log det(D) − L(Θ; S) = trace SD−1 Θ2 − log(2π) 2 2 2 n n n L(Θ; S) = log det(Θ) − trace(SΘ) − log(2π) 2 2 2 with D = diag(Θ). Proposition (Ambroise, Chiquet, Matias, 2008) Neighborhood selection leads to the graph maximizing the penalized pseudo-log-likelihood ˆ θ ˜ Proof: βi = − ij , where Θ = arg maxΘ L(Θ; S) − λ Θ 1 θjjInferring Multiple Graph Structures 9
16. 16. Neighborhood vs. Likelihood Pseudo-likelihood (Besag, 1975) p P(X1 , . . . , Xp ) P(Xj |{Xk }k=j ) j=1 n n n log det(D) − L(Θ; S) = trace SD−1 Θ2 − log(2π) 2 2 2 n n n L(Θ; S) = log det(Θ) − trace(SΘ) − log(2π) 2 2 2 with D = diag(Θ). Proposition (Ambroise, Chiquet, Matias, 2008) Neighborhood selection leads to the graph maximizing the penalized pseudo-log-likelihood ˆ θ ˜ Proof: βi = − ij , where Θ = arg maxΘ L(Θ; S) − λ Θ 1 θjjInferring Multiple Graph Structures 9
17. 17. Outline Statistical model Multi-task learning Algorithms and methods Model selection ExperimentsInferring Multiple Graph Structures 10
18. 18. Multi-task learning We have T samples (experimental cond.) of the same variables X(t) is the tth data matrix, S(t) is the empirical covariance examples are assumed to be drawn from N (0, Σ(t) ) Ignoring the relationships between the tasks leads to separable objectives max L(Θ(t) ; S(t) ) − λ Θ(t) 1 Θ(t) ∈Rp×p ,t=1...,T Multi-task learning = solving the T tasks jointly We may couple the objectives through the ﬁtting term term, through the penalty term.Inferring Multiple Graph Structures 11
19. 19. Multi-task learning We have T samples (experimental cond.) of the same variables X(t) is the tth data matrix, S(t) is the empirical covariance examples are assumed to be drawn from N (0, Σ(t) ) Ignoring the relationships between the tasks leads to separable objectives max L(Θ(t) ; S(t) ) − λ Θ(t) 1 Θ(t) ∈Rp×p ,t=1...,T Multi-task learning = solving the T tasks jointly We may couple the objectives through the ﬁtting term term, through the penalty term.Inferring Multiple Graph Structures 11
20. 20. Coupling through the ﬁtting term Intertwined L ASSO T max L(Θ(t) ; S(t) ) − λ Θ(t) 1 Θ(t) ,t...,T t=1 1 T (t) is the “pooled-tasks” covariance matrix. S= n t=1 nt S S(t) = αS(t) + (1 − α)S is a mixture between speciﬁc and pooled covariance matrices. α = 0 pools the data sets and infers a single graph α = 1 separates the data sets and infers T graphs independently α = 1/2 in all our experimentsInferring Multiple Graph Structures 12
21. 21. Coupling through penalties: group-L ASSO X1 X2We group parameters by sets ofcorresponding edges across graphs: X3 X4 Graphical group-L ASSO T T 1/2 (t) (t) (t) 2 max L Θ ;S −λ θij Θ(t) ,t...,T t=1 i,j t=1 i=j Sparsity pattern shared between graphs Identical graphs across tasksInferring Multiple Graph Structures 13
22. 22. Coupling through penalties: group-L ASSO X1 X2 X1 X2We group parameters by sets of X1 X1 X2 X2corresponding edges across graphs: X3 X4 X3 X4 X3 X4 X3 X4 Graphical group-L ASSO T T 1/2 (t) (t) (t) 2 max L Θ ;S −λ θij Θ(t) ,t...,T t=1 i,j t=1 i=j Sparsity pattern shared between graphs Identical graphs across tasksInferring Multiple Graph Structures 13
23. 23. Coupling through penalties: group-L ASSO X1 X2 X1 X2We group parameters by sets of X1 X1 X2 X2corresponding edges across graphs: X3 X4 X3 X4 X3 X4 X3 X4 Graphical group-L ASSO T T 1/2 (t) (t) (t) 2 max L Θ ;S −λ θij Θ(t) ,t...,T t=1 i,j t=1 i=j Sparsity pattern shared between graphs Identical graphs across tasksInferring Multiple Graph Structures 13
24. 24. Coupling through penalties: group-L ASSO X1 X2 X1 X2We group parameters by sets of X1 X1 X2 X2corresponding edges across graphs: X3 X4 X3 X4 X3 X4 X3 X4 Graphical group-L ASSO T T 1/2 (t) (t) (t) 2 max L Θ ;S −λ θij Θ(t) ,t...,T t=1 i,j t=1 i=j Sparsity pattern shared between graphs Identical graphs across tasksInferring Multiple Graph Structures 13
25. 25. Coupling through penalties: group-L ASSO X1 X2 X1 X2We group parameters by sets of X1 X1 X2 X2corresponding edges across graphs: X3 X4 X3 X4 X3 X4 X3 X4 Graphical group-L ASSO T T 1/2 (t) (t) (t) 2 max L Θ ;S −λ θij Θ(t) ,t...,T t=1 i,j t=1 i=j Sparsity pattern shared between graphs Identical graphs across tasksInferring Multiple Graph Structures 13
26. 26. Coupling through penalties: cooperative-L ASSO Same grouping, and bet that X1 X2 correlations are likely to be sign consistent Gene interactions are either X3 X4 inhibitory or activating across assays Graphical cooperative-L ASSO T T 1 T 1 (t) (t) (t) 2 2 (t) 2 2 max L S ;Θ −λ θij + θij Θ(t) + − t=1,...,T t=1 i,j t=1 t=1 i=j where [u]+ = max(0, u) and [u]− = min(0, u). Plausible in many other situations Sparsity pattern shared between graphs, which may differInferring Multiple Graph Structures 14
27. 27. Coupling through penalties: cooperative-L ASSO Same grouping, and bet that X1 X2 correlations are likely to be sign consistent Gene interactions are either X3 X4 inhibitory or activating across assays Graphical cooperative-L ASSO T T 1 T 1 (t) (t) (t) 2 2 (t) 2 2 max L S ;Θ −λ θij + θij Θ(t) + − t=1,...,T t=1 i,j t=1 t=1 i=j where [u]+ = max(0, u) and [u]− = min(0, u). Plausible in many other situations Sparsity pattern shared between graphs, which may differInferring Multiple Graph Structures 14
28. 28. Coupling through penalties: cooperative-L ASSO Same grouping, and bet that X1 X2 X1 X2 correlations are likely to be sign X1 X2 X1 X2 consistent Gene interactions are either X3 X4 inhibitory or activating across assays X3 X4 X3 X4 X3 X4 Graphical cooperative-L ASSO T T 1 T 1 (t) (t) (t) 2 2 (t) 2 2 max L S ;Θ −λ θij + θij Θ(t) + − t=1,...,T t=1 i,j t=1 t=1 i=j where [u]+ = max(0, u) and [u]− = min(0, u). Plausible in many other situations Sparsity pattern shared between graphs, which may differInferring Multiple Graph Structures 14
29. 29. Coupling through penalties: cooperative-L ASSO Same grouping, and bet that X1 X2 X1 X2 correlations are likely to be sign X1 X2 X1 X2 consistent Gene interactions are either X3 X4 inhibitory or activating across assays X3 X4 X3 X4 X3 X4 Graphical cooperative-L ASSO T T 1 T 1 (t) (t) (t) 2 2 (t) 2 2 max L S ;Θ −λ θij + θij Θ(t) + − t=1,...,T t=1 i,j t=1 t=1 i=j where [u]+ = max(0, u) and [u]− = min(0, u). Plausible in many other situations Sparsity pattern shared between graphs, which may differInferring Multiple Graph Structures 14
30. 30. Coupling through penalties: cooperative-L ASSO Same grouping, and bet that X1 X2 X1 X2 correlations are likely to be sign X1 X2 X1 X2 consistent Gene interactions are either X3 X4 inhibitory or activating across assays X3 X4 X3 X4 X3 X4 Graphical cooperative-L ASSO T T 1 T 1 (t) (t) (t) 2 2 (t) 2 2 max L S ;Θ −λ θij + θij Θ(t) + − t=1,...,T t=1 i,j t=1 t=1 i=j where [u]+ = max(0, u) and [u]− = min(0, u). Plausible in many other situations Sparsity pattern shared between graphs, which may differInferring Multiple Graph Structures 14
31. 31. Coupling through penalties: cooperative-L ASSO Same grouping, and bet that X1 X2 X1 X2 correlations are likely to be sign X1 X2 X1 X2 consistent Gene interactions are either X3 X4 inhibitory or activating across assays X3 X4 X3 X4 X3 X4 Graphical cooperative-L ASSO T T 1 T 1 (t) (t) (t) 2 2 (t) 2 2 max L S ;Θ −λ θij + θij Θ(t) + − t=1,...,T t=1 i,j t=1 t=1 i=j where [u]+ = max(0, u) and [u]− = min(0, u). Plausible in many other situations Sparsity pattern shared between graphs, which may differInferring Multiple Graph Structures 14
32. 32. Coupling through penalties: cooperative-L ASSO Same grouping, and bet that X1 X2 X1 X2 correlations are likely to be sign X1 X2 X1 X2 consistent Gene interactions are either X3 X4 inhibitory or activating across assays X3 X4 X3 X4 X3 X4 Graphical cooperative-L ASSO T T 1 T 1 (t) (t) (t) 2 2 (t) 2 2 max L S ;Θ −λ θij + θij Θ(t) + − t=1,...,T t=1 i,j t=1 t=1 i=j where [u]+ = max(0, u) and [u]− = min(0, u). Plausible in many other situations Sparsity pattern shared between graphs, which may differInferring Multiple Graph Structures 14
33. 33. Outline Statistical model Multi-task learning Algorithms and methods Model selection ExperimentsInferring Multiple Graph Structures 15
34. 34. A Geometric View of Sparsity Constrained Optimization L(β1 , β2 ) max L(β1 , β2 ) − λΩ(β1 , β2 ) β1 ,β2 β2 β1Inferring Multiple Graph Structures 16
35. 35. A Geometric View of Sparsity Constrained Optimization L(β1 , β2 ) max L(β1 , β2 ) − λΩ(β1 , β2 ) β1 ,β2 max L(β1 , β2 ) ⇔ β1 ,β2 s.t. Ω(β1 , β2 ) ≤ c β2 β1Inferring Multiple Graph Structures 16
36. 36. A Geometric View of Sparsity Constrained Optimization max L(β1 , β2 ) − λΩ(β1 , β2 ) β1 ,β2 max L(β1 , β2 ) ⇔ β1 ,β2 s.t. Ω(β1 , β2 ) ≤ c β2 β1Inferring Multiple Graph Structures 16
37. 37. A Geometric View of Sparsity Supporting Hyperplane An hyperplane supports a set iff the set is contained in one half-space the set has at least one point on the hyperplane β2 β1Inferring Multiple Graph Structures 17
38. 38. A Geometric View of Sparsity Supporting Hyperplane An hyperplane supports a set iff the set is contained in one half-space the set has at least one point on the hyperplane β2 β2 β1 β1Inferring Multiple Graph Structures 17
39. 39. A Geometric View of Sparsity Supporting Hyperplane An hyperplane supports a set iff the set is contained in one half-space the set has at least one point on the hyperplane β2 β2 β1 β1Inferring Multiple Graph Structures 17
40. 40. A Geometric View of Sparsity Supporting Hyperplane An hyperplane supports a set iff the set is contained in one half-space the set has at least one point on the hyperplane β2 β2 β1 β1Inferring Multiple Graph Structures 17
41. 41. A Geometric View of Sparsity Supporting Hyperplane An hyperplane supports a set iff the set is contained in one half-space the set has at least one point on the hyperplane β2 β2 β2 β1 β1 β1 There are Supporting Hyperplane at all points of convex sets: Generalize tangentsInferring Multiple Graph Structures 17
42. 42. A Geometric View of Sparsity Dual Cone Generalizes normals β2 β2 β2 β1 β1 β1Inferring Multiple Graph Structures 18
43. 43. A Geometric View of Sparsity Dual Cone Generalizes normals β2 β2 β2 β1 β1 β1Inferring Multiple Graph Structures 18
44. 44. A Geometric View of Sparsity Dual Cone Generalizes normals β2 β2 β2 β1 β1 β1Inferring Multiple Graph Structures 18
45. 45. A Geometric View of Sparsity Dual Cone Generalizes normals β2 β2 β2 β1 β1 β1 Shape of dual cones ⇒ sparsity patternInferring Multiple Graph Structures 18
46. 46. Group-L ASSO balls (2) (2) β2 =0 β2 = 0.3Admissible set 2 tasks (T = 2) 1 1 2 coefﬁcients (p = 2) =0 (1) (1) β2 −1 1 β2 −1 1 (2)Unit ball β1 −1 −1 (1) (1) β1 β1 2 2 1/2 (t) 2 βi ≤1 1 1 i=1 t=1 = 0.3 (1) (1) β2 −1 1 β2 −1 1 (2) β1 −1 −1 (1) (1) β1 β1Inferring Multiple Graph Structures 19
47. 47. Group-L ASSO balls (2) (2) β2 =0 β2 = 0.3Admissible set 2 tasks (T = 2) 1 1 2 coefﬁcients (p = 2) =0 (1) (1) β2 −1 1 β2 −1 1 (2)Unit ball β1 −1 −1 (1) (1) β1 β1 2 2 1/2 (t) 2 βi ≤1 1 1 i=1 t=1 = 0.3 (1) (1) β2 −1 1 β2 −1 1 (2) β1 −1 −1 (1) (1) β1 β1Inferring Multiple Graph Structures 19
48. 48. Group-L ASSO balls (2) (2) β2 =0 β2 = 0.3Admissible set 2 tasks (T = 2) 1 1 2 coefﬁcients (p = 2) =0 (1) (1) β2 −1 1 β2 −1 1 (2)Unit ball β1 −1 −1 (1) (1) β1 β1 2 2 1/2 (t) 2 βi ≤1 1 1 i=1 t=1 = 0.3 (1) (1) β2 −1 1 β2 −1 1 (2) β1 −1 −1 (1) (1) β1 β1Inferring Multiple Graph Structures 19
49. 49. Group-L ASSO balls (2) (2) β2 =0 β2 = 0.3Admissible set 2 tasks (T = 2) 1 1 2 coefﬁcients (p = 2) =0 (1) (1) β2 −1 1 β2 −1 1 (2)Unit ball β1 −1 −1 (1) (1) β1 β1 2 2 1/2 (t) 2 βi ≤1 1 1 i=1 t=1 = 0.3 (1) (1) β2 −1 1 β2 −1 1 (2) β1 −1 −1 (1) (1) β1 β1Inferring Multiple Graph Structures 19
50. 50. Cooperative-L ASSO balls (2) (2) β2 =0 β2 = 0.3Admissible set 2 tasks (T = 2) 2 coefﬁcients (p = 2) 1 1Unit ball =0 (1) (1) β2 −1 1 β2 −1 1 (2) 1/2 β1 2 2 (t) 2 −1 −1 (1) (1) βj β1 β1 + j=1 t=1 2 2 1/2 1 1 (t) 2 = 0.3+ −βj ≤1 (1) β2 (1) β2 + −1 1 −1 1 j=1 t=1 (2) β1 −1 −1 (1) (1) β1 β1Inferring Multiple Graph Structures 20
51. 51. Cooperative-L ASSO balls (2) (2) β2 =0 β2 = 0.3Admissible set 2 tasks (T = 2) 2 coefﬁcients (p = 2) 1 1Unit ball =0 (1) (1) β2 −1 1 β2 −1 1 (2) 1/2 β1 2 2 (t) 2 −1 −1 (1) (1) βj β1 β1 + j=1 t=1 2 2 1/2 1 1 (t) 2 = 0.3+ −βj ≤1 (1) β2 (1) β2 + −1 1 −1 1 j=1 t=1 (2) β1 −1 −1 (1) (1) β1 β1Inferring Multiple Graph Structures 20
52. 52. Cooperative-L ASSO balls (2) (2) β2 =0 β2 = 0.3Admissible set 2 tasks (T = 2) 2 coefﬁcients (p = 2) 1 1Unit ball =0 (1) (1) β2 −1 1 β2 −1 1 (2) 1/2 β1 2 2 (t) 2 −1 −1 (1) (1) βj β1 β1 + j=1 t=1 2 2 1/2 1 1 (t) 2 = 0.3+ −βj ≤1 (1) β2 (1) β2 + −1 1 −1 1 j=1 t=1 (2) β1 −1 −1 (1) (1) β1 β1Inferring Multiple Graph Structures 20
53. 53. Cooperative-L ASSO balls (2) (2) β2 =0 β2 = 0.3Admissible set 2 tasks (T = 2) 2 coefﬁcients (p = 2) 1 1Unit ball =0 (1) (1) β2 −1 1 β2 −1 1 (2) 1/2 β1 2 2 (t) 2 −1 −1 (1) (1) βj β1 β1 + j=1 t=1 2 2 1/2 1 1 (t) 2 = 0.3+ −βj ≤1 (1) β2 (1) β2 + −1 1 −1 1 j=1 t=1 (2) β1 −1 −1 (1) (1) β1 β1Inferring Multiple Graph Structures 20
54. 54. Cooperative-L ASSO balls (2) (2) β2 =0 β2 = 0.3Admissible set 2 tasks (T = 2) 2 coefﬁcients (p = 2) 1 1Unit ball =0 (1) (1) β2 −1 1 β2 −1 1 (2) 1/2 β1 2 2 (t) 2 −1 −1 (1) (1) βj β1 β1 + j=1 t=1 2 2 1/2 1 1 (t) 2 = 0.3+ −βj ≤1 (1) β2 (1) β2 + −1 1 −1 1 j=1 t=1 (2) β1 −1 −1 (1) (1) β1 β1Inferring Multiple Graph Structures 20
55. 55. Cooperative-L ASSO balls (2) (2) β2 =0 β2 = 0.3Admissible set 2 tasks (T = 2) 2 coefﬁcients (p = 2) 1 1Unit ball =0 (1) (1) β2 −1 1 β2 −1 1 (2) 1/2 β1 2 2 (t) 2 −1 −1 (1) (1) βj β1 β1 + j=1 t=1 2 2 1/2 1 1 (t) 2 = 0.3+ −βj ≤1 (1) β2 (1) β2 + −1 1 −1 1 j=1 t=1 (2) β1 −1 −1 (1) (1) β1 β1Inferring Multiple Graph Structures 20
56. 56. Decomposition strategy Estimate the j th neighborhood of the T graphs T max ˜ L(K(t) ; S(t) ) − λ Ω(K(t) ) K(t) ,t=1...,T t=1 decomposes into p convex optimization problems of size β j = argmin fj (β) + λ Ω(β) β∈RT ×(p−1) where β j is a minimizer iff 0 ∈ β fj (β) + λ∂β Ω(β)Inferring Multiple Graph Structures 21
57. 57. Decomposition strategy Estimate the j th neighborhood of the T graphs T max ˜ L(K(t) ; S(t) ) − λ Ω(K(t) ) K(t) ,t=1...,T t=1 decomposes into p convex optimization problems of size β j = argmin fj (β) + λ Ω(β) β∈RT ×(p−1) where β j is a minimizer iff 0 ∈ β fj (β) + λ∂β Ω(β) Group-L ASSO: p−1 [1:T ] Ω(β) = βi 2 i=1 [1:T ] where β i is the vector corresponding to the edges (i, j) across graphsInferring Multiple Graph Structures 21
58. 58. Decomposition strategy Estimate the j th neighborhood of the T graphs T max ˜ L(K(t) ; S(t) ) − λ Ω(K(t) ) K(t) ,t=1...,T t=1 decomposes into p convex optimization problems of size β j = argmin fj (β) + λ Ω(β) β∈RT ×(p−1) where β j is a minimizer iff 0 ∈ β fj (β) + λ∂β Ω(β) Coop-L ASSO: p−1 [1:T ] [1:T ] Ω(β) = βi + −β i + 2 + 2 i=1 [1:T ] where β i is the vector corresponding to the edges (i, j) across graphsInferring Multiple Graph Structures 21
59. 59. Active set algorithm: yellow belt // 0. INITIALIZATION β ← 0, A ← ∅ while 0 ∈ ∂β L(β) do / // 1. MASTER PROBLEM: OPTIMIZATION WITH RESPECT TO β A Find a solution h to the smooth problem h f (β A + h) + λ∂h Ω(β A + h) = 0, where ∂h Ω = { h Ω} . βA ← βA + h // 2. IDENTIFY NEWLY ZEROED VARIABLES A ← A{i} // 3. IDENTIFY NEW NON-ZERO VARIABLES // Select a candidate i ∈ Ac ∂f (β) i ← arg max vj , where vj = min ∂βj + λν j∈Ac ν∈∂β Ω j endInferring Multiple Graph Structures 22
60. 60. Active set algorithm: orange belt // 0. INITIALIZATION β ← 0, A ← ∅ while 0 ∈ ∂β L(β) do / // 1. MASTER PROBLEM: OPTIMIZATION WITH RESPECT TO β A Find a solution h to the smooth problem h f (β A + h) + λ∂h Ω(β A + h) = 0, where ∂h Ω = { h Ω} . βA ← βA + h // 2. IDENTIFY NEWLY ZEROED VARIABLES A ← A{i} // 3. IDENTIFY NEW NON-ZERO VARIABLES // Select a candidate i ∈ Ac which violates the more the optimality conditions ∂f (β) i ← arg max vj , where vj = min ∂βj + λν j∈Ac ν∈∂β Ω j if it exists such an i then A ← A ∪ {i} else Stop and return β, which is optimal end endInferring Multiple Graph Structures 22
61. 61. Active set algorithm: green belt // 0. INITIALIZATION β ← 0, A ← ∅ while 0 ∈ ∂β L(β) do / // 1. MASTER PROBLEM: OPTIMIZATION WITH RESPECT TO β A Find a solution h to the smooth problem h f (β A + h) + λ∂h Ω(β A + h) = 0, where ∂h Ω = { h Ω} . βA ← βA + h // 2. IDENTIFY NEWLY ZEROED VARIABLES ∂f (β) while ∃i ∈ A : βi = 0 and min ∂β + λν = 0 do ν∈∂β Ω i i A ← A{i} end // 3. IDENTIFY NEW NON-ZERO VARIABLES // Select a candidate i ∈ Ac such that an infinitesimal change of βi provides the highest reduction of L ∂f (β) i ← arg max vj , where vj = min ∂β + λν ν∈∂β Ω j j∈Ac j if vi = 0 then A ← A ∪ {i} else Stop and return β, which is optimal end endInferring Multiple Graph Structures 22
62. 62. Outline Statistical model Multi-task learning Algorithms and methods Model selection ExperimentsInferring Multiple Graph Structures 23
63. 63. Tuning the penalty parameter What does the literature say? Theory based penalty choices √ 1. Optimal order of penalty in the p n framework: n log p Bunea et al. 2007, Bickel et al. 2009 2. Control on the probability of connecting two distinct connectivity sets Meinshausen et al. 2006, Banerjee et al. 2008, Ambroise et al. 2009 practically much too conservative Cross-validation Optimal in terms of prediction, not in terms of selection Problematic with small samples: changes the sparsity constraint due to sample sizeInferring Multiple Graph Structures 24
64. 64. Tuning the penalty parameter BIC / AIC Theorem (Zou et al. 2008) ˆlasso ˆlasso df(βλ ) = βλ 0 Straightforward extensions to the graphical framework ˆ ˆ log n BIC(λ) = L(Θλ ; X) − df(Θλ ) 2 ˆ ˆ AIC(λ) = L(Θλ ; X) − df(Θλ ) Rely on asymptotic approximations, but still relevant for small data setInferring Multiple Graph Structures 25
65. 65. Outline Statistical model Multi-task learning Algorithms and methods Model selection ExperimentsInferring Multiple Graph Structures 26
66. 66. Data Generation We set the number of nodes p the number of edges K the number of examples n Process 1. Generate a random adjacency matrix with 2 K off-diagonal terms 2. Compute the normalized Laplacian L 3. Generate a symmetric matrix of random signs R 4. Compute the concentration matrix Kij = Lij Rij 5. compute Σ by pseudo-inversion of K 6. generate correlated Gaussian data ∼ N (0, Σ )Inferring Multiple Graph Structures 27
67. 67. Simulating Related Tasks Generate 1. an “ancestor” with p = 20 nodes and K = 20 edges 2. T = 4 children by adding and deleting δ edges 3. T = 4 Gaussian samples Figure: ancestor and children with δ = 2 perturbationsInferring Multiple Graph Structures 28
68. 68. Simulating Related Tasks Generate 1. an “ancestor” with p = 20 nodes and K = 20 edges 2. T = 4 children by adding and deleting δ edges 3. T = 4 Gaussian samples Figure: ancestor and children with δ = 2 perturbationsInferring Multiple Graph Structures 28
69. 69. Simulating Related Tasks Generate 1. an “ancestor” with p = 20 nodes and K = 20 edges 2. T = 4 children by adding and deleting δ edges 3. T = 4 Gaussian samples Figure: ancestor and children with δ = 2 perturbationsInferring Multiple Graph Structures 28
70. 70. Simulation results Precision/Recall curve ROC curve precision = TP/(TP+FP) fallout = FP/N (type I error) recall = TP/P (power) recall = TP/P (power)Inferring Multiple Graph Structures 29
71. 71. Simulation results large sample size penalty: λmax −→ 0 penalty: λmax −→ 0 1.0 1.0 0.8 0.8 0.6 0.6 precision recall 0.4 0.4 CoopLasso CoopLasso 0.2 0.2 GroupLasso GroupLasso Intertwined Intertwined Independent Independent Pooled Pooled 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 recall fallout Figure: nt = 100, δ = 1Inferring Multiple Graph Structures 29
72. 72. Simulation results large sample size penalty: λmax −→ 0 penalty: λmax −→ 0 1.0 1.0 0.8 0.8 0.6 0.6 precision recall 0.4 0.4 CoopLasso CoopLasso 0.2 0.2 GroupLasso GroupLasso Intertwined Intertwined Independent Independent Pooled Pooled 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 recall fallout Figure: nt = 100, δ = 3Inferring Multiple Graph Structures 29
73. 73. Simulation results large sample size penalty: λmax −→ 0 penalty: λmax −→ 0 1.0 1.0 0.8 0.8 0.6 0.6 precision recall 0.4 0.4 CoopLasso CoopLasso 0.2 0.2 GroupLasso GroupLasso Intertwined Intertwined Independent Independent Pooled Pooled 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 recall fallout Figure: nt = 100, δ = 5Inferring Multiple Graph Structures 29
74. 74. Simulation results medium sample size penalty: λmax −→ 0 penalty: λmax −→ 0 1.0 1.0 0.8 0.8 0.6 0.6 precision recall 0.4 0.4 CoopLasso CoopLasso 0.2 0.2 GroupLasso GroupLasso Intertwined Intertwined Independent Independent Pooled Pooled 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 recall fallout Figure: nt = 50, δ = 1Inferring Multiple Graph Structures 29
75. 75. Simulation results medium sample size penalty: λmax −→ 0 penalty: λmax −→ 0 1.0 1.0 0.8 0.8 0.6 0.6 precision recall 0.4 0.4 CoopLasso CoopLasso 0.2 0.2 GroupLasso GroupLasso Intertwined Intertwined Independent Independent Pooled Pooled 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 recall fallout Figure: nt = 50, δ = 3Inferring Multiple Graph Structures 29
76. 76. Simulation results medium sample size penalty: λmax −→ 0 penalty: λmax −→ 0 1.0 1.0 0.8 0.8 0.6 0.6 precision recall 0.4 0.4 CoopLasso CoopLasso 0.2 0.2 GroupLasso GroupLasso Intertwined Intertwined Independent Independent Pooled Pooled 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 recall fallout Figure: nt = 50, δ = 5Inferring Multiple Graph Structures 29
77. 77. Simulation results small sample size penalty: λmax −→ 0 penalty: λmax −→ 0 1.0 1.0 0.8 0.8 0.6 0.6 precision recall 0.4 0.4 CoopLasso CoopLasso 0.2 0.2 GroupLasso GroupLasso Intertwined Intertwined Independent Independent Pooled Pooled 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 recall fallout Figure: nt = 25, δ = 1Inferring Multiple Graph Structures 29
78. 78. Simulation results small sample size penalty: λmax −→ 0 penalty: λmax −→ 0 1.0 1.0 0.8 0.8 0.6 0.6 precision recall 0.4 0.4 CoopLasso CoopLasso 0.2 0.2 GroupLasso GroupLasso Intertwined Intertwined Independent Independent Pooled Pooled 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 recall fallout Figure: nt = 25, δ = 3Inferring Multiple Graph Structures 29
79. 79. Simulation results small sample size penalty: λmax −→ 0 penalty: λmax −→ 0 1.0 1.0 0.8 0.8 0.6 0.6 precision recall 0.4 0.4 CoopLasso CoopLasso 0.2 0.2 GroupLasso GroupLasso Intertwined Intertwined Independent Independent Pooled Pooled 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 recall fallout Figure: nt = 25, δ = 5Inferring Multiple Graph Structures 29
80. 80. Breast Cancer Prediction of the outcome of preoperative chemotherapy Two types of patients Patient response can be classiﬁed either as 1. pathologic complete response (PCR) 2. residual disease (not PCR) Gene expression data 133 patients (99 not PCR, 34 PCR) 26 identiﬁed genes (differential analysis)Inferring Multiple Graph Structures 30
81. 81. Package Demo cancer data: Coop-LassoInferring Multiple Graph Structures 31
82. 82. Conclusions To sum-up Clariﬁed links between neighborhood selection and graphical L ASSO Identiﬁed the relevance of Multi-Task Learning in network inference First methods for inferring multiple Gaussian Graphical Models Consistent improvements upon the available baseline solutions Available in the R package SIMoNe Perspectives Explore model-selection capabilities Other applications of the Cooperative-L ASSO Theoretical analysis (uniqueness, selection consistency)Inferring Multiple Graph Structures 32