Multitask learning for GGM

853 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
853
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Multitask learning for GGM

  1. 1. Inferring Multiple Graph Structures Julien Chiquet1 , Yves Grandvalet2 , Christophe Ambroise1 1 ´ ´ ´ Statistique et Genome, CNRS & Universite d’Evry Val d’Essonne 2 ´ ` Heudiasyc, CNRS & Universite de Technologie de Compiegne NeMo – 21 juin 2010 Chiquet, Grandvalet, Ambroise, arXiv preprint. Inferring multiple Gaussian graphical structures. Chiquet, Grasseau, Charbonnier and Ambroise, R-package SIMoNe. http://stat.genopole.cnrs.fr/~jchiquet/fr/softwares/simoneInferring Multiple Graph Structures 1
  2. 2. Problem Inference few arrays ⇔ few examples lots of genes ⇔ high dimension Which interactions? interactions ⇔ very high dimension The main trouble is the low sample size and high dimensional setting Our main hope is to benefit from sparsity: few genes interactInferring Multiple Graph Structures 2
  3. 3. Handling the scarcity of data Merge several experimental conditions experiment 1 experiment 2 experiment 3Inferring Multiple Graph Structures 3
  4. 4. Handling the scarcity of data Inferring each graph independently does not help experiment 1 experiment 2 experiment 3 (1) (1) (2) (2) (3) (3) (X1 , . . . , Xn1 ) (X1 , . . . , Xn2 ) (X1 , . . . , Xn3 ) inference inference inferenceInferring Multiple Graph Structures 3
  5. 5. Handling the scarcity of data By pooling all the available data experiment 1 experiment 2 experiment 3 (X1 , . . . , Xn ), n = n1 + n2 + n3 . inferenceInferring Multiple Graph Structures 3
  6. 6. Handling the scarcity of data experiment 1 experiment 2 experiment 3 (1) (1) (2) (2) (3) (3) (X1 , . . . , Xn1 ) (X1 , . . . , Xn2 ) (X1 , . . . , Xn3 ) inference inference inferenceInferring Multiple Graph Structures 3
  7. 7. Handling the scarcity of data By breaking the separability experiment 1 experiment 2 experiment 3 (1) (1) (2) (2) (3) (3) (X1 , . . . , Xn1 ) (X1 , . . . , Xn2 ) (X1 , . . . , Xn3 ) inference inference inferenceInferring Multiple Graph Structures 3
  8. 8. Handling the scarcity of data By breaking the separability experiment 1 experiment 2 experiment 3 (1) (1) (2) (2) (3) (3) (X1 , . . . , Xn1 ) (X1 , . . . , Xn2 ) (X1 , . . . , Xn3 ) inference inference inferenceInferring Multiple Graph Structures 3
  9. 9. Outline Statistical model Multi-task learning Algorithms and methods Model selection ExperimentsInferring Multiple Graph Structures 4
  10. 10. Outline Statistical model Multi-task learning Algorithms and methods Model selection ExperimentsInferring Multiple Graph Structures 5
  11. 11. Gaussian graphical modeling Let X = (X1 , . . . , Xp ) ∼ N (0p , Σ) and assume n i.i.d. copies of X, X be the n × p matrix whose kth row is Xk , Θ = (θij )i,j∈P Σ−1 be the concentration matrix. Graphical interpretation Since corij|P{i,j} = −θij / θii θjj for i = j,   θij = 0 Xi ⊥ Xj |XP{i,j} ⇔ ⊥ or edge (i, j) ∈ network. /  non zeroes in Θ describes the graph structure.Inferring Multiple Graph Structures 6
  12. 12. The model likelihood Let S = n−1 X X be the empirical variance-covariance matrix: S is a sufficient statistic for X ⇒ L(Θ; X) = L(Θ; S) The log-likelihood n n n L(Θ; S) = log det(Θ) − trace(SΘ) − log(2π). 2 2 2 The MLE of Θ is S−1 not defined for n < p not sparse ⇒ fully connected graphInferring Multiple Graph Structures 7
  13. 13. Penalized Approaches Penalized Likelihood (Banerjee et al., 2008) max L(Θ; S) − λ Θ 1 Θ∈S+ well defined for n < p sparse ⇒ sensible graph SDP of size O(p2 ) (solved by Friedman et al., 2007) ¨ Neighborhood Selection (Meinshausen & Bulhman, 2006) 1 2 β = argmin Xj − Xj β 2 + λ β 1 β∈Rp−1 n where Xj is the jth column of X and Xj is X deprived of Xj not symmetric, not positive-definite p independent L ASSO problems of size (p − 1)Inferring Multiple Graph Structures 8
  14. 14. Penalized Approaches Penalized Likelihood (Banerjee et al., 2008) max L(Θ; S) − λ Θ 1 Θ∈S+ well defined for n < p sparse ⇒ sensible graph SDP of size O(p2 ) (solved by Friedman et al., 2007) ¨ Neighborhood Selection (Meinshausen & Bulhman, 2006) 1 2 β = argmin Xj − Xj β 2 + λ β 1 β∈Rp−1 n where Xj is the jth column of X and Xj is X deprived of Xj not symmetric, not positive-definite p independent L ASSO problems of size (p − 1)Inferring Multiple Graph Structures 8
  15. 15. Neighborhood vs. Likelihood Pseudo-likelihood (Besag, 1975) p P(X1 , . . . , Xp ) P(Xj |{Xk }k=j ) j=1 n n n log det(D) − L(Θ; S) = trace SD−1 Θ2 − log(2π) 2 2 2 n n n L(Θ; S) = log det(Θ) − trace(SΘ) − log(2π) 2 2 2 with D = diag(Θ). Proposition (Ambroise, Chiquet, Matias, 2008) Neighborhood selection leads to the graph maximizing the penalized pseudo-log-likelihood ˆ θ ˜ Proof: βi = − ij , where Θ = arg maxΘ L(Θ; S) − λ Θ 1 θjjInferring Multiple Graph Structures 9
  16. 16. Neighborhood vs. Likelihood Pseudo-likelihood (Besag, 1975) p P(X1 , . . . , Xp ) P(Xj |{Xk }k=j ) j=1 n n n log det(D) − L(Θ; S) = trace SD−1 Θ2 − log(2π) 2 2 2 n n n L(Θ; S) = log det(Θ) − trace(SΘ) − log(2π) 2 2 2 with D = diag(Θ). Proposition (Ambroise, Chiquet, Matias, 2008) Neighborhood selection leads to the graph maximizing the penalized pseudo-log-likelihood ˆ θ ˜ Proof: βi = − ij , where Θ = arg maxΘ L(Θ; S) − λ Θ 1 θjjInferring Multiple Graph Structures 9
  17. 17. Outline Statistical model Multi-task learning Algorithms and methods Model selection ExperimentsInferring Multiple Graph Structures 10
  18. 18. Multi-task learning We have T samples (experimental cond.) of the same variables X(t) is the tth data matrix, S(t) is the empirical covariance examples are assumed to be drawn from N (0, Σ(t) ) Ignoring the relationships between the tasks leads to separable objectives max L(Θ(t) ; S(t) ) − λ Θ(t) 1 Θ(t) ∈Rp×p ,t=1...,T Multi-task learning = solving the T tasks jointly We may couple the objectives through the fitting term term, through the penalty term.Inferring Multiple Graph Structures 11
  19. 19. Multi-task learning We have T samples (experimental cond.) of the same variables X(t) is the tth data matrix, S(t) is the empirical covariance examples are assumed to be drawn from N (0, Σ(t) ) Ignoring the relationships between the tasks leads to separable objectives max L(Θ(t) ; S(t) ) − λ Θ(t) 1 Θ(t) ∈Rp×p ,t=1...,T Multi-task learning = solving the T tasks jointly We may couple the objectives through the fitting term term, through the penalty term.Inferring Multiple Graph Structures 11
  20. 20. Coupling through the fitting term Intertwined L ASSO T max L(Θ(t) ; S(t) ) − λ Θ(t) 1 Θ(t) ,t...,T t=1 1 T (t) is the “pooled-tasks” covariance matrix. S= n t=1 nt S S(t) = αS(t) + (1 − α)S is a mixture between specific and pooled covariance matrices. α = 0 pools the data sets and infers a single graph α = 1 separates the data sets and infers T graphs independently α = 1/2 in all our experimentsInferring Multiple Graph Structures 12
  21. 21. Coupling through penalties: group-L ASSO X1 X2We group parameters by sets ofcorresponding edges across graphs: X3 X4 Graphical group-L ASSO T T 1/2 (t) (t) (t) 2 max L Θ ;S −λ θij Θ(t) ,t...,T t=1 i,j t=1 i=j Sparsity pattern shared between graphs Identical graphs across tasksInferring Multiple Graph Structures 13
  22. 22. Coupling through penalties: group-L ASSO X1 X2 X1 X2We group parameters by sets of X1 X1 X2 X2corresponding edges across graphs: X3 X4 X3 X4 X3 X4 X3 X4 Graphical group-L ASSO T T 1/2 (t) (t) (t) 2 max L Θ ;S −λ θij Θ(t) ,t...,T t=1 i,j t=1 i=j Sparsity pattern shared between graphs Identical graphs across tasksInferring Multiple Graph Structures 13
  23. 23. Coupling through penalties: group-L ASSO X1 X2 X1 X2We group parameters by sets of X1 X1 X2 X2corresponding edges across graphs: X3 X4 X3 X4 X3 X4 X3 X4 Graphical group-L ASSO T T 1/2 (t) (t) (t) 2 max L Θ ;S −λ θij Θ(t) ,t...,T t=1 i,j t=1 i=j Sparsity pattern shared between graphs Identical graphs across tasksInferring Multiple Graph Structures 13
  24. 24. Coupling through penalties: group-L ASSO X1 X2 X1 X2We group parameters by sets of X1 X1 X2 X2corresponding edges across graphs: X3 X4 X3 X4 X3 X4 X3 X4 Graphical group-L ASSO T T 1/2 (t) (t) (t) 2 max L Θ ;S −λ θij Θ(t) ,t...,T t=1 i,j t=1 i=j Sparsity pattern shared between graphs Identical graphs across tasksInferring Multiple Graph Structures 13
  25. 25. Coupling through penalties: group-L ASSO X1 X2 X1 X2We group parameters by sets of X1 X1 X2 X2corresponding edges across graphs: X3 X4 X3 X4 X3 X4 X3 X4 Graphical group-L ASSO T T 1/2 (t) (t) (t) 2 max L Θ ;S −λ θij Θ(t) ,t...,T t=1 i,j t=1 i=j Sparsity pattern shared between graphs Identical graphs across tasksInferring Multiple Graph Structures 13
  26. 26. Coupling through penalties: cooperative-L ASSO Same grouping, and bet that X1 X2 correlations are likely to be sign consistent Gene interactions are either X3 X4 inhibitory or activating across assays Graphical cooperative-L ASSO T T 1 T 1 (t) (t) (t) 2 2 (t) 2 2 max L S ;Θ −λ θij + θij Θ(t) + − t=1,...,T t=1 i,j t=1 t=1 i=j where [u]+ = max(0, u) and [u]− = min(0, u). Plausible in many other situations Sparsity pattern shared between graphs, which may differInferring Multiple Graph Structures 14
  27. 27. Coupling through penalties: cooperative-L ASSO Same grouping, and bet that X1 X2 correlations are likely to be sign consistent Gene interactions are either X3 X4 inhibitory or activating across assays Graphical cooperative-L ASSO T T 1 T 1 (t) (t) (t) 2 2 (t) 2 2 max L S ;Θ −λ θij + θij Θ(t) + − t=1,...,T t=1 i,j t=1 t=1 i=j where [u]+ = max(0, u) and [u]− = min(0, u). Plausible in many other situations Sparsity pattern shared between graphs, which may differInferring Multiple Graph Structures 14
  28. 28. Coupling through penalties: cooperative-L ASSO Same grouping, and bet that X1 X2 X1 X2 correlations are likely to be sign X1 X2 X1 X2 consistent Gene interactions are either X3 X4 inhibitory or activating across assays X3 X4 X3 X4 X3 X4 Graphical cooperative-L ASSO T T 1 T 1 (t) (t) (t) 2 2 (t) 2 2 max L S ;Θ −λ θij + θij Θ(t) + − t=1,...,T t=1 i,j t=1 t=1 i=j where [u]+ = max(0, u) and [u]− = min(0, u). Plausible in many other situations Sparsity pattern shared between graphs, which may differInferring Multiple Graph Structures 14
  29. 29. Coupling through penalties: cooperative-L ASSO Same grouping, and bet that X1 X2 X1 X2 correlations are likely to be sign X1 X2 X1 X2 consistent Gene interactions are either X3 X4 inhibitory or activating across assays X3 X4 X3 X4 X3 X4 Graphical cooperative-L ASSO T T 1 T 1 (t) (t) (t) 2 2 (t) 2 2 max L S ;Θ −λ θij + θij Θ(t) + − t=1,...,T t=1 i,j t=1 t=1 i=j where [u]+ = max(0, u) and [u]− = min(0, u). Plausible in many other situations Sparsity pattern shared between graphs, which may differInferring Multiple Graph Structures 14
  30. 30. Coupling through penalties: cooperative-L ASSO Same grouping, and bet that X1 X2 X1 X2 correlations are likely to be sign X1 X2 X1 X2 consistent Gene interactions are either X3 X4 inhibitory or activating across assays X3 X4 X3 X4 X3 X4 Graphical cooperative-L ASSO T T 1 T 1 (t) (t) (t) 2 2 (t) 2 2 max L S ;Θ −λ θij + θij Θ(t) + − t=1,...,T t=1 i,j t=1 t=1 i=j where [u]+ = max(0, u) and [u]− = min(0, u). Plausible in many other situations Sparsity pattern shared between graphs, which may differInferring Multiple Graph Structures 14
  31. 31. Coupling through penalties: cooperative-L ASSO Same grouping, and bet that X1 X2 X1 X2 correlations are likely to be sign X1 X2 X1 X2 consistent Gene interactions are either X3 X4 inhibitory or activating across assays X3 X4 X3 X4 X3 X4 Graphical cooperative-L ASSO T T 1 T 1 (t) (t) (t) 2 2 (t) 2 2 max L S ;Θ −λ θij + θij Θ(t) + − t=1,...,T t=1 i,j t=1 t=1 i=j where [u]+ = max(0, u) and [u]− = min(0, u). Plausible in many other situations Sparsity pattern shared between graphs, which may differInferring Multiple Graph Structures 14
  32. 32. Coupling through penalties: cooperative-L ASSO Same grouping, and bet that X1 X2 X1 X2 correlations are likely to be sign X1 X2 X1 X2 consistent Gene interactions are either X3 X4 inhibitory or activating across assays X3 X4 X3 X4 X3 X4 Graphical cooperative-L ASSO T T 1 T 1 (t) (t) (t) 2 2 (t) 2 2 max L S ;Θ −λ θij + θij Θ(t) + − t=1,...,T t=1 i,j t=1 t=1 i=j where [u]+ = max(0, u) and [u]− = min(0, u). Plausible in many other situations Sparsity pattern shared between graphs, which may differInferring Multiple Graph Structures 14
  33. 33. Outline Statistical model Multi-task learning Algorithms and methods Model selection ExperimentsInferring Multiple Graph Structures 15
  34. 34. A Geometric View of Sparsity Constrained Optimization L(β1 , β2 ) max L(β1 , β2 ) − λΩ(β1 , β2 ) β1 ,β2 β2 β1Inferring Multiple Graph Structures 16
  35. 35. A Geometric View of Sparsity Constrained Optimization L(β1 , β2 ) max L(β1 , β2 ) − λΩ(β1 , β2 ) β1 ,β2 max L(β1 , β2 ) ⇔ β1 ,β2 s.t. Ω(β1 , β2 ) ≤ c β2 β1Inferring Multiple Graph Structures 16
  36. 36. A Geometric View of Sparsity Constrained Optimization max L(β1 , β2 ) − λΩ(β1 , β2 ) β1 ,β2 max L(β1 , β2 ) ⇔ β1 ,β2 s.t. Ω(β1 , β2 ) ≤ c β2 β1Inferring Multiple Graph Structures 16
  37. 37. A Geometric View of Sparsity Supporting Hyperplane An hyperplane supports a set iff the set is contained in one half-space the set has at least one point on the hyperplane β2 β1Inferring Multiple Graph Structures 17
  38. 38. A Geometric View of Sparsity Supporting Hyperplane An hyperplane supports a set iff the set is contained in one half-space the set has at least one point on the hyperplane β2 β2 β1 β1Inferring Multiple Graph Structures 17
  39. 39. A Geometric View of Sparsity Supporting Hyperplane An hyperplane supports a set iff the set is contained in one half-space the set has at least one point on the hyperplane β2 β2 β1 β1Inferring Multiple Graph Structures 17
  40. 40. A Geometric View of Sparsity Supporting Hyperplane An hyperplane supports a set iff the set is contained in one half-space the set has at least one point on the hyperplane β2 β2 β1 β1Inferring Multiple Graph Structures 17
  41. 41. A Geometric View of Sparsity Supporting Hyperplane An hyperplane supports a set iff the set is contained in one half-space the set has at least one point on the hyperplane β2 β2 β2 β1 β1 β1 There are Supporting Hyperplane at all points of convex sets: Generalize tangentsInferring Multiple Graph Structures 17
  42. 42. A Geometric View of Sparsity Dual Cone Generalizes normals β2 β2 β2 β1 β1 β1Inferring Multiple Graph Structures 18
  43. 43. A Geometric View of Sparsity Dual Cone Generalizes normals β2 β2 β2 β1 β1 β1Inferring Multiple Graph Structures 18
  44. 44. A Geometric View of Sparsity Dual Cone Generalizes normals β2 β2 β2 β1 β1 β1Inferring Multiple Graph Structures 18
  45. 45. A Geometric View of Sparsity Dual Cone Generalizes normals β2 β2 β2 β1 β1 β1 Shape of dual cones ⇒ sparsity patternInferring Multiple Graph Structures 18
  46. 46. Group-L ASSO balls (2) (2) β2 =0 β2 = 0.3Admissible set 2 tasks (T = 2) 1 1 2 coefficients (p = 2) =0 (1) (1) β2 −1 1 β2 −1 1 (2)Unit ball β1 −1 −1 (1) (1) β1 β1 2 2 1/2 (t) 2 βi ≤1 1 1 i=1 t=1 = 0.3 (1) (1) β2 −1 1 β2 −1 1 (2) β1 −1 −1 (1) (1) β1 β1Inferring Multiple Graph Structures 19
  47. 47. Group-L ASSO balls (2) (2) β2 =0 β2 = 0.3Admissible set 2 tasks (T = 2) 1 1 2 coefficients (p = 2) =0 (1) (1) β2 −1 1 β2 −1 1 (2)Unit ball β1 −1 −1 (1) (1) β1 β1 2 2 1/2 (t) 2 βi ≤1 1 1 i=1 t=1 = 0.3 (1) (1) β2 −1 1 β2 −1 1 (2) β1 −1 −1 (1) (1) β1 β1Inferring Multiple Graph Structures 19
  48. 48. Group-L ASSO balls (2) (2) β2 =0 β2 = 0.3Admissible set 2 tasks (T = 2) 1 1 2 coefficients (p = 2) =0 (1) (1) β2 −1 1 β2 −1 1 (2)Unit ball β1 −1 −1 (1) (1) β1 β1 2 2 1/2 (t) 2 βi ≤1 1 1 i=1 t=1 = 0.3 (1) (1) β2 −1 1 β2 −1 1 (2) β1 −1 −1 (1) (1) β1 β1Inferring Multiple Graph Structures 19
  49. 49. Group-L ASSO balls (2) (2) β2 =0 β2 = 0.3Admissible set 2 tasks (T = 2) 1 1 2 coefficients (p = 2) =0 (1) (1) β2 −1 1 β2 −1 1 (2)Unit ball β1 −1 −1 (1) (1) β1 β1 2 2 1/2 (t) 2 βi ≤1 1 1 i=1 t=1 = 0.3 (1) (1) β2 −1 1 β2 −1 1 (2) β1 −1 −1 (1) (1) β1 β1Inferring Multiple Graph Structures 19
  50. 50. Cooperative-L ASSO balls (2) (2) β2 =0 β2 = 0.3Admissible set 2 tasks (T = 2) 2 coefficients (p = 2) 1 1Unit ball =0 (1) (1) β2 −1 1 β2 −1 1 (2) 1/2 β1 2 2 (t) 2 −1 −1 (1) (1) βj β1 β1 + j=1 t=1 2 2 1/2 1 1 (t) 2 = 0.3+ −βj ≤1 (1) β2 (1) β2 + −1 1 −1 1 j=1 t=1 (2) β1 −1 −1 (1) (1) β1 β1Inferring Multiple Graph Structures 20
  51. 51. Cooperative-L ASSO balls (2) (2) β2 =0 β2 = 0.3Admissible set 2 tasks (T = 2) 2 coefficients (p = 2) 1 1Unit ball =0 (1) (1) β2 −1 1 β2 −1 1 (2) 1/2 β1 2 2 (t) 2 −1 −1 (1) (1) βj β1 β1 + j=1 t=1 2 2 1/2 1 1 (t) 2 = 0.3+ −βj ≤1 (1) β2 (1) β2 + −1 1 −1 1 j=1 t=1 (2) β1 −1 −1 (1) (1) β1 β1Inferring Multiple Graph Structures 20
  52. 52. Cooperative-L ASSO balls (2) (2) β2 =0 β2 = 0.3Admissible set 2 tasks (T = 2) 2 coefficients (p = 2) 1 1Unit ball =0 (1) (1) β2 −1 1 β2 −1 1 (2) 1/2 β1 2 2 (t) 2 −1 −1 (1) (1) βj β1 β1 + j=1 t=1 2 2 1/2 1 1 (t) 2 = 0.3+ −βj ≤1 (1) β2 (1) β2 + −1 1 −1 1 j=1 t=1 (2) β1 −1 −1 (1) (1) β1 β1Inferring Multiple Graph Structures 20
  53. 53. Cooperative-L ASSO balls (2) (2) β2 =0 β2 = 0.3Admissible set 2 tasks (T = 2) 2 coefficients (p = 2) 1 1Unit ball =0 (1) (1) β2 −1 1 β2 −1 1 (2) 1/2 β1 2 2 (t) 2 −1 −1 (1) (1) βj β1 β1 + j=1 t=1 2 2 1/2 1 1 (t) 2 = 0.3+ −βj ≤1 (1) β2 (1) β2 + −1 1 −1 1 j=1 t=1 (2) β1 −1 −1 (1) (1) β1 β1Inferring Multiple Graph Structures 20
  54. 54. Cooperative-L ASSO balls (2) (2) β2 =0 β2 = 0.3Admissible set 2 tasks (T = 2) 2 coefficients (p = 2) 1 1Unit ball =0 (1) (1) β2 −1 1 β2 −1 1 (2) 1/2 β1 2 2 (t) 2 −1 −1 (1) (1) βj β1 β1 + j=1 t=1 2 2 1/2 1 1 (t) 2 = 0.3+ −βj ≤1 (1) β2 (1) β2 + −1 1 −1 1 j=1 t=1 (2) β1 −1 −1 (1) (1) β1 β1Inferring Multiple Graph Structures 20
  55. 55. Cooperative-L ASSO balls (2) (2) β2 =0 β2 = 0.3Admissible set 2 tasks (T = 2) 2 coefficients (p = 2) 1 1Unit ball =0 (1) (1) β2 −1 1 β2 −1 1 (2) 1/2 β1 2 2 (t) 2 −1 −1 (1) (1) βj β1 β1 + j=1 t=1 2 2 1/2 1 1 (t) 2 = 0.3+ −βj ≤1 (1) β2 (1) β2 + −1 1 −1 1 j=1 t=1 (2) β1 −1 −1 (1) (1) β1 β1Inferring Multiple Graph Structures 20
  56. 56. Decomposition strategy Estimate the j th neighborhood of the T graphs T max ˜ L(K(t) ; S(t) ) − λ Ω(K(t) ) K(t) ,t=1...,T t=1 decomposes into p convex optimization problems of size β j = argmin fj (β) + λ Ω(β) β∈RT ×(p−1) where β j is a minimizer iff 0 ∈ β fj (β) + λ∂β Ω(β)Inferring Multiple Graph Structures 21
  57. 57. Decomposition strategy Estimate the j th neighborhood of the T graphs T max ˜ L(K(t) ; S(t) ) − λ Ω(K(t) ) K(t) ,t=1...,T t=1 decomposes into p convex optimization problems of size β j = argmin fj (β) + λ Ω(β) β∈RT ×(p−1) where β j is a minimizer iff 0 ∈ β fj (β) + λ∂β Ω(β) Group-L ASSO: p−1 [1:T ] Ω(β) = βi 2 i=1 [1:T ] where β i is the vector corresponding to the edges (i, j) across graphsInferring Multiple Graph Structures 21
  58. 58. Decomposition strategy Estimate the j th neighborhood of the T graphs T max ˜ L(K(t) ; S(t) ) − λ Ω(K(t) ) K(t) ,t=1...,T t=1 decomposes into p convex optimization problems of size β j = argmin fj (β) + λ Ω(β) β∈RT ×(p−1) where β j is a minimizer iff 0 ∈ β fj (β) + λ∂β Ω(β) Coop-L ASSO: p−1 [1:T ] [1:T ] Ω(β) = βi + −β i + 2 + 2 i=1 [1:T ] where β i is the vector corresponding to the edges (i, j) across graphsInferring Multiple Graph Structures 21
  59. 59. Active set algorithm: yellow belt // 0. INITIALIZATION β ← 0, A ← ∅ while 0 ∈ ∂β L(β) do / // 1. MASTER PROBLEM: OPTIMIZATION WITH RESPECT TO β A Find a solution h to the smooth problem h f (β A + h) + λ∂h Ω(β A + h) = 0, where ∂h Ω = { h Ω} . βA ← βA + h // 2. IDENTIFY NEWLY ZEROED VARIABLES A ← A{i} // 3. IDENTIFY NEW NON-ZERO VARIABLES // Select a candidate i ∈ Ac ∂f (β) i ← arg max vj , where vj = min ∂βj + λν j∈Ac ν∈∂β Ω j endInferring Multiple Graph Structures 22
  60. 60. Active set algorithm: orange belt // 0. INITIALIZATION β ← 0, A ← ∅ while 0 ∈ ∂β L(β) do / // 1. MASTER PROBLEM: OPTIMIZATION WITH RESPECT TO β A Find a solution h to the smooth problem h f (β A + h) + λ∂h Ω(β A + h) = 0, where ∂h Ω = { h Ω} . βA ← βA + h // 2. IDENTIFY NEWLY ZEROED VARIABLES A ← A{i} // 3. IDENTIFY NEW NON-ZERO VARIABLES // Select a candidate i ∈ Ac which violates the more the optimality conditions ∂f (β) i ← arg max vj , where vj = min ∂βj + λν j∈Ac ν∈∂β Ω j if it exists such an i then A ← A ∪ {i} else Stop and return β, which is optimal end endInferring Multiple Graph Structures 22
  61. 61. Active set algorithm: green belt // 0. INITIALIZATION β ← 0, A ← ∅ while 0 ∈ ∂β L(β) do / // 1. MASTER PROBLEM: OPTIMIZATION WITH RESPECT TO β A Find a solution h to the smooth problem h f (β A + h) + λ∂h Ω(β A + h) = 0, where ∂h Ω = { h Ω} . βA ← βA + h // 2. IDENTIFY NEWLY ZEROED VARIABLES ∂f (β) while ∃i ∈ A : βi = 0 and min ∂β + λν = 0 do ν∈∂β Ω i i A ← A{i} end // 3. IDENTIFY NEW NON-ZERO VARIABLES // Select a candidate i ∈ Ac such that an infinitesimal change of βi provides the highest reduction of L ∂f (β) i ← arg max vj , where vj = min ∂β + λν ν∈∂β Ω j j∈Ac j if vi = 0 then A ← A ∪ {i} else Stop and return β, which is optimal end endInferring Multiple Graph Structures 22
  62. 62. Outline Statistical model Multi-task learning Algorithms and methods Model selection ExperimentsInferring Multiple Graph Structures 23
  63. 63. Tuning the penalty parameter What does the literature say? Theory based penalty choices √ 1. Optimal order of penalty in the p n framework: n log p Bunea et al. 2007, Bickel et al. 2009 2. Control on the probability of connecting two distinct connectivity sets Meinshausen et al. 2006, Banerjee et al. 2008, Ambroise et al. 2009 practically much too conservative Cross-validation Optimal in terms of prediction, not in terms of selection Problematic with small samples: changes the sparsity constraint due to sample sizeInferring Multiple Graph Structures 24
  64. 64. Tuning the penalty parameter BIC / AIC Theorem (Zou et al. 2008) ˆlasso ˆlasso df(βλ ) = βλ 0 Straightforward extensions to the graphical framework ˆ ˆ log n BIC(λ) = L(Θλ ; X) − df(Θλ ) 2 ˆ ˆ AIC(λ) = L(Θλ ; X) − df(Θλ ) Rely on asymptotic approximations, but still relevant for small data setInferring Multiple Graph Structures 25
  65. 65. Outline Statistical model Multi-task learning Algorithms and methods Model selection ExperimentsInferring Multiple Graph Structures 26
  66. 66. Data Generation We set the number of nodes p the number of edges K the number of examples n Process 1. Generate a random adjacency matrix with 2 K off-diagonal terms 2. Compute the normalized Laplacian L 3. Generate a symmetric matrix of random signs R 4. Compute the concentration matrix Kij = Lij Rij 5. compute Σ by pseudo-inversion of K 6. generate correlated Gaussian data ∼ N (0, Σ )Inferring Multiple Graph Structures 27
  67. 67. Simulating Related Tasks Generate 1. an “ancestor” with p = 20 nodes and K = 20 edges 2. T = 4 children by adding and deleting δ edges 3. T = 4 Gaussian samples Figure: ancestor and children with δ = 2 perturbationsInferring Multiple Graph Structures 28
  68. 68. Simulating Related Tasks Generate 1. an “ancestor” with p = 20 nodes and K = 20 edges 2. T = 4 children by adding and deleting δ edges 3. T = 4 Gaussian samples Figure: ancestor and children with δ = 2 perturbationsInferring Multiple Graph Structures 28
  69. 69. Simulating Related Tasks Generate 1. an “ancestor” with p = 20 nodes and K = 20 edges 2. T = 4 children by adding and deleting δ edges 3. T = 4 Gaussian samples Figure: ancestor and children with δ = 2 perturbationsInferring Multiple Graph Structures 28
  70. 70. Simulation results Precision/Recall curve ROC curve precision = TP/(TP+FP) fallout = FP/N (type I error) recall = TP/P (power) recall = TP/P (power)Inferring Multiple Graph Structures 29
  71. 71. Simulation results large sample size penalty: λmax −→ 0 penalty: λmax −→ 0 1.0 1.0 0.8 0.8 0.6 0.6 precision recall 0.4 0.4 CoopLasso CoopLasso 0.2 0.2 GroupLasso GroupLasso Intertwined Intertwined Independent Independent Pooled Pooled 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 recall fallout Figure: nt = 100, δ = 1Inferring Multiple Graph Structures 29
  72. 72. Simulation results large sample size penalty: λmax −→ 0 penalty: λmax −→ 0 1.0 1.0 0.8 0.8 0.6 0.6 precision recall 0.4 0.4 CoopLasso CoopLasso 0.2 0.2 GroupLasso GroupLasso Intertwined Intertwined Independent Independent Pooled Pooled 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 recall fallout Figure: nt = 100, δ = 3Inferring Multiple Graph Structures 29
  73. 73. Simulation results large sample size penalty: λmax −→ 0 penalty: λmax −→ 0 1.0 1.0 0.8 0.8 0.6 0.6 precision recall 0.4 0.4 CoopLasso CoopLasso 0.2 0.2 GroupLasso GroupLasso Intertwined Intertwined Independent Independent Pooled Pooled 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 recall fallout Figure: nt = 100, δ = 5Inferring Multiple Graph Structures 29
  74. 74. Simulation results medium sample size penalty: λmax −→ 0 penalty: λmax −→ 0 1.0 1.0 0.8 0.8 0.6 0.6 precision recall 0.4 0.4 CoopLasso CoopLasso 0.2 0.2 GroupLasso GroupLasso Intertwined Intertwined Independent Independent Pooled Pooled 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 recall fallout Figure: nt = 50, δ = 1Inferring Multiple Graph Structures 29
  75. 75. Simulation results medium sample size penalty: λmax −→ 0 penalty: λmax −→ 0 1.0 1.0 0.8 0.8 0.6 0.6 precision recall 0.4 0.4 CoopLasso CoopLasso 0.2 0.2 GroupLasso GroupLasso Intertwined Intertwined Independent Independent Pooled Pooled 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 recall fallout Figure: nt = 50, δ = 3Inferring Multiple Graph Structures 29
  76. 76. Simulation results medium sample size penalty: λmax −→ 0 penalty: λmax −→ 0 1.0 1.0 0.8 0.8 0.6 0.6 precision recall 0.4 0.4 CoopLasso CoopLasso 0.2 0.2 GroupLasso GroupLasso Intertwined Intertwined Independent Independent Pooled Pooled 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 recall fallout Figure: nt = 50, δ = 5Inferring Multiple Graph Structures 29
  77. 77. Simulation results small sample size penalty: λmax −→ 0 penalty: λmax −→ 0 1.0 1.0 0.8 0.8 0.6 0.6 precision recall 0.4 0.4 CoopLasso CoopLasso 0.2 0.2 GroupLasso GroupLasso Intertwined Intertwined Independent Independent Pooled Pooled 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 recall fallout Figure: nt = 25, δ = 1Inferring Multiple Graph Structures 29
  78. 78. Simulation results small sample size penalty: λmax −→ 0 penalty: λmax −→ 0 1.0 1.0 0.8 0.8 0.6 0.6 precision recall 0.4 0.4 CoopLasso CoopLasso 0.2 0.2 GroupLasso GroupLasso Intertwined Intertwined Independent Independent Pooled Pooled 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 recall fallout Figure: nt = 25, δ = 3Inferring Multiple Graph Structures 29
  79. 79. Simulation results small sample size penalty: λmax −→ 0 penalty: λmax −→ 0 1.0 1.0 0.8 0.8 0.6 0.6 precision recall 0.4 0.4 CoopLasso CoopLasso 0.2 0.2 GroupLasso GroupLasso Intertwined Intertwined Independent Independent Pooled Pooled 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 recall fallout Figure: nt = 25, δ = 5Inferring Multiple Graph Structures 29
  80. 80. Breast Cancer Prediction of the outcome of preoperative chemotherapy Two types of patients Patient response can be classified either as 1. pathologic complete response (PCR) 2. residual disease (not PCR) Gene expression data 133 patients (99 not PCR, 34 PCR) 26 identified genes (differential analysis)Inferring Multiple Graph Structures 30
  81. 81. Package Demo cancer data: Coop-LassoInferring Multiple Graph Structures 31
  82. 82. Conclusions To sum-up Clarified links between neighborhood selection and graphical L ASSO Identified the relevance of Multi-Task Learning in network inference First methods for inferring multiple Gaussian Graphical Models Consistent improvements upon the available baseline solutions Available in the R package SIMoNe Perspectives Explore model-selection capabilities Other applications of the Cooperative-L ASSO Theoretical analysis (uniqueness, selection consistency)Inferring Multiple Graph Structures 32

×