Your SlideShare is downloading.
×

- 1. An introduction to Biological network inference via Gaussian Graphical Models Christophe Ambroise, Julien Chiquet e ´ Statistique et G´nome, CNRS & Universit´ d’Evry Val d’Essonne e S˜o Paulo – School on Advance Science – Octobre 2012 a http://stat.genopole.cnrs.fr/~cambroise Network inference 1
- 2. Outline Introduction Motivations Background on omics Modeling issue Modeling tools Statistical dependence Graphical models Covariance selection and Gaussian vector Gaussian Graphical Models for genomic data Steady-state data Time-course data Statistical inference Penalized likelihood approach Inducing sparsity and regularization The Lasso Application in Post-genomics Modeling time-course data Illustrations Multitask learning Network inference 2
- 3. Outline Introduction Motivations Background on omics Modeling issue Modeling tools Statistical dependence Graphical models Covariance selection and Gaussian vector Gaussian Graphical Models for genomic data Steady-state data Time-course data Statistical inference Penalized likelihood approach Inducing sparsity and regularization The Lasso Application in Post-genomics Modeling time-course data Illustrations Multitask learning Network inference 3
- 4. Outline Introduction Motivations Background on omics Modeling issue Modeling tools Statistical dependence Graphical models Covariance selection and Gaussian vector Gaussian Graphical Models for genomic data Steady-state data Time-course data Statistical inference Penalized likelihood approach Inducing sparsity and regularization The Lasso Application in Post-genomics Modeling time-course data Illustrations Multitask learning Network inference 4
- 5. Real networks I Many scientiﬁc ﬁelds : I World Wide Web I Biology, sociology, physics I Nature of data under study: I Interactions between N objects I O(N 2 ) possible interactions I Network topology : I Describes the way nodes interact, structure/function Sample of 250 blogs (nodes) with their links relationship (edges) of the French political Blogosphere. Network inference 5
- 6. 1 What the reconstructed networks are expected to be (1) Regulatory networks E. coli regulatory network I relationships between gene and their products I inhibition/activation I impossible to recover at large scale I always incomplete 1 1 and are presumably wrongly assumed to be Network inference 6
- 7. What the reconstructed networks are expected to be (2) Regulatory networks Figure: Regulatory network identiﬁed in mammalian cells: highly structured Network inference 7
- 8. What the reconstructed networks are expected to be (3) Protein-Protein interaction networks Figure: Yeast PPI network : do not be mislead by the representation, trust stat ! Network inference 8
- 9. What the reconstructed networks are expected to be (3) Protein-Protein interaction networks Figure: Yeast PPI network : do not be mislead by the representation, trust stat ! Network inference 8
- 10. What the reconstructed networks are expected to be (3) Protein-Protein interaction networks Figure: Yeast PPI network : do not be mislead by the representation, trust stat ! Network inference 8
- 11. Outline Introduction Motivations Background on omics Modeling issue Modeling tools Statistical dependence Graphical models Covariance selection and Gaussian vector Gaussian Graphical Models for genomic data Steady-state data Time-course data Statistical inference Penalized likelihood approach Inducing sparsity and regularization The Lasso Application in Post-genomics Modeling time-course data Illustrations Multitask learning Network inference 9
- 12. What are we looking at? Central dogma of molecular biology transcription translation DNA mRNA Proteins replication Proteins I are building blocks of any cellular functionality, I are encoded by the genes, I do interact (at the protein and gene level – regulations). Network inference 10
- 13. What questions in functional genomics? (1) Various levels/scales of study I genome: sequence analysis, I transcriptome: gene expression levels, I proteome: protein functions and interactions. Questions 1. Biological understanding I Mechanisms of diseases, I gene/protein functions and interactions. 2. Medical/clinical care I Diagnostic (type of disease), I prognostic (survival analysis), I treatment (prediction of response). Network inference 11
- 14. What questions in functional genomics? (1) Various levels/scales of study I genome: sequence analysis, I transcriptome: gene expression levels, I proteome: protein functions and interactions. Questions 1. Biological understanding I Mechanisms of diseases, I gene/protein functions and interactions. 2. Medical/clinical care I Diagnostic (type of disease), I prognostic (survival analysis), I treatment (prediction of response). Network inference 11
- 15. What questions in functional genomics? (2) Central dogma of molecular biology transcription translation DNA mRNA Proteins replication Basic biostatistical issues Selecting some genes of interest (biomarkers), Looking for interactions between them (pathway analysis). Network inference 12
- 16. How is this measured? (1) Microarray technology: parallel measurement of many biological features signal processing Matrix of features n ⌧ p 0 1 2 3 p 1 x1 x1 x1 . . . x1 Expression levels of p B. C pretreatment X=@ . . A probes are simultaneously p 1 2 2 xn xn x1 . . . xn monitored for n individuals Network inference 13
- 17. How is this measured? (2) Next Generation Sequencing: parallel measurement of even many more biological features assembling Matrix of features n n p 0 1 2 3 p 1 k1 k1 k1 . . . k1 B. C X=@ . Expression counts are extracted pretreatment . A from small repeated sequences p 1 2 2 kn kn k1 . . . kn and monitored for n individuals Network inference 14
- 18. What questions are we dealing with? (1) Supervised canonical example at the gene level: di↵erential analysis Leukemia (Golub data, thanks to P. Neuvial) I AML – Acute Myeloblastic Leukemia, n1 = 11, I ALL – Acute Lymphoblastic Leukemia n2 = 27, a n1 + n2 vector of outcome with each patient’s tumor type. Supervised classiﬁcation Find genes with signiﬁcant di↵erent expression levels between groups – biomarkers prediction purpose Network inference 15
- 19. What questions are we dealing with? (2) Unsupervised canonical example at the gene level: hierarchical clustering Same kind of data, no outcome is considered (Unsupervised) clustering Find groups of gene which show statistical dependencies/commonalities – hoping for biological interactions exploratory purpose functional understanding Can we do better than that ? And how do genes interact anyway? Network inference 16
- 20. What questions are we dealing with? (2) Unsupervised canonical example at the gene level: hierarchical clustering Same kind of data, no outcome is considered (Unsupervised) clustering Find groups of gene which show statistical dependencies/commonalities – hoping for biological interactions exploratory purpose functional understanding Can we do better than that ? And how do genes interact anyway? Network inference 16
- 21. Outline Introduction Motivations Background on omics Modeling issue Modeling tools Statistical dependence Graphical models Covariance selection and Gaussian vector Gaussian Graphical Models for genomic data Steady-state data Time-course data Statistical inference Penalized likelihood approach Inducing sparsity and regularization The Lasso Application in Post-genomics Modeling time-course data Illustrations Multitask learning Network inference 17
- 22. The problem at hand Inference ⇡ 10s/100s microarray/sequencing experiments ⇡ 1000s probes (“genes”) Modeling questions prior to inference 1. What do the nodes represent? (the easiest one) 2. What is/should be the meaning of an edge? (the toughest one) I Biologically? I Statistically? Network inference 18
- 23. The problem at hand Inference ⇡ 10s/100s microarray/sequencing experiments ⇡ 1000s probes (“genes”) Modeling questions prior to inference 1. What do the nodes represent? (the easiest one) 2. What is/should be the meaning of an edge? (the toughest one) I Biologically? I Statistically? Network inference 18
- 24. The problem at hand Inference ⇡ 10s/100s microarray/sequencing experiments ⇡ 1000s probes (“genes”) Modeling questions prior to inference 1. What do the nodes represent? (the easiest one) 2. What is/should be the meaning of an edge? (the toughest one) I Biologically? I Statistically? Network inference 18
- 25. The problem at hand Inference ⇡ 10s/100s microarray/sequencing experiments ⇡ 1000s probes (“genes”) Modeling questions prior to inference 1. What do the nodes represent? (the easiest one) 2. What is/should be the meaning of an edge? (the toughest one) I Biologically? I Statistically? Network inference 18
- 26. More questions/issues Modelling I Is the network dynamic of static? I How has the data been generated? (time-course/steady state) I Are the edges oriented or not? (causality) I What do the edges represent for my particular problem? Statistical challenges I (Ultra) high dimensionality, I Noisy data, lack of reproducibility, I Heterogeneity of the data (many techniques, various signals). Network inference 19
- 27. More questions/issues Modelling I Is the network dynamic of static? I How has the data been generated? (time-course/steady state) I Are the edges oriented or not? (causality) I What do the edges represent for my particular problem? Statistical challenges I (Ultra) high dimensionality, I Noisy data, lack of reproducibility, I Heterogeneity of the data (many techniques, various signals). Network inference 19
- 28. Outline Introduction Motivations Background on omics Modeling issue Modeling tools Statistical dependence Graphical models Covariance selection and Gaussian vector Gaussian Graphical Models for genomic data Steady-state data Time-course data Statistical inference Penalized likelihood approach Inducing sparsity and regularization The Lasso Application in Post-genomics Modeling time-course data Illustrations Multitask learning Network inference 20
- 29. Canonical model settings Biological microarrays in comparable conditions Notations 1. a set P = {1, . . . , p} of p variables: these are typically the genes (could be proteins); 2. a sample N = {1, . . . , n} of individuals associated to the variables: these are typically the microarray (could be sequence counts). Basic statistical model This can be view as I a random vector X in Rp , whose j th entry is the j th variable, I a n-size sample (X 1 , . . . , X n ), such as X i is the i th microarrays, I could be independent identically distributed copies (steady-state) I could be dependent in a certain way (time-course data) I assume a parametric probability distribution for X (Gaussian). Network inference 21
- 30. Canonical model settings Biological microarrays in comparable conditions Notations 1. a set P = {1, . . . , p} of p variables: these are typically the genes (could be proteins); 2. a sample N = {1, . . . , n} of individuals associated to the variables: these are typically the microarray (could be sequence counts). Basic statistical model This can be view as I a random vector X in Rp , whose j th entry is the j th variable, I a n-size sample (X 1 , . . . , X n ), such as X i is the i th microarrays, I could be independent identically distributed copies (steady-state) I could be dependent in a certain way (time-course data) I assume a parametric probability distribution for X (Gaussian). Network inference 21
- 31. Canonical model settings Biological microarrays in comparable conditions Notations 1. a set P = {1, . . . , p} of p variables: these are typically the genes (could be proteins); 2. a sample N = {1, . . . , n} of individuals associated to the variables: The data are typically the microarray (could be sequence counts). these Stacking (X 1 , . . . , X n ), we met the usual individual/variable table X Basic statistical model 0 1 2 3 p1 This can be view as x1 x1 x1 . . . x1 B. C I Inference j th @ . a random vector X in Rp , whose X =entry is the j th variable, A . 1 2 2 p I a n-size sample (X 1 , . . . , X n ), such as Xxin is xn ix1 microarrays, the th . . . xn I could be independent identically distributed copies (steady-state) I could be dependent in a certain way (time-course data) I assume a parametric probability distribution for X (Gaussian). Network inference 21
- 32. Outline Introduction Motivations Background on omics Modeling issue Modeling tools Statistical dependence Graphical models Covariance selection and Gaussian vector Gaussian Graphical Models for genomic data Steady-state data Time-course data Statistical inference Penalized likelihood approach Inducing sparsity and regularization The Lasso Application in Post-genomics Modeling time-course data Illustrations Multitask learning Network inference 22
- 33. Modeling relationship between variables (1) Independence Deﬁnition (Independence of events) Two events A and B are independent if and only if P(A, B ) = P(A)P(B ), which is usually denoted by A ? B . Equivalently, ? I A ? B , P(A|B ) = P(A), ? I A ? B , P(A|B ) = P(A|B c ) ? Example (class vs party) party party class Labour Tory class Labour Tory working 0.42 0.28 working 0.60 0.40 bourgeoisie 0.06 0.24 bourgeoisie 0.20 0.80 Table: Joint probability (left) vs. conditional probability (right) Network inference 23
- 34. Modeling relationship between variables (1) Independence Deﬁnition (Independence of events) Two events A and B are independent if and only if P(A, B ) = P(A)P(B ), which is usually denoted by A ? B . Equivalently, ? I A ? B , P(A|B ) = P(A), ? I A ? B , P(A|B ) = P(A|B c ) ? Example (class vs party) party party class Labour Tory class Labour Tory working 0.42 0.28 working 0.60 0.40 bourgeoisie 0.06 0.24 bourgeoisie 0.20 0.80 Table: Joint probability (left) vs. conditional probability (right) Network inference 23
- 35. Modeling relationships between variables (2) Conditional independence Generalizing to more than two events requires strong assumptions (mutual independence). Better handle with Deﬁnition (Conditional independence of events) Two events A and B are independent if and only if P(A, B |C ) = P(A|C )P(B |C ), which is usually denoted by A ? B |C ? Example (Does QI depends on weight?) Consider the events A = ”having low QI”, B = ”having low weight”. Network inference 24
- 36. Modeling relationships between variables (2) Conditional independence Generalizing to more than two events requires strong assumptions (mutual independence). Better handle with Deﬁnition (Conditional independence of events) Two events A and B are independent if and only if P(A, B |C ) = P(A|C )P(B |C ), which is usually denoted by A ? B |C ? Example (Does QI depends on weight?) Consider the events A = ”having low QI”, B = ”having low weight”. Network inference 24
- 37. Modeling relationships between variables (2) Conditional independence Generalizing to more than two events requires strong assumptions (mutual independence). Better handle with Deﬁnition (Conditional independence of events) Two events A and B are independent if and only if P(A, B |C ) = P(A|C )P(B |C ), which is usually denoted by A ? B |C ? Example (Does QI depends on weight?) Consider the events A = ”having low QI”, B = ”having low weight”. Estimating2 P(A, B ), P(A) and P(B ) in a sample would lead to P(A, B ) 6= P(A)P(B ) 2 stupidly Network inference 24
- 38. Modeling relationships between variables (2) Conditional independence Generalizing to more than two events requires strong assumptions (mutual independence). Better handle with Deﬁnition (Conditional independence of events) Two events A and B are independent if and only if P(A, B |C ) = P(A|C )P(B |C ), which is usually denoted by A ? B |C ? Example (Does QI depends on weight?) Consider the events A = ”having low QI”, B = ”having low weight”. But in fact, introducing C = ”having a given age”, P(A, B |C ) = P(A|C )P(B |C ) Network inference 24
- 39. Independence of random vectors (1) Independence and Conditional independence: natural generalization Deﬁnition Consider 3 random vector X , Y , Z with distribution fX , fY , fZ , jointly fXY , fXYZ . Then, I X and Y are independent iif fXY (x , y) = fX (x )fY (y); I X and Y are conditionally independent on Z , z : fZ (z ) > 0 iif fXY |Z (x , y; z ) = fX |Z (x ; z )fY |Z (y; z ). Proposition (Factorization criterion) X and Y are independent (resp. conditionally independent on Z ) iif there exists functions g and h such as, for all x and y 1. fXY (x , y) = g(x )h(y), 2. fXYZ (x , y, z ) = g(x , z )h(y, z ), for all z fZ (z ) > 0. Network inference 25
- 40. Independence of random vectors (1) Independence and Conditional independence: natural generalization Deﬁnition Consider 3 random vector X , Y , Z with distribution fX , fY , fZ , jointly fXY , fXYZ . Then, I X and Y are independent iif fXY (x , y) = fX (x )fY (y); I X and Y are conditionally independent on Z , z : fZ (z ) > 0 iif fXY |Z (x , y; z ) = fX |Z (x ; z )fY |Z (y; z ). Proposition (Factorization criterion) X and Y are independent (resp. conditionally independent on Z ) iif there exists functions g and h such as, for all x and y 1. fXY (x , y) = g(x )h(y), 2. fXYZ (x , y, z ) = g(x , z )h(y, z ), for all z fZ (z ) > 0. Network inference 25
- 41. Independence of random vectors (2) Independence vs Conditional independence f ; X ? Y |Z ? f ; fXYZ f ; fX fY fZ f ; X ? Z |Y ? f ; Y ? Z |X ? Figure: Mutual independence, Conditional dependence, full dependence. Network inference 26
- 42. Outline Introduction Motivations Background on omics Modeling issue Modeling tools Statistical dependence Graphical models Covariance selection and Gaussian vector Gaussian Graphical Models for genomic data Steady-state data Time-course data Statistical inference Penalized likelihood approach Inducing sparsity and regularization The Lasso Application in Post-genomics Modeling time-course data Illustrations Multitask learning Network inference 27
- 43. Deﬁnition Deﬁnition A graphical model gives a graphical (intuitive) representation of the dependence structure of a probability distribution. Graphical structure $ Random variables/Random vector It links 1. a random vector (or a set of random variables.) X = {X1 , . . . , Xp } with distribution P, 2. a graph G = (P, E) where I P = {1, . . . , p} is the set of nodes associated to each variable, I E is a set of edges describing the dependence relationship of X ⇠ P. Network inference 28
- 44. Deﬁnition Deﬁnition A graphical model gives a graphical (intuitive) representation of the dependence structure of a probability distribution. Graphical structure $ Random variables/Random vector It links 1. a random vector (or a set of random variables.) X = {X1 , . . . , Xp } with distribution P, 2. a graph G = (P, E) where I P = {1, . . . , p} is the set of nodes associated to each variable, I E is a set of edges describing the dependence relationship of X ⇠ P. Network inference 28
- 45. Conditional Independence Graphs Deﬁnition Deﬁnition The conditional independence graph of a random vector X is the undirected graph G = {P, E} with the set of node P = {1, . . . , p} and where (i , j ) 2 E , Xi ? Xj |P{i , j }. / ? Property It owns the Markov property: any two subsets of variables separated by a third is independent conditionally on variables in the third set. Network inference 29
- 46. Conditional Independence Graphs Deﬁnition Deﬁnition The conditional independence graph of a random vector X is the undirected graph G = {P, E} with the set of node P = {1, . . . , p} and where (i , j ) 2 E , Xi ? Xj |P{i , j }. / ? Property It owns the Markov property: any two subsets of variables separated by a third is independent conditionally on variables in the third set. Network inference 29
- 47. Conditional Independence Graphs An example Let X1 , X2 , X3 , X4 be four random variables with joint probability density function fX (x ) = exp(u + x1 + x1 x2 + x2 x3 x4 ) with u a given constant. Apply the factorization property fX (x ) = exp(u + x1 + x1 x2 + x2 x3 x4 ) = exp(u) · exp(x1 + x1 x2 ) · exp(x2 x3 x4 ) Graphical representation 1 2 4 G = (P, E) such as P = {1, 2, 3, 4} and E= 3 Network inference 30
- 48. Conditional Independence Graphs An example Let X1 , X2 , X3 , X4 be four random variables with joint probability density function fX (x ) = exp(u + x1 + x1 x2 + x2 x3 x4 ) with u a given constant. Apply the factorization property fX (x ) = exp(u + x1 + x1 x2 + x2 x3 x4 ) = exp(u) · exp(x1 + x1 x2 ) · exp(x2 x3 x4 ) Graphical representation 1 2 4 G = (P, E) such as P = {1, 2, 3, 4} and E = {?} 3 Network inference 30
- 49. Conditional Independence Graphs An example Let X1 , X2 , X3 , X4 be four random variables with joint probability density function fX (x ) = exp(u + x1 + x1 x2 + x2 x3 x4 ) with u a given constant. Apply the factorization property fX (x ) = exp(u + x1 + x1 x2 + x2 x3 x4 ) = exp(u) · exp(x1 + x1 x2 ) · exp(x2 x3 x4 ) Graphical representation 1 2 4 G = (P, E) such as P = {1, 2, 3, 4} and E = {(1, 2)} 3 Network inference 30
- 50. Conditional Independence Graphs An example Let X1 , X2 , X3 , X4 be four random variables with joint probability density function fX (x ) = exp(u + x1 + x1 x2 + x2 x3 x4 ) with u a given constant. Apply the factorization property fX (x ) = exp(u + x1 + x1 x2 + x2 x3 x4 ) = exp(u) · exp(x1 + x1 x2 ) · exp(x2 x3 x4 ) Graphical representation 1 2 4 G = (P, E) such as P = {1, 2, 3, 4} and E = {(2, 3), (3, 4), (2, 4)} 3 Network inference 30
- 51. Directed Acyclic conditional independence Graph (DAG) Motivation Limitation of undirected graphs Sometimes an ordering on the variables is known, which allows to break the symmetry in the graphical representation to introduce, in some sense, “causality” in the modeling. Consequences I Each element of E has to be directed. I There are no directed cycle in the graph. We thus deal with a directed acyclic graph (or DAG). Network inference 31
- 52. Directed Acyclic conditional independence Graph (DAG) Deﬁnition Deﬁnition (Ordering) An ordering between variables {1, . . . , p} is a relation such that: i) for all couple (i , j ), either i j or j i , ii) is transitive iii) is not reﬂexive. I A natural ordering is obtained when variables are observed across time, I A natural conditioning set for a pair of variables (i , j ) is the past, denoted P(j ) = 1, . . . , j for j . Deﬁnition (DAG) The directed conditional dependence graph of X is the directed graph G = (P, E ) where (i , j ) such as i j 2 E , Xj ? Xi |P(j ){i , j }. / ? Network inference 32
- 53. Directed Acyclic conditional independence Graph (DAG) Deﬁnition Deﬁnition (Ordering) An ordering between variables {1, . . . , p} is a relation such that: i) for all couple (i , j ), either i j or j i , ii) is transitive iii) is not reﬂexive. I A natural ordering is obtained when variables are observed across time, I A natural conditioning set for a pair of variables (i , j ) is the past, denoted P(j ) = 1, . . . , j for j . Deﬁnition (DAG) The directed conditional dependence graph of X is the directed graph G = (P, E ) where (i , j ) such as i j 2 E , Xj ? Xi |P(j ){i , j }. / ? Network inference 32
- 54. Directed Acyclic conditional independence Graph (DAG) Factorization and Markov property Another view is a parent/descendant relationships to deal with the ordering of the nodes: The factorization property p Y fX (x ) = fXk |pak (xk |pak ), k =1 where pak are the parents of node k . Network inference 33
- 55. Directed Acyclic conditional independence Graph (DAG) An example x1 x2 x3 x4 x5 x6 x7 fX (x ) = ?fX1 fX2 fX3 fX4 |X1 ,X2 ,X3 fX5 |X1 ,X3 fX6 |X4 fX7 |X4 ,X5 . Network inference 34
- 56. Directed Acyclic conditional independence Graph (DAG) An example x1 x2 x3 x4 x5 x6 x7 fX (x ) = ?fX1 · · · fX2 fX3 fX4 |X1 ,X2 ,X3 fX5 |X1 ,X3 fX6 |X4 fX7 |X4 ,X5 . Network inference 34
- 57. Directed Acyclic conditional independence Graph (DAG) An example x1 x2 x3 x4 x5 x6 x7 fX (x ) = ?fX1 fX2 · · · fX3 fX4 |X1 ,X2 ,X3 fX5 |X1 ,X3 fX6 |X4 fX7 |X4 ,X5 . Network inference 34
- 58. Directed Acyclic conditional independence Graph (DAG) An example x1 x2 x3 x4 x5 x6 x7 fX (x ) = ?fX1 fX2 fX3 · · · fX4 |X1 ,X2 ,X3 fX5 |X1 ,X3 fX6 |X4 fX7 |X4 ,X5 . Network inference 34
- 59. Directed Acyclic conditional independence Graph (DAG) An example x1 x2 x3 x4 x5 x6 x7 fX (x ) = ?fX1 fX2 fX3 fX4 |X1 ,X2 ,X3 · · · fX5 |X1 ,X3 fX6 |X4 fX7 |X4 ,X5 . Network inference 34
- 60. Directed Acyclic conditional independence Graph (DAG) An example x1 x2 x3 x4 x5 x6 x7 fX (x ) = ?fX1 fX2 fX3 fX4 |X1 ,X2 ,X3 fX5 |X1 ,X3 · · · fX6 |X4 fX7 |X4 ,X5 . Network inference 34
- 61. Directed Acyclic conditional independence Graph (DAG) An example x1 x2 x3 x4 x5 x6 x7 fX (x ) = ?fX1 fX2 fX3 fX4 |X1 ,X2 ,X3 fX5 |X1 ,X3 fX6 |X4 · · · fX7 |X4 ,X5 . Network inference 34
- 62. Directed Acyclic conditional independence Graph (DAG) An example x1 x2 x3 x4 x5 x6 x7 fX (x ) = ?fX1 fX2 fX3 fX4 |X1 ,X2 ,X3 fX5 |X1 ,X3 fX6 |X4 fX7 |X4 ,X5 . Network inference 34
- 63. Directed Acyclic conditional independence Graph (DAG) Markov property Local Markov property For any Y 2 dek where dek are the descendants of k , then Xk ? Y | pak , ? that is, Xk is conditionally independent on its non-descendants given its parents. Network inference 35
- 64. Local Markov property: example x1 x2 x3 Check that x4 ? x5 | {x2 , x3 }, by ? using the factorization property. x4 x5 P(x2 , x3 , x4 , x5 ) P(x4 |x5 , x2 , x3 ) = P(x2 , x3 , x5 ) P(x2 )P(x3 )P(x4 |x2 , x3 )P(x5 |x3 ) = P(x2 )P(x3 )P(x5 |x3 ) = P(x4 |x2 , x3 ). Network inference 36
- 65. Local Markov property: example x1 x2 x3 Check that x4 ? x5 | {x2 , x3 }, by ? using the factorization property. x4 x5 P(x2 , x3 , x4 , x5 ) P(x4 |x5 , x2 , x3 ) = P(x2 , x3 , x5 ) P(x2 )P(x3 )P(x4 |x2 , x3 )P(x5 |x3 ) = P(x2 )P(x3 )P(x5 |x3 ) = P(x4 |x2 , x3 ). Network inference 36
- 66. Outline Introduction Motivations Background on omics Modeling issue Modeling tools Statistical dependence Graphical models Covariance selection and Gaussian vector Gaussian Graphical Models for genomic data Steady-state data Time-course data Statistical inference Penalized likelihood approach Inducing sparsity and regularization The Lasso Application in Post-genomics Modeling time-course data Illustrations Multitask learning Network inference 37
- 67. Modeling the genomic data Gaussian assumption The data 0 1 2 3 p 1 x1 x1 x1 . . . x1 B. C Inference X=@ . . A 1 2 2 p xn xn x1 . . . xn Assuming fX (X) multivariate Gaussian Greatly simpliﬁes the inference: naturally links independence and conditional independence to the covariance and partial covariance, gives a straightforward interpretation to the graphical modeling previously considered. Network inference 38
- 68. Modeling the genomic data Gaussian assumption The data 0 1 2 3 p 1 x1 x1 x1 . . . x1 B. C Inference X=@ . . A 1 2 2 p xn xn x1 . . . xn Assuming fX (X) multivariate Gaussian Greatly simpliﬁes the inference: naturally links independence and conditional independence to the covariance and partial covariance, gives a straightforward interpretation to the graphical modeling previously considered. Network inference 38
- 69. Start gently with the univariate Gaussian distribution The Gaussian distribution is the natural model for the level of expression of gene (noisy data). We note X ⇠ N (µ, 2 ),so as EX = µ, VarX = 2 and ⇢ 1 1 fX (x ) = p exp (x µ)2 , 2⇡ 2 2 and p 1 log fX (x ) = log 2⇡ 2 (x µ)2 . 2 Useless for modeling the distribution of expression level for a whole bunch of genes. Network inference 39
- 70. Start gently with the univariate Gaussian distribution The Gaussian distribution is the natural model for the level of expression of gene (noisy data). We note X ⇠ N (µ, 2 ),so as EX = µ, VarX = 2 and ⇢ 1 1 fX (x ) = p exp (x µ)2 , 2⇡ 2 2 and p 1 log fX (x ) = log 2⇡ 2 (x µ)2 . 2 Useless for modeling the distribution of expression level for a whole bunch of genes. Network inference 39
- 71. One step forward: bivariate Gaussian distribution Need concepts of covariance and correlation Let X , Y be two real random variables. Deﬁnitions h i cov(X , Y ) = E X E(X ) Y E(Y ) = E(XY ) E(X )E(Y ). cov(X , Y ) ⇢XY = cor(X , Y ) = p . Var(X ) · Var(Y ) Proposition I cov(X , X ) = Var(X ) = E[(X EX )(Y EY )], I cov(X + Y , Z ) = cov(X , Z ) + cov(X , Z ), I Var(X + Y ) = Var(X ) + Var(Y ) + 2cov(X , Y ). I X ? Y ) cov(X , Y ) = 0. ? I X ? Y , cov(X , Y ) = 0 when X , Y are Gaussian. ? Network inference 40
- 72. One step forward: bivariate Gaussian distribution Need concepts of covariance and correlation Let X , Y be two real random variables. Deﬁnitions h i cov(X , Y ) = E X E(X ) Y E(Y ) = E(XY ) E(X )E(Y ). cov(X , Y ) ⇢XY = cor(X , Y ) = p . Var(X ) · Var(Y ) Proposition I cov(X , X ) = Var(X ) = E[(X EX )(Y EY )], I cov(X + Y , Z ) = cov(X , Z ) + cov(X , Z ), I Var(X + Y ) = Var(X ) + Var(Y ) + 2cov(X , Y ). I X ? Y ) cov(X , Y ) = 0. ? I X ? Y , cov(X , Y ) = 0 when X , Y are Gaussian. ? Network inference 40
- 73. The bivariate Gaussian distribution ✓ ◆ 1 1 1 x µ1 fXY (x , y) = p exp{ x µ1 y µ2 ⌃ } 2⇡ det ⌃ 2 y µ2 where ⌃ is the variance/covariance matrix which is symmetric and positive deﬁnite. ✓ ◆ Var(X ) cov(Y , X ) ⌃= . cov(Y , X ) Var(Y ) and 1 1 fX ,Y (x , y) = p exp (x 2 + y 2 + 2⇢XY xy), 2⇡(1 ⇢2 ) XY 2(1 ⇢2 ) XY where ⇢XY is the correlation between X , Y and describe the interaction between them. Network inference 41
- 74. The bivariate Gaussian distribution ✓ ◆ 1 1 1 x µ1 fXY (x , y) = p exp{ x µ1 y µ2 ⌃ } 2⇡ det ⌃ 2 y µ2 where ⌃ is the variance/covariance matrix which is symmetric and positive deﬁnite. If standardized, ✓ ◆ 1 ⇢XY ⌃= . ⇢XY 1 and 1 1 fX ,Y (x , y) = p exp (x 2 + y 2 + 2⇢XY xy), 2⇡(1 ⇢2 ) XY 2(1 ⇢2 ) XY where ⇢XY is the correlation between X , Y and describe the interaction between them. Network inference 41
- 75. The bivariate Gaussian distribution The Covariance Matrix Let X ⇠ N (0, ⌃), with unit variance and ⇢XY = 0 ✓ ◆ 1 0 ⌃= . 0 1 The shape of the 2-D distribution evolves accordingly. Network inference 42
- 76. The bivariate Gaussian distribution The Covariance Matrix Let X ⇠ N (0, ⌃), with unit variance and ⇢XY = 0.9 ✓ ◆ 1 0.9 ⌃= . 0.9 1 The shape of the 2-D distribution evolves accordingly. Network inference 42
- 77. Full generalization: multivariate Gaussian vector Now need partial covariance and partial correlation Let X , Y , Z be real random variables. Deﬁnitions cov(X , Y |Z ) = cov(X , Y ) cov(X , Z )cov(Y , Z )/Var(Z ). ⇢XY ⇢XZ ⇢YZ ⇢XY |Z = q q . 1 ⇢2XZ 1 ⇢2 YZ Give the interaction between X and Y once removed the e↵ect of Z . Proposition When X , Y , Z are jointly Gaussian, then cov(X , Y |Z ) = 0 , cor(X , Y |Z ) = 0 , X ? Y |Z . ? Network inference 43
- 78. Full generalization: multivariate Gaussian vector Now need partial covariance and partial correlation Let X , Y , Z be real random variables. Deﬁnitions cov(X , Y |Z ) = cov(X , Y ) cov(X , Z )cov(Y , Z )/Var(Z ). ⇢XY ⇢XZ ⇢YZ ⇢XY |Z = q q . 1 ⇢2XZ 1 ⇢2 YZ Give the interaction between X and Y once removed the e↵ect of Z . Proposition When X , Y , Z are jointly Gaussian, then cov(X , Y |Z ) = 0 , cor(X , Y |Z ) = 0 , X ? Y |Z . ? Network inference 43
- 79. The multivariate Gaussian distribution Allow to give a modeling for the expression level of a whole set of genes P: Gaussian vector Let X ⇠ N (µ, ⌃), and assume any block decomposition with {a, b} a partition of P ✓ ◆ ⌃ab ⌃ba ⌃= . ⌃ab ⌃bb Then 1. Xa is Gaussian with distribution N (µa , ⌃aa ) 2. Xa |Xb = x is Gaussian with distribution N (µa|b , ⌃a|b ) known. Network inference 44
- 80. Outline Introduction Motivations Background on omics Modeling issue Modeling tools Statistical dependence Graphical models Covariance selection and Gaussian vector Gaussian Graphical Models for genomic data Steady-state data Time-course data Statistical inference Penalized likelihood approach Inducing sparsity and regularization The Lasso Application in Post-genomics Modeling time-course data Illustrations Multitask learning Network inference 45
- 81. Outline Introduction Motivations Background on omics Modeling issue Modeling tools Statistical dependence Graphical models Covariance selection and Gaussian vector Gaussian Graphical Models for genomic data Steady-state data Time-course data Statistical inference Penalized likelihood approach Inducing sparsity and regularization The Lasso Application in Post-genomics Modeling time-course data Illustrations Multitask learning Network inference 46
- 82. Steady-state data: scheme Inference ⇡ 10s microarrays over time Which interactions? ⇡ 1000s probes (“genes”) Network inference 47
- 83. Modeling the underlying distribution (1) Model for data generation I A microarray can be represented as a multivariate vector X = (X1 , . . . , Xp ) 2 Rp , I Consider n biological replicate in the same condition, which forms a usual n-size sample (X1 , . . . , Xn ). Consequence: a Gaussian Graphical Model I X ⇠ N (µ, ⌃) with X1 , . . . , Xn i.i.d. copies of X , I ⇥ = (✓ij )i,j 2P , ⌃ 1 is called the concentration matrix. Network inference 48
- 84. Modeling the underlying distribution (1) Model for data generation I A microarray can be represented as a multivariate vector X = (X1 , . . . , Xp ) 2 Rp , I Consider n biological replicate in the same condition, which forms a usual n-size sample (X1 , . . . , Xn ). Consequence: a Gaussian Graphical Model I X ⇠ N (µ, ⌃) with X1 , . . . , Xn i.i.d. copies of X , I ⇥ = (✓ij )i,j 2P , ⌃ 1 is called the concentration matrix. Network inference 48
- 85. Modeling the underlying distribution (2) Interpretation as a GGM Multivariate Gaussian vector and covariance selection ✓ij p = cor Xi , Xj |XPi,j = ⇢ij |P{i,j } , ✓ii ✓jj Graphical Interpretation The matrix ⇥ = (✓ij )i,j 2P encodes the network G we are looking for. conditional dependency between Xj and Xi ? i or if and only if non-null partial correlation between Xj and Xi j m ✓ij 6= 0 Network inference 49
- 86. Modeling the underlying distribution (2) Interpretation as a GGM Multivariate Gaussian vector and covariance selection ✓ij p = cor Xi , Xj |XPi,j = ⇢ij |P{i,j } , ✓ii ✓jj Graphical Interpretation The matrix ⇥ = (✓ij )i,j 2P encodes the network G we are looking for. conditional dependency between Xj and Xi ? i or if and only if non-null partial correlation between Xj and Xi j m ✓ij 6= 0 Network inference 49
- 87. Outline Introduction Motivations Background on omics Modeling issue Modeling tools Statistical dependence Graphical models Covariance selection and Gaussian vector Gaussian Graphical Models for genomic data Steady-state data Time-course data Statistical inference Penalized likelihood approach Inducing sparsity and regularization The Lasso Application in Post-genomics Modeling time-course data Illustrations Multitask learning Network inference 50
- 88. Time-course data: scheme t0 Inference t1 tn ⇡ 10s microarrays over time Which interactions? ⇡ 1000s probes (“genes”) Network inference 51
- 89. Modeling time-course data with DAG Collecting gene expression 1. Follow-up of one single experiment/individual; 2. Close enough time-points to ensure I dependency between consecutive measurements; I homogeneity of the Markov process. Xt 1 Xt+1 1 X4 Xt 2 Xt+1 2 X1 stands for Xt+1 3 X3 X2 X5 G Xt+1 4 Xt+1 5 Network inference 52
- 90. Modeling time-course data with DAG Collecting gene expression 1. Follow-up of one single experiment/individual; 2. Close enough time-points to ensure I dependency between consecutive measurements; I homogeneity of the Markov process. Xt 1 Xt+1 1 X4 Xt 2 Xt+1 2 X1 stands for Xt+1 3 X3 X2 X5 G Xt+1 4 Xt+1 5 Network inference 52
- 91. Modeling time-course data with DAG Collecting gene expression 1. Follow-up of one single experiment/individual; 2. Close enough time-points to ensure I dependency between consecutive measurements; I homogeneity of the Markov process. Xt 1 X2 1 ... Xn 1 X1 2 X2 2 ... Xn 2 X1 3 X2 3 ... Xn 3 X1 4 X2 4 ... Xn 4 G G G X1 5 X2 5 ... Xn 5 Network inference 52
- 92. DAG: remark X1 t X1 t+1 X4 X2 t X2 t+1 X1 versus X3 t+1 X3 X2 X5 G X4 t+1 X5 t+1 Argh, there is a cycle :’( is indeed a DAG Overcomes the rather restrictive acyclic requirement Network inference 53
- 93. Modeling the underlying distribution (1) Model for data generation A microarray can be represented as a multivariate vector X = (X1 , . . . , Xp ) 2 Rp , generated through a ﬁrst order vector autoregressive process VAR(1): X t = ⇥X t 1 + b + "t , t 2 [1, n] where "t is a white noise to ensure the Markov property and X 0 ⇠ N (0, ⌃0 ). Consequence: a Gaussian Graphical Model I Each X t |X t 1 ⇠ N (✓X t 1 , ⌃), I or, equivalently, Xjt |X t 1 ⇠ N (⇥j X t 1 , ⌃) where ⌃ is known and ⇥j is the j th row of ⇥. Network inference 54
- 94. Modeling the underlying distribution (1) Model for data generation A microarray can be represented as a multivariate vector X = (X1 , . . . , Xp ) 2 Rp , generated through a ﬁrst order vector autoregressive process VAR(1): X t = ⇥X t 1 + b + "t , t 2 [1, n] where "t is a white noise to ensure the Markov property and X 0 ⇠ N (0, ⌃0 ). Consequence: a Gaussian Graphical Model I Each X t |X t 1 ⇠ N (✓X t 1 , ⌃), I or, equivalently, Xjt |X t 1 ⇠ N (⇥j X t 1 , ⌃) where ⌃ is known and ⇥j is the j th row of ⇥. Network inference 54
- 95. Modeling the underlying distribution (2) I 2 3 2 3 ✓11 . . . ✓1j . . ✓1p 2 1 3 2 3 2 1 3 Xt1 X b " 6 . 7 6 6 . . . . . . . . 7 6 t 17 6 1 7 6 t 7 7 . .7 6.7 6 7 6 . . . . . . . . 76 7 6 6 . 7 6 76 . 7 6 . 7 6 . 7 6 i7 6 . . . . . . . . 7676 7 6 7 6 7 6 Xt 7 6 . 7 6 7 6 i7 6 7 = 6 ✓i1 . . . ✓ij . 7 6 j 7 + 6 bi 7 + 6 "t 7 . ✓ip 7 6 6 . 7 6 X 7 6.7 6.7 6 7 6 . . . . . . . . 7 6 t 17 6 7 6 7 6 . 7 6 76 . 7 6 . 7 6 . 7 6 7 6 . . . . . . . . 76 7 6 7 6 7 4 . 5 6 74 . 5 4 . 5 4 . 5 4 . . . . . . . . 5 Xtp Xtp 1 bp "pt ✓p1 . . . ✓pj . . ✓pp I Example: 0 1 ✓11 ✓12 0 ⇥ = @ ✓21 0 0 A 0 ✓32 0 Network inference 55
- 96. Modeling the underlying distribution (3) Interpretation as a GGM The VAR(1) as a covariance selection model ⇣ ⌘ cov Xit , Xjt 1 |XPj1 t ✓ij = ⇣ ⌘ , var Xjt 1 |XPj1 t Graphical Interpretation The matrix ⇥ = (✓ij )i,j 2P encodes the network G we are looking for. conditional dependency between Xjt 1 and Xit ? i or if and only if non-null partial correlation between Xjt 1 and Xit j m ✓ij 6= 0 Network inference 56
- 97. Modeling the underlying distribution (3) Interpretation as a GGM The VAR(1) as a covariance selection model ⇣ ⌘ cov Xit , Xjt 1 |XPj1 t ✓ij = ⇣ ⌘ , var Xjt 1 |XPj1 t Graphical Interpretation The matrix ⇥ = (✓ij )i,j 2P encodes the network G we are looking for. conditional dependency between Xjt 1 and Xit ? i or if and only if non-null partial correlation between Xjt 1 and Xit j m ✓ij 6= 0 Network inference 56
- 98. Outline Introduction Motivations Background on omics Modeling issue Modeling tools Statistical dependence Graphical models Covariance selection and Gaussian vector Gaussian Graphical Models for genomic data Steady-state data Time-course data Statistical inference Penalized likelihood approach Inducing sparsity and regularization The Lasso Application in Post-genomics Modeling time-course data Illustrations Multitask learning Network inference 57
- 99. Outline Introduction Motivations Background on omics Modeling issue Modeling tools Statistical dependence Graphical models Covariance selection and Gaussian vector Gaussian Graphical Models for genomic data Steady-state data Time-course data Statistical inference Penalized likelihood approach Inducing sparsity and regularization The Lasso Application in Post-genomics Modeling time-course data Illustrations Multitask learning Network inference 58
- 100. The graphical models: remindera a for goldﬁsh-like memories Assumption A microarray can be represented as a multivariate Gaussian vector X . Collecting gene expression 1. Steady-state data leads to an i.i.d. sample. 2. Time-course data gives a time series. Graphical interpretation i conditional dependency between X (i) and X (j ) if and only if or j non null partial correlation between X (i) and X (j ) Encoded in an unknown matrix of parameters ⇥. Network inference 59
- 101. The graphical models: remindera a for goldﬁsh-like memories Assumption A microarray can be represented as a multivariate Gaussian vector X . Collecting gene expression 1. Steady-state data leads to an i.i.d. sample. 2. Time-course data gives a time series. Graphical interpretation i ? conditional dependency between X (i) and X (j ) if and only if or j non null partial correlation between X (i) and X (j ) Encoded in an unknown matrix of parameters ⇥. Network inference 59
- 102. The graphical models: remindera a for goldﬁsh-like memories Assumption A microarray can be represented as a multivariate Gaussian vector X . Collecting gene expression 1. Steady-state data leads to an i.i.d. sample. 2. Time-course data gives a time series. Graphical interpretation i ? conditional dependency between Xt (i) and Xt 1 (j ) if and only if or j non null partial correlation between Xt (i) and Xt 1 (j ) Encoded in an unknown matrix of parameters ⇥. Network inference 59
- 103. The Maximum likelihood estimator The natural approach for parametric statistics Let X be a random vector with distribution deﬁned by fX (x ; ⇥), where ⇥ are the model parameters. Maximum likelihood estimator ˆ ⇥ = arg max L(⇥; X) ⇥ where L is the log likelihood, a function of the parameters: n Y L(⇥; X) = log fX (xk ; ⇥), k =1 where xk is the k row of X. Remarks I This a convex optimization problem, I We just need to detect non zero coe cients in ⇥ Network inference 60
- 104. The penalized likelihood approach Let ⇥ be the parameters to infer (the edges). A penalized likelihood approach ˆ ⇥ = arg max L(⇥; X) pen`1 (⇥), ⇥ I L is the model log-likelihood, I pen`1 is a penalty function tuned by > 0. It performs 1. regularization (needed when n ⌧ p), 2. selection (sparsity induced by the `1 -norm), Network inference 61
- 105. The penalized likelihood approach Let ⇥ be the parameters to infer (the edges). A penalized likelihood approach ˆ ⇥ = arg max L(⇥; X) pen`1 (⇥), ⇥ I L is the model log-likelihood, I pen`1 is a penalty function tuned by > 0. It performs 1. regularization (needed when n ⌧ p), 2. selection (sparsity induced by the `1 -norm), Network inference 61
- 106. Outline Introduction Motivations Background on omics Modeling issue Modeling tools Statistical dependence Graphical models Covariance selection and Gaussian vector Gaussian Graphical Models for genomic data Steady-state data Time-course data Statistical inference Penalized likelihood approach Inducing sparsity and regularization The Lasso Application in Post-genomics Modeling time-course data Illustrations Multitask learning Network inference 62
- 107. A Geometric View of Sparsity Constrained Optimization We basically want to solve a problem of the form maximize f ( 1, 2 ; X) 1, 2 2 ; X) where f is typically a concave likelihood function. 1, This is strictly equivalent to solve f( minimize g( 1, 2 ; X) 1, 2 where g = f is convex ! For instance the square lost in the OLS. 2 1 Network inference 63
- 108. A Geometric View of Sparsity Constrained Optimization ( 2 ; X) maximize f ( 1, 2 ; X) 1, 2 , s.t. ⌦( 1, 2) c 1, where ⌦ deﬁnes a domain that f( constrains . 2 1 Network inference 63
- 109. A Geometric View of Sparsity Constrained Optimization ( maximize f ( 1, 2 ; X) 1, 2 , s.t. ⌦( 1, 2) c where ⌦ deﬁnes a domain that constrains . m 2 maximize f ( 1, 2 ; X) ⌦( 1, 2) 1, 2 1 Network inference 63
- 110. A Geometric View of Sparsity Constrained Optimization ( maximize f ( 1, 2 ; X) 1, 2 , s.t. ⌦( 1, 2) c where ⌦ deﬁnes a domain that constrains . m maximize f ( 1, 2 ; X) ⌦( 1, 2) 2 1, 2 How shall we deﬁne ⌦ to induce sparsity? 1 Network inference 63
- 111. A Geometric View of Sparsity Supporting Hyperplane An hyperplane supports a set i↵ I the set is contained in one half-space I the set has at least one point on the hyperplane 2 1 Network inference 64
- 112. A Geometric View of Sparsity Supporting Hyperplane An hyperplane supports a set i↵ I the set is contained in one half-space I the set has at least one point on the hyperplane 2 1 Network inference 64
- 113. A Geometric View of Sparsity Supporting Hyperplane An hyperplane supports a set i↵ I the set is contained in one half-space I the set has at least one point on the hyperplane 2 1 Network inference 64
- 114. A Geometric View of Sparsity Supporting Hyperplane An hyperplane supports a set i↵ I the set is contained in one half-space I the set has at least one point on the hyperplane 2 1 Network inference 64
- 115. A Geometric View of Sparsity Supporting Hyperplane An hyperplane supports a set i↵ I the set is contained in one half-space I the set has at least one point on the hyperplane 2 1 There are Supporting Hyperplane at all points of convex sets: Generalize tangents Network inference 64
- 116. A Geometric View of Sparsity Supporting Hyperplane An hyperplane supports a set i↵ I the set is contained in one half-space I the set has at least one point on the hyperplane 2 2 1 1 Network inference 64
- 117. A Geometric View of Sparsity Dual Cone Generalizes normals 2 2 2 1 1 1 Network inference 65
- 118. A Geometric View of Sparsity Dual Cone Generalizes normals 2 2 2 1 1 1 Network inference 65
- 119. A Geometric View of Sparsity Dual Cone Generalizes normals 2 2 2 1 1 1 Network inference 65
- 120. A Geometric View of Sparsity Dual Cone Generalizes normals 2 2 2 1 1 1 Shape of dual cones ) sparsity pattern Network inference 65
- 121. Outline Introduction Motivations Background on omics Modeling issue Modeling tools Statistical dependence Graphical models Covariance selection and Gaussian vector Gaussian Graphical Models for genomic data Steady-state data Time-course data Statistical inference Penalized likelihood approach Inducing sparsity and regularization The Lasso Application in Post-genomics Modeling time-course data Illustrations Multitask learning Network inference 66
- 122. The LASSO R. Tibshirani, 1996. The Lasso: Least Absolute Shrinkage and Selection Operator S. Chen , D. Donoho , M. Saunders, 1995. 3.2. Basis Pursuit. Régularisations ` p 23 Weisberg, 1980. Forward Stagewise regression. 2 2 ( minimize ky X k2 , 2 2R2 s.t.` 2 k k1 = | 1 | + | 2| c. ls ls m `1 1 1 minimize ky X k2 + k k1 . 2 2R Fig. 3.2 – Comparaisons des solutions de problèmes régularisés par une norme `1 et `2 . Network inference 67
- 123. Orthogonal case and link to the OLS OLS shrinkage The Lasso has no analytical solution but in the orthogonal case: when X| X = I (never for real data), ˆlasso = sign( ˆols ) max(0, | ˆols | ). j j j OLS 4 Lasso 2 0 ols 4 2 0 2 4 2 4 Network inference 68
- 124. LARs: Least angle regression B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, 2004. Least Angle Regression. E cient algorithm to compute the Lasso solutions The LARS solution consists of a curve denoting the solution for each value of . I construct a piecewise linear path of solution starting from the null vector towards the OLS estimate, I (Almost) the same cost as OLS, I well adapted to cross validation (help us to choose ). Network inference 69
- 125. Example: prostate cancer I Lasso solution path with Lars > library(lars) > load("prostate.rda") > x <- as.matrix(x) > x <- scale(as.matrix(x)) > out <- lars(x,y) > plot(out) Network inference 70
- 126. Example: prostate cancer II LASSO 0 1 3 5 6 7 8 1 * 6 * * * ** * Standardized Coefficients * 4 * 2 * * * * * * 2 ** * * 8 * * * * * * * * 7 * * * 0 * * * ** * * * * * * 3 * 0.0 0.2 0.4 0.6 0.8 1.0 Network inference 71
- 127. Choice of the tuning parameter I Model selection criteria log n BIC( ) = ky X ˆ k2 2 df( ˆ ) 2 AIC( ) = ky X ˆ k2 2 df( ˆ ) where df( ˆ ) is the number of nonzero entries in . Cross-validation 1. split the data into K folds, 2. use successively each K fold as the testing set, 3. compute the test error on this K folds, 4. average to obtain the CV estimation of the test error. is chosen to minimize the CV test error. Network inference 72
- 128. Choice of the tuning parameter II CV choice for > cv.lars(x,y, K=10) Network inference 73
- 129. Choice of the tuning parameter III 1.6 1.4 ● ● Cross−Validated MSE ● ● ● 1.2 ● ● ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ●●●● 0.8 ●●●●●●●●●●●●●●●●● ●● ●● ●● 0.6 ●●● ●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● 0.0 0.2 0.4 0.6 0.8 1.0 Network inference 74
- 130. Many variations Group-Lasso Activate the variables by group (given by the user). Adaptive/Weighted-Lasso Adjust the penalty level to each variables, according to prior knowledge or with data driven weights. BoLasso Bootstrapped version that removes false positives/stabilizes the estimate. etc. + many theoretical results. Network inference 75
- 131. Outline Introduction Motivations Background on omics Modeling issue Modeling tools Statistical dependence Graphical models Covariance selection and Gaussian vector Gaussian Graphical Models for genomic data Steady-state data Time-course data Statistical inference Penalized likelihood approach Inducing sparsity and regularization The Lasso Application in Post-genomics Modeling time-course data Illustrations Multitask learning Network inference 76
- 132. Outline Introduction Motivations Background on omics Modeling issue Modeling tools Statistical dependence Graphical models Covariance selection and Gaussian vector Gaussian Graphical Models for genomic data Steady-state data Time-course data Statistical inference Penalized likelihood approach Inducing sparsity and regularization The Lasso Application in Post-genomics Modeling time-course data Illustrations Multitask learning Network inference 77
- 133. Problem t0 Inference t1 tn ⇡ 10s microarrays over time Which interactions? ⇡ 1000s probes (“genes”) The main statistical issue is the high dimensional setting. Network inference 78
- 134. Handling the scarcity of the data By introducing some prior Priors should be biologically grounded 1. few genes e↵ectively interact (sparsity), 2. networks are organized (latent clustering), G8 G7 G9 G11 G1 G6 G10 G4 G5 G2 G12 G13 G3 Network inference 79
- 135. Handling the scarcity of the data By introducing some prior Priors should be biologically grounded 1. few genes e↵ectively interact (sparsity), 2. networks are organized (latent clustering), G8 G7 G9 G11 G1 G6 G10 G4 G5 G2 G12 G13 G3 Network inference 79
- 136. Handling the scarcity of the data By introducing some prior Priors should be biologically grounded 1. few genes e↵ectively interact (sparsity), 2. networks are organized (latent clustering), B3 B2 B4 B A1 B1 B5 A4 A A2 C1 C A3 Network inference 79
- 137. Penalized log-likelihood Banerjee et al., JMLR 2008 ˆ ⇥ = arg max Liid (⇥; S) k⇥k`1 , ⇥ e ciently solved by the graphical Lasso of Friedman et al, 2008. Ambroise, Chiquet, Matias, EJS 2009 Use adaptive penalty parameters for di↵erent coe cients Liid (⇥; S) kPZ ? ⇥k`1 , where PZ is a matrix of weights depending on the underlying clustering Z. Works with the pseudo log-likelihood (computationally e cient). Network inference 80
- 138. Penalized log-likelihood Banerjee et al., JMLR 2008 ˆ ⇥ = arg max Liid (⇥; S) k⇥k`1 , ⇥ e ciently solved by the graphical Lasso of Friedman et al, 2008. Ambroise, Chiquet, Matias, EJS 2009 Use adaptive penalty parameters for di↵erent coe cients ˜ Liid (⇥; S) kPZ ? ⇥k`1 , where PZ is a matrix of weights depending on the underlying clustering Z. Works with the pseudo log-likelihood (computationally e cient). Network inference 80
- 139. Neighborhood selection (1) Let I Xi be the i th column of X, I Xi be X deprived of Xi . ✓ij Xi = Xi + ", where j = . ✓ii Meinshausen and B¨lhman, 2006 u Since sign(corij |P{i,j } ) = sign( j ), select the neighbors of i with 1 2 arg min Xi Xi 2 + k k`1 . n The sign pattern of ⇥ is inferred after a symmetrization step. Network inference 81
- 140. Neighborhood selection (2) The pseudo log-likelihood of the i.i.d Gaussian sample is p n ! X X ˜ Liid (⇥; S) = log P(Xk (i )|Xk (Pi ); ⇥i ) , i=1 k =1 n n ⇣ ⌘ n 1/2 1/2 = log det(D) Trace D ⇥S⇥D log(2⇡), 2 2 2 where D = diag(⇥). Proposition ˆ pseudo = arg max Liid (⇥; S) ⇥ ˜ k⇥k`1 ⇥:✓ij 6=✓ii has the same null entries as inferred by neighborhood selection. Network inference 82
- 141. Structured regularization Introduce prior knowledge Building the weights 1. Build w from prior biological information I transcription factors vs. regulatees, I number of potential binding sites, I KEGG pathways, Gene Ontology . . . 2. Build the weights matrix from clustering algorithm I Infer the network G 0 with w = 1 for each node, I Apply a clustering algorithm on G 0 , I Re-Infer G with w built according to the clustering Z. Network inference 83