Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Inferring networks from multiple samples with consensus LASSO

718 views

Published on

Séminaire Parisien de Statistique, Paris, France
September 15th, 2014

Published in: Science
  • Be the first to comment

  • Be the first to like this

Inferring networks from multiple samples with consensus LASSO

  1. 1. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Joint network inference with the consensual LASSO Nathalie Villa-Vialaneix Joint work with Matthieu Vignes, Nathalie Viguerie and Magali San Cristobal Séminaire de Statistique Parisien Paris, 15 septembre 2014 http://www.nathalievilla.org nathalie.villa@toulouse.inra.fr Nathalie Villa-Vialaneix | Consensus Lasso 1/36
  2. 2. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Outline 1 Short overview on biological background 2 Network inference and GGM 3 Inference with multiple samples 4 Simulations Nathalie Villa-Vialaneix | Consensus Lasso 2/36
  3. 3. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations DNA DNA (DeoxyriboNucleic Acid (DNA) molecule that encodes the genetic instructions used in the development and functioning of all known living organisms and many viruses double helix made with only four nucleotides (Adenine, Cytosine, Thymine, Guanine): A binds with T and C with G. Nathalie Villa-Vialaneix | Consensus Lasso 3/36
  4. 4. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Transcription Transcription of DNA transcription is the process by which a particular segment of DNA (called gene) is copied into a single-strand RNA (which is called message RNA) A is transcripted into U, T into A, C into G and G into C Nathalie Villa-Vialaneix | Consensus Lasso 4/36
  5. 5. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations What is mRNA used for? mRNA then moves out of the cell nucleus and is translated into proteins which are made of 20 different amino-acids (using an alphabet: 3 letters of mRNA ! 1 amino-acid) Proteins are used by the cell to function. Nathalie Villa-Vialaneix | Consensus Lasso 5/36
  6. 6. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Gene expression In a given cell, at a given time, all the genes are not translated or not translated at the same level. Gene expression is a process by which a given gene is transcripted or/and traducted It depends on the type of cell, the environment, ... Nathalie Villa-Vialaneix | Consensus Lasso 6/36
  7. 7. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Gene expression In a given cell, at a given time, all the genes are not translated or not translated at the same level. Gene expression is a process by which a given gene is transcripted or/and traducted It depends on the type of cell, the environment, ... Gene expression can be measured either by: “counting” the number of copies (mRNA) of a given gene in the cell: transcriptomic data; “counting” the quantity of a given protein in the cell: proteomic data. Even though a given mRNA is translated into a unique protein, there is no simple relationship between transcriptomic and proteomic data. Nathalie Villa-Vialaneix | Consensus Lasso 6/36
  8. 8. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Transcriptomic data How are transcriptomic data obtained? DNA spots associated to target genes (probes) are attached to a solid surface (array) Nathalie Villa-Vialaneix | Consensus Lasso 7/36
  9. 9. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Transcriptomic data How are transcriptomic data obtained? RNA material extracting for cells of interest (blood, lipid tissue, muscle, urine...) are labeled fluorescently and applied on the array When the binding between mRNA and the probes is good, the spot becomes fluorescent. Expression of a given gene is quantified by the intensity of the fluorescent signal (read by a scanner). Nathalie Villa-Vialaneix | Consensus Lasso 7/36
  10. 10. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Typical microarray data Data: large scale gene expression data individuals n ' 30=50 8>>><>>>: X = 0BBBBBBBB@ : : : : : : : : Xj i : : : : : : : : : 1CCCCCCCCA | {z } variables (genes expression); p'103=4 Nathalie Villa-Vialaneix | Consensus Lasso 8/36
  11. 11. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Typical microarray data Data: large scale gene expression data individuals n ' 30=50 8>>><>>>: X = 0BBBBBBBB@ : : : : : : : : Xj i : : : : : : : : : 1CCCCCCCCA | {z } variables (genes expression); p'103=4 Typical design: two (or more) conditions (treated/control for instance) More and more complicated designs: crossed conditions (treated/control and different breeds), longitudinal data... Nathalie Villa-Vialaneix | Consensus Lasso 8/36
  12. 12. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Typical microarray data Data: large scale gene expression data individuals n ' 30=50 8>>><>>>: X = 0BBBBBBBB@ : : : : : : : : Xj i : : : : : : : : : 1CCCCCCCCA | {z } variables (genes expression); p'103=4 Typical design: two (or more) conditions (treated/control for instance) More and more complicated designs: crossed conditions (treated/control and different breeds), longitudinal data... Typical issues: find differentially expressed genes (i.e., genes whose expression is significantly different between the conditions) Nathalie Villa-Vialaneix | Consensus Lasso 8/36
  13. 13. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Outline 1 Short overview on biological background 2 Network inference and GGM 3 Inference with multiple samples 4 Simulations Nathalie Villa-Vialaneix | Consensus Lasso 9/36
  14. 14. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Systems biology Instead of being used to produce proteins, some genes’ expressions activate or repress other genes’ expressions ) understanding the whole cascade helps to comprehend the global functioning of living organisms1 1Picture taken from: Abdollahi A et al., PNAS 2007, 104:12890-12895. c 2007 by National Academy of Sciences Nathalie Villa-Vialaneix | Consensus Lasso 10/36
  15. 15. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Model framework Data: large scale gene expression data individuals n ' 30=50 8>>><>>>:X = 0BBBBBBBB@ : : : : : : : : Xj i : : : : : : : : : 1CCCCCCCCA | {z } variables (genes expression); p'103=4 What we want to obtain: a graph/network with nodes: (selected) genes; edges: strong links between gene expressions. Nathalie Villa-Vialaneix | Consensus Lasso 11/36
  16. 16. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Advantages of network inference 1 over raw data: focuses on the strongest direct relationships: irrelevant or indirect relations are removed (more robust) and the data are easier to visualize and understand (track transcription relations). Nathalie Villa-Vialaneix | Consensus Lasso 12/36
  17. 17. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Advantages of network inference 1 over raw data: focuses on the strongest direct relationships: irrelevant or indirect relations are removed (more robust) and the data are easier to visualize and understand (track transcription relations). Expression data are analyzed all together and not by pairs (systems model). Nathalie Villa-Vialaneix | Consensus Lasso 12/36
  18. 18. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Advantages of network inference 1 over raw data: focuses on the strongest direct relationships: irrelevant or indirect relations are removed (more robust) and the data are easier to visualize and understand (track transcription relations). Expression data are analyzed all together and not by pairs (systems model). 2 over bibliographic network: can handle interactions with yet unknown (not annotated) genes and deal with data collected in a particular condition. Nathalie Villa-Vialaneix | Consensus Lasso 12/36
  19. 19. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Using correlations Relevance network [Butte and Kohane, 1999] First (naive) approach: calculate correlations between expressions for all pairs of genes, threshold the smallest ones and build the network. Correlations Thresholding Graph Nathalie Villa-Vialaneix | Consensus Lasso 13/36
  20. 20. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Using partial correlations x y z strong indirect correlation Nathalie Villa-Vialaneix | Consensus Lasso 14/36
  21. 21. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Using partial correlations x y z strong indirect correlation set.seed(2807); x <- rnorm(100) y <- 2*x+1+rnorm(100,0,0.1); cor(x,y) [1] 0.998826 z <- 2*x+1+rnorm(100,0,0.1); cor(x,z) [1] 0.998751 cor(y,z) [1] 0.9971105 Nathalie Villa-Vialaneix | Consensus Lasso 14/36
  22. 22. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Using partial correlations x y z strong indirect correlation set.seed(2807); x <- rnorm(100) y <- 2*x+1+rnorm(100,0,0.1); cor(x,y) [1] 0.998826 z <- 2*x+1+rnorm(100,0,0.1); cor(x,z) [1] 0.998751 cor(y,z) [1] 0.9971105 ] Partial correlation cor(lm(xz)$residuals,lm(yz)$residuals) [1] 0.7801174 cor(lm(xy)$residuals,lm(zy)$residuals) [1] 0.7639094 cor(lm(yx)$residuals,lm(zx)$residuals) [1] -0.1933699 Nathalie Villa-Vialaneix | Consensus Lasso 14/36
  23. 23. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Partial correlation and GGM (Xi)i=1;:::;n are i.i.d. Gaussian random variables N(0; ) (gene expression); then j ! j0(genes j and j0 are linked) , Cor Xj ; Xj0 j(Xk )k,j;j0 0 Nathalie Villa-Vialaneix | Consensus Lasso 15/36
  24. 24. Short overview on biological background Network inference and GGM Inference with multiple samples Simulations Partial correlation and GGM (Xi)i=1;:::;n are i.i.d. Gaussian random variables N(0; ) (gene expression); then j ! j0(genes j and j0 are linked) , Cor Xj ; Xj0 j(Xk )k,j;j0 0 If (concentration matrix) S =

×