Mixture models
for analysing
transcriptome and ChIP-chip data
Marie-Laure Martin-Magniette
French National Institute for a...
Presentation outline
1 Introduction
2 Mixture model definition
3 Genomic examples
4 Conclusions
M.L Martin-Magniette (INRA)...
Introduction
Observations described by 2 variables
Observation distribution seems easy to model with one Gaussian
M.L Mart...
Introduction
Observations described by 2 variables
Data are scattered and subpopulations are observed
According to the exp...
Introduction
Definition of a mixture model
It is a probabilistic model for representing the presence of subpopula-
tions wi...
Functional annotation is the new challenge
It is now relatively easy to sequence an organism and to localize
its genes
But...
First genomic example: co-expression analysis
Co-expressed genes are good candidates to be involved in a
same biological p...
Second example: ChIP-chip analysis
These experiments aim at
identifying interactions between a
protein and DNA
Most method...
Presentation outline
1 Introduction
2 Mixture model definition
3 Genomic examples
4 Conclusions
M.L Martin-Magniette (INRA)...
Key ingredients of a mixture model
what we observe the model the expected results
Z = ? Z: 1 = •, 2 = •, 3 = •
Let y = (y1...
Some properties:
{Zi} are independent
{Yi} are independent conditionally to {Zi}
Couples {(Yi, Zi)} are i.i.d.
The model i...
Statistical inference of incomplete data models
Maximum likelihood estimate:
θ = arg max
θ
log P(Y|K, θ) = arg max
θ
n
i=1...
EM algorithm details
Initialisation of θ(0)
While the convergence criterion is not reached, iterate
E-step Calculation of ...
EM algorithm properties
Convergence is always reached but not always toward a global
maximum
EM algorithm is sensitive to ...
Outputs of the model
Distribution: Conditional probabilities:
g(yi ) = π1f(yi ; γ1) + π2f(yi ; γ2) + π3f(yi ; γ3) τik = P(...
Outputs of the model
Distribution: Conditional probabilities:
g(yi ) = π1f(yi ; γ1) + π2f(yi ; γ2) + π3f(yi ; γ3) τik = P(...
Model selection
The number of components of the mixture is often unknown
A collection of models where K varies between 2 a...
Model selection
The number of components of the mixture is often unknown
A collection of models where K varies between 2 a...
Conclusions on the model selection
BIC aims at finding a good number of components for a global fit
of the data distribution...
Presentation outline
1 Introduction
2 Mixture model definition
3 Genomic examples
Mixtures for co-expression analysis
Mixtu...
GEM2Net: From gene expression modeling to
-omics network
Goal: Explore the orphean gene space to identify
new genes involv...
Workflow overview
- Extraction of CATdb of 387 stress comparaisons
- 17,264 genes are differentially expressed in at least ...
Results of the co-expression analysis
- 18 categories (9 biotic and 9 abiotic), identification of 681 clusters
- Large over...
Focus on nematode stress
7467 genes described by 10 expression
differences
29 clusters of co-expression identified
1519 gen...
GEM2Net database
http://urgv.evry.inra.fr/GEM2NET
Integration of various resources: gene ontology, genes involved in
stres...
ChIP-chip experiments
The log-ratio is not tractable while the couple (IP, Input) is
Development of mixture of 2 linear re...
MultiChIPmix: Mixture of two linear regressions
Let Zi the status of the probe i: P(Zi = 1) = π
The linear relation betwee...
MultiChIPmix: Mixture of two linear regressions
Let Zi the status of the probe i: P(Zi = 1) = π
The linear relation betwee...
Use to
create the first epigenomic map of Arabidopsis thaliana: Roudier et
al. (2011), EMBO Journal
study the additive inhe...
MultiChIPmixHMM for taking the spatial
information into account
When probes are (almost)
equally spaced along the
genome, ...
Table : Example of one known H3K27me3 target gene identified only with
MultiChIPmixHMM.
MultiChIPmix and MultiChIPmixHMM ar...
Presentation outline
1 Introduction
2 Mixture model definition
3 Genomic examples
4 Conclusions
M.L Martin-Magniette (INRA)...
Conclusions
Mixtures reveal underlying structures
Key ingredients are P(Z) and P(Y|Z)
For genomic data, component distribu...
Acknowledgements
Statistics Bioinformatics Biology
S. Robin V. Brunaud J-P. Renou
T. Mary-Huard J-P Tamby E. Delannoy
C. B...
Upcoming SlideShare
Loading in …5
×

Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

934 views

Published on

Mixture models are useful for identifying underlying structures. In such models, the density of the observations is modelled by a weighted sum of parametric density (e.g. each component is a Gaussian distribution) and each one represents a subpopulation composed of observations sharing common characteristics. The first part of my talk
will be dedicated to a presentation of the mixture models. I will explain the concept and the outputs of an analysis based on a mixture through easy examples. In the second part of my talk, I will show how mixture models can be applied to analyze transcriptomic (co‐expression analysis of Arabidopsis thaliana genes) and chIP‐chip data (detection of enriched regions and of differentially methylated regions).

First presented at the 2014 Winter School in Mathematical and Computational Biology http://bioinformatics.org.au/ws14/program/

Published in: Science
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
934
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Marie Laure-Martin-Magniette - Mixture models for analysing transcriptome and chIP‐chip data

  1. 1. Mixture models for analysing transcriptome and ChIP-chip data Marie-Laure Martin-Magniette French National Institute for agricultural research (INRA) Unit of Applied Mathematics and Informatics at AgroParisTech, Paris Unit of Plant Genomics Research (URGV), Evry M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 1 / 30
  2. 2. Presentation outline 1 Introduction 2 Mixture model definition 3 Genomic examples 4 Conclusions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 2 / 30
  3. 3. Introduction Observations described by 2 variables Observation distribution seems easy to model with one Gaussian M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 3 / 30
  4. 4. Introduction Observations described by 2 variables Data are scattered and subpopulations are observed According to the experimental design, there exists no external information about them This is an underlying structure observed through the data M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 3 / 30
  5. 5. Introduction Definition of a mixture model It is a probabilistic model for representing the presence of subpopula- tions within an overall population. Introduction of a latent variable Z indicating the subpopulation where each observation comes from what we observe the model the expected results Z = ? Z : 1 = •, 2 = •, 3 = • → It is an unsupervised classification method M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 4 / 30
  6. 6. Functional annotation is the new challenge It is now relatively easy to sequence an organism and to localize its genes But between 20% and 40% of the genes have an unknown function For Arabidopsis thaliana, 16% of the genes are orphean genes i.e. without any information on their function → with the high-throughput technologies, it is now possible to improve the functional annotation M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 5 / 30
  7. 7. First genomic example: co-expression analysis Co-expressed genes are good candidates to be involved in a same biological process (Eisen et al, 1998) Pearson correlation values are often used to measure the co-expression, but it is a local point of view Co-expression analysis can be recast as a research of an underlying structure in a whole dataset Table : Examples of co-expression clusters of genes observed on 45 independent transcriptome experiments. Clusters are identified with a mixture. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 6 / 30
  8. 8. Second example: ChIP-chip analysis These experiments aim at identifying interactions between a protein and DNA Most methods look for peaks of log(IP/Input) along the genome There exists an underlying structure between the two samples M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 7 / 30
  9. 9. Presentation outline 1 Introduction 2 Mixture model definition 3 Genomic examples 4 Conclusions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 8 / 30
  10. 10. Key ingredients of a mixture model what we observe the model the expected results Z = ? Z: 1 = •, 2 = •, 3 = • Let y = (y1, . . . , yn) denote n observations with yi ∈ RQ and let Z = (Z1, . . . , Zn) be the latent vector. 1) Distribution of Z: {Zi } are assumed to be independent and P(Zi = k) = πk with K k=1 πk = 1 → Z ∼ M(n; π1, . . . , πK ) and where K is the number of components of the mixture 2) Distribution of (yi |Zi = k): a parametric distribution f(•; γk ) M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 9 / 30
  11. 11. Some properties: {Zi} are independent {Yi} are independent conditionally to {Zi} Couples {(Yi, Zi)} are i.i.d. The model is invariant for any permutation of the labels {1, . . . , K} ⇒ the mixture model has K! equivalent definitions. Distribution of Y: P(Y|K, θ) = n i=1 K k=1 P(Yi, Zi = k) = n i=1 K k=1 P(Zi = k)P(Yi|Zi = k) = n i=1 K k=1 πk f(Yi; γk ) → It is a weighted sum of parametric distributions known up to the parameter vector θ = (π1, . . . , πK−1, γ1, . . . , γK ) M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 10 / 30
  12. 12. Statistical inference of incomplete data models Maximum likelihood estimate: θ = arg max θ log P(Y|K, θ) = arg max θ n i=1 log K k=1 πk f(Yi; γk ) → It is not always possible since this sum involves Kn terms.... Expectation-Maximization algorithm: iterative algorithm based on the expectation of the completed data conditionally to θ(l) θ(l+1) = arg max θ E log P(Y, Z|K, θ)|Y, θ(l) → According to the theory, it implies that log P(Y|K, θ) tends toward a local maximum. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 11 / 30
  13. 13. EM algorithm details Initialisation of θ(0) While the convergence criterion is not reached, iterate E-step Calculation of the conditional probabilities τ (l) ik = P(Zi = k|yi, θ(l) ) = π (l) k f(yi; γ (l) k ) K k =1 π (l) k f(yi; γ (l) k ) M-step Calculation of θ by maximising the complete likehood where Z is replaced with the conditional probabilities θ = arg max θ n i=1 K k=1 τ (l) ik [log πk + log f(yi; γk )] → weighted version of the usual maximum likelihood estimates (MLE). M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 12 / 30
  14. 14. EM algorithm properties Convergence is always reached but not always toward a global maximum EM algorithm is sensitive to the initialisation step EM algorithm exists in all good statistical sotfwares In R software, it is available in MCLUST and RMIXMOD packages. RMIXMOD proposes the best strategy of initialisation M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 13 / 30
  15. 15. Outputs of the model Distribution: Conditional probabilities: g(yi ) = π1f(yi ; γ1) + π2f(yi ; γ2) + π3f(yi ; γ3) τik = P(Zi = k|yi ) = πk f(yi ; γk ) g(yi ) τik (%) i = 1 i = 2 i = 3 k = 1 65.8 0.7 0.0 k = 2 34.2 47.8 0.0 k = 3 0.0 51.5 1.0 → These probabilities enables the classification of the observations into the subpopulations M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 14 / 30
  16. 16. Outputs of the model Distribution: Conditional probabilities: g(yi ) = π1f(yi ; γ1) + π2f(yi ; γ2) + π3f(yi ; γ3) τik = P(Zi = k|yi ) = πk f(yi ; γk ) g(yi ) τik (%) i = 1 i = 2 i = 3 k = 1 65.8 0.7 0.0 k = 2 34.2 47.8 0.0 k = 3 0.0 51.5 1.0 Maximum A Posteriori rule: Classification in the component for which the conditional probability is the highest. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 14 / 30
  17. 17. Model selection The number of components of the mixture is often unknown A collection of models where K varies between 2 and Kmax The best model is the one maximising a criterion Bayesian Information Criterion (BIC) proxy of the integrated likelihood P(Y|K) = P(Y|K, θ)π(θ|K)dθ aims at finding a good number of components for a global fit of the data distribution BIC(K) = log P(Y|K, θ) − νK 2 log(n) where νK is the number of free parameters of the model P(Y|K, θ) is the maximum likelihood under this model. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 15 / 30
  18. 18. Model selection The number of components of the mixture is often unknown A collection of models where K varies between 2 and Kmax The best model is the one maximising a criterion Integrated Information Criterion (ICL) proxy of the integrated complete likelihood P(Y, Z|m) dedicated to classification since it strongly penalizes models for which the classification is uncertain ICL(K) = BIC(K)+ n i=1 K k=1 τik log τik , where νK is the number of free parameters P(Y|K, θ) is the maximum likelihood under this model. M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 15 / 30
  19. 19. Conclusions on the model selection BIC aims at finding a good number of components for a global fit of the data distribution. It tends to overestimate the number of components ICL is dedicated to a classification purpose. It strongly penalizes models for which the classification is uncertain. Whatever the criterion, it must be a convex function of the number of components Bad behavior Correct behavior → a non-convex function may indicate an issue of modeling M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 16 / 30
  20. 20. Presentation outline 1 Introduction 2 Mixture model definition 3 Genomic examples Mixtures for co-expression analysis Mixtures for analysing chIP-chip data 4 Conclusions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 17 / 30
  21. 21. GEM2Net: From gene expression modeling to -omics network Goal: Explore the orphean gene space to identify new genes involved in defense and adaptation process Method: Predict co-expression networks using mixture models Data: An original resource generated by the transcriptomic platform of URGV Homogeneous data generated with the CATMA microarray 5,095 genes not present in Affymetrix chip High diversity of biological samples relative to stress conditions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 18 / 30
  22. 22. Workflow overview - Extraction of CATdb of 387 stress comparaisons - 17,264 genes are differentially expressed in at least one of these comparisons (FWER controlled at 5% on overall the tests) - Analyses performed with Gaussian Mixture Models - According to BIC curve, the naive clustering on the whole dataset is not relevant - Gene co-expression depends on the stress categoriesM.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 19 / 30
  23. 23. Results of the co-expression analysis - 18 categories (9 biotic and 9 abiotic), identification of 681 clusters - Large overlap between biotic and abiotic clusters - 98% of clusters have a functional bias in a term of gene ontology - 80% are associated to a stress term - 39% have a preferential sub-cellular localization in plastid - 18% are enriched in transcription factors and for stifenia, no cluster is enriched in TF M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 20 / 30
  24. 24. Focus on nematode stress 7467 genes described by 10 expression differences 29 clusters of co-expression identified 1519 genes with a conditional proba. close to 1 Example of Cluster 14 49 genes repressed from 14 days after infection 13 genes known to be involved in stress response 10 orphean genes Endoplasmic reticulum bias M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 21 / 30
  25. 25. GEM2Net database http://urgv.evry.inra.fr/GEM2NET Integration of various resources: gene ontology, genes involved in stress responses, gene families (transcription factors and hormones) and protein-protein interactions (experimental and predicted). Original representation and interactive visualization, using pie charts to summarize the functional biases at first glance M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 22 / 30
  26. 26. ChIP-chip experiments The log-ratio is not tractable while the couple (IP, Input) is Development of mixture of 2 linear regressions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 23 / 30
  27. 27. MultiChIPmix: Mixture of two linear regressions Let Zi the status of the probe i: P(Zi = 1) = π The linear relation between IP and Input depends on the probe status IPir =    a0r + b0rInputir + Eir if Zi = 0 (normal) a1r + b1rInputir + Eir if Zi = 1 (enriched) V(IPir) = σ2 r Martin-Magniette et al. (2008), Bioinformatics M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 24 / 30
  28. 28. MultiChIPmix: Mixture of two linear regressions Let Zi the status of the probe i: P(Zi = 1) = π The linear relation between IP and Input depends on the probe status IPir =    a0r + b0rInputir + Eir if Zi = 0 (normal) a1r + b1rInputir + Eir if Zi = 1 (enriched) V(IPir) = σ2 r M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 24 / 30
  29. 29. Use to create the first epigenomic map of Arabidopsis thaliana: Roudier et al. (2011), EMBO Journal study the additive inherance of histone modifications in Arabidopsis thaliana intra-specific hybrids: Moghaddam et al. (2011), Plant Journal M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 25 / 30
  30. 30. MultiChIPmixHMM for taking the spatial information into account When probes are (almost) equally spaced along the genome, hybridisation signals tend to be clustered Assuming that the probe status are (Markov-)dependent enables this information in the model: {Zi} ∼ MC(π, ν) πk = Pr{Zi = k|Zi−1 = } M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 26 / 30
  31. 31. Table : Example of one known H3K27me3 target gene identified only with MultiChIPmixHMM. MultiChIPmix and MultiChIPmixHMM are alternative methods to peak detections Analysis of several replicates simultaneously + modelling the spatial dependency = more accurate conditional probabilities MultiChIPmixHMM is available as an R package: Bérard et al. (2013), BMC Bioinformatics M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 27 / 30
  32. 32. Presentation outline 1 Introduction 2 Mixture model definition 3 Genomic examples 4 Conclusions M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 28 / 30
  33. 33. Conclusions Mixtures reveal underlying structures Key ingredients are P(Z) and P(Y|Z) For genomic data, component distribution modeling is sometimes tricky, especially for RNA-Seq data Applications on genomic data sometimes raise new methodological questions about the parameter inference and classification rules Examples of R packages using mixtures: Mclust, Rmixmod, MultiChIPmixHMM, HTSDiff, HTSCluster, poisson.glm.mix M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 29 / 30
  34. 34. Acknowledgements Statistics Bioinformatics Biology S. Robin V. Brunaud J-P. Renou T. Mary-Huard J-P Tamby E. Delannoy C. Bérard R. Zaag S. Balzergue G. Celeux Z. Tariq C. Maugis-Rabusseau V. Colot G. Rigaill F. Roudier A. Rau P. Papastamoulis M. Seifert Thank you for your attention ! M.L Martin-Magniette (INRA) Mixture models and genomic data 7-11 July 2014 30 / 30

×