mixOmics and intro

Data integration

Time course data

Cross-platform

Integration of highly dimensional
biological data ...
mixOmics and intro

Data integration

Time course data

Cross-platform

In God we trust, all others must bring
data.(W. Ed...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

Outline of this talk and research que...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

Philosophy

mixOmics: philosophy
is o...
mixOmics and intro

Data integration

Time course data

Cross-platform

Framework

Framework

ExploraFon$

INPUT$
DATA$

O...
mixOmics and intro

Data integration

Time course data

Cross-platform

Web interface

Web interface

http://www.qfab.org/...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

Introduction with PCA

Principal Comp...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

Introduction with PCA

−0.2
−0.4

z2
...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

Dealing with high dimensional data

P...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

Variable selection

sparse Principal ...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

Motivation

Integration of multiple d...
mixOmics and intro

Data integration

Example

Example
Kidney transplant study
Transcriptomics and proteomics study
of 40 ...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

Example

New developments include a s...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

Motivation

Integration of time-cours...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

Modelling trajectories

Modelling tra...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

Modelling trajectories

Step 1: use c...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

Some results

Variable representation...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

Example: sample representation

Sampl...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

Motivation

Cross-platform comparison...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

The challenges

What is a ‘batch’ effe...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

The challenges

1""."."."."."."."."."...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

The challenges

Multi-group Analysis
...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

The challenges

Multi-group PLS-DA
mg...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

The challenges

Multi-group PLS-DA, m...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

sparse multi-group PLSDA
30

mg PLS−D...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

Comparisons

Comparisons
(2)

(3)

q
...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

Conclusions

Multivariate integrative...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

Un GRAND MERCI à ...

Data providers
...
mixOmics and intro

Data integration

Time course data

Cross-platform

Conclusions

Un GRAND MERCI à ...

Questions?
Spri...
Upcoming SlideShare
Loading in …5
×

Integration of highly dimensional biological data sets with multivariate analysis - Kim-Anh Lê Cao

1,042 views

Published on

1. It is all about mixOmics?
2. What is the common information contained in different
experiments?
3. What are the correlated features across different time course experiments?
4. Can we combine similar experiments performed in different labs or/and on different platforms?

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,042
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
16
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Integration of highly dimensional biological data sets with multivariate analysis - Kim-Anh Lê Cao

  1. 1. mixOmics and intro Data integration Time course data Cross-platform Integration of highly dimensional biological data sets with multivariate analysis Kim-Anh Lê Cao Queensland Facility for Advanced Bioinformatics The University of Queensland Kim-Anh Lê Cao AMATA 2013 Conclusions
  2. 2. mixOmics and intro Data integration Time course data Cross-platform In God we trust, all others must bring data.(W. Edwards Deming) Kim-Anh Lê Cao AMATA 2013 Conclusions
  3. 3. mixOmics and intro Data integration Time course data Cross-platform Conclusions Outline of this talk and research questions 1 It is all about mixOmics 2 What is the common information contained in different experiments? 3 What are the correlated features across different time course experiments? 4 Can we combine similar experiments performed in different labs or/and on different platforms? Kim-Anh Lê Cao AMATA 2013
  4. 4. mixOmics and intro Data integration Time course data Cross-platform Conclusions Philosophy mixOmics: philosophy is originally an R package developed for the statistical exploration and integration of large biological data sets. Development of statistical multivariate approaches Variable selection included in the methodologies Graphical outputs User-friendly use: website, R package, web interface First R CRAN release in May 2009. Lê Cao et al. mixOmics: an R package to unravel relationships between two omics data sets, Bioinformatics, 25 (21). Kim-Anh Lê Cao AMATA 2013
  5. 5. mixOmics and intro Data integration Time course data Cross-platform Framework Framework ExploraFon$ INPUT$ DATA$ One$data$set$ PRE$PROCESSING$ Missing$ values$ GRAPHICAL$$ OUTPUTS$ PCA$ IPCA$ Sample$plots$ PLS/DA$ Time$Course$ Two$ matching$ data$sets$ Repeated$ measurements$ Smoothing$ splines$ PLS$ PLS$ MulFlevel$ rCCA$ $ IntegraFon$ NIPALS$ MULTIVARIATE$$ APPROACH$ More$than$ two$data$ sets$ Kim-Anh Lê Cao AMATA 2013 rGCCA$ Comparison$ across$$ plaIorms$ mulF/ group$ Variable$plots$ Conclusions
  6. 6. mixOmics and intro Data integration Time course data Cross-platform Web interface Web interface http://www.qfab.org/mixomics/ Kim-Anh Lê Cao AMATA 2013 Conclusions
  7. 7. mixOmics and intro Data integration Time course data Cross-platform Conclusions Introduction with PCA Principal Component Analysis: an introduction 0.2 x2 0.0 −0.2 −0.4 → principal components: artificial variables that are linear combinations of the original variables: 0.4 Seek the best directions in the data that account for most of the variability c1 = w1 x 1 + w2 x 2 + · · · + wp x p −1.0 −0.5 0.0 x1 where c1 is the first principal component with max. variance {w1 , . . . , wp } are the weights in the linear combination {x 1 , . . . , x p } are the gene expression profiles. All PCs are mutually orthogonal. (c1 , c2 , . . . ) Kim-Anh Lê Cao AMATA 2013 0.5 1.0
  8. 8. mixOmics and intro Data integration Time course data Cross-platform Conclusions Introduction with PCA −0.2 −0.4 z2 Project the data on these new axes to summarize the information related to the variance. 0.0 0.2 0.4 The new PCs form a a smaller subspace of dimension < p −1.0 −0.5 0.0 0.5 z1 → approximate representation of the data points in a lower dimensional space PCA should be a compulsory first step in exploratory data analysis to: Have a first understanding of the underlying data structure Identify bias, experimental errors, batch effects Kim-Anh Lê Cao AMATA 2013 1.0
  9. 9. mixOmics and intro Data integration Time course data Cross-platform Conclusions Dealing with high dimensional data Problem with PCA: interpretation can be difficult with very large number of (possibly) irrelevant variables. Remember that the principal components are linear combinations of the original variables: c = w1 x 1 + w2 x 2 + · · · + wp x p A clearer signal could be observed if some of the variable weights {w1 , . . . , wp } could be set to 0 for the irrelevant variables: c = 0 ∗ x 1 + w2 x 2 + · · · + 0 ∗ x p These variables weights are defined in the loading vectors. Important weights = important contribution to the PC. Similar weights = correlated variables. Kim-Anh Lê Cao AMATA 2013
  10. 10. mixOmics and intro Data integration Time course data Cross-platform Conclusions Variable selection sparse Principal Component Analysis: sPCA loading vectors (sPCA) 0.10 loading 1 0.00 40 60 80 100 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 −0.5 0.0 0.5 1.0 0.15 0.10 loading 2 0.05 0.15 0.10 0.05 0.00 −1.0 0.00 −0.4 loading 2 −0.2 20 0 x2 0 0.0 0.05 0.10 loading 1 0.00 0.2 0.05 0.4 0.15 0.15 sparse loading vectors (PCA) Principal components x1 The principal components are linear combinations of the original variables, variables weights are defined in the associated loading vectors. sparse PCA computes the sparse loading vectors to remove irrelevant variables using lasso penalizations (Shen & Huang 2008, J. Multivariate Analysis ). Kim-Anh Lê Cao AMATA 2013
  11. 11. mixOmics and intro Data integration Time course data Cross-platform Conclusions Motivation Integration of multiple data sets: what is the common information contained in different experiments? 1"".".".".".".".".".".".".".".""."."p" 1" 2" ." ." ." ." n" 1""."."."."."."."."."."."."."."."."."."."."."."."."."q" 1""."."."."."."."."."r" 1"".".".".".".".".".".".".".""."s" ."."." 0" 1" 0" 1" 1" ." ." 0" Statistical data integration, how to? Define the best components (linear combination of each data set variables) that are best correlated with each other (RGCCA: Regularized Generalised CCA). The same samples are measured across different experiments Select relevant biological entities which are correlated across the different data sets Emphasis on graphical outputs to ease results interpretation Kim-Anh Lê Cao AMATA 2013
  12. 12. mixOmics and intro Data integration Example Example Kidney transplant study Transcriptomics and proteomics study of 40 patients with kidney transplant, rejecting (n1 = 20) or not (n2 = 20) the transplant, PROOF Centre, UBC. Kim-Anh Lê Cao AMATA 2013 Time course data Cross-platform Conclusions
  13. 13. mixOmics and intro Data integration Time course data Cross-platform Conclusions Example New developments include a sparse RGGCA (sGCCA) to select correlated variables across the different platforms Relevance networks Gonzales, Lê Cao et al. (2012), J. Data Mining Kim-Anh Lê Cao AMATA 2013
  14. 14. mixOmics and intro Data integration Time course data Cross-platform Conclusions Motivation Integration of time-course data: what are the correlated features across different time course experiments? 1""."."."."."."."."."..".".""."."p" 1" 2" ." ." ." ." n" 1""."."."."."."."."."."."."."."."."."."."."."."."."q" 1""."."."."."."."."."r" 1"".".".".".".".".".".".".".""."s" 0" 1" 0" 1" 1" ." ." 0" Requires the same samples measured at several time point Variable selection across platforms and time Hypothesize that correlated biological entities belong to the same biological pathway Kim-Anh Lê Cao AMATA 2013
  15. 15. mixOmics and intro Data integration Time course data Cross-platform Conclusions Modelling trajectories Modelling trajectories Use cubic smoothing splines to summarize each profile Enables interpolation of missing values There exists a theoretical link between smoothing splines and linear mixed models (Verbyla et al. 1999). Example on one CAGE cluster 50 Appropriate for correlated repeated measures q 40 TPM Expr 35 30 q 25 q q q q d0 d6 d12 time Kim-Anh Lê Cao AMATA 2013 q q q q 20 Assesses the variability for each feature (technical , biological variability) tech rep 1 tech rep 2 45 q No parameters to tune Flexibility of the model q d18
  16. 16. mixOmics and intro Data integration Time course data Cross-platform Conclusions Modelling trajectories Step 1: use cubic smoothing splines to reduce one dimension (samples dimension) Step 2: apply sGCCA on the estimated splines to identify correlated profiles both within and between the two data sets → → → → modelisation of the trajectories filtering of the profiles integration of two types of data selection and clustering of the correlated time profiles Kidney transplant study Transcriptomics and proteomics study of 40 patients with kidney transplant, rejecting (n1 = 20) or not (n2 = 20) the transplant. Follow up on 5 time points (weeks), PROOF Centre, UBC. Kim-Anh Lê Cao AMATA 2013
  17. 17. mixOmics and intro Data integration Time course data Cross-platform Conclusions Some results Variable representation: time profile clusters are selected dimension 2 dimension 3 0 expression −1.5 −2 −2 −1.0 −1 −1 0 expression 0.0 −0.5 expression 0.5 1 1 1.0 2 2 1.5 dimension 1 2 4 6 time 8 10 2 4 6 time 8 10 2 4 6 8 time The approach selects both transcripts and proteins which are positively or negatively correlated across time Quality of clusters decreases with the number of PLS components (dimensions) as obvious patterns cannot be extracted anymore Kim-Anh Lê Cao AMATA 2013 10
  18. 18. mixOmics and intro Data integration Time course data Cross-platform Conclusions Example: sample representation Sample representation: how the approach can extract common information across these different platforms Project Grandiose Longitudinal study of cell reprogramming - 8 time points - 10 platforms: microarray, cell surface proteome, total proteome, RNA-seq, miRNA... See also keynote speaker Session 11 Nicole Cloonan Kim-Anh Lê Cao AMATA 2013
  19. 19. mixOmics and intro Data integration Time course data Cross-platform Conclusions Motivation Cross-platform comparison: can we combine similar experiments performed in different labs or with different platforms? Not the same samples measured across different experiments Experimental biases (batch effects) Well known fact: microarray experiments across studies bring different results! “But I would very much like to compare my gene signature from my experiment to my colleagues’ gene signatures from their experiments!" Aim: Identify genes diagnostic of the key characteristics of stem cells derived from independent studies See contributed paper Session 9 Celena Heazlewood and Poster 46, Florian Rohart Kim-Anh Lê Cao AMATA 2013
  20. 20. mixOmics and intro Data integration Time course data Cross-platform Conclusions The challenges What is a ‘batch’ effect? ● Brennand Jia Maherali Nayler Zaehres 0 −50 1.2 Boxplots of the samples (data normalised Fibroblast hESC hiPSC −40 1.4 −20 PC2: 15.87 % 20 40 Classic PCA: Across platform 3 classes per experiment) 1 0.8 1.0 PCA highlights a ‘lab effect’. → Combining different experiments is a real challenge Kim-Anh Lê Cao AMATA 2013 0 50 100 PC1: 58.76 % 150 200
  21. 21. mixOmics and intro Data integration Time course data Cross-platform Conclusions The challenges 1"".".".".".".".".".".""."."p" 1" 2" ." ." ." ." n1" 1"" 2" 1" 3" 3" 1" 3" 1" 2" ." n2" 1"" 3" 3" 2" 1" 2" ." ." ." n3" 1"" 2" 1" 1" 2" 3" 1" 2" ." ." ." n4" 1"" 2" 1" 3" 3" 1" ." ." ." Kim-Anh Lê Cao AMATA 2013 X = several microarray experiments, performed in different labs but studying the same biological conditions Not the same samples, but the same variables Group = experiment/platform/lab Y = biological conditions → Supervised problem We propose: multi-group analysis for the selection of common biomarkers across the data sets.
  22. 22. mixOmics and intro Data integration Time course data Cross-platform Conclusions The challenges Multi-group Analysis Individuals are a priori structured into several groups within group part: group structure effect is removed between-group part: group structure effect is taken into account Fig from A. Eslami Partial and global components and loading vectors → Understand and model the common stucture between the groups and within each group →Two block multi-group analysis by the means of Partial Least Squares regression (PLS) Kim-Anh Lê Cao AMATA 2013
  23. 23. mixOmics and intro Data integration Time course data Cross-platform Conclusions The challenges Multi-group PLS-DA mg PLS−DA, Across platform 3 classes 10 q q q q q q q q q q 0 0 comp 2 50 X−variate 2 20 100 30 PLS−DA, diff platforms 3 classes qq q q q q −50 −10 q q −50 0 50 X−variate 1 Fibroblast hESC hiPSC ● Kim-Anh Lê Cao AMATA 2013 Brennand Jia Maherali Nayler Zaehres −50 0 50 100 comp 1 Classical PLSDA (LHS) vs multi-group PLSDA (RHS), the latter gives better (visual) results. 150
  24. 24. mixOmics and intro Data integration Time course data Cross-platform Conclusions The challenges Multi-group PLS-DA, more data sets Multi group PLS−DA, same platform Multi group PLS−DA, MAQC 15 q q qq q q q q q 10 60 20 Multi group PLS−DA, diff platform 2 classes q q q q q q 5 0 40 q q q qq q −10 −20 0 Comp 2 q q q 0 100 Comp 1 200 300 −15 −40 q q −60 q −100 q −5 Comp 2 0 −20 q q q −60 q q q q q −40 Comp 2 20 q −40 −20 0 Comp 1 20 40 60 q q qq q q −10 0 10 20 30 Comp 1 Same Affymetrix platform; MAQC Illumina + Affymetrix; Illumina + Affymetrix → In order to select common biomarkers across different data sets, variable selection is needed Kim-Anh Lê Cao AMATA 2013
  25. 25. mixOmics and intro Data integration Time course data Cross-platform Conclusions sparse multi-group PLSDA 30 mg PLS−DA, Across platform 3 classes 10 20 sparse multi-group PLSDA and performance 0 comp 2 q q q q q q q q q q qq q q q q −10 q q −50 sparse mg PLS−DA, Across platform 3 classes 0 50 100 150 comp 1 10 Table: Classif. error rate (10-fold CV) diff. platform K=2 diff. platform K=3 #samples 53 70 82 #variables 22K 9K 9K #groups 5 5 5 multi grp 8.9% 15.3% 16.1% sparse multi grp 7.7% 4.5% 10.6% #selected markers 40 30 20 5 same platform K=3 comp 2 q q q q q q −5 0 q q q q q q q q q q q −5 0 5 10 comp 1 15 training set: circles, testing set: triangles with colors as predicted Kim-Anh Lê Cao AMATA 2013 20
  26. 26. mixOmics and intro Data integration Time course data Cross-platform Conclusions Comparisons Comparisons (2) (3) q q q q q q q q q q q q q q q q q q q q q q qq q q qq q −5 −0.1 q q q q −0.1 q q q q 0.1 q q q 0.0 q q q q q q q qq comp 2 comp 2 q q q 0 0.1 q q q q q 0.0 X−variate 2 0.2 0.2 5 0.3 0.3 0.4 10 (1) −0.1 0.0 0.1 0.2 0.3 −5 0 X−variate 1 3 sPLS 3 4 3 3 20 0.0 0.1 0.2 0.3 comp 1 smgPLS 3 1 Data centered per group 2 3 10 AMATA 2013 15 sPLS + ComBat 7 sPLS + ComBat Kim-Anh Lê Cao 10 comp 1 sPLS smgPLS 5 Multi group PLSDA ComBat Combating Batches of gene expression microarray data (Johnson and Li, 3 2007)
  27. 27. mixOmics and intro Data integration Time course data Cross-platform Conclusions Conclusions Multivariate integrative approaches are flexible and can answer various types of questions mixOmics implements 6 different methodologies plus their sparse variants Data integration, variable selection, graphical outputs Includes a web-associated interface (and soon through Galaxy too) Help generating new hypotheses, further statistical tests New additions! (still in development) Smoothing spline/LMM framework for the integration of time-course data Multi-group analysis to address a challenging current problem in molecular biology. Kim-Anh Lê Cao AMATA 2013
  28. 28. mixOmics and intro Data integration Time course data Cross-platform Conclusions Un GRAND MERCI à ... Data providers Kidney transplant study Methods development Multiple data integration Oliver Günther UBC Arthur Tenenhaus Scott Tebutt UBC Time course Grandiose project Andras Nagy Jasmin Straube Univ. Toronto Stemformatics team Supelec Paris QFAB Kathy Ruggiero Univ. Auckland Rowland Mosbergen INRA Toulouse UQ, Univ. Glasgow Cross-platform comparison UQ Stéphanie Bougeard ANSES Othmar Korn UQ Christine Wells Christèle Robert Sébastien Déjean Univ. Tlse Ignacio González Univ. Tlse Xin Yi Chua QFAB Kim-Anh Lê Cao AMATA 2013 ANSES, ONIRIS Florian Rohart mixOmics team Aida Eslami AIBN UQ
  29. 29. mixOmics and intro Data integration Time course data Cross-platform Conclusions Un GRAND MERCI à ... Questions? Springtime Bioinformatics Training Workshops by QFAB (University of Queensland) RNA seq analysis, 4 Nov. Introduction to statististics for biologists, 19-21 Nov. Variant detection, 22 Nov. Introduction to De Novo Genome Assembly, 9 Dec. training@qfab.org http://www.math.univ-toulouse.fr/∼biostat/mixOmics Kim-Anh Lê Cao AMATA 2013

×