Your SlideShare is downloading. ×
Integration of highly dimensional biological data sets with multivariate analysis - Kim-Anh Lê Cao
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Integration of highly dimensional biological data sets with multivariate analysis - Kim-Anh Lê Cao

452
views

Published on

1. It is all about mixOmics? …

1. It is all about mixOmics?
2. What is the common information contained in different
experiments?
3. What are the correlated features across different time course experiments?
4. Can we combine similar experiments performed in different labs or/and on different platforms?

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
452
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. mixOmics and intro Data integration Time course data Cross-platform Integration of highly dimensional biological data sets with multivariate analysis Kim-Anh Lê Cao Queensland Facility for Advanced Bioinformatics The University of Queensland Kim-Anh Lê Cao AMATA 2013 Conclusions
  • 2. mixOmics and intro Data integration Time course data Cross-platform In God we trust, all others must bring data.(W. Edwards Deming) Kim-Anh Lê Cao AMATA 2013 Conclusions
  • 3. mixOmics and intro Data integration Time course data Cross-platform Conclusions Outline of this talk and research questions 1 It is all about mixOmics 2 What is the common information contained in different experiments? 3 What are the correlated features across different time course experiments? 4 Can we combine similar experiments performed in different labs or/and on different platforms? Kim-Anh Lê Cao AMATA 2013
  • 4. mixOmics and intro Data integration Time course data Cross-platform Conclusions Philosophy mixOmics: philosophy is originally an R package developed for the statistical exploration and integration of large biological data sets. Development of statistical multivariate approaches Variable selection included in the methodologies Graphical outputs User-friendly use: website, R package, web interface First R CRAN release in May 2009. Lê Cao et al. mixOmics: an R package to unravel relationships between two omics data sets, Bioinformatics, 25 (21). Kim-Anh Lê Cao AMATA 2013
  • 5. mixOmics and intro Data integration Time course data Cross-platform Framework Framework ExploraFon$ INPUT$ DATA$ One$data$set$ PRE$PROCESSING$ Missing$ values$ GRAPHICAL$$ OUTPUTS$ PCA$ IPCA$ Sample$plots$ PLS/DA$ Time$Course$ Two$ matching$ data$sets$ Repeated$ measurements$ Smoothing$ splines$ PLS$ PLS$ MulFlevel$ rCCA$ $ IntegraFon$ NIPALS$ MULTIVARIATE$$ APPROACH$ More$than$ two$data$ sets$ Kim-Anh Lê Cao AMATA 2013 rGCCA$ Comparison$ across$$ plaIorms$ mulF/ group$ Variable$plots$ Conclusions
  • 6. mixOmics and intro Data integration Time course data Cross-platform Web interface Web interface http://www.qfab.org/mixomics/ Kim-Anh Lê Cao AMATA 2013 Conclusions
  • 7. mixOmics and intro Data integration Time course data Cross-platform Conclusions Introduction with PCA Principal Component Analysis: an introduction 0.2 x2 0.0 −0.2 −0.4 → principal components: artificial variables that are linear combinations of the original variables: 0.4 Seek the best directions in the data that account for most of the variability c1 = w1 x 1 + w2 x 2 + · · · + wp x p −1.0 −0.5 0.0 x1 where c1 is the first principal component with max. variance {w1 , . . . , wp } are the weights in the linear combination {x 1 , . . . , x p } are the gene expression profiles. All PCs are mutually orthogonal. (c1 , c2 , . . . ) Kim-Anh Lê Cao AMATA 2013 0.5 1.0
  • 8. mixOmics and intro Data integration Time course data Cross-platform Conclusions Introduction with PCA −0.2 −0.4 z2 Project the data on these new axes to summarize the information related to the variance. 0.0 0.2 0.4 The new PCs form a a smaller subspace of dimension < p −1.0 −0.5 0.0 0.5 z1 → approximate representation of the data points in a lower dimensional space PCA should be a compulsory first step in exploratory data analysis to: Have a first understanding of the underlying data structure Identify bias, experimental errors, batch effects Kim-Anh Lê Cao AMATA 2013 1.0
  • 9. mixOmics and intro Data integration Time course data Cross-platform Conclusions Dealing with high dimensional data Problem with PCA: interpretation can be difficult with very large number of (possibly) irrelevant variables. Remember that the principal components are linear combinations of the original variables: c = w1 x 1 + w2 x 2 + · · · + wp x p A clearer signal could be observed if some of the variable weights {w1 , . . . , wp } could be set to 0 for the irrelevant variables: c = 0 ∗ x 1 + w2 x 2 + · · · + 0 ∗ x p These variables weights are defined in the loading vectors. Important weights = important contribution to the PC. Similar weights = correlated variables. Kim-Anh Lê Cao AMATA 2013
  • 10. mixOmics and intro Data integration Time course data Cross-platform Conclusions Variable selection sparse Principal Component Analysis: sPCA loading vectors (sPCA) 0.10 loading 1 0.00 40 60 80 100 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 −0.5 0.0 0.5 1.0 0.15 0.10 loading 2 0.05 0.15 0.10 0.05 0.00 −1.0 0.00 −0.4 loading 2 −0.2 20 0 x2 0 0.0 0.05 0.10 loading 1 0.00 0.2 0.05 0.4 0.15 0.15 sparse loading vectors (PCA) Principal components x1 The principal components are linear combinations of the original variables, variables weights are defined in the associated loading vectors. sparse PCA computes the sparse loading vectors to remove irrelevant variables using lasso penalizations (Shen & Huang 2008, J. Multivariate Analysis ). Kim-Anh Lê Cao AMATA 2013
  • 11. mixOmics and intro Data integration Time course data Cross-platform Conclusions Motivation Integration of multiple data sets: what is the common information contained in different experiments? 1"".".".".".".".".".".".".".".""."."p" 1" 2" ." ." ." ." n" 1""."."."."."."."."."."."."."."."."."."."."."."."."."q" 1""."."."."."."."."."r" 1"".".".".".".".".".".".".".""."s" ."."." 0" 1" 0" 1" 1" ." ." 0" Statistical data integration, how to? Define the best components (linear combination of each data set variables) that are best correlated with each other (RGCCA: Regularized Generalised CCA). The same samples are measured across different experiments Select relevant biological entities which are correlated across the different data sets Emphasis on graphical outputs to ease results interpretation Kim-Anh Lê Cao AMATA 2013
  • 12. mixOmics and intro Data integration Example Example Kidney transplant study Transcriptomics and proteomics study of 40 patients with kidney transplant, rejecting (n1 = 20) or not (n2 = 20) the transplant, PROOF Centre, UBC. Kim-Anh Lê Cao AMATA 2013 Time course data Cross-platform Conclusions
  • 13. mixOmics and intro Data integration Time course data Cross-platform Conclusions Example New developments include a sparse RGGCA (sGCCA) to select correlated variables across the different platforms Relevance networks Gonzales, Lê Cao et al. (2012), J. Data Mining Kim-Anh Lê Cao AMATA 2013
  • 14. mixOmics and intro Data integration Time course data Cross-platform Conclusions Motivation Integration of time-course data: what are the correlated features across different time course experiments? 1""."."."."."."."."."..".".""."."p" 1" 2" ." ." ." ." n" 1""."."."."."."."."."."."."."."."."."."."."."."."."q" 1""."."."."."."."."."r" 1"".".".".".".".".".".".".".""."s" 0" 1" 0" 1" 1" ." ." 0" Requires the same samples measured at several time point Variable selection across platforms and time Hypothesize that correlated biological entities belong to the same biological pathway Kim-Anh Lê Cao AMATA 2013
  • 15. mixOmics and intro Data integration Time course data Cross-platform Conclusions Modelling trajectories Modelling trajectories Use cubic smoothing splines to summarize each profile Enables interpolation of missing values There exists a theoretical link between smoothing splines and linear mixed models (Verbyla et al. 1999). Example on one CAGE cluster 50 Appropriate for correlated repeated measures q 40 TPM Expr 35 30 q 25 q q q q d0 d6 d12 time Kim-Anh Lê Cao AMATA 2013 q q q q 20 Assesses the variability for each feature (technical , biological variability) tech rep 1 tech rep 2 45 q No parameters to tune Flexibility of the model q d18
  • 16. mixOmics and intro Data integration Time course data Cross-platform Conclusions Modelling trajectories Step 1: use cubic smoothing splines to reduce one dimension (samples dimension) Step 2: apply sGCCA on the estimated splines to identify correlated profiles both within and between the two data sets → → → → modelisation of the trajectories filtering of the profiles integration of two types of data selection and clustering of the correlated time profiles Kidney transplant study Transcriptomics and proteomics study of 40 patients with kidney transplant, rejecting (n1 = 20) or not (n2 = 20) the transplant. Follow up on 5 time points (weeks), PROOF Centre, UBC. Kim-Anh Lê Cao AMATA 2013
  • 17. mixOmics and intro Data integration Time course data Cross-platform Conclusions Some results Variable representation: time profile clusters are selected dimension 2 dimension 3 0 expression −1.5 −2 −2 −1.0 −1 −1 0 expression 0.0 −0.5 expression 0.5 1 1 1.0 2 2 1.5 dimension 1 2 4 6 time 8 10 2 4 6 time 8 10 2 4 6 8 time The approach selects both transcripts and proteins which are positively or negatively correlated across time Quality of clusters decreases with the number of PLS components (dimensions) as obvious patterns cannot be extracted anymore Kim-Anh Lê Cao AMATA 2013 10
  • 18. mixOmics and intro Data integration Time course data Cross-platform Conclusions Example: sample representation Sample representation: how the approach can extract common information across these different platforms Project Grandiose Longitudinal study of cell reprogramming - 8 time points - 10 platforms: microarray, cell surface proteome, total proteome, RNA-seq, miRNA... See also keynote speaker Session 11 Nicole Cloonan Kim-Anh Lê Cao AMATA 2013
  • 19. mixOmics and intro Data integration Time course data Cross-platform Conclusions Motivation Cross-platform comparison: can we combine similar experiments performed in different labs or with different platforms? Not the same samples measured across different experiments Experimental biases (batch effects) Well known fact: microarray experiments across studies bring different results! “But I would very much like to compare my gene signature from my experiment to my colleagues’ gene signatures from their experiments!" Aim: Identify genes diagnostic of the key characteristics of stem cells derived from independent studies See contributed paper Session 9 Celena Heazlewood and Poster 46, Florian Rohart Kim-Anh Lê Cao AMATA 2013
  • 20. mixOmics and intro Data integration Time course data Cross-platform Conclusions The challenges What is a ‘batch’ effect? ● Brennand Jia Maherali Nayler Zaehres 0 −50 1.2 Boxplots of the samples (data normalised Fibroblast hESC hiPSC −40 1.4 −20 PC2: 15.87 % 20 40 Classic PCA: Across platform 3 classes per experiment) 1 0.8 1.0 PCA highlights a ‘lab effect’. → Combining different experiments is a real challenge Kim-Anh Lê Cao AMATA 2013 0 50 100 PC1: 58.76 % 150 200
  • 21. mixOmics and intro Data integration Time course data Cross-platform Conclusions The challenges 1"".".".".".".".".".".""."."p" 1" 2" ." ." ." ." n1" 1"" 2" 1" 3" 3" 1" 3" 1" 2" ." n2" 1"" 3" 3" 2" 1" 2" ." ." ." n3" 1"" 2" 1" 1" 2" 3" 1" 2" ." ." ." n4" 1"" 2" 1" 3" 3" 1" ." ." ." Kim-Anh Lê Cao AMATA 2013 X = several microarray experiments, performed in different labs but studying the same biological conditions Not the same samples, but the same variables Group = experiment/platform/lab Y = biological conditions → Supervised problem We propose: multi-group analysis for the selection of common biomarkers across the data sets.
  • 22. mixOmics and intro Data integration Time course data Cross-platform Conclusions The challenges Multi-group Analysis Individuals are a priori structured into several groups within group part: group structure effect is removed between-group part: group structure effect is taken into account Fig from A. Eslami Partial and global components and loading vectors → Understand and model the common stucture between the groups and within each group →Two block multi-group analysis by the means of Partial Least Squares regression (PLS) Kim-Anh Lê Cao AMATA 2013
  • 23. mixOmics and intro Data integration Time course data Cross-platform Conclusions The challenges Multi-group PLS-DA mg PLS−DA, Across platform 3 classes 10 q q q q q q q q q q 0 0 comp 2 50 X−variate 2 20 100 30 PLS−DA, diff platforms 3 classes qq q q q q −50 −10 q q −50 0 50 X−variate 1 Fibroblast hESC hiPSC ● Kim-Anh Lê Cao AMATA 2013 Brennand Jia Maherali Nayler Zaehres −50 0 50 100 comp 1 Classical PLSDA (LHS) vs multi-group PLSDA (RHS), the latter gives better (visual) results. 150
  • 24. mixOmics and intro Data integration Time course data Cross-platform Conclusions The challenges Multi-group PLS-DA, more data sets Multi group PLS−DA, same platform Multi group PLS−DA, MAQC 15 q q qq q q q q q 10 60 20 Multi group PLS−DA, diff platform 2 classes q q q q q q 5 0 40 q q q qq q −10 −20 0 Comp 2 q q q 0 100 Comp 1 200 300 −15 −40 q q −60 q −100 q −5 Comp 2 0 −20 q q q −60 q q q q q −40 Comp 2 20 q −40 −20 0 Comp 1 20 40 60 q q qq q q −10 0 10 20 30 Comp 1 Same Affymetrix platform; MAQC Illumina + Affymetrix; Illumina + Affymetrix → In order to select common biomarkers across different data sets, variable selection is needed Kim-Anh Lê Cao AMATA 2013
  • 25. mixOmics and intro Data integration Time course data Cross-platform Conclusions sparse multi-group PLSDA 30 mg PLS−DA, Across platform 3 classes 10 20 sparse multi-group PLSDA and performance 0 comp 2 q q q q q q q q q q qq q q q q −10 q q −50 sparse mg PLS−DA, Across platform 3 classes 0 50 100 150 comp 1 10 Table: Classif. error rate (10-fold CV) diff. platform K=2 diff. platform K=3 #samples 53 70 82 #variables 22K 9K 9K #groups 5 5 5 multi grp 8.9% 15.3% 16.1% sparse multi grp 7.7% 4.5% 10.6% #selected markers 40 30 20 5 same platform K=3 comp 2 q q q q q q −5 0 q q q q q q q q q q q −5 0 5 10 comp 1 15 training set: circles, testing set: triangles with colors as predicted Kim-Anh Lê Cao AMATA 2013 20
  • 26. mixOmics and intro Data integration Time course data Cross-platform Conclusions Comparisons Comparisons (2) (3) q q q q q q q q q q q q q q q q q q q q q q qq q q qq q −5 −0.1 q q q q −0.1 q q q q 0.1 q q q 0.0 q q q q q q q qq comp 2 comp 2 q q q 0 0.1 q q q q q 0.0 X−variate 2 0.2 0.2 5 0.3 0.3 0.4 10 (1) −0.1 0.0 0.1 0.2 0.3 −5 0 X−variate 1 3 sPLS 3 4 3 3 20 0.0 0.1 0.2 0.3 comp 1 smgPLS 3 1 Data centered per group 2 3 10 AMATA 2013 15 sPLS + ComBat 7 sPLS + ComBat Kim-Anh Lê Cao 10 comp 1 sPLS smgPLS 5 Multi group PLSDA ComBat Combating Batches of gene expression microarray data (Johnson and Li, 3 2007)
  • 27. mixOmics and intro Data integration Time course data Cross-platform Conclusions Conclusions Multivariate integrative approaches are flexible and can answer various types of questions mixOmics implements 6 different methodologies plus their sparse variants Data integration, variable selection, graphical outputs Includes a web-associated interface (and soon through Galaxy too) Help generating new hypotheses, further statistical tests New additions! (still in development) Smoothing spline/LMM framework for the integration of time-course data Multi-group analysis to address a challenging current problem in molecular biology. Kim-Anh Lê Cao AMATA 2013
  • 28. mixOmics and intro Data integration Time course data Cross-platform Conclusions Un GRAND MERCI à ... Data providers Kidney transplant study Methods development Multiple data integration Oliver Günther UBC Arthur Tenenhaus Scott Tebutt UBC Time course Grandiose project Andras Nagy Jasmin Straube Univ. Toronto Stemformatics team Supelec Paris QFAB Kathy Ruggiero Univ. Auckland Rowland Mosbergen INRA Toulouse UQ, Univ. Glasgow Cross-platform comparison UQ Stéphanie Bougeard ANSES Othmar Korn UQ Christine Wells Christèle Robert Sébastien Déjean Univ. Tlse Ignacio González Univ. Tlse Xin Yi Chua QFAB Kim-Anh Lê Cao AMATA 2013 ANSES, ONIRIS Florian Rohart mixOmics team Aida Eslami AIBN UQ
  • 29. mixOmics and intro Data integration Time course data Cross-platform Conclusions Un GRAND MERCI à ... Questions? Springtime Bioinformatics Training Workshops by QFAB (University of Queensland) RNA seq analysis, 4 Nov. Introduction to statististics for biologists, 19-21 Nov. Variant detection, 22 Nov. Introduction to De Novo Genome Assembly, 9 Dec. training@qfab.org http://www.math.univ-toulouse.fr/∼biostat/mixOmics Kim-Anh Lê Cao AMATA 2013