Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Multivariate models for dimension
re...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Challenges
Challenges
Close interact...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Multivariate analysis
Linear multiva...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Outline
Outline of this talk and res...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Principal Component Analysis
Princip...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Principal Component Analysis
Maximiz...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Principal Component Analysis
PCA as ...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Principal Component Analysis
A linea...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Principal Component Analysis
The dat...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Principal Component Analysis
Computi...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Principal Component Analysis
PCA is ...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Principal Component Analysis
Tuning ...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Visualise batch effects
PCA is useful...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Visualise batch effects
Cell lines ca...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Visualise batch effects
PCA in action...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Visualise batch effects
PCA to visual...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Visualise batch effects
Summary on PC...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Dealing with high dimensional data
T...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
sparse PCA
The curse of dimensionali...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
sparse PCA
sparse Principal Componen...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
sparse PCA
Parameters to tune
1 The ...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
sparse PCA
Correlation circles enabl...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Linear Discriminant analysis
Definiti...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Linear Discriminant analysis
Fisher’...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Linear Discriminant analysis
Pros an...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
PLS- Discriminant Analysis
PLS - Dis...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
A linear combination of biomarkers
s...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
A linear combination of biomarkers
W...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Example: transcriptomics data
Kidney...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Example: transcriptomics data
Testin...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Example: oesophageal cancer
Oesophag...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Framework
Biomarker discovery when i...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
sGCCA
sparse Generalised Canonical C...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Example: asthma study
Asthma study (...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Example: asthma study
Method
sGCCA c...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Example: asthma study
Results: Selec...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
mixOmics
It is all about mixOmics
mi...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Conclusions
To put it in a nutshell
...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Acknowledgements
mixOmics developmen...
Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu
Questions?
Questions?
Further inform...
Upcoming SlideShare
Loading in …5
×

Kim Anh-le Cao - Multivariate models for dimension reduction and biomarker selection in omics data

1,891 views

Published on

Recent advances in high throughput ’omics’ technologies enable quantitative measurements of expression or abundance of biological molecules of a whole biological system. The transcriptome, proteome and metabolome are dynamic entities, with the presence, abundance and function of each transcript, protein and metabolite being
critically dependent on its temporal and spatial location.
Whilst single omics analyses are commonly performed to detect between‐groups difference from either static or dynamic experiments, the integration or combination of multi‐layer information is required to fully unravel the complexities of a biological system. Data integration relies on the currently accepted biological assumption that
each functional level is related to each other. Therefore, considering all the biological entities (transcripts, proteins, metabolites) as part of a whole biological system is crucial to unravel the complexity of living organisms.
With many contributors and collaborators, we have further developed several multivariate approaches to project high dimensional data into a smaller space and select relevant biological features, while capturing the largest sources of variation in the data. These approaches are based on variants of Partial Least Squares regression and Canonical Correlation Analysis and enable the integration of several types of omics data.
In this presentation, I will illustrate how various techniques enable exploration, biomarker selection and visualisation for different types of analytical frameworks.

First presented at the 2014 Winter School in Mathematical and Computational Biology http://bioinformatics.org.au/ws14/program/

Published in: Science
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,891
On SlideShare
0
From Embeds
0
Number of Embeds
23
Actions
Shares
0
Downloads
31
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Kim Anh-le Cao - Multivariate models for dimension reduction and biomarker selection in omics data

  1. 1. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Multivariate models for dimension reduction and biomarker selection in omics data Kim-Anh Lê Cao The University of Queensland Diamantina Institute Brisbane, Australia Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  2. 2. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Challenges Challenges Close interaction between statisticians, bioinformaticians and molecular biologists Understand the biological problem Try answer complex biological questions Irrelevant (noisy) variables n << p and n very small → limited statistical validation Biological validation needed Keep up with new technologies Anticipate computational issues Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  3. 3. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Multivariate analysis Linear multivariate approaches Linear multivariate approaches use latent variables (e.g. variables that are not directly observed to reduce the dimensionality of the data). A large number of observable variables are aggregated in a model to represent an underlying concept, making it easier to understand the data. Dimension reduction → project the data in a smaller subspace To handle multicollinear, irrelevant, missing variables To capture experimental and biological variation Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  4. 4. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Outline Outline of this talk and research questions 1 Principal Component Analysis - The workhorse for linear multivariate statistical analysis - Detection and correction of batch effects - Variable selection 2 One data set - Looking for a linear combination of multiple biomarkers from a single experiment - Improving the prediction of the model 3 Integration of multiple data sets - Looking for a linear combination of multiple ‘omics biomarkers from multiple ‘omics experiments - Improving the prediction of the model Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  5. 5. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Principal Component Analysis Principal Component Analysis PCA: the workhorse for linear multivariate statistical analysis is an (almost) compulsory first step in exploratory data analysis to: Understand the underlying data structure Identify bias, experimental errors, batch effects Insightful graphical outputs to: to visualise the samples in a smaller subspace (components ‘scores’) to visualise the relationship between variables ... and samples (correlation circles) Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  6. 6. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Principal Component Analysis Maximize the variance In PCA, we think of the variance as the‘ information’ contained in the data. We replace original variables by artificial variables (principal components) which explain as much information as possible from the original data. Toy example with data (n = 3 × p = 3): Cov(data) = 1.3437 -0.16015 0.1864 -0.01601 0.6192 -0.1266 0 .1864 -0.1266 1.4855 We have 1.3437 + 0.6192 + 1.4855 = 3.4484 Replace the data by the principal components, which are orthogonal to each other (covariance = 0). Cov(pc1, pc2, pc3) = 1.6513 0 0 0 1.2202 0 0 0 0.5769 We have 1.6513 + 1.2202 + 0.5769 = 3.4484 Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  7. 7. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Principal Component Analysis PCA as an iterative algorithm 1 Seek for the first PC for which the variance is maximised 2 Project the data points onto the first PC 3 Seek for the second PC that is orthogonal to the first one 4 Project the residual data points onto the second PC 5 And so on until we explain enough variance in the data −1.0 −0.5 0.0 0.5 1.0 −0.4−0.20.00.20.4 x1 x2 Seek max. variance −1.0 −0.5 0.0 0.5 1.0 −0.4−0.20.00.20.4 z1 z2 Project into new subspace Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  8. 8. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Principal Component Analysis A linear combination of variables Seek for the best directions in the data that account for most of the variability. Objective function: max ||v||=1 var(Xv) Each principal component u is a linear combinations of the original variables: u = v1x1 + v2x2 + · · · + vpxp X is a n × p data matrix with {x1, . . . , xp} the p variable profiles. u is the first principal component with max. variance {v1, . . . , vp} are the weights in the linear combination Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  9. 9. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Principal Component Analysis The data are projected into a smaller subspace Each principal component is orthogonal to each other to ensure that no redundant information is extracted. The new PCs form a a smaller subspace of dimension << p. Each value in the principal component corresponds to a score for each sample → this is how we project each sample into a new subspace spanned by the principal components → approximate representation of the data points in a lower dimensional space → summarize the information related to the variance Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  10. 10. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Principal Component Analysis Computing PCA Several ways of solving PCA Eigenvalue decomposition: the old way, does not work well in high dimension Sv = λv; u = ˜Xv S = variance covariance matrix or correlation matrix if ˜X is scaled NIPALS algorithm: long to compute but works for missing values and enables least squares regression framework (useful for variable selection) Singular Value Decomposition (SVD): the easiest and fastest, implemented in most softwares. Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  11. 11. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Principal Component Analysis PCA is simply a matrix decomposition Singular Value Decomposition: X = ˜UΛV T Λ diagonal matrix with √ λh U = ˜UΛ, U contains the PCs uh V contains the loading vectors vh h = 1..H The variance of a principal component u1 is equal to its associated eigenvalue λ1, etc.. The obtained eigenvalues λh are decreasing. Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  12. 12. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Principal Component Analysis Tuning PCA How many principal components to choose to summarize most of the information? Note: we can have as many components as the rank of the matrix X Look at proportion of explained variance Look at the screeplot of eigenvalues. Any elbow? Look at sample plot. Makes sense? Some stat tests exist to estimate the ‘intrinsic’ dimension Proportion of explained variance for the first 8 principal components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 0.5876 0.1587 0.0996 0.0848 0.0159 0.0098 0.0052 0.0051 Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  13. 13. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Visualise batch effects PCA is useful to visualise batch effects ‘Batch effects are sub-groups of measurements that have qualitatively different behaviour across conditions and are unrelated to the biological or scientific variables in a study. For example, batch effects may occur if a subset of experiments was run on Monday and another set on Tuesday, if two technicians were responsible for different subsets of the experiments, or if two different lots of reagents, chips or instruments were used’. (Leek et al. (2010), Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. ) Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  14. 14. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Visualise batch effects Cell lines case study (www.stemformatics.org, AIBN) Five microarray data set studies including n = 82 samples, p = 9K genes and K = 3 groups of cell lines including 12 fibroblasts, 22 hPSC and 48 hiPSC. This is a complex case of cross-platform comparison where we expect to see a batch effect (the study).p ( yy)y Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  15. 15. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Visualise batch effects PCA in action What we would really like to see is a separation between the different types of cell lines. PCA is a good exploratory method to highlight the sources of variation in the data and highlight batch effects if present.effects if present. Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  16. 16. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Visualise batch effects PCA to visualise batch effect correction Remove known batch effects Combat: uses empirical Bayesian framework linear (mixed) model Remove unknown batch effects sva: estimates surrogate variables for unknown sources of variation YuGene: works directly on the distribution of each sample combat: Johnson WE. et al (2007), Biostatistics; sva: Leek JT and Storey JD. (2007), PLoS Genetics; YuGene: Lê Cao KA, Rohart F et al. (2014), Genomics. Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  17. 17. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Visualise batch effects Summary on PCA PCA is a matrix decomposition technique that allows dimension reduction Run a PCA first to understand the sources of variation in your data Note: I illustrated the batch effect on the very topical issue of meta-analysis. But batch effects happen very often! → What did Alec say? Experimental design is important! If clear batch effect: try to remove/ accommodate for it If no clear separation between your groups: → Biological question may not be related to the highest variance in the data, try ICA instead → Mostly due to a large amount of noisy variables that need to be removed Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  18. 18. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Dealing with high dimensional data The curse of dimensionality Fact: in high throughput experiments, data are noisy or irrelevant. They make a PCA analyse difficult to understand. → clearer signal if some of the variable weights {v1, . . . , vp} set to 0 for the ’irrelevant’ variables: u = 0 ∗ x1 + v2x2 + · · · + 0 ∗ xp Important weights = important contribution to define the principal components. Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  19. 19. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu sparse PCA The curse of dimensionality Since the ’90, many statisticians have been trying to address the issue of high (and now ‘ultra high’) dimension. Let’s face it, classical statistics cannot cope! Univariate analysis: multiple testing to adress # false positives Machine Learning approaches Parsimonious statistical models to make valid inferences → variable selection Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  20. 20. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu sparse PCA sparse Principal Component Analysis Aim: sparse loading vectors to remove irrelevant variables which determine the principal components. The idea is to apply LASSO (least absolute shrinkage and selection operator, Tibshirani, R. (1996)) or some soft-thresholding approach in a PCA framework (least squares regression framework by using the close link between SVD and low rank matrix approximations). → obtain sparse loading vectors, with very few non-zero elements → perform internal variable selection Shen, H., Huang, J.Z. 2008. Sparse principal component analysis via regularized low rank matrix approximation, J. Multivariate Analysis. Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  21. 21. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu sparse PCA Parameters to tune 1 The number of principal components (as seen earlier) 2 The number of variables to select. → tricky! At this point we can already sense that separating hPSC vs hiPSC is going to be a challenge Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  22. 22. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu sparse PCA Correlation circles enable to understand the relationship between variables (correlation structure) between variables and samples Samples VariablesVariables Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  23. 23. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Linear Discriminant analysis Definition Linear Discriminant analysis seeks for a linear combination of features which characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction prior to classification. Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  24. 24. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Linear Discriminant analysis Fisher’s Linear Discriminant We can decompose the total variation into V = B + W (W the Within-groups and B the Between-groups variance) The problem to solve is: max a a Ba a W a → maximise the Between-groups variance → minimise the Within-groups variance Image source: http://pvermees.andropov.org/ Number of components to choose: ≤ min(p, K − 1), K = # classes Prediction of a new observation x is based on the discriminant score Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  25. 25. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Linear Discriminant analysis Pros and Cons of LDA LDA has often been shown to produce very good classification results Assumptions: - equal variance in each group - independent variables to be normally distributed - independent variables to be linearly related But with too many correlated predictors, LDA does not work so well → regularize or introduce sparsity, using a cousin of LDA called Partial Least Squares Discriminant analysis Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  26. 26. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu PLS- Discriminant Analysis PLS - Discriminant Analysis Partial Least Squares Discriminant Analysis seeks for the PLS components from X which best explain the outcome vector Y. Objective function based on the general PLS: max ||u||=1,||v||=1 cov(Xu, Y v) Y is the qualitative response matrix (dummy block matrix) Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  27. 27. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu A linear combination of biomarkers sparse PLS - DA Include soft-thresholding or LASSO penalization on the loading vectors (v1, . . . , vp) similar to sPCA. Parameters to tune: Number of components often equals to K − 1 Number of variables to select on each dimension → classification error, sensitivity, specificity Lê Cao K-A., Boitard S. and Besse P. (2011) Sparse PLS Discriminant Analysis: biologically relevant feature selection and graphical displays for multiclass problems, BMC Bioinformatics, 12:253. Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  28. 28. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu A linear combination of biomarkers What is k fold cross-validation? → estimate how accurately a predictive model will perform in practice Partition the data into k equal size subsets Train a classifier on k − 1 subsets Test the classifier on the kth remaining set Estimate classification error rate Average across k folds Repeat the process many times Be careful of selection bias when performing variable selection during the process! Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  29. 29. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Example: transcriptomics data Kidney transplant study (PROOF Centre, UBC) Genomics assay (p = 27K) of 40 patients with kidney transplant, with acute rejection (AR, n1 = 20) or no rejection (NR, n2 = 20) of the transplant. Training step: 5 fold cross-validation (100 times) on a training set of ntrain = 26 Selection of 90 probe sets Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  30. 30. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Example: transcriptomics data Testing step: NR −0.2 −0.1 0.0 0.1 0.2 0.3 −0.4−0.20.00.20.40.6 Comp 1 Comp2 Genomics Train AR NR Test AR Training performed: model with 90 probes Genes mostly related to inflammation and innate immune responses External test set of ntest = 14 Project the test samples onto the training set subspace of sPLS-DA Classifier # probes Error rate Sensitivity Specificity AUC sPLS-DA 90 0.14 0.71 1 0.90 PLS-DA 27K (all) 0.21 0.57 1 0.82 Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  31. 31. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Example: oesophageal cancer Oesophageal cancer study (UQDI) Proteomics assay (p = 129) of 40 patients with Barrett’s oesophagus benign (n1 = 20) or oesophageal adenocarcinoma cancer (n2 = 20). Aim: develop blood tests for detection and personalised treatment. sPLS-DA model and selection of 9 proteins Individual proteins biomarkers give a poor performance (AUC 0.46-0.79) A combined signature from sPLS-DA model improves the AUC up to 0.94 Roc curve Specificity Sensitivity 0.00.20.40.60.81.0 1.0 0.8 0.6 0.4 0.2 0.0 Combined markers P1 P2 P3 P4 P5 P6 P7 P8 Acknowledgements: Benoit Gautier Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  32. 32. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Framework Biomarker discovery when integrating multiple data sets Model relationships between different data sets Select relevant biological entities which are correlated across the different data sets Use multivariate integrative approaches: Partial Least Squares regression (PLS) and Canonical Correlation Analysis (CCA) variants Same idea as before: maximize covariance between latent components associated to 2 data sets at a time Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  33. 33. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu sGCCA sparse Generalised Canonical Correlation Analysis sGCCA generalizes sPLS to more than 2 data sets and maximizes the sum of pairwise covariances between two components at a time max a1,...,aJ J j,k=1,j=k ckj g(Cov(Xj aj , Xkak)) s.t. τj ||aj ||2 + (1 − τj )Var(Xj aj ), with j = 1, . . . , J, where the aj are the loading vectors associated to each block j, g(x) = |x|. sGCCA: variable selection is performed on each data set Tenenhaus, A., Philippe, C., Guillemot, V., Lê Cao K-A., Grill, J., Frouin, V. et al. 2014, Variable selection for generalized canonical correlation analysis, Biostatistics Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  34. 34. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Example: asthma study Asthma study (UBC, Vancouver) Multiple assay study on asthma patients pre (n1 = 14) and post (n2 = 14) challenge. Three data sets: complete blood count (p1 = 9), mRNA (p2 = 10, 863) and metabolites (p3 = 146). Aims Identify a multi omic biomarker signature to differentiate asthmatic response. Can we improve the prediction with biomarkers of different types? Acknowledgements: Amrit Singh Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  35. 35. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Example: asthma study Method sGCCA components score predict the class of the samples per data set. →Use the average score to get an ‘Ensemble’ prediction Results ‘Ensemble’ prediction outperforms each data set’s prediction Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  36. 36. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Example: asthma study Results: Selection of 4 cells, 30 genes and 16 metabolites, some of which are very relevant to Asthma. Cells: Relative Neutrophils; Relative Eosinophils; T reg; T cells Genes: Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  37. 37. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu mixOmics It is all about mixOmics mixOmics is an R package implementing multivariate methods such as PCA, ICA, PLS, PLS-DA, CCA (work in progress), the sparse variants we have developed (and much more). Website with tutorial Web interface Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  38. 38. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Conclusions To put it in a nutshell Multivariate linear methods enables to answer a wide range of biological questions - data exploration - classification - integration of multiple data sets Least squares framework for variable selection Future of mixOmics Time course modelling Meta analysis / multi group analysis Several workshops coming up! (Toulouse 2014, Brisbane 2015, Auckland 2015) Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  39. 39. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Acknowledgements mixOmics development Sébastien Déjean Univ. Toulouse Ignacio González Univ. Toulouse Xin Yi Chua QFAB Benoit Gauthier UQDI Data providers Oliver Günther Kidney PROOF, Vancouver Michelle Hill EAC UQDI Alok Shah EAC UQDI Christine Wells Stemformatics AIBN, Univ. Glasgow Methods development Amrit Singh UBC, Vancouver Florian Rohart AIBN, UQ Jasmin Straube QFAB Kathy Ruggiero Univ. Auckland Christèle Robert INRA Toulouse Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology
  40. 40. Intro PCA: the workhorse Variable selection for single ... and multiple ’omics Conclu Questions? Questions? Further information and registration on www.di.uq.edu.au/statistics Please forward on to colleagues who may be interested in attending Enquiries Dr Kim-Anh Lê Cao (DI) Tel: 3343 7069 Email: k.lecao@uq.edu.au Dr Nick Hamilton (IMB) Tel: 3346 2033 Email: n.hamilton@uq.edu.au With a focus on real world examples, notes and resources you can take home, and quick exercises to test your understanding of the material, this short course will cover many of the essential elements of biostatistics and problems that commonly arise in analysis in the lab. * Presentations will be given simultaneously live at UQDI-TRI and remotely at IMB Statistics for frightened bioresearchers A short course organised by UQDI and IMB 10.30am -12pm Monday 28 July, 1 September, 29 September, 3 November > Room 2007, Translational Research Institute, 37 Kent Street, Princess Alexandra Hospital, Woolloongabba, QLD 4102 * > Large Seminar Room, Institute for Molecular Bioscience, Queensland Biosciences Precinct, The University of Queensland, St Lucia Campus * You are invited to a short course in biostatistics Four presentations over four months will cover 1. Summarizing and presenting data 2. Hypothesis testing and statistical inference 3. Comparing two groups 4. Comparing more than two groups IMB Career Advantage 2014 Free registration: www.di.uq.edu.au/statistics k.lecao@uq.edu.au http://www.math.univ-toulouse.fr/∼biostat/mixOmics Kim-Anh Lê Cao 2014 Winter School in Mathematical & Computational Biology

×