Upcoming SlideShare
×

# 0 introduction

16,721
-1

Published on

3 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total Views
16,721
On Slideshare
0
From Embeds
0
Number of Embeds
31
Actions
Shares
0
60
0
Likes
3
Embeds 0
No embeds

No notes for slide

### 0 introduction

1. 1. Introduction Introduction to Metabolomic Data Analysis Dmitry Grapov, PhD
2. 2. Introduction Important •This is an introduction to a series of 8 tutorials for metabolomic data analysis •Download all the required files and software here: https://sourceforge.net/projects/teachingdemos/files/Winter%202014%20LC-MS%20and%20Statistics%20Course/ •Then follow the directions in the software/startup.R to launch all accompanying software
3. 3. Goals?
4. 4. Analysis at the Metabolomic Scale
5. 5. Cycle of Scientific Discovery Hypothesis Hypothesis Generation Data Acquisition Data Processing Data Analysis Data
6. 6. Univariate vs. Multivariate Multivariate Predictive Modeling Group 2 Group 1 Univariate Hypothesis testing (t-Test, ANOVA, etc.) PCA O-/PLS/-DA
7. 7. Univariate vs. Multivariate univariate/bivariate vs. multivariate outliers? mixed up samples?
8. 8. Data Analysis Goals Exploration Classification • Are there any trends in my data? – analytical sources – meta data/covariates • Useful Methods – matrix decomposition (PCA, ICA, NMF) – cluster analysis • Differences/similarities between groups? – discrimination, classification, significant changes • Useful Methods – analysis of variance (ANOVA), mixed effects models – partial least squares discriminant analysis (O-/PLS-DA) – Others: random forest, CART, SVM, ANN • What is related or predictive of my variable(s) of interest? – Regression, correlation • Useful Methods – correlation – partial least squares (O-/PLS) Prediction
9. 9. Data Complexity Meta Data m n variables Experimental Design = complexity samples Data m-D 1-D 2-D Variable # = dimensionality
10. 10. Univariate Qualities •length (sample size) •center (mean, median, geometric mean) •dispersion (variance, standard deviation) •range (min / max), •quantiles •shape (skewness, kurtosis, normality, etc.) standard deviation mean
11. 11. Data Quality Metrics • Precision • Accuracy Remedies • normalization • outliers detection *Start lab 1-statistical analysis
12. 12. Univariate Analyses •Identify differences in sample population means •sensitive to distribution shape •parametric = assumes normality •error in Y, not in X (Y = mX + error) wide •optimal for long data •assumed independence •false discovery rate (FDR) long n-of-one
13. 13. False Discovery Rate (FDR) Type I Error: False Positives •Type II Error: False Negatives •Type I risk = •1-(1-p.value)m m = number of variables tested FDR correction • p-value adjustment or estimate of FDR (Fdr, q-value) Bioinformatics (2008) 24 (12):1461-1462
14. 14. Achieving “significance” is a function of: significance level (α) and power (1-β ) effect size (standardized difference in means) sample size (n) *finish lab 1-statistical analysis
15. 15. Clustering Identify •patterns •group structure •relationships •Evaluate/refine hypothesis •Reduce complexity Artist: Chuck Close
16. 16. Cluster Analysis Use the concept similarity/dissimilarity to group a collection of samples or variables Linkage Approaches •hierarchical (HCA) •non-hierarchical (k-NN, k-means) •distribution (mixtures models) •density (DBSCAN) •self organizing maps (SOM) Distribution k-means Density
17. 17. Hierarchical Cluster Analysis • similarity/dissimilarity defines “nearness” or distance euclidean manhattan Mahalanobis non-euclidean X X X * Y Y Y
18. 18. Hierarchical Cluster Analysis Agglomerative/linkage algorithm defines how points are grouped single complete centroid average
19. 19. Dendrograms x x x Similarity x
20. 20. Hierarchical Cluster Analysis How does my metadata match my data structure? Exploration *finish lab 2-Cluster Analysis Confirmation
21. 21. Projection of Data The algorithm defines the position of the light source Principal Components Analysis (PCA) • unsupervised • maximize variance (X) Partial Least Squares Projection to Latent Structures (PLS) • supervised • maximize covariance (Y ~ X) James X. Li, 2009, VisuMap Tech.
22. 22. Interpreting PCA Results Variance explained (eigenvalues) Row (sample) scores and column (variable) loadings