Strategies for Metabolomic
Data Analysis
Dmitry Grapov, PhD
Introduction
Goals?
Metabolomics
Analysis at the Metabolomic Scale
Can you spot the difference?
Univariate Multivariate Predictive Modeling
ANOVA PCA PLS
Analytical Dimensions
Samples
variables
Analyzing Metabolomic Data
•Pre-analysis
•Data properties
•Statistical approaches
•Multivariate approaches
•Systems approa...
Pre-analysis
Data quality metrics
•precision
•accuracy
Remedies
• normalization
• outliers
detection
• missing values
impu...
Normalization
• sample-wise
•sum, adjusted
• measurement-wise
•transformation (normality)
•encoding (trigonometric,
etc.)
...
Outliers
• single
measurements
(univariate)
• two
compounds
(bivariate)
Outliers
univariate/bivariate vs.
 multivariate
mixed up samples
outliers?
X -0.5X
Transformation
• logarithm
(shifted)
• power
(BOX-COX)
• inverse
Quantile-quantile (Q-Q)
plots are useful for visu...
Missing Values Imputation
Why is it missing?
•random
•systematic
• analytical
• biological
Imputation methods
•single valu...
Data Analysis Goals
• Are there any trends in my data?
– analytical sources
– meta data/covariates
• Useful Methods
– matr...
Data Structure
•univariate: a single variable (1-D)
•bivariate: two variables (2-D)
•multivariate: 2 > variables (m-D)
•Da...
Data Complexity
n
m
1-D 2-D m-D
Data
samples
variables
complexity
Meta
Data
Experimental
Design =
Variable # = dimensional...
Univariate Analyses
univariate properties
•length
•center (mean, median,
geometric mean)
•dispersion (variance,
standard d...
Univariate Analyses
•sensitive to distribution shape
•parametric = assumes normality
•error in Y, not in X (Y = mX + error...
False Discovery Rate (FDR)
univariate approaches do not scale well
• Type I Error: False Positives
•Type II Error: False N...
FDR correction
Example:
Design: 30 sample, 300 variables
Test: t-test
FDR method: Benjamini and
Hochberg (fdr) correction ...
Achieving “significance” is a function of:
significance level (α) and power (1-β )
effect size (standardized difference in...
Bivariate Data
relationship between two variables
•correlation (strength)
•regression (predictive)
correlation
regression
Correlation
•Parametric (Pearson) or rank-order (Spearman, Kendall)
•correlation is covariance scaled between -1 and 1
Correlation vs. Regression
Regression describes the
least squares or best-fit-
line for the relationship
(Y = m*X + b)
Geyser Example
Goal: Don’t miss eruption!
Data
•time between eruptions
– 70 ± 14 min
•duration of eruption
– 3.5 ± 1 min
A...
Two cluster pattern for
both duration and
frequency
Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old ...
Geyser Example (cont.)
Noted deviations from
two cluster pattern
–Outliers?
–Covariates?
Trends in data which mask primary goals can be accounted
for using covariate adjustment and appropriate modeling
strategie...
Geyser Example (cont.)
Noted deviations from
two cluster pattern can
be explained by
covariate:
Hydrofraking 
Covariate a...
Summary
Data exploration and pre-analysis:
•increase robustness of results
•guards against spurious findings
•Can greatly ...
Resources
Web-based data analysis platforms
•MetaboAnalyst(http://www.metaboanalyst.ca/MetaboAnalyst/faces/Home.jsp)
•Melt...
Upcoming SlideShare
Loading in...5
×

Strategies for Metabolomics Data Analysis

3,180

Published on

Part of a lectures series for the international summer course in metabolomics 2013 (http://metabolomics.ucdavis.edu/courses-and-seminars/courses). Get more material and information here (http://imdevsoftware.wordpress.com/2013/09/08/sessions-in-metabolomics-2013/).

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,180
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
40
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Strategies for Metabolomics Data Analysis

  1. 1. Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD Introduction
  2. 2. Goals?
  3. 3. Metabolomics
  4. 4. Analysis at the Metabolomic Scale
  5. 5. Can you spot the difference? Univariate Multivariate Predictive Modeling ANOVA PCA PLS
  6. 6. Analytical Dimensions Samples variables
  7. 7. Analyzing Metabolomic Data •Pre-analysis •Data properties •Statistical approaches •Multivariate approaches •Systems approaches
  8. 8. Pre-analysis Data quality metrics •precision •accuracy Remedies • normalization • outliers detection • missing values imputation
  9. 9. Normalization • sample-wise •sum, adjusted • measurement-wise •transformation (normality) •encoding (trigonometric, etc.) mean standard deviation
  10. 10. Outliers • single measurements (univariate) • two compounds (bivariate)
  11. 11. Outliers univariate/bivariate vs. multivariate mixed up samples outliers?
  12. 12. X -0.5X Transformation • logarithm (shifted) • power (BOX-COX) • inverse Quantile-quantile (Q-Q) plots are useful for visual overview of variable normality
  13. 13. Missing Values Imputation Why is it missing? •random •systematic • analytical • biological Imputation methods •single value (mean, min, etc.) •multiple •multivariate mean PCA
  14. 14. Data Analysis Goals • Are there any trends in my data? – analytical sources – meta data/covariates • Useful Methods – matrix decomposition (PCA, ICA, NMF) – cluster analysis • Differences/similarities between groups? – discrimination, classification, significant changes • Useful Methods – analysis of variance (ANOVA) – partial least squares discriminant analysis (PLS-DA) – Others: random forest, CART, SVM, ANN • What is related or predictive of my variable(s) of interest? – regression • Useful Methods – correlation Exploration Classification Prediction
  15. 15. Data Structure •univariate: a single variable (1-D) •bivariate: two variables (2-D) •multivariate: 2 > variables (m-D) •Data Types •continuous •discreet • binary
  16. 16. Data Complexity n m 1-D 2-D m-D Data samples variables complexity Meta Data Experimental Design = Variable # = dimensionality
  17. 17. Univariate Analyses univariate properties •length •center (mean, median, geometric mean) •dispersion (variance, standard deviation) •Range (min / max) mean standard deviation
  18. 18. Univariate Analyses •sensitive to distribution shape •parametric = assumes normality •error in Y, not in X (Y = mX + error) •optimal for long data •assumed independence •false discovery rate long wide n-of-one
  19. 19. False Discovery Rate (FDR) univariate approaches do not scale well • Type I Error: False Positives •Type II Error: False Negatives •Type I risk = •1-(1-p.value)m m = number of variables tested
  20. 20. FDR correction Example: Design: 30 sample, 300 variables Test: t-test FDR method: Benjamini and Hochberg (fdr) correction at q=0.05 Bioinformatics (2008) 24 (12):1461-1462 Results FDR adjusted p-values (fdr) or estimate of FDR (Fdr, q-value)
  21. 21. Achieving “significance” is a function of: significance level (α) and power (1-β ) effect size (standardized difference in means) sample size (n)
  22. 22. Bivariate Data relationship between two variables •correlation (strength) •regression (predictive) correlation regression
  23. 23. Correlation •Parametric (Pearson) or rank-order (Spearman, Kendall) •correlation is covariance scaled between -1 and 1
  24. 24. Correlation vs. Regression Regression describes the least squares or best-fit- line for the relationship (Y = m*X + b)
  25. 25. Geyser Example Goal: Don’t miss eruption! Data •time between eruptions – 70 ± 14 min •duration of eruption – 3.5 ± 1 min Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357–365
  26. 26. Two cluster pattern for both duration and frequency Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357–365 Geyser Example (cont.)
  27. 27. Geyser Example (cont.) Noted deviations from two cluster pattern –Outliers? –Covariates?
  28. 28. Trends in data which mask primary goals can be accounted for using covariate adjustment and appropriate modeling strategies Covariates
  29. 29. Geyser Example (cont.) Noted deviations from two cluster pattern can be explained by covariate: Hydrofraking  Covariate adjustment is an integral aspect of statistical analyses (e.g. ANCOVA)
  30. 30. Summary Data exploration and pre-analysis: •increase robustness of results •guards against spurious findings •Can greatly improve primary analyses Univariate Statistics: •are useful for identification of statically significant changes or relationships •sub-optimal for wide data •best when combined with advanced multivariate techniques
  31. 31. Resources Web-based data analysis platforms •MetaboAnalyst(http://www.metaboanalyst.ca/MetaboAnalyst/faces/Home.jsp) •MeltDB(https://meltdb.cebitec.uni-bielefeld.de/cgi-bin/login.cgi) Programming tools •The R Project for Statistical Computing(http://www.r-project.org/) •Bioconductor(http://www.bioconductor.org/ ) GUI tools •imDEV(http://sourceforge.net/projects/imdev/?source=directory)
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×