0
Introduction

Introduction to
Metabolomic Data Analysis

Dmitry Grapov, PhD
Introduction

Important
•This is an introduction to a series
of 8 tutorials for metabolomic data
analysis
•Download all th...
Goals?
Analysis at the Metabolomic Scale
Cycle of Scientific Discovery
Hypothesis

Hypothesis Generation

Data Acquisition

Data Processing

Data Analysis

Data
Univariate vs. Multivariate
Multivariate

Predictive Modeling

Group 2

Group 1

Univariate

Hypothesis testing
(t-Test, A...
Univariate vs. Multivariate
univariate/bivariate


vs.
multivariate

outliers?
mixed up samples?
Data Analysis Goals
Exploration

Classification

• Are there any trends in my data?
– analytical sources
– meta data/covar...
Data Complexity
Meta
Data
m
n

variables

Experimental
Design =
complexity

samples

Data
m-D
1-D 2-D
Variable # = dimensi...
Univariate Qualities
•length (sample size)
•center (mean, median,
geometric mean)
•dispersion (variance,
standard deviatio...
Data Quality
Metrics
• Precision
• Accuracy
Remedies

• normalization
• outliers
detection
*Start lab 1-statistical analys...
Univariate Analyses
•Identify differences in sample population
means
•sensitive to distribution shape
•parametric = assume...
False Discovery Rate (FDR)
Type I Error: False Positives
•Type II Error: False Negatives
•Type I risk =
•1-(1-p.value)m
m ...
Achieving “significance” is a function of:
significance level (α) and power (1-β )

effect size (standardized difference i...
Clustering
Identify
•patterns
•group structure

•relationships
•Evaluate/refine hypothesis

•Reduce complexity

Artist: Ch...
Cluster Analysis
Use the concept similarity/dissimilarity
to group a collection of samples or
variables
Linkage
Approaches...
Hierarchical Cluster Analysis
• similarity/dissimilarity
defines “nearness” or
distance
euclidean manhattan Mahalanobis no...
Hierarchical Cluster Analysis
Agglomerative/linkage algorithm
defines how points are grouped

single

complete centroid av...
Dendrograms

x

x
x

Similarity

x
Hierarchical Cluster Analysis
How does my metadata
match my data structure?

Exploration

*finish lab 2-Cluster Analysis

...
Projection of Data

The algorithm defines the position of the light source
Principal Components Analysis (PCA)
• unsupervi...
Interpreting PCA Results
Variance explained (eigenvalues)

Row (sample) scores and column (variable) loadings
How are scores and
loadings related?
Centering and Scaling

PMID: 16762068

*finish lab 3-Principal Components Analysis
Use PLS to test a hypothesis
Partial Least Squares (PLS) is used to identify planes of maximum
correlation between X measu...
Modeling multifactorial
relationships
~two-way ANOVA

dynamic changes among groups
PLS Related Objects
Model
•dimensions, latent variables (LV)
•performance metrics (Q2, RMSEP, etc)
•validation (training/t...
“goodness” of the model is all about the
perspective

Determine in-sample (Q2) and outof-sample error (RMSEP) and
compare ...
Biological Interpretation
Projection or mapping of analysis results
into a biological context.
• Visualization
• Enrichmen...
Identification of alterations in
biochemical domains
Organism specific biochemical relationships and information
Multiple ...
Network Mapping
1. Generate
Connections

2. Calculate
Mappings

3. Create
Network

Grapov D., Fiehn O., Multivariate and n...
Connections and
Contexts
Biochemical (substrate/product)
•Database lookup
•Web query
Chemical (structural or
spectral simi...
Mapping Analysis Results
Analysis results

Network Annotation

*finish lab 7-Network Mapping I

Mapped Network
Biochemical
Relationships

http://www.genome.jp/dbget-bin/www_bget?rn:R00975
Structural
Similarity

http://pubchem.ncbi.nlm.nih.gov//score_matrix/score_matrix.cgi
Mass Spectral Connections

Watrous J et al. PNAS 2012;109:E1743-E1752

*finish lab 8-Network Mapping II
Upcoming SlideShare
Loading in...5
×

0 introduction

12,002

Published on

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
12,002
On Slideshare
0
From Embeds
0
Number of Embeds
31
Actions
Shares
0
Downloads
45
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "0 introduction"

  1. 1. Introduction Introduction to Metabolomic Data Analysis Dmitry Grapov, PhD
  2. 2. Introduction Important •This is an introduction to a series of 8 tutorials for metabolomic data analysis •Download all the required files and software here: https://sourceforge.net/projects/teachingdemos/files/Winter%202014%20LC-MS%20and%20Statistics%20Course/ •Then follow the directions in the software/startup.R to launch all accompanying software
  3. 3. Goals?
  4. 4. Analysis at the Metabolomic Scale
  5. 5. Cycle of Scientific Discovery Hypothesis Hypothesis Generation Data Acquisition Data Processing Data Analysis Data
  6. 6. Univariate vs. Multivariate Multivariate Predictive Modeling Group 2 Group 1 Univariate Hypothesis testing (t-Test, ANOVA, etc.) PCA O-/PLS/-DA
  7. 7. Univariate vs. Multivariate univariate/bivariate vs. multivariate outliers? mixed up samples?
  8. 8. Data Analysis Goals Exploration Classification • Are there any trends in my data? – analytical sources – meta data/covariates • Useful Methods – matrix decomposition (PCA, ICA, NMF) – cluster analysis • Differences/similarities between groups? – discrimination, classification, significant changes • Useful Methods – analysis of variance (ANOVA), mixed effects models – partial least squares discriminant analysis (O-/PLS-DA) – Others: random forest, CART, SVM, ANN • What is related or predictive of my variable(s) of interest? – Regression, correlation • Useful Methods – correlation – partial least squares (O-/PLS) Prediction
  9. 9. Data Complexity Meta Data m n variables Experimental Design = complexity samples Data m-D 1-D 2-D Variable # = dimensionality
  10. 10. Univariate Qualities •length (sample size) •center (mean, median, geometric mean) •dispersion (variance, standard deviation) •range (min / max), •quantiles •shape (skewness, kurtosis, normality, etc.) standard deviation mean
  11. 11. Data Quality Metrics • Precision • Accuracy Remedies • normalization • outliers detection *Start lab 1-statistical analysis
  12. 12. Univariate Analyses •Identify differences in sample population means •sensitive to distribution shape •parametric = assumes normality •error in Y, not in X (Y = mX + error) wide •optimal for long data •assumed independence •false discovery rate (FDR) long n-of-one
  13. 13. False Discovery Rate (FDR) Type I Error: False Positives •Type II Error: False Negatives •Type I risk = •1-(1-p.value)m m = number of variables tested FDR correction • p-value adjustment or estimate of FDR (Fdr, q-value) Bioinformatics (2008) 24 (12):1461-1462
  14. 14. Achieving “significance” is a function of: significance level (α) and power (1-β ) effect size (standardized difference in means) sample size (n) *finish lab 1-statistical analysis
  15. 15. Clustering Identify •patterns •group structure •relationships •Evaluate/refine hypothesis •Reduce complexity Artist: Chuck Close
  16. 16. Cluster Analysis Use the concept similarity/dissimilarity to group a collection of samples or variables Linkage Approaches •hierarchical (HCA) •non-hierarchical (k-NN, k-means) •distribution (mixtures models) •density (DBSCAN) •self organizing maps (SOM) Distribution k-means Density
  17. 17. Hierarchical Cluster Analysis • similarity/dissimilarity defines “nearness” or distance euclidean manhattan Mahalanobis non-euclidean X X X * Y Y Y
  18. 18. Hierarchical Cluster Analysis Agglomerative/linkage algorithm defines how points are grouped single complete centroid average
  19. 19. Dendrograms x x x Similarity x
  20. 20. Hierarchical Cluster Analysis How does my metadata match my data structure? Exploration *finish lab 2-Cluster Analysis Confirmation
  21. 21. Projection of Data The algorithm defines the position of the light source Principal Components Analysis (PCA) • unsupervised • maximize variance (X) Partial Least Squares Projection to Latent Structures (PLS) • supervised • maximize covariance (Y ~ X) James X. Li, 2009, VisuMap Tech.
  22. 22. Interpreting PCA Results Variance explained (eigenvalues) Row (sample) scores and column (variable) loadings
  23. 23. How are scores and loadings related?
  24. 24. Centering and Scaling PMID: 16762068 *finish lab 3-Principal Components Analysis
  25. 25. Use PLS to test a hypothesis Partial Least Squares (PLS) is used to identify planes of maximum correlation between X measurements and Y (hypothesis) PLS PCA time = 0 120 min.
  26. 26. Modeling multifactorial relationships ~two-way ANOVA dynamic changes among groups
  27. 27. PLS Related Objects Model •dimensions, latent variables (LV) •performance metrics (Q2, RMSEP, etc) •validation (training/testing, permutation, cross-validation) •orthogonal correction Samples •scores •predicted values •residuals Variables •Loadings •Coefficients, summary of loadings based on all LVs •VIP, variable importance in projection •Feature selection
  28. 28. “goodness” of the model is all about the perspective Determine in-sample (Q2) and outof-sample error (RMSEP) and compare to a random model •permutation tests •training/testing *finish lab 4-Partial Least Squares and lab 5-Data Analysis Case Study
  29. 29. Biological Interpretation Projection or mapping of analysis results into a biological context. • Visualization • Enrichment • Networks – biochemical – structural – spectral – empirical
  30. 30. Identification of alterations in biochemical domains Organism specific biochemical relationships and information Multiple organism DBs •KEGG •BioCyc •Reactome •Human •HMDB •SMPDB *finish lab 6-Metabolite Enrichment Analysis
  31. 31. Network Mapping 1. Generate Connections 2. Calculate Mappings 3. Create Network Grapov D., Fiehn O., Multivariate and network tools for analysis and visualization of metabolomic data, ASMS, June 08, 2013, Minneapolis, MN
  32. 32. Connections and Contexts Biochemical (substrate/product) •Database lookup •Web query Chemical (structural or spectral similarity ) •fingerprint generation BMC Bioinformatics 2012, 13:99 doi:10.1186/1471-2105-13-99 Empirical (dependency) •correlation, partial-correlation
  33. 33. Mapping Analysis Results Analysis results Network Annotation *finish lab 7-Network Mapping I Mapped Network
  34. 34. Biochemical Relationships http://www.genome.jp/dbget-bin/www_bget?rn:R00975
  35. 35. Structural Similarity http://pubchem.ncbi.nlm.nih.gov//score_matrix/score_matrix.cgi
  36. 36. Mass Spectral Connections Watrous J et al. PNAS 2012;109:E1743-E1752 *finish lab 8-Network Mapping II
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×