Metabolomic Data Analysis for the
Study of Diseases

Dmitry Grapov, PhD
State of the art facility producing massive
amounts of biological data…
>13,000 samples/yr
>160 studies
~32,000 data points/study
Goals?
Analysis at the Metabolomic Scale
Univariate vs. Multivariate
Multivariate

Predictive Modeling

Group 2

Group 1

Univariate

Hypothesis testing
(t-Test, ANOVA, etc.)

PCA

O-/PLS/-DA
Univariate vs. Multivariate
univariate/bivariate



vs.

multivariate

outliers?
mixed up samples?
Data Complexity
Meta
Data
m
n

variables

Experimental
Design =
complexity

samples

Data

m-D
1-D 2-D
Variable # = dimensionality
Statistical Analysis
•Identify differences in sample population
means
•sensitive to distribution shape

•parametric = assumes normality
•error in Y, not in X (Y = mX + error)

wide

•optimal for long data

•assumed independence
•false discovery rate (FDR)

long

n-of-one
Achieving “significance” is a function of:
significance level (α) and power (1-β )

effect size (standardized difference in means)

sample size (n)
False Discovery Rate (FDR)
Type I Error: False Positives

•Type II Error: False Negatives
•Type I risk =
•1-(1-p.value)m
m = number of variables tested

FDR correction
• p-value adjustment or estimate of FDR (Fdr, q-value)
Bioinformatics (2008) 24 (12):1461-1462
FDR correction

FDR adjusted p-value

Benjamini & Hochberg
(1995) (“BH”)
•Accepted standard

Bonferroni
•Very conservative
•adjusted p-value
= p-value*# of tests
(e.g. 0.005 * 148 = 0.74 )

p-value
Multivariate Analysis
Clustering
• Grouping based on similarity/dissimilarity

Principal Components Analysis (PCA)
• Identify modes of variance in the data

Partial Least Squares (PLS)
•Identify modes of variance in the data
correlated with a hypothesis
Cluster Analysis
Use similarity/dissimilarity to group a
collection of samples or variables
Linkage

Approaches
•hierarchical (HCA)
•non-hierarchical (k-NN, k-means)
•distribution (mixtures models)
•density (DBSCAN)
Distribution
•self organizing maps (SOM)

k-means

Density
Hierarchical Cluster Analysis
similarity/dissimilarity defines “nearness” or distance

objects are grouped based on linkage methods
Hierarchy of Similarity
How does my metadata
match my data structure?
Hierarchy of
effect sizes

x

x
x

Similarity

x
Projection of Data
Raw data

PCA dimensions

http://www.scholarpedia.org/article/Eigenfaces

The algorithm defines the position of the light source
Principal Components Analysis (PCA)
• unsupervised
• maximize variance (X)
Partial Least Squares Projection to
Latent Structures (PLS)
• supervised
• maximize covariance (Y ~ X)

PC1
PC2
James X. Li, 2009, VisuMap Tech.
Interpreting PCA Results
Variance explained (eigenvalues)

Row (sample) scores and column (variable) loadings
How are scores and
loadings related?
Centering and Scaling

PMID: 16762068
Use PLS to test a hypothesis
Partial Least Squares (PLS) is used to identify planes of maximum
correlation between X measurements and Y (hypothesis)
PLS

PCA

time = 0

120 min.
PLS model validation is critical

Determine in-sample (Q2) and outof-sample error (RMSEP) and
compare to a random model
•permutation tests

•training/testing
Biochemical domain information
Databases for organism specific biochemical information:

Multiple organisms
•KEGG
•BioCyc
•Reactome
Human

•HMDB
•SMPDB
Pathway Enrichment Analysis

enrichment
topological
importance

http://www.metaboanalyst.ca/MetaboAnalyst/faces/UploadView.jsp
Network
Mapping
Biochemical

Structural Similarity

doi:10.1186/1471-2105-13-99
Data visualization as form of analysis
Dextromethorphan

= additives in

DM

Liver
CYP2D6

•high fructose
corn syrup

dextrorphan

• antioxidants

•flavor
Identification of relationships between
altered metabolites
urea cycle

protein
glycosylation

nucleotide
synthesis
Identification of treatment effects
Analysis of differential metabolic responses

Treatment 1

Treatment 2
Resources
•DeviumWeb- Dynamic multivariate data analysis and
visualization platform
url: https://github.com/dgrapov/DeviumWeb

•imDEV- Microsoft Excel add-in for multivariate analysis
url: http://sourceforge.net/projects/imdev/

•MetaMapR: Network analysis tools for metabolomics
url: https://github.com/dgrapov/MetaMapR

•TeachingDemos- Tutorials and demonstrations
•url: http://sourceforge.net/projects/teachingdemos/?source=directory
•url: https://github.com/dgrapov/TeachingDemos

•CDS Blog- Data analysis case studies
url: http://imdevsoftware.wordpress.com/
dgrapov@ucdavis.edu
metabolomics.ucdavis.edu

This research was supported in part by NIH 1 U24 DK097154

High Dimensional Biological Data Analysis and Visualization