Metabolomics Data Analysis

Metabolomics Data Analysis
Johan A. Westerhuis
Swammerdam Institute for Life Sciences, University of Amsterdam
Business Mathematics and Information,
North-West University, Potchefstroom, South Africa

egra
SeqAhead, Barcelona February 2013

Metabolomics pipeline :
Issues for biostatistics

Biological Data Statistical Biological
Experimental Data Metabolite
question Pre- Data inter-
design acquisition identification
processing analysis pretation

Power analysis Normalisation Explorative
Treatment Quantification Predictive
design Hypothetical
QC strategy biomarkers
Measurement Spectral Network
design matching inference,
De NOVO MSEA,
indentification Pathway
analysis

3

Data Analysis
special issue Metabolomics
• Data preprocessing methods (make samples
more comparable)
• How to treat non-detects
• Variable importance in multivariate models
• Metabolic network analysis
• Data fusion methods
• Individual responses
• Between metabolite ratio’s

Guest Editors
Jeroen J. Jansen
Johan A. Westerhuis

Multivariate metabolomics data
NONTARGETED PROFILING TARGETED ANALYSIS

hipp fum urea allant TMAO citrat
1 67 45 6 3 31 10 44 32 10 3 1 8 7 13 4
3 24 12 4 33 23 0 0 99 76 5 2 12 6 15 2

Technical correlations
Biological correlations
Biological correlations

Multivariate
Metabolomics Data analysis
• Explorative
– Find groups, clusters structure /
outliers in metabolites and in
samples

• Supervised
– Discriminate two or more groups to
make predictive model and to find • Special topics
biomarkers. – Between metabolite
ratios
• Biological Interpretation
– Metabolite set enrichment, – Metabolomics Data
Pathway analysis Fusion
– Metabolic network inference

Metabolomics Data preprocessing

• Optimize biological content of data

• Correct for incorrect sampling, sample
workup issues, batch effects
• What is the noise level in the data? Generalized log
transform
Variance stabilization.
• High peaks more important than low
peaks?
• Multivariate methods love large values!
7

Metabolic changes during E. coli culture
growth using k-means clustering. time

metabolites
(A) Growth curve (optical density) of unperturbed E. coli culture. Numbers of
respective sampling time points are marked in the curve. Time point 0 minutes
marks the application of the respective stress condition.

(B) Relative changes of metabolites pools normalized time point 1. Fold change is
presented on log10 scale. To reveal main trends of metabolic changes
10 K means clusters are color coded.

Szymanski, Jedrzej et al. PLoS ONE (2009), vol. 4 issue. 10

Self Organising Map of Metabolites in serum

1H NMR spectra of 613 patients
with type I diabetes and a diverse
spread of complications

Nonlinear mapping method
for large number of samples.

Relate position on the map to
diagnostic responses.

Can be made supervised

1H NMR metabonomics approach to the disease continuum of diabetic complications and premature death
VP Mäkinen et al, Molecular Systems Biology 4:167, 2008

Multivariate
• Explorative
samples

• Supervised (Differentially expressed)
biomarkers.
– Between metabolite
ratios
– Metabolite set enrichment, Pathway
analysis – Metabolomics Data
– Metabolic network inference Fusion

Supervised Metabolomics Data analysis
Case – Control (PLSDA)
Y 4

Men
3
0 Women
2
0
1
0

PC2
0

1 -1

1 -2

1 -3
-4 -2 0 2 4 6
PC1
0.04

• Is there really a difference
between the groups ?
0.02

Statistical validation issues 0

PLS
b
• Which are the most important -0.02

peaks for discrimination ? -0.04

Variable importance -0.06
4 3.5 3 2.5 2 1.5 1 0.5 0
Chemical shift (ppm)

• Psyhogios example uitleggen met paper
voorbeelden en metaboanalyst voorbeelden

Proton NMR spectra of the urine samples were obtained
on a 500MHz 1H NMR machine.

13

NMR spectra of urine samples

14

Nonsupervised

Supervised

UNIVERSITY OF
15
AMSTERDAM

Experimental Design Example

Experiment:
Rats are given Bromobenzene that affects the liver

Measurements: NMR spectroscopy of urine Rats

Experimental Design: 6 hours

24 hours
Time: 6, 24 and 48 hours 48 hours

Groups: 3 doses of BB 3.0275

Vehicle group, Control group 2.055
5.38 3.285
3.0475
Animals: 3 rats per dose per time 3.675
3.7525
2.7175
2.075
2.93
point
10 8 6 4 2 0
chemical shift (ppm)

Different contributions
Experimental Design
Time

4

3.5 0 0.2 0.4 time 0.6 0.8 1
Metabolite concentration

3

2.5
Dose
2

1.5

1
0 0.2 0.4 0.6 0.8 1
0.5 time

0

-0.5
0 0.2 0.4 0.6 0.8 1
time
Animal

Trajectories 0 0.2 0.4 time 0.6 0.8 1

ANOVA decomposition of each variable

xhkihk    k   hk   hkihk
4
3.5
3
2.5
2
1.5
1
0.5
0 0 0.2 0.4 0.6 0.8 1
-0.5 0.2 0.4 0.6 0.8 1
0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

MATRICES:
X  1mT  X α  X αβ  X αβγ

ANOVA and PCA  ASCA

X  1m  Xα  Xαβ  Xαβγ
T

Pα Pαβ Pαβγ

X E
Tα Tαβ Tαβγ
Parts of the
data not
explained by
the
component
X  1mT  TαPα  TαβPαβ  TαβγPαβγ  E
T T T
models

Results

0.5 control
vehicle
0.4 low
Xαβγ medium
Xα 0.3
high

αβ -scores
Xαβ Scores 0.2

0.1

40 % 0

-0.1

-0.2

6 24 48
Time (Hours)

Results  biomarkers
3.0475
5.38

3.7525
3.675
Unique to the α submodel

α Differences
3.9675 2.735
2.055
between submodels
2.5425

2.5825
2.6975
2.055
Interesting for Biology

2.075
Interesting for Statistics /
2.91 Diagnostics
αβ
3.0275
2.93

3.9675 2.735
2.6975
2.5825

3.285
3.2625

2.075
2.93

αβγ
3.0475 2.055
3.73
3.8875

2.735
3.0275

3.285

10 8 6 4 2 0
chemical shift (ppm)

Multivariate
• Explorative
samples

• Supervised
– Method comparison ratios
– Metabolite set enrichment – Metabolomics Data
– Pathway analysis Fusion

NONTARGETED
SELDI measurements of serum samples of
20 Gaucher patients and 20 healthy
controls.

Gaucher is a genetic disease in which a fatty
substance (lipid) accumulates in cells and
certain organs

• human urine and porcine cerebrospinal fluid
samples spiked with a range of peptides.
• Variation in #samples, within and between
group variation

Feature selection methods RESULTS
• Complex nontargeted Gaucher profiling data with
highly variable background and varying difference
between case and control: Multivariate methods
perform best.

• Spiked LCMS targeted data with less variation in
effect size: univariate and semi-univariate methods
are best in selecting biomarkers.

Biomarkers:

A: Univariate
B: Multivariate
C: Change in group correlation

BMR of green tea intervention study
186 human subjects with abdominal obesity

Validation shows significant changes in BMR between placebo and green tea treatment
together with most important triacylglycerols TG28-29 and TG41-42.

Multivariate
• Explorative
samples

• Supervised
ratios
– Metabolite set enrichment – Metabolomics Data
Pathway analysis Fusion

Differences in blood metabolites due to aging

Aging biomarker metabolites in liver

Special topic: Metabolic networks
Biochemical Network vs Association Network

Figure 7 Marginal correlation network for a set of metabolites in
tomato. Volatiles in red, derivatized metabolites in yellow. Solid
lines represent positive correlations, dashed lines negative ones.
Thickness of line corresponds to magnitude of ...
Margriet M.W.B. Hendriks , Data-processing strategies for metabolomics studies, Trends in Analytical Chemistry, 20212

Metabolomics, 2005

Data from
Potato tubers

Metabolic neighbors Do not participate in common reactions
High correlation due to e.g. chemical equilibrium, mass conservation,..

“a systematic relationship between observed correlation
networks and the underlying biochemical pathways.”
Ralf Steuer: Observing and interpreting correlations in metabolomic networks, Bioinformatics, 2003

Metabolic Network Inference
Search for the link between metabolome data and underlying metabolic
networks.

F A E ?? F A E
C B C B

D D

 As an example: can we distinguish healthy from diseased networks:

C Glucose A B C
Glucose A B
G G G
G
D D
HEALTHY DISEASE
F F E
E

F F

From data to network
NETWORK
TOPOLOGY
Goal: ?

? DIRECTIONS

Problems:

NOISE MISSING METABOLITES
HUGE AMOUNT OF POSSIBLE
NETWORK STRUCTURES

40

Inference from static data
1. DATA COLLECTION 2. SIMILARITY SCORE CALCULATION

2a. Relevance Networks 2b. Conditioned Networks
A. Enzymatic
Variability ALL POSSIBLE
Pearson Correlation (PC) Partial Pearson Correlation (PPC)
PAIRWISE
0.6

INTERACTIONS (linear) (linear)
0.55

F
0.5

A E F
A E
0.45 2

0.4
1.5

B
0.35

B
1
100 200 300 400 500 600 700 800 900 1000

0.5 5
C C
2 0 4

B. Intrinsic Variability
1 1.5

1
0.2 0.4
3
0.6 0.8
D D
2
0.9
0.5 1

5
0.8 0 0
0 1 2 3 4
4
0.2 0.4 0.6 0.8

0.7
3

0.6 2

1

0.5
0
F
A E
0 1 2 3 4

0 0.4
50 100
2

1.5
F
0 2 4 6 8 B A E
1
C
0.5

B
5

0 4
C
D
C. Environmental
0.2 0.4 0.6 0.8
3

2

Variability 1
D
0
0 1 2 3 4

Mutual Information (MI) Conditional Mutual Information
(non-linear) (CMI) (non-linear)

0 50 100
10 20 30 40 50

ESTIMATION OF CORRELATION NETWORKS
1. ASPP 2. ASA 3. HS 4. HSP Real Pathway

Vmax Variability Intrinsic Variability Environmental Variability

PC ASPP ASA HS HSP PC ASPP ASA HS HSP PC ASPP ASA HS HSP

MI ASPP ASA HS HSP MI ASPP ASA HS HSP MI ASPP ASA HS HSP

PPC1 ASPP PPC1 ASPP ASA HS HSP
PPC1 ASPP ASA HS HSP ASA HS HSP

CMI1 ASPP ASA HS HSP CMI1 ASPP ASA HS HSP CMI1 ASPP ASA HS HSP

PPCn ASPP ASA HS HSP
PPCn ASPP ASA HS HSP PPCn ASPP ASA HS HSP

100%
PC: Pearson Correlation (linear measure) > 90%
MI: Entropy-based Mutual Information (non-linear measure) 10% … 90%
PPC: Partial Pearson Correlation (linear conditioning measure) < 10%
CMI: Conditional Mutual Information (nonlinear conditioning measure)

42
Cakir, Metabolomics 2009

Metabolomics data fusion
• Account for between-block difference in quality of
measurements to improve data fusion

• For example, multi-platform data fusion, with differences in
quantification, (non) targeted, error structure

Amino acids Lipids

Fused data

• How to quantify the quality of measurements with many
metabolites, and many samples?

Error model for 1 metabolite
QC sample -> RSD
Standard Deviaton St.D

• Error models:
- RSD using 1 QC sample

- 2-component
using study samples
M
• Good error description
- sufficient # samples
A  - large -range
study samples
I
S
Mean Intensity I

Figure of merit for data from 1 platform

Median: F-50 = 0.1
St.D

Var. 15
Var. 365 90th-percentile: F-90 = 0.35

Number of peaks
Var. 118

F-50 F-90

Var. 213

I
(Van Batenburg et al. Analytical Chemistry, 2011)

Two-step data fusion
j GC/MS LC/MS

J1=
82 J2= 49 peaks
peaks
 Ij

M M

• Step 1:
Compute figures of merit for each platform
 

Two-step data fusion: MB-MLPCA
• Step 2 : Multi-block PCA with weighting by figures of merit

Fused error
covariance

X1 X2

Amino acids Lipids  js
ˆ2
 


• Method needs good estimation of error variance by
– Repeats
– QC samples

Realistic simulations
using GCMS and
LCMS data

• Error variance estimated
from duplicates

• True error variance

• Estimating variance from
duplicates is problematic.
• Use Mix of QC samples and
repeats.

Metabolomics Data Analysis

More Related Content

What's hot

Viewers also liked

Similar to Metabolomics Data Analysis

More from COST action BM1006

Metabolomics Data Analysis