Statistical methods in Metabolomics

Statistical methods in metabolomics
David Moriñaa,b
david.morina@uab.cat
a
Centre for Research in Environmental Epidemiology (CREAL)
b
Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona
May 08 2014, Reus

Contents
1 Introduction
2 Basic statistics
3 Available tools
4 R basics
5 LC/MS example
6 Further reading
2 / 66

Introduction
Where does data come from?
Metabolomics
• Metabolomics is the analysis and study of the set of metabolites in a cell,
organ, or tissue.
• To detect and quantify metabolites, separation techniques like gas or li-
quid chromatography, followed by quantiﬁcation by mass spectrometry
(GC/MS, or LC/MS) are often used.
• Nuclear magnetic resonance spectroscopy (NMR) is also frequently em-
ployed and has some appealing properties:
3 / 66

Introduction
Metabolomics
organ, or tissue.
• Is non-destructive, in the sense that it does not “destroy” the samples during
the analysis process.
3 / 66

Introduction
Metabolomics
organ, or tissue.
• Is non-destructive, in the sense that it does not “destroy” the samples during
the analysis process.
• Is useful when analyzing tissues or when sequential analysis of samples is
required.
3 / 66

Basic statistics
What does people say about statistics?
• There are three kinds of lies: lies, damned lies, and statistics. (B. Disraeli)
4 / 66

Basic statistics
• Statistics are like bikinis: What they reveal is suggestive, but what they
hide is vital. (A. Levenstein)
4 / 66

Basic statistics
• Statistics are like bikinis: What they reveal is suggestive, but what they
hide is vital. (A. Levenstein)
• About 93% of all statistics are made up. (Any newspaper)
4 / 66

Basic statistics
Distributions and hypothesis testing
Normal distribution
5 / 66

Basic statistics
Normal distribution
6 / 66

Basic statistics
Central Limit Theorem
Under some conditions (not much demanding), the distribution of the sum
of independent and identically distributed random variables tends to normal
distribution if the number of observations is not too small.
7 / 66

Basic statistics
The following example shows the distribution of the sum of the scores obtained
when rolling 1, 2, 3, 5 and 10 dices:
8 / 66

Basic statistics
9 / 66

Basic statistics
10 / 66

Basic statistics
Testing normality
There are some analytical methods to test if a random variable follow a normal
distribution or not. Some of them are
• Kolmogorov-Smirnov test
• Shapiro-Wilk test
• Graphical methods (QQ-plot, . . . )
11 / 66

Basic statistics
Testing normality
There are some mathematical functions that can be applied in order to stabili-
ze the variance of a random variable
12 / 66

Basic statistics
Testing normality
There are some mathematical functions that can be applied in order to stabili-
ze the variance of a random variable
• log transformation
• logit transformation
• Square root transformation
12 / 66

Basic statistics
Testing normality
Original variable
X scale
Density
0 1 2 3 4 5 6
0.00.20.40.6
log−transformed variable
log(X) scale
Density
−8 −6 −4 −2 0 2
0.00.20.40.6
13 / 66

Basic statistics
Signiﬁcance
Interpretation of p-value is well-known. . . Sure?
14 / 66

Basic statistics
Signiﬁcance
• Probability of a difference at least as the observed if H0 is true (by
chance)
• Probability of mistake when rejecting H0
• Evidence against H0 provided by the sample. If the p-value is small, it’s
not likely to observe the sample differences by chance
• Probability that the observed differences are false
• 1 - p-value = Probability that the observed differences are real
14 / 66

Basic statistics
Signiﬁcance
• Probability of a difference at least as the observed if H0 is true (by
chance)
• Probability of mistake when rejecting H0
• Evidence against H0 provided by the sample. If the p-value is small, it’s
not likely to observe the sample differences by chance
• Probability that the observed differences are false
• 1 - p-value = Probability that the observed differences are real
15 / 66

Basic statistics
Signiﬁcance
• The p-value is computed under the assumption that H0 is true and
therefore it cannot provide direct data about its certainty
• Scientist should decide on H0 based on the evidence against it that
sample provides, without reality knowledge
16 / 66

Basic statistics
Significance
• If a confidence level 1 − α is fixed:
p < α −→ Statistically significant differences
p ≥ α −→ No statistically significant differences
17 / 66

Basic statistics
Signiﬁcance
18 / 66

Basic statistics
Signiﬁcance
19 / 66

Basic statistics
Comparing two populations
• Student’s t test was designed to compare two means.
• A t-test can also be used to determine whether 2 clusters are different.
q
q
qqqq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
0 100 200 300 400
−10010203040
Time
Value
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
20 / 66

Basic statistics
Comparing two populations (non-gaussian)
• If the distribution of the variable of interest is not gaussian, we can still
compare two populations, by means of Mann-Whitney’s U test (for
independent samples) or Wilcoxon test (for paired samples).
• Formally, these non-parametric tests are comparing two medians.
21 / 66

Basic statistics
Comparing three (or more) populations
• If we want to compare more than two groups we can use ANOVA
technique.
• Essentially, it is a genearlization of Student’s t test.
• Intra-group variance should be similar.
• Normality is not crucial.
• Just tells if some of the compared groups is different from the others.
22 / 66

Basic statistics
• ANOVA can also be used to determine whether 3 or more clusters are
different.
q
qq
qq
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qqq
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
qq
qq
q
q
q
qqq
q
q
q
0 100 200 300 400 500 600
−100102030405060
Time
Value
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qqq
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
23 / 66

Basic statistics
If H0 can be rejected, which is the different group?
• We need to perform a posteriori mean tests.
• They compare each pair of means.
• More conservative to control αT .
24 / 66

Basic statistics
Multiple comparisons
There are several methods to control type I error:
• Bonferroni
• Holm
• Tukey
• Scheffé
• Dunnett (control)
25 / 66

Basic statistics
False Discovery Rate
Suppose you performed 100 different t-tests, and found 20 results with a p-
value < 0.05.
• How many of these 20 tests are likely false positives?
26 / 66

Basic statistics
False Discovery Rate
Suppose you performed 100 different t-tests, and found 20 results with a p-
value < 0.05.
• How many of these 20 tests are likely false positives?
• 20 · 0.05 = 1
• To correct for this we can consider as signiﬁcant the results with a
p-value < 0.05
20
or p < 0.0025
26 / 66

Basic statistics
Correlation
• If there is some dependency between the two variables or if there is a
relationship between the predicted and observer variable or if the
“before” and “after” treatments led to some effect, then it is possible to
see some clear patterns to the scatter plot
• This pattern or relationship is called correlation
27 / 66

Basic statistics
Correlation
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−2 −1 0 1 2
34567
Positive correlation
x
y
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−2 −1 0 1 2
−2−1012
Negative correlation
x
z
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−2 −1 0 1 2
−3−2−1012
No correlation
x
t
28 / 66

Basic statistics
Correlation
The correlation coefﬁcient (Pearson coefﬁcient) is computed by means of
r =
(xi − ¯x)(yi − ¯y)
(xi − ¯x)2(yi − ¯y)2
29 / 66

Basic statistics
Correlation and signiﬁcance
qq
q
−3.0 −2.0 −1.0
−5051015
r=0.98
qq
q
q
q
−3.0 −2.0 −1.0
−5051015
r=0.22
30 / 66

Basic statistics
Clustering
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
0 100 200 300 400
−20246
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
31 / 66

Basic statistics
Clustering
−200 −100 0 100 200
−6−4−2024
CLUSPLOT( mydata )
Component 1
Component2
These two components explain 100 % of the point variability.
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
32 / 66

Basic statistics
Clustering
Clustering
• Clustering is a process by which objects that are logically similar in
characteristics are grouped together
• It’s a previous step before classiﬁcation.
• It requires a method to measure similarity (a similarity matrix) or
dissimilarity (a dissimilarity coefﬁcient) between objects
• Uses a threshold value to decide whether an object belongs with a
cluster
• There are several clustering methods, differing in how they start the
clustering process
33 / 66

Basic statistics
Clustering
Clustering
• K-means algorithm: divides a set of N objects into M clusters – with or
without overlap. M must be speciﬁed by the analist
• Hiearchical clustering: produces a set of nested clusters in which each
pair of objects is progressively nested into a larger cluster until only one
cluster remains
34 / 66

Basic statistics
Clustering
K-means algorithm
• Make the ﬁrst object the centroid for the ﬁrst cluster
• For the next object calculate the similarity to each existing centroid
• If the similarity is greater than a threshold add the object to the existing
cluster and redetermine the centroid, else use the object to start new
cluster
• Return to step 2 and repeat until done
35 / 66

Basic statistics
Clustering
K-means algorithm example
# Read data
> st1 <- read.table("Data/global_afegits.csv", sep=";",
dec=",", header=T)
# Determine number of clusters
> n <- nrow(st2.ado)
> wss <- rep(1:10)
> wss[1] <- (n-1)*sum(apply(st2.ado[,2:8],2,var))
> for (i in 2:10)
{
wss[i] <- sum(kmeans(na.omit(st2.ado[,2:8]),
centers=i)$withinss)
}
> plot(1:10,wss,type="b",xlab="Number of groups",
ylab="Within groups sum of squares")
36 / 66

Basic statistics
Clustering
q
q
q
q
q
q
q
q
q
q
2 4 6 8 10
12000140001600018000200002200024000
Number of groups
Withingroupssumofsquares
37 / 66

Basic statistics
Clustering
If we choose 5 clusters, we then
> fit <- kmeans(st2.ado, 5)
will classify the observations in the 5 groups
38 / 66

Basic statistics
Clustering
Hierarchical clustering
• Find the two closest objects and merge them into a cluster
• Find and merge the next two closest objects (or an object and a cluster,
or two clusters) using some similarity measure and a predeﬁned
threshold
• If more than one cluster remains return to step 2 until ﬁnished
39 / 66

Basic statistics
Clustering
Hierarchical clustering example
# Ward Hierarchical Clustering
> d <- dist(st2.ado, method = "euclidean")
> fit <- hclust(d, method="ward.D")
> plot(fit) # display dendogram
> groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
> rect.hclust(fit, k=5, border="red")
40 / 66

Basic statistics
Clustering
Hierarchical clustering example
2813022403092341710223204221563305251482921379283556152155672583003317327017026225115828614228216231518656729424153513194471094920522032682315583628928717559111031162601612021192252691415992092172222502492142972302293139160179595257254113200210205183215177324184193314235227181102552653211760129118514915414616994312608569528546326131187466062432676052486620732723324715716429029597140275172122568312748243111293221519820618823819218253256452883072591995760229803137124301126313515965135548226201176261511482242572342125590244751976826627714492545912841362462132378855562523381653371052122393281285321653435219963395821811033455358522416761057812831012318812017133961119127219650607156539271296341278303220263554421119516321626153281343062227313211127456611416818012117493106781151902981662551331181381457430410815013012311214310720871540982993162642324281189153609147561825385744127628030236159533852452795755766219275861660017822831943542219561
050100150200250300
Cluster Dendrogram
hclust (*, "ward.D")
d
Height
41 / 66

Basic statistics
Clustering
Validating cluster solutions
There are several methods to compare different clustering solutions to the
same problem.
42 / 66

Basic statistics
Clustering
same problem.
• Hubert’s gamma coefﬁcient
• Dunn index
• Corrected rand index
42 / 66

Basic statistics
Clustering
same problem.
• Hubert’s gamma coefﬁcient
• Dunn index
• Corrected rand index
Some of them are implemented in R package fpc
42 / 66

Basic statistics
Multivariate statistics
• Multivariate statistics means dealing with several variables at the same
time
• Multivariate problems requires more complex, multidimensional analyses
or dimensional reduction methods
• Metabolomics experiments typically measure many metabolites at once,
in other words the instruments are measuring multiple variables and so
metabolomic data are inherently multivariate data
• The key trick in multivariate statistics is to ﬁnd a way that effectively
reduces the multivariate data into univariate data
• Then we can apply the same univariate concepts such as p-values,
t-tests and ANOVA tests to the data
43 / 66

Basic statistics
Principal Component Analysis
• PCA is a process that transforms a number of possibly correlated
variables into a smaller number of uncorrelated variables called principal
components
• PCA captures what should be visually detectable
• If you can’t see it, PCA probably won’t help
44 / 66

Basic statistics
> data(USArrests)
> pc.cr <- princomp(USArrests, cor = TRUE)
> biplot(pc.cr)
45 / 66

Basic statistics
−0.2 −0.1 0.0 0.1 0.2 0.3
−0.2−0.10.00.10.20.3
Comp.1
Comp.2
AlabamaAlaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana Iowa
Kansas
Kentucky
Louisiana
MaineMaryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
OregonPennsylvania
Rhode Island
South Carolina
South DakotaTennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
−5 0 5
−505
Murder
Assault
UrbanPop
Rape
46 / 66

Basic statistics
Other methods
There are several multivariate methods, with an increasing usage in metabo-
lomics and related ﬁelds
• Discriminant Analysis (DA, PLS-DA, OPLS-DA)
• Factor Analysis
• Structural Equation Modeling
47 / 66

Available tools
How to analyze data?
R
• R is a freely available language and environment for statistical computing
and graphics.
• It provides a wide variety of statistical and graphical techniques.
• It is constantly expanding thanks to user-contributed packages.
• Can be downloaded from http://cran.r-project.org.
48 / 66

Available tools
How to analyze data?
R
• R is a freely available language and environment for statistical computing
and graphics.
• It provides a wide variety of statistical and graphical techniques.
• It is constantly expanding thanks to user-contributed packages.
• Can be downloaded from http://cran.r-project.org.
Bioconductor
• Bioconductor is a repository of user-contributed R packages.
• It is accessible from http://www.bioconductor.org.
• Provides tools for the analysis and comprehension of high-throughput ge-
nomic data.
• It has mailing lists and a very active users/developers community.
48 / 66

Available tools
Bioconductor
Bioconductor submitted packages
49 / 66

Available tools
Bioconductor
Installation of Bioconductor packages
The installation of Bioconductor can be done within the R session by
source("http://bioconductor.org/biocLite.R")
biocLite()
50 / 66

R basics
Getting help
Getting help
• ?mean
• help(mean)
• help.search("mean")
• apropos("mean")
• example(mean)
51 / 66

R basics
R packages for metabolomics
Useful packages
There are a number of useful packages in Bioconductor regarding metabolo-
mics data analysis.
• flagme: Analysis of metabolomics GC/MS data
52 / 66

R basics
R packages for metabolomics
Useful packages
There are a number of useful packages in Bioconductor regarding metabolo-
mics data analysis.
• flagme: Analysis of metabolomics GC/MS data
• xcms: Analysis of metabolomics XC/MS data
52 / 66

LC/MS example
LC/MS example
xcms
• Can read data stored in several formats like netcdf, mzXML, mzData and
mzML.
• Provides methods for feature detection, non-linear retention time align-
ment, visualization, relative quantization and statistics.
• Is capable of simultaneously preprocessing, analyzing, and visualizing
the raw data from hundreds of samples.
• It’s available as an R package or as an online platform accessible through
https://xcmsonline.scripps.edu/.
53 / 66

LC/MS example
LC/MS example
Typical xcms workﬂow
54 / 66

LC/MS example
LC/MS example
Reading the data
# use biocLite to install a Bioconductor package
> source("http://bioconductor.org/biocLite.R")
# Install the xcms package
> biocLite("xcms")
# Install dataset package used in this session
> biocLite("faahKO")
55 / 66

LC/MS example
LC/MS example
Reading the data
The data in faahKO consists of LC/MS peaks from the spinal cords of 6 wild-
type and 6 FAAH knockout mice. The data is a subset of the original data from
200-600 m/z and 2500-4500 seconds. It was collected in positive ionization
mode.
# Load libraries
> library("xcms")
> library("faahKO")
56 / 66

LC/MS example
LC/MS example
Reading the data
> cdfpath <- system.file("cdf",package="faahKO")
> files <- list.files(cdfpath, recursive=T, full=T)
> data <- xcmsSet(files)
57 / 66

LC/MS example
LC/MS example
Reading the data
> cdfpath <- system.file("cdf",package="faahKO")
> files <- list.files(cdfpath, recursive=T, full=T)
> data <- xcmsSet(files)
Some important parameters
• scanrange=c(lower, upper): to scan part of the spectra
• fwhm = seconds: specify full width at half maximum (default 30s)
based on the type of chromatography
• method = “centWave”): use wavelet algorithm for peak detection,
suitable for high resolution spectra
57 / 66

LC/MS example
LC/MS example
Peak alignment and retention time correction
> xsg <- group(data) # peak alignment
> xsg <- retcor(xsg) # retention time correction
> xsg <- group(xsg) # re-align
58 / 66

LC/MS example
LC/MS example
> xsg <- group(data) # peak alignment
> xsg <- retcor(xsg) # retention time correction
> xsg <- group(xsg) # re-align
• Matching peaks across samples
• Using the peak groups to correct drift
• Re-do the alignment
• Can be performed iteratively until no further change
58 / 66

LC/MS example
LC/MS example
−2−10123
Retention Time Deviation vs. Retention Time
Retention Time
RetentionTimeDeviation
q
q
q
q
q
q
ko15
ko16
ko18
ko19
ko21
ko22
wt15
wt16
wt18
wt19
wt21
wt22q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qqq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
qq q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
qq
q
q
q
qq
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
qq
qq
q
qq
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qqq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
2500 3000 3500 4000 4500
Retention Time
PeakDensity
All
Correction
59 / 66

LC/MS example
LC/MS example
Filling in missing peaks
> xsg <- fillPeaks(xsg)
60 / 66

LC/MS example
LC/MS example
Filling in missing peaks
> xsg <- fillPeaks(xsg)
• A signiﬁcant number of potential peaks can be missed during peak
detection
• Missing values are problematic for robust statistical analysis
• We now have a better idea about where to expect real peaks and their
boundaries
• Re-scan the raw spectra and integrate peaks in the regions of the
missing peaks
60 / 66

LC/MS example
LC/MS example
Results of peak detection
> peaks(xsg)
61 / 66

LC/MS example
LC/MS example
Results of peak detection
> peaks(xsg)
peaks() function gives a list of peaks with
• mz
• mzmin
• mzmax
• rt
• rtmin
• rtmax
• peak intensities/areas (raw data)
61 / 66

LC/MS example
LC/MS example
Statistical analysis
> report <- diffreport(xsg, "WT", "KO")
62 / 66

LC/MS example
LC/MS example
Statistical analysis
> report <- diffreport(xsg, "WT", "KO")
• diffreport() function computes Welch’s two-sample t-statistic for
each analyte and ranks them by p-value.
• It returns a summary report
• Multivariate analysis and visualization can be performed using
MetaboAnalyst
• The report generated by diffreport() can be directly uploaded to
MetaboAnalyst
62 / 66

LC/MS example
LC/MS example
Visualizing important peaks
# Select peaks with median retention time
# between 3300 and 3400 and detected in
# at least 8 samples
> gr <- groups(xsg)
> groupidx <- which(gr[,"rtmed"]>3300 &
gr[,"rtmed"]<3400 &
gr[,"npeaks"]>=8])[1]
> eiccor <- getEIC(xsg, groupidx=groupidx)
> plot(eiccor, col=as.numeric(phenoData(xsg)$class))
63 / 66

LC/MS example
LC/MS example
# Select peaks with median retention time
# between 3300 and 3400 and detected in
# at least 8 samples
> gr <- groups(xsg)
> groupidx <- which(gr[,"rtmed"]>3300 &
gr[,"rtmed"]<3400 &
gr[,"npeaks"]>=8])[1]
> eiccor <- getEIC(xsg, groupidx=groupidx)
> plot(eiccor, col=as.numeric(phenoData(xsg)$class))
• When signiﬁcant peaks are identiﬁed, it is critical to visualize these
peaks to assess quality
• This is done using the Extracted Ion Chromatogram (EIC)
63 / 66

LC/MS example
LC/MS example
3300 3350 3400 3450
050000100000150000200000250000
Extracted Ion Chromatogram: 300.1 − 300.2 m/z
Retention Time (seconds)
Intensity
64 / 66

Further reading
Some references
• Broadhurst, D. I., Kell, D. B. (2007). Statistical strategies for avoiding
false discoveries in metabolomics and related experiments.
Metabolomics, 2 (4), 171–196.
• Worley, B., Powers, R. (2013). Multivariate Analysis in Metabolomics.
Current metabolomics, 1 (1), 92–107.
• Issaq, H. J., Van, Q. N., Waybright, T. J., Muschik, G. M., Veenstra, T. D.
(2009). Analytical and statistical approaches to metabolomics research.
Journal of separation science, 32, 2183–2199.
• Smith, C. A. (2014). LC/MS Preprocessing and Analysis with xcms. R
package documentation.
• Korman, A., Oh, A., Raskind, A., Banks, D. (2012). Statistical methods in
metabolomics. Methods in molecular biology, 856 (Evolutionary
genomics), 381–413. Springer.
65 / 66

Centre for Research
in Environmental
Epidemiology
Parc de Recerca Biomèdica de Barcelona
Doctor Aiguader, 88
08003 Barcelona (Spain)
Tel. (+34) 93 214 70 00
Fax (+34) 93 214 73 02
info@creal.cat
www.creal.cat

Statistical methods in Metabolomics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Statistical methods in Metabolomics

Similar to Statistical methods in Metabolomics (20)

More from David Moriña Soler

More from David Moriña Soler (7)

Recently uploaded

Recently uploaded (20)

Statistical methods in Metabolomics