SlideShare a Scribd company logo
1 of 84
Download to read offline
Statistical methods in metabolomics
David Moriñaa,b
david.morina@uab.cat
a
Centre for Research in Environmental Epidemiology (CREAL)
b
Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona
May 08 2014, Reus
Statistical methods in metabolomics
Contents
1 Introduction
2 Basic statistics
3 Available tools
4 R basics
5 LC/MS example
6 Further reading
2 / 66
Statistical methods in metabolomics
Introduction
Where does data come from?
Metabolomics
• Metabolomics is the analysis and study of the set of metabolites in a cell,
organ, or tissue.
• To detect and quantify metabolites, separation techniques like gas or li-
quid chromatography, followed by quantification by mass spectrometry
(GC/MS, or LC/MS) are often used.
• Nuclear magnetic resonance spectroscopy (NMR) is also frequently em-
ployed and has some appealing properties:
3 / 66
Statistical methods in metabolomics
Introduction
Where does data come from?
Metabolomics
• Metabolomics is the analysis and study of the set of metabolites in a cell,
organ, or tissue.
• To detect and quantify metabolites, separation techniques like gas or li-
quid chromatography, followed by quantification by mass spectrometry
(GC/MS, or LC/MS) are often used.
• Nuclear magnetic resonance spectroscopy (NMR) is also frequently em-
ployed and has some appealing properties:
• Is non-destructive, in the sense that it does not “destroy” the samples during
the analysis process.
3 / 66
Statistical methods in metabolomics
Introduction
Where does data come from?
Metabolomics
• Metabolomics is the analysis and study of the set of metabolites in a cell,
organ, or tissue.
• To detect and quantify metabolites, separation techniques like gas or li-
quid chromatography, followed by quantification by mass spectrometry
(GC/MS, or LC/MS) are often used.
• Nuclear magnetic resonance spectroscopy (NMR) is also frequently em-
ployed and has some appealing properties:
• Is non-destructive, in the sense that it does not “destroy” the samples during
the analysis process.
• Is useful when analyzing tissues or when sequential analysis of samples is
required.
3 / 66
Statistical methods in metabolomics
Basic statistics
What does people say about statistics?
• There are three kinds of lies: lies, damned lies, and statistics. (B. Disraeli)
4 / 66
Statistical methods in metabolomics
Basic statistics
What does people say about statistics?
• There are three kinds of lies: lies, damned lies, and statistics. (B. Disraeli)
• Statistics are like bikinis: What they reveal is suggestive, but what they
hide is vital. (A. Levenstein)
4 / 66
Statistical methods in metabolomics
Basic statistics
What does people say about statistics?
• There are three kinds of lies: lies, damned lies, and statistics. (B. Disraeli)
• Statistics are like bikinis: What they reveal is suggestive, but what they
hide is vital. (A. Levenstein)
• About 93% of all statistics are made up. (Any newspaper)
4 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Normal distribution
5 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Normal distribution
6 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Central Limit Theorem
Under some conditions (not much demanding), the distribution of the sum
of independent and identically distributed random variables tends to normal
distribution if the number of observations is not too small.
7 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Central Limit Theorem
The following example shows the distribution of the sum of the scores obtained
when rolling 1, 2, 3, 5 and 10 dices:
8 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Central Limit Theorem
The following example shows the distribution of the sum of the scores obtained
when rolling 1, 2, 3, 5 and 10 dices:
8 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Central Limit Theorem
9 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Central Limit Theorem
10 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Testing normality
There are some analytical methods to test if a random variable follow a normal
distribution or not. Some of them are
• Kolmogorov-Smirnov test
• Shapiro-Wilk test
• Graphical methods (QQ-plot, . . . )
11 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Testing normality
There are some mathematical functions that can be applied in order to stabili-
ze the variance of a random variable
12 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Testing normality
There are some mathematical functions that can be applied in order to stabili-
ze the variance of a random variable
• log transformation
• logit transformation
• Square root transformation
12 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Testing normality
Original variable
X scale
Density
0 1 2 3 4 5 6
0.00.20.40.6
log−transformed variable
log(X) scale
Density
−8 −6 −4 −2 0 2
0.00.20.40.6
13 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
Interpretation of p-value is well-known. . . Sure?
14 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
Interpretation of p-value is well-known. . . Sure?
• Probability of a difference at least as the observed if H0 is true (by
chance)
• Probability of mistake when rejecting H0
• Evidence against H0 provided by the sample. If the p-value is small, it’s
not likely to observe the sample differences by chance
• Probability that the observed differences are false
• 1 - p-value = Probability that the observed differences are real
14 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
Interpretation of p-value is well-known. . . Sure?
• Probability of a difference at least as the observed if H0 is true (by
chance)
• Probability of mistake when rejecting H0
• Evidence against H0 provided by the sample. If the p-value is small, it’s
not likely to observe the sample differences by chance
• Probability that the observed differences are false
• 1 - p-value = Probability that the observed differences are real
15 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
• The p-value is computed under the assumption that H0 is true and
therefore it cannot provide direct data about its certainty
• Scientist should decide on H0 based on the evidence against it that
sample provides, without reality knowledge
16 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
• If a confidence level 1 − α is fixed:
p < α −→ Statistically significant differences
p ≥ α −→ No statistically significant differences
17 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
18 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
19 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Comparing two populations
• Student’s t test was designed to compare two means.
• A t-test can also be used to determine whether 2 clusters are different.
q
q
qqqq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
0 100 200 300 400
−10010203040
Time
Value
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
20 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Comparing two populations (non-gaussian)
• If the distribution of the variable of interest is not gaussian, we can still
compare two populations, by means of Mann-Whitney’s U test (for
independent samples) or Wilcoxon test (for paired samples).
• Formally, these non-parametric tests are comparing two medians.
21 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Comparing three (or more) populations
• If we want to compare more than two groups we can use ANOVA
technique.
• Essentially, it is a genearlization of Student’s t test.
• Intra-group variance should be similar.
• Normality is not crucial.
• Just tells if some of the compared groups is different from the others.
22 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Comparing three (or more) populations
• ANOVA can also be used to determine whether 3 or more clusters are
different.
q
qq
qq
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qqq
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
qq
qq
q
q
q
qqq
q
q
q
0 100 200 300 400 500 600
−100102030405060
Time
Value
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qqq
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
23 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Comparing three (or more) populations
If H0 can be rejected, which is the different group?
• We need to perform a posteriori mean tests.
• They compare each pair of means.
• More conservative to control αT .
24 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Multiple comparisons
There are several methods to control type I error:
• Bonferroni
• Holm
• Tukey
• Scheffé
• Dunnett (control)
25 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
False Discovery Rate
Suppose you performed 100 different t-tests, and found 20 results with a p-
value < 0.05.
• How many of these 20 tests are likely false positives?
26 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
False Discovery Rate
Suppose you performed 100 different t-tests, and found 20 results with a p-
value < 0.05.
• How many of these 20 tests are likely false positives?
• 20 · 0.05 = 1
• To correct for this we can consider as significant the results with a
p-value < 0.05
20
or p < 0.0025
26 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Correlation
• If there is some dependency between the two variables or if there is a
relationship between the predicted and observer variable or if the
“before” and “after” treatments led to some effect, then it is possible to
see some clear patterns to the scatter plot
• This pattern or relationship is called correlation
27 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Correlation
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−2 −1 0 1 2
34567
Positive correlation
x
y
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−2 −1 0 1 2
−2−1012
Negative correlation
x
z
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−2 −1 0 1 2
−3−2−1012
No correlation
x
t
28 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Correlation
The correlation coefficient (Pearson coefficient) is computed by means of
r =
(xi − ¯x)(yi − ¯y)
(xi − ¯x)2(yi − ¯y)2
29 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Correlation and significance
qq
q
−3.0 −2.0 −1.0
−5051015
r=0.98
qq
q
q
q
−3.0 −2.0 −1.0
−5051015
r=0.22
30 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Clustering
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
0 100 200 300 400
−20246
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
31 / 66
Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Clustering
−200 −100 0 100 200
−6−4−2024
CLUSPLOT( mydata )
Component 1
Component2
These two components explain 100 % of the point variability.
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
32 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
Clustering
• Clustering is a process by which objects that are logically similar in
characteristics are grouped together
• It’s a previous step before classification.
• It requires a method to measure similarity (a similarity matrix) or
dissimilarity (a dissimilarity coefficient) between objects
• Uses a threshold value to decide whether an object belongs with a
cluster
• There are several clustering methods, differing in how they start the
clustering process
33 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
Clustering
• K-means algorithm: divides a set of N objects into M clusters – with or
without overlap. M must be specified by the analist
• Hiearchical clustering: produces a set of nested clusters in which each
pair of objects is progressively nested into a larger cluster until only one
cluster remains
34 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
K-means algorithm
• Make the first object the centroid for the first cluster
• For the next object calculate the similarity to each existing centroid
• If the similarity is greater than a threshold add the object to the existing
cluster and redetermine the centroid, else use the object to start new
cluster
• Return to step 2 and repeat until done
35 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
K-means algorithm example
# Read data
> st1 <- read.table("Data/global_afegits.csv", sep=";",
dec=",", header=T)
# Determine number of clusters
> n <- nrow(st2.ado)
> wss <- rep(1:10)
> wss[1] <- (n-1)*sum(apply(st2.ado[,2:8],2,var))
> for (i in 2:10)
{
wss[i] <- sum(kmeans(na.omit(st2.ado[,2:8]),
centers=i)$withinss)
}
> plot(1:10,wss,type="b",xlab="Number of groups",
ylab="Within groups sum of squares")
36 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
K-means algorithm example
q
q
q
q
q
q
q
q
q
q
2 4 6 8 10
12000140001600018000200002200024000
Number of groups
Withingroupssumofsquares
37 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
K-means algorithm example
If we choose 5 clusters, we then
> fit <- kmeans(st2.ado, 5)
will classify the observations in the 5 groups
38 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
Hierarchical clustering
• Find the two closest objects and merge them into a cluster
• Find and merge the next two closest objects (or an object and a cluster,
or two clusters) using some similarity measure and a predefined
threshold
• If more than one cluster remains return to step 2 until finished
39 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
Hierarchical clustering example
# Ward Hierarchical Clustering
> d <- dist(st2.ado, method = "euclidean")
> fit <- hclust(d, method="ward.D")
> plot(fit) # display dendogram
> groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
> rect.hclust(fit, k=5, border="red")
40 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
Hierarchical clustering example
2813022403092341710223204221563305251482921379283556152155672583003317327017026225115828614228216231518656729424153513194471094920522032682315583628928717559111031162601612021192252691415992092172222502492142972302293139160179595257254113200210205183215177324184193314235227181102552653211760129118514915414616994312608569528546326131187466062432676052486620732723324715716429029597140275172122568312748243111293221519820618823819218253256452883072591995760229803137124301126313515965135548226201176261511482242572342125590244751976826627714492545912841362462132378855562523381653371052122393281285321653435219963395821811033455358522416761057812831012318812017133961119127219650607156539271296341278303220263554421119516321626153281343062227313211127456611416818012117493106781151902981662551331181381457430410815013012311214310720871540982993162642324281189153609147561825385744127628030236159533852452795755766219275861660017822831943542219561
050100150200250300
Cluster Dendrogram
hclust (*, "ward.D")
d
Height
41 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
Validating cluster solutions
There are several methods to compare different clustering solutions to the
same problem.
42 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
Validating cluster solutions
There are several methods to compare different clustering solutions to the
same problem.
• Hubert’s gamma coefficient
• Dunn index
• Corrected rand index
42 / 66
Statistical methods in metabolomics
Basic statistics
Clustering
Validating cluster solutions
There are several methods to compare different clustering solutions to the
same problem.
• Hubert’s gamma coefficient
• Dunn index
• Corrected rand index
Some of them are implemented in R package fpc
42 / 66
Statistical methods in metabolomics
Basic statistics
Multivariate statistics
Multivariate statistics
• Multivariate statistics means dealing with several variables at the same
time
• Multivariate problems requires more complex, multidimensional analyses
or dimensional reduction methods
• Metabolomics experiments typically measure many metabolites at once,
in other words the instruments are measuring multiple variables and so
metabolomic data are inherently multivariate data
• The key trick in multivariate statistics is to find a way that effectively
reduces the multivariate data into univariate data
• Then we can apply the same univariate concepts such as p-values,
t-tests and ANOVA tests to the data
43 / 66
Statistical methods in metabolomics
Basic statistics
Multivariate statistics
Principal Component Analysis
• PCA is a process that transforms a number of possibly correlated
variables into a smaller number of uncorrelated variables called principal
components
• PCA captures what should be visually detectable
• If you can’t see it, PCA probably won’t help
44 / 66
Statistical methods in metabolomics
Basic statistics
Multivariate statistics
Principal Component Analysis
> data(USArrests)
> pc.cr <- princomp(USArrests, cor = TRUE)
> biplot(pc.cr)
45 / 66
Statistical methods in metabolomics
Basic statistics
Multivariate statistics
Principal Component Analysis
−0.2 −0.1 0.0 0.1 0.2 0.3
−0.2−0.10.00.10.20.3
Comp.1
Comp.2
AlabamaAlaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana Iowa
Kansas
Kentucky
Louisiana
MaineMaryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
OregonPennsylvania
Rhode Island
South Carolina
South DakotaTennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
−5 0 5
−505
Murder
Assault
UrbanPop
Rape
46 / 66
Statistical methods in metabolomics
Basic statistics
Multivariate statistics
Other methods
There are several multivariate methods, with an increasing usage in metabo-
lomics and related fields
• Discriminant Analysis (DA, PLS-DA, OPLS-DA)
• Factor Analysis
• Structural Equation Modeling
47 / 66
Statistical methods in metabolomics
Available tools
How to analyze data?
R
• R is a freely available language and environment for statistical computing
and graphics.
• It provides a wide variety of statistical and graphical techniques.
• It is constantly expanding thanks to user-contributed packages.
• Can be downloaded from http://cran.r-project.org.
48 / 66
Statistical methods in metabolomics
Available tools
How to analyze data?
R
• R is a freely available language and environment for statistical computing
and graphics.
• It provides a wide variety of statistical and graphical techniques.
• It is constantly expanding thanks to user-contributed packages.
• Can be downloaded from http://cran.r-project.org.
Bioconductor
• Bioconductor is a repository of user-contributed R packages.
• It is accessible from http://www.bioconductor.org.
• Provides tools for the analysis and comprehension of high-throughput ge-
nomic data.
• It has mailing lists and a very active users/developers community.
48 / 66
Statistical methods in metabolomics
Available tools
Bioconductor
Bioconductor submitted packages
49 / 66
Statistical methods in metabolomics
Available tools
Bioconductor
Installation of Bioconductor packages
The installation of Bioconductor can be done within the R session by
source("http://bioconductor.org/biocLite.R")
biocLite()
50 / 66
Statistical methods in metabolomics
R basics
Getting help
Getting help
• ?mean
• help(mean)
• help.search("mean")
• apropos("mean")
• example(mean)
51 / 66
Statistical methods in metabolomics
R basics
R packages for metabolomics
Useful packages
There are a number of useful packages in Bioconductor regarding metabolo-
mics data analysis.
• flagme: Analysis of metabolomics GC/MS data
52 / 66
Statistical methods in metabolomics
R basics
R packages for metabolomics
Useful packages
There are a number of useful packages in Bioconductor regarding metabolo-
mics data analysis.
• flagme: Analysis of metabolomics GC/MS data
• xcms: Analysis of metabolomics XC/MS data
52 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
xcms
• Can read data stored in several formats like netcdf, mzXML, mzData and
mzML.
• Provides methods for feature detection, non-linear retention time align-
ment, visualization, relative quantization and statistics.
• Is capable of simultaneously preprocessing, analyzing, and visualizing
the raw data from hundreds of samples.
• It’s available as an R package or as an online platform accessible through
https://xcmsonline.scripps.edu/.
53 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Typical xcms workflow
54 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Reading the data
# use biocLite to install a Bioconductor package
> source("http://bioconductor.org/biocLite.R")
# Install the xcms package
> biocLite("xcms")
# Install dataset package used in this session
> biocLite("faahKO")
55 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Reading the data
The data in faahKO consists of LC/MS peaks from the spinal cords of 6 wild-
type and 6 FAAH knockout mice. The data is a subset of the original data from
200-600 m/z and 2500-4500 seconds. It was collected in positive ionization
mode.
# Load libraries
> library("xcms")
> library("faahKO")
56 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Reading the data
> cdfpath <- system.file("cdf",package="faahKO")
> files <- list.files(cdfpath, recursive=T, full=T)
> data <- xcmsSet(files)
57 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Reading the data
> cdfpath <- system.file("cdf",package="faahKO")
> files <- list.files(cdfpath, recursive=T, full=T)
> data <- xcmsSet(files)
Some important parameters
• scanrange=c(lower, upper): to scan part of the spectra
• fwhm = seconds: specify full width at half maximum (default 30s)
based on the type of chromatography
• method = “centWave”): use wavelet algorithm for peak detection,
suitable for high resolution spectra
57 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Peak alignment and retention time correction
> xsg <- group(data) # peak alignment
> xsg <- retcor(xsg) # retention time correction
> xsg <- group(xsg) # re-align
58 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Peak alignment and retention time correction
> xsg <- group(data) # peak alignment
> xsg <- retcor(xsg) # retention time correction
> xsg <- group(xsg) # re-align
• Matching peaks across samples
• Using the peak groups to correct drift
• Re-do the alignment
• Can be performed iteratively until no further change
58 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Peak alignment and retention time correction
−2−10123
Retention Time Deviation vs. Retention Time
Retention Time
RetentionTimeDeviation
q
q
q
q
q
q
ko15
ko16
ko18
ko19
ko21
ko22
wt15
wt16
wt18
wt19
wt21
wt22q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qqq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
qq q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
qq
q
q
q
qq
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
qq
q
q
qq
qq
q
qq
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qqq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
2500 3000 3500 4000 4500
Retention Time
PeakDensity
All
Correction
59 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Filling in missing peaks
> xsg <- fillPeaks(xsg)
60 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Filling in missing peaks
> xsg <- fillPeaks(xsg)
• A significant number of potential peaks can be missed during peak
detection
• Missing values are problematic for robust statistical analysis
• We now have a better idea about where to expect real peaks and their
boundaries
• Re-scan the raw spectra and integrate peaks in the regions of the
missing peaks
60 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Results of peak detection
> peaks(xsg)
61 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Results of peak detection
> peaks(xsg)
peaks() function gives a list of peaks with
• mz
• mzmin
• mzmax
• rt
• rtmin
• rtmax
• peak intensities/areas (raw data)
61 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Statistical analysis
> report <- diffreport(xsg, "WT", "KO")
62 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Statistical analysis
> report <- diffreport(xsg, "WT", "KO")
• diffreport() function computes Welch’s two-sample t-statistic for
each analyte and ranks them by p-value.
• It returns a summary report
• Multivariate analysis and visualization can be performed using
MetaboAnalyst
• The report generated by diffreport() can be directly uploaded to
MetaboAnalyst
62 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Visualizing important peaks
# Select peaks with median retention time
# between 3300 and 3400 and detected in
# at least 8 samples
> gr <- groups(xsg)
> groupidx <- which(gr[,"rtmed"]>3300 &
gr[,"rtmed"]<3400 &
gr[,"npeaks"]>=8])[1]
> eiccor <- getEIC(xsg, groupidx=groupidx)
> plot(eiccor, col=as.numeric(phenoData(xsg)$class))
63 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Visualizing important peaks
# Select peaks with median retention time
# between 3300 and 3400 and detected in
# at least 8 samples
> gr <- groups(xsg)
> groupidx <- which(gr[,"rtmed"]>3300 &
gr[,"rtmed"]<3400 &
gr[,"npeaks"]>=8])[1]
> eiccor <- getEIC(xsg, groupidx=groupidx)
> plot(eiccor, col=as.numeric(phenoData(xsg)$class))
• When significant peaks are identified, it is critical to visualize these
peaks to assess quality
• This is done using the Extracted Ion Chromatogram (EIC)
63 / 66
Statistical methods in metabolomics
LC/MS example
LC/MS example
Visualizing important peaks
3300 3350 3400 3450
050000100000150000200000250000
Extracted Ion Chromatogram: 300.1 − 300.2 m/z
Retention Time (seconds)
Intensity
64 / 66
Statistical methods in metabolomics
Further reading
Some references
• Broadhurst, D. I., Kell, D. B. (2007). Statistical strategies for avoiding
false discoveries in metabolomics and related experiments.
Metabolomics, 2 (4), 171–196.
• Worley, B., Powers, R. (2013). Multivariate Analysis in Metabolomics.
Current metabolomics, 1 (1), 92–107.
• Issaq, H. J., Van, Q. N., Waybright, T. J., Muschik, G. M., Veenstra, T. D.
(2009). Analytical and statistical approaches to metabolomics research.
Journal of separation science, 32, 2183–2199.
• Smith, C. A. (2014). LC/MS Preprocessing and Analysis with xcms. R
package documentation.
• Korman, A., Oh, A., Raskind, A., Banks, D. (2012). Statistical methods in
metabolomics. Methods in molecular biology, 856 (Evolutionary
genomics), 381–413. Springer.
65 / 66
Centre for Research
in Environmental
Epidemiology
Parc de Recerca Biomèdica de Barcelona
Doctor Aiguader, 88
08003 Barcelona (Spain)
Tel. (+34) 93 214 70 00
Fax (+34) 93 214 73 02
info@creal.cat
www.creal.cat

More Related Content

What's hot

Proteomics course 1
Proteomics course 1Proteomics course 1
Proteomics course 1utpaltatu
 
Overview Radboudumc Center for Proteomics, Glycomics and Metabolomics april 2015
Overview Radboudumc Center for Proteomics, Glycomics and Metabolomics april 2015Overview Radboudumc Center for Proteomics, Glycomics and Metabolomics april 2015
Overview Radboudumc Center for Proteomics, Glycomics and Metabolomics april 2015Alain van Gool
 
Application of proteomics for identification of abiotic stress tolerance in c...
Application of proteomics for identification of abiotic stress tolerance in c...Application of proteomics for identification of abiotic stress tolerance in c...
Application of proteomics for identification of abiotic stress tolerance in c...Vivek Zinzala
 
Quantitative proteomics
Quantitative proteomicsQuantitative proteomics
Quantitative proteomicsutpaltatu
 
Bottom-up proteomics and top-down proteomics
Bottom-up  proteomics and top-down proteomicsBottom-up  proteomics and top-down proteomics
Bottom-up proteomics and top-down proteomicsCreative Proteomics
 
A Brief Introduction to Metabolomics
A Brief Introduction to Metabolomics A Brief Introduction to Metabolomics
A Brief Introduction to Metabolomics Ranjith Raj V
 
Proteomics & Metabolomics
Proteomics & MetabolomicsProteomics & Metabolomics
Proteomics & Metabolomicsgumccomm
 
A brief introfuction of label-free protein quantification methods
A brief introfuction of label-free protein quantification methodsA brief introfuction of label-free protein quantification methods
A brief introfuction of label-free protein quantification methodsCreative Proteomics
 
Peptide Mass Fingerprinting (PMF) and Isotope Coded Affinity Tags (ICAT)
Peptide Mass Fingerprinting  (PMF) and Isotope Coded Affinity Tags (ICAT)Peptide Mass Fingerprinting  (PMF) and Isotope Coded Affinity Tags (ICAT)
Peptide Mass Fingerprinting (PMF) and Isotope Coded Affinity Tags (ICAT)Suresh Antre
 
Proteins – Basics you need to know for Proteomics
Proteins – Basics you need to know for ProteomicsProteins – Basics you need to know for Proteomics
Proteins – Basics you need to know for ProteomicsLionel Wolberger
 
Proteomics in foods
Proteomics in foodsProteomics in foods
Proteomics in foodsSpringer
 
Functional proteomics, and tools
Functional proteomics, and toolsFunctional proteomics, and tools
Functional proteomics, and toolsKAUSHAL SAHU
 
proteomics, mass spectrometry, science, bioinformatics, electrophoresis, liqu...
proteomics, mass spectrometry, science, bioinformatics, electrophoresis, liqu...proteomics, mass spectrometry, science, bioinformatics, electrophoresis, liqu...
proteomics, mass spectrometry, science, bioinformatics, electrophoresis, liqu...Amit Yadav
 

What's hot (20)

Proteomics course 1
Proteomics course 1Proteomics course 1
Proteomics course 1
 
Proteomics
ProteomicsProteomics
Proteomics
 
Proteomics
Proteomics Proteomics
Proteomics
 
Overview Radboudumc Center for Proteomics, Glycomics and Metabolomics april 2015
Overview Radboudumc Center for Proteomics, Glycomics and Metabolomics april 2015Overview Radboudumc Center for Proteomics, Glycomics and Metabolomics april 2015
Overview Radboudumc Center for Proteomics, Glycomics and Metabolomics april 2015
 
Application of proteomics for identification of abiotic stress tolerance in c...
Application of proteomics for identification of abiotic stress tolerance in c...Application of proteomics for identification of abiotic stress tolerance in c...
Application of proteomics for identification of abiotic stress tolerance in c...
 
proteomics
 proteomics proteomics
proteomics
 
Proteomics
ProteomicsProteomics
Proteomics
 
Quantitative proteomics
Quantitative proteomicsQuantitative proteomics
Quantitative proteomics
 
Bottom-up proteomics and top-down proteomics
Bottom-up  proteomics and top-down proteomicsBottom-up  proteomics and top-down proteomics
Bottom-up proteomics and top-down proteomics
 
A Brief Introduction to Metabolomics
A Brief Introduction to Metabolomics A Brief Introduction to Metabolomics
A Brief Introduction to Metabolomics
 
Proteomics & Metabolomics
Proteomics & MetabolomicsProteomics & Metabolomics
Proteomics & Metabolomics
 
A brief introfuction of label-free protein quantification methods
A brief introfuction of label-free protein quantification methodsA brief introfuction of label-free protein quantification methods
A brief introfuction of label-free protein quantification methods
 
Peptide Mass Fingerprinting (PMF) and Isotope Coded Affinity Tags (ICAT)
Peptide Mass Fingerprinting  (PMF) and Isotope Coded Affinity Tags (ICAT)Peptide Mass Fingerprinting  (PMF) and Isotope Coded Affinity Tags (ICAT)
Peptide Mass Fingerprinting (PMF) and Isotope Coded Affinity Tags (ICAT)
 
Proteins – Basics you need to know for Proteomics
Proteins – Basics you need to know for ProteomicsProteins – Basics you need to know for Proteomics
Proteins – Basics you need to know for Proteomics
 
Proteomics in foods
Proteomics in foodsProteomics in foods
Proteomics in foods
 
Salisha ppt (1) (1)
Salisha ppt (1) (1)Salisha ppt (1) (1)
Salisha ppt (1) (1)
 
Genomics & Proteomics Based Drug Discovery
Genomics & Proteomics Based Drug DiscoveryGenomics & Proteomics Based Drug Discovery
Genomics & Proteomics Based Drug Discovery
 
Functional proteomics, and tools
Functional proteomics, and toolsFunctional proteomics, and tools
Functional proteomics, and tools
 
proteomics, mass spectrometry, science, bioinformatics, electrophoresis, liqu...
proteomics, mass spectrometry, science, bioinformatics, electrophoresis, liqu...proteomics, mass spectrometry, science, bioinformatics, electrophoresis, liqu...
proteomics, mass spectrometry, science, bioinformatics, electrophoresis, liqu...
 
protein microarray
protein microarray protein microarray
protein microarray
 

Similar to Statistical methods in Metabolomics

De-Mystifying Stats: A primer on basic statistics
De-Mystifying Stats: A primer on basic statisticsDe-Mystifying Stats: A primer on basic statistics
De-Mystifying Stats: A primer on basic statisticsGillian Byrne
 
Epidemiological Analysis Workshop By Dr Suzanne Campbell
Epidemiological Analysis Workshop By Dr Suzanne Campbell Epidemiological Analysis Workshop By Dr Suzanne Campbell
Epidemiological Analysis Workshop By Dr Suzanne Campbell COUNTDOWN on NTDs
 
NON-PARAMETRIC TESTS by Prajakta Sawant
NON-PARAMETRIC TESTS by Prajakta SawantNON-PARAMETRIC TESTS by Prajakta Sawant
NON-PARAMETRIC TESTS by Prajakta SawantPRAJAKTASAWANT33
 
Marketing Research Project on T test
Marketing Research Project on T test Marketing Research Project on T test
Marketing Research Project on T test Meghna Baid
 
David Moher - MedicReS World Congress 2012
David Moher - MedicReS World Congress 2012David Moher - MedicReS World Congress 2012
David Moher - MedicReS World Congress 2012MedicReS
 
Tests of significance Periodontology
Tests of significance PeriodontologyTests of significance Periodontology
Tests of significance PeriodontologySaiLakshmi128
 
Hypothesis and its important parametric tests
Hypothesis and its important parametric testsHypothesis and its important parametric tests
Hypothesis and its important parametric testsMansiGajare1
 
Lecture 2 What is Statistics, Anyway
Lecture 2 What is Statistics, AnywayLecture 2 What is Statistics, Anyway
Lecture 2 What is Statistics, AnywayJason Edington
 
Class 5 Hypothesis & Normal Disdribution.pptx
Class 5 Hypothesis & Normal Disdribution.pptxClass 5 Hypothesis & Normal Disdribution.pptx
Class 5 Hypothesis & Normal Disdribution.pptxCallplanetsDeveloper
 
Statistical-Tests-and-Hypothesis-Testing.pptx
Statistical-Tests-and-Hypothesis-Testing.pptxStatistical-Tests-and-Hypothesis-Testing.pptx
Statistical-Tests-and-Hypothesis-Testing.pptxCHRISTINE MAY CERDA
 
Epidemiology Chapter 5.pptx
Epidemiology Chapter 5.pptxEpidemiology Chapter 5.pptx
Epidemiology Chapter 5.pptxAdugnaWari
 
Research method ch07 statistical methods 1
Research method ch07 statistical methods 1Research method ch07 statistical methods 1
Research method ch07 statistical methods 1naranbatn
 
Statistics basics for oncologist kiran
Statistics basics for oncologist kiranStatistics basics for oncologist kiran
Statistics basics for oncologist kiranKiran Ramakrishna
 
Luigi Atzori Metabolomica: Introduzione e review di alcune applicazioni in am...
Luigi Atzori Metabolomica: Introduzione e review di alcune applicazioni in am...Luigi Atzori Metabolomica: Introduzione e review di alcune applicazioni in am...
Luigi Atzori Metabolomica: Introduzione e review di alcune applicazioni in am...CRS4 Research Center in Sardinia
 
When to use, What Statistical Test for data Analysis modified.pptx
When to use, What Statistical Test for data Analysis modified.pptxWhen to use, What Statistical Test for data Analysis modified.pptx
When to use, What Statistical Test for data Analysis modified.pptxAsokan R
 
Introduction to biostatistics
Introduction to biostatisticsIntroduction to biostatistics
Introduction to biostatisticsAli Al Mousawi
 

Similar to Statistical methods in Metabolomics (20)

De-Mystifying Stats: A primer on basic statistics
De-Mystifying Stats: A primer on basic statisticsDe-Mystifying Stats: A primer on basic statistics
De-Mystifying Stats: A primer on basic statistics
 
Epidemiological Analysis Workshop By Dr Suzanne Campbell
Epidemiological Analysis Workshop By Dr Suzanne Campbell Epidemiological Analysis Workshop By Dr Suzanne Campbell
Epidemiological Analysis Workshop By Dr Suzanne Campbell
 
NON-PARAMETRIC TESTS by Prajakta Sawant
NON-PARAMETRIC TESTS by Prajakta SawantNON-PARAMETRIC TESTS by Prajakta Sawant
NON-PARAMETRIC TESTS by Prajakta Sawant
 
Marketing Research Project on T test
Marketing Research Project on T test Marketing Research Project on T test
Marketing Research Project on T test
 
David Moher - MedicReS World Congress 2012
David Moher - MedicReS World Congress 2012David Moher - MedicReS World Congress 2012
David Moher - MedicReS World Congress 2012
 
Tests of significance Periodontology
Tests of significance PeriodontologyTests of significance Periodontology
Tests of significance Periodontology
 
Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)
 
Hypothesis and its important parametric tests
Hypothesis and its important parametric testsHypothesis and its important parametric tests
Hypothesis and its important parametric tests
 
Lecture 2 What is Statistics, Anyway
Lecture 2 What is Statistics, AnywayLecture 2 What is Statistics, Anyway
Lecture 2 What is Statistics, Anyway
 
Class 5 Hypothesis & Normal Disdribution.pptx
Class 5 Hypothesis & Normal Disdribution.pptxClass 5 Hypothesis & Normal Disdribution.pptx
Class 5 Hypothesis & Normal Disdribution.pptx
 
Statistical-Tests-and-Hypothesis-Testing.pptx
Statistical-Tests-and-Hypothesis-Testing.pptxStatistical-Tests-and-Hypothesis-Testing.pptx
Statistical-Tests-and-Hypothesis-Testing.pptx
 
Lecture 7 gwas full
Lecture 7 gwas fullLecture 7 gwas full
Lecture 7 gwas full
 
Epidemiology Chapter 5.pptx
Epidemiology Chapter 5.pptxEpidemiology Chapter 5.pptx
Epidemiology Chapter 5.pptx
 
Research method ch07 statistical methods 1
Research method ch07 statistical methods 1Research method ch07 statistical methods 1
Research method ch07 statistical methods 1
 
Statistics basics for oncologist kiran
Statistics basics for oncologist kiranStatistics basics for oncologist kiran
Statistics basics for oncologist kiran
 
Luigi Atzori Metabolomica: Introduzione e review di alcune applicazioni in am...
Luigi Atzori Metabolomica: Introduzione e review di alcune applicazioni in am...Luigi Atzori Metabolomica: Introduzione e review di alcune applicazioni in am...
Luigi Atzori Metabolomica: Introduzione e review di alcune applicazioni in am...
 
Research methodology
Research methodologyResearch methodology
Research methodology
 
Research methodology
Research methodologyResearch methodology
Research methodology
 
When to use, What Statistical Test for data Analysis modified.pptx
When to use, What Statistical Test for data Analysis modified.pptxWhen to use, What Statistical Test for data Analysis modified.pptx
When to use, What Statistical Test for data Analysis modified.pptx
 
Introduction to biostatistics
Introduction to biostatisticsIntroduction to biostatistics
Introduction to biostatistics
 

More from David Moriña Soler

The R package survsim for the simulation of simple and complex survival data
The R package survsim for the simulation of simple and complex survival dataThe R package survsim for the simulation of simple and complex survival data
The R package survsim for the simulation of simple and complex survival dataDavid Moriña Soler
 
Estimating cumulated doses and associated health risks due to occupational ex...
Estimating cumulated doses and associated health risks due to occupational ex...Estimating cumulated doses and associated health risks due to occupational ex...
Estimating cumulated doses and associated health risks due to occupational ex...David Moriña Soler
 
Use of multivariate survival models with common baseline risk under event dep...
Use of multivariate survival models with common baseline risk under event dep...Use of multivariate survival models with common baseline risk under event dep...
Use of multivariate survival models with common baseline risk under event dep...David Moriña Soler
 
Generalized Hermite distribution: From hit charts to cytogenetic biodosimetry
Generalized Hermite distribution: From hit charts to cytogenetic biodosimetryGeneralized Hermite distribution: From hit charts to cytogenetic biodosimetry
Generalized Hermite distribution: From hit charts to cytogenetic biodosimetryDavid Moriña Soler
 
Estimating cumulated absorbed doses and associated health risks due to occupa...
Estimating cumulated absorbed doses and associated health risks due to occupa...Estimating cumulated absorbed doses and associated health risks due to occupa...
Estimating cumulated absorbed doses and associated health risks due to occupa...David Moriña Soler
 
The R package survsim for the simulation of simple and complex survival data
The R package survsim for the simulation of simple and complex survival dataThe R package survsim for the simulation of simple and complex survival data
The R package survsim for the simulation of simple and complex survival dataDavid Moriña Soler
 
Sèries temporals discretes amb aplicacions
Sèries temporals discretes amb aplicacionsSèries temporals discretes amb aplicacions
Sèries temporals discretes amb aplicacionsDavid Moriña Soler
 

More from David Moriña Soler (7)

The R package survsim for the simulation of simple and complex survival data
The R package survsim for the simulation of simple and complex survival dataThe R package survsim for the simulation of simple and complex survival data
The R package survsim for the simulation of simple and complex survival data
 
Estimating cumulated doses and associated health risks due to occupational ex...
Estimating cumulated doses and associated health risks due to occupational ex...Estimating cumulated doses and associated health risks due to occupational ex...
Estimating cumulated doses and associated health risks due to occupational ex...
 
Use of multivariate survival models with common baseline risk under event dep...
Use of multivariate survival models with common baseline risk under event dep...Use of multivariate survival models with common baseline risk under event dep...
Use of multivariate survival models with common baseline risk under event dep...
 
Generalized Hermite distribution: From hit charts to cytogenetic biodosimetry
Generalized Hermite distribution: From hit charts to cytogenetic biodosimetryGeneralized Hermite distribution: From hit charts to cytogenetic biodosimetry
Generalized Hermite distribution: From hit charts to cytogenetic biodosimetry
 
Estimating cumulated absorbed doses and associated health risks due to occupa...
Estimating cumulated absorbed doses and associated health risks due to occupa...Estimating cumulated absorbed doses and associated health risks due to occupa...
Estimating cumulated absorbed doses and associated health risks due to occupa...
 
The R package survsim for the simulation of simple and complex survival data
The R package survsim for the simulation of simple and complex survival dataThe R package survsim for the simulation of simple and complex survival data
The R package survsim for the simulation of simple and complex survival data
 
Sèries temporals discretes amb aplicacions
Sèries temporals discretes amb aplicacionsSèries temporals discretes amb aplicacions
Sèries temporals discretes amb aplicacions
 

Recently uploaded

Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 

Recently uploaded (20)

Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 

Statistical methods in Metabolomics

  • 1. Statistical methods in metabolomics David Moriñaa,b david.morina@uab.cat a Centre for Research in Environmental Epidemiology (CREAL) b Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona May 08 2014, Reus
  • 2. Statistical methods in metabolomics Contents 1 Introduction 2 Basic statistics 3 Available tools 4 R basics 5 LC/MS example 6 Further reading 2 / 66
  • 3. Statistical methods in metabolomics Introduction Where does data come from? Metabolomics • Metabolomics is the analysis and study of the set of metabolites in a cell, organ, or tissue. • To detect and quantify metabolites, separation techniques like gas or li- quid chromatography, followed by quantification by mass spectrometry (GC/MS, or LC/MS) are often used. • Nuclear magnetic resonance spectroscopy (NMR) is also frequently em- ployed and has some appealing properties: 3 / 66
  • 4. Statistical methods in metabolomics Introduction Where does data come from? Metabolomics • Metabolomics is the analysis and study of the set of metabolites in a cell, organ, or tissue. • To detect and quantify metabolites, separation techniques like gas or li- quid chromatography, followed by quantification by mass spectrometry (GC/MS, or LC/MS) are often used. • Nuclear magnetic resonance spectroscopy (NMR) is also frequently em- ployed and has some appealing properties: • Is non-destructive, in the sense that it does not “destroy” the samples during the analysis process. 3 / 66
  • 5. Statistical methods in metabolomics Introduction Where does data come from? Metabolomics • Metabolomics is the analysis and study of the set of metabolites in a cell, organ, or tissue. • To detect and quantify metabolites, separation techniques like gas or li- quid chromatography, followed by quantification by mass spectrometry (GC/MS, or LC/MS) are often used. • Nuclear magnetic resonance spectroscopy (NMR) is also frequently em- ployed and has some appealing properties: • Is non-destructive, in the sense that it does not “destroy” the samples during the analysis process. • Is useful when analyzing tissues or when sequential analysis of samples is required. 3 / 66
  • 6. Statistical methods in metabolomics Basic statistics What does people say about statistics? • There are three kinds of lies: lies, damned lies, and statistics. (B. Disraeli) 4 / 66
  • 7. Statistical methods in metabolomics Basic statistics What does people say about statistics? • There are three kinds of lies: lies, damned lies, and statistics. (B. Disraeli) • Statistics are like bikinis: What they reveal is suggestive, but what they hide is vital. (A. Levenstein) 4 / 66
  • 8. Statistical methods in metabolomics Basic statistics What does people say about statistics? • There are three kinds of lies: lies, damned lies, and statistics. (B. Disraeli) • Statistics are like bikinis: What they reveal is suggestive, but what they hide is vital. (A. Levenstein) • About 93% of all statistics are made up. (Any newspaper) 4 / 66
  • 9. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Normal distribution 5 / 66
  • 10. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Normal distribution 6 / 66
  • 11. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Central Limit Theorem Under some conditions (not much demanding), the distribution of the sum of independent and identically distributed random variables tends to normal distribution if the number of observations is not too small. 7 / 66
  • 12. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Central Limit Theorem The following example shows the distribution of the sum of the scores obtained when rolling 1, 2, 3, 5 and 10 dices: 8 / 66
  • 13. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Central Limit Theorem The following example shows the distribution of the sum of the scores obtained when rolling 1, 2, 3, 5 and 10 dices: 8 / 66
  • 14. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Central Limit Theorem 9 / 66
  • 15. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Central Limit Theorem 10 / 66
  • 16. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Testing normality There are some analytical methods to test if a random variable follow a normal distribution or not. Some of them are • Kolmogorov-Smirnov test • Shapiro-Wilk test • Graphical methods (QQ-plot, . . . ) 11 / 66
  • 17. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Testing normality There are some mathematical functions that can be applied in order to stabili- ze the variance of a random variable 12 / 66
  • 18. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Testing normality There are some mathematical functions that can be applied in order to stabili- ze the variance of a random variable • log transformation • logit transformation • Square root transformation 12 / 66
  • 19. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Testing normality Original variable X scale Density 0 1 2 3 4 5 6 0.00.20.40.6 log−transformed variable log(X) scale Density −8 −6 −4 −2 0 2 0.00.20.40.6 13 / 66
  • 20. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Significance Interpretation of p-value is well-known. . . Sure? 14 / 66
  • 21. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Significance Interpretation of p-value is well-known. . . Sure? • Probability of a difference at least as the observed if H0 is true (by chance) • Probability of mistake when rejecting H0 • Evidence against H0 provided by the sample. If the p-value is small, it’s not likely to observe the sample differences by chance • Probability that the observed differences are false • 1 - p-value = Probability that the observed differences are real 14 / 66
  • 22. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Significance Interpretation of p-value is well-known. . . Sure? • Probability of a difference at least as the observed if H0 is true (by chance) • Probability of mistake when rejecting H0 • Evidence against H0 provided by the sample. If the p-value is small, it’s not likely to observe the sample differences by chance • Probability that the observed differences are false • 1 - p-value = Probability that the observed differences are real 15 / 66
  • 23. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Significance • The p-value is computed under the assumption that H0 is true and therefore it cannot provide direct data about its certainty • Scientist should decide on H0 based on the evidence against it that sample provides, without reality knowledge 16 / 66
  • 24. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Significance • If a confidence level 1 − α is fixed: p < α −→ Statistically significant differences p ≥ α −→ No statistically significant differences 17 / 66
  • 25. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Significance 18 / 66
  • 26. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Significance 19 / 66
  • 27. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Comparing two populations • Student’s t test was designed to compare two means. • A t-test can also be used to determine whether 2 clusters are different. q q qqqq q q q q q qq q q q q q q q q q q q qq q q q q q q q q q q qq q q q q q q q q q q q q qq q qq q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q qq q qq q q q q q q q qq q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q q q q q q q q q qq q q q q q q q q q q q q qq q q qq q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q 0 100 200 300 400 −10010203040 Time Value q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q qq q q q q q q q q q q q q q q q q q q q qq q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q qq q q q qq q qq q q q q q q q q q q q q q qq q q q q q q q q q q q q q 20 / 66
  • 28. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Comparing two populations (non-gaussian) • If the distribution of the variable of interest is not gaussian, we can still compare two populations, by means of Mann-Whitney’s U test (for independent samples) or Wilcoxon test (for paired samples). • Formally, these non-parametric tests are comparing two medians. 21 / 66
  • 29. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Comparing three (or more) populations • If we want to compare more than two groups we can use ANOVA technique. • Essentially, it is a genearlization of Student’s t test. • Intra-group variance should be similar. • Normality is not crucial. • Just tells if some of the compared groups is different from the others. 22 / 66
  • 30. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Comparing three (or more) populations • ANOVA can also be used to determine whether 3 or more clusters are different. q qq qq qq q qq q q q q q q q q q q q q q q q qqq q q q q q q q q q q q q q q q q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q qq q q q qq q q q q q q q q q q q qq q q q qq q q q qq q q q q qq q q q q q q q q qq q qq q q q q q q q q q q qqq q q q q q qqq q q q q q q q q q q q q q qq q q q q q q qqq qq q q q q q q q q q q qq q q q q qq qq qq q q q qqq q q q 0 100 200 300 400 500 600 −100102030405060 Time Value q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q qq q q q q qqq q q qq q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q qq q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 23 / 66
  • 31. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Comparing three (or more) populations If H0 can be rejected, which is the different group? • We need to perform a posteriori mean tests. • They compare each pair of means. • More conservative to control αT . 24 / 66
  • 32. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Multiple comparisons There are several methods to control type I error: • Bonferroni • Holm • Tukey • Scheffé • Dunnett (control) 25 / 66
  • 33. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing False Discovery Rate Suppose you performed 100 different t-tests, and found 20 results with a p- value < 0.05. • How many of these 20 tests are likely false positives? 26 / 66
  • 34. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing False Discovery Rate Suppose you performed 100 different t-tests, and found 20 results with a p- value < 0.05. • How many of these 20 tests are likely false positives? • 20 · 0.05 = 1 • To correct for this we can consider as significant the results with a p-value < 0.05 20 or p < 0.0025 26 / 66
  • 35. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Correlation • If there is some dependency between the two variables or if there is a relationship between the predicted and observer variable or if the “before” and “after” treatments led to some effect, then it is possible to see some clear patterns to the scatter plot • This pattern or relationship is called correlation 27 / 66
  • 36. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Correlation q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q qqq q q q q q q q q q q q q q q q q q q −2 −1 0 1 2 34567 Positive correlation x y q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q qqq q q q q q q q q q q q q q q q q q q −2 −1 0 1 2 −2−1012 Negative correlation x z q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −2 −1 0 1 2 −3−2−1012 No correlation x t 28 / 66
  • 37. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Correlation The correlation coefficient (Pearson coefficient) is computed by means of r = (xi − ¯x)(yi − ¯y) (xi − ¯x)2(yi − ¯y)2 29 / 66
  • 38. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Correlation and significance qq q −3.0 −2.0 −1.0 −5051015 r=0.98 qq q q q −3.0 −2.0 −1.0 −5051015 r=0.22 30 / 66
  • 39. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Clustering q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q qq q q q qq q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q qq q q 0 100 200 300 400 −20246 q q q q q q q q qq q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qqq q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q 31 / 66
  • 40. Statistical methods in metabolomics Basic statistics Distributions and hypothesis testing Clustering −200 −100 0 100 200 −6−4−2024 CLUSPLOT( mydata ) Component 1 Component2 These two components explain 100 % of the point variability. q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q qq q q q q q q q qqq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q qq q q q q q q q q q q q q q q qqqqq q q q q q q q q q q q q q q q q q q q q q q 32 / 66
  • 41. Statistical methods in metabolomics Basic statistics Clustering Clustering • Clustering is a process by which objects that are logically similar in characteristics are grouped together • It’s a previous step before classification. • It requires a method to measure similarity (a similarity matrix) or dissimilarity (a dissimilarity coefficient) between objects • Uses a threshold value to decide whether an object belongs with a cluster • There are several clustering methods, differing in how they start the clustering process 33 / 66
  • 42. Statistical methods in metabolomics Basic statistics Clustering Clustering • K-means algorithm: divides a set of N objects into M clusters – with or without overlap. M must be specified by the analist • Hiearchical clustering: produces a set of nested clusters in which each pair of objects is progressively nested into a larger cluster until only one cluster remains 34 / 66
  • 43. Statistical methods in metabolomics Basic statistics Clustering K-means algorithm • Make the first object the centroid for the first cluster • For the next object calculate the similarity to each existing centroid • If the similarity is greater than a threshold add the object to the existing cluster and redetermine the centroid, else use the object to start new cluster • Return to step 2 and repeat until done 35 / 66
  • 44. Statistical methods in metabolomics Basic statistics Clustering K-means algorithm example # Read data > st1 <- read.table("Data/global_afegits.csv", sep=";", dec=",", header=T) # Determine number of clusters > n <- nrow(st2.ado) > wss <- rep(1:10) > wss[1] <- (n-1)*sum(apply(st2.ado[,2:8],2,var)) > for (i in 2:10) { wss[i] <- sum(kmeans(na.omit(st2.ado[,2:8]), centers=i)$withinss) } > plot(1:10,wss,type="b",xlab="Number of groups", ylab="Within groups sum of squares") 36 / 66
  • 45. Statistical methods in metabolomics Basic statistics Clustering K-means algorithm example q q q q q q q q q q 2 4 6 8 10 12000140001600018000200002200024000 Number of groups Withingroupssumofsquares 37 / 66
  • 46. Statistical methods in metabolomics Basic statistics Clustering K-means algorithm example If we choose 5 clusters, we then > fit <- kmeans(st2.ado, 5) will classify the observations in the 5 groups 38 / 66
  • 47. Statistical methods in metabolomics Basic statistics Clustering Hierarchical clustering • Find the two closest objects and merge them into a cluster • Find and merge the next two closest objects (or an object and a cluster, or two clusters) using some similarity measure and a predefined threshold • If more than one cluster remains return to step 2 until finished 39 / 66
  • 48. Statistical methods in metabolomics Basic statistics Clustering Hierarchical clustering example # Ward Hierarchical Clustering > d <- dist(st2.ado, method = "euclidean") > fit <- hclust(d, method="ward.D") > plot(fit) # display dendogram > groups <- cutree(fit, k=5) # cut tree into 5 clusters # draw dendogram with red borders around the 5 clusters > rect.hclust(fit, k=5, border="red") 40 / 66
  • 49. Statistical methods in metabolomics Basic statistics Clustering Hierarchical clustering example 2813022403092341710223204221563305251482921379283556152155672583003317327017026225115828614228216231518656729424153513194471094920522032682315583628928717559111031162601612021192252691415992092172222502492142972302293139160179595257254113200210205183215177324184193314235227181102552653211760129118514915414616994312608569528546326131187466062432676052486620732723324715716429029597140275172122568312748243111293221519820618823819218253256452883072591995760229803137124301126313515965135548226201176261511482242572342125590244751976826627714492545912841362462132378855562523381653371052122393281285321653435219963395821811033455358522416761057812831012318812017133961119127219650607156539271296341278303220263554421119516321626153281343062227313211127456611416818012117493106781151902981662551331181381457430410815013012311214310720871540982993162642324281189153609147561825385744127628030236159533852452795755766219275861660017822831943542219561 050100150200250300 Cluster Dendrogram hclust (*, "ward.D") d Height 41 / 66
  • 50. Statistical methods in metabolomics Basic statistics Clustering Validating cluster solutions There are several methods to compare different clustering solutions to the same problem. 42 / 66
  • 51. Statistical methods in metabolomics Basic statistics Clustering Validating cluster solutions There are several methods to compare different clustering solutions to the same problem. • Hubert’s gamma coefficient • Dunn index • Corrected rand index 42 / 66
  • 52. Statistical methods in metabolomics Basic statistics Clustering Validating cluster solutions There are several methods to compare different clustering solutions to the same problem. • Hubert’s gamma coefficient • Dunn index • Corrected rand index Some of them are implemented in R package fpc 42 / 66
  • 53. Statistical methods in metabolomics Basic statistics Multivariate statistics Multivariate statistics • Multivariate statistics means dealing with several variables at the same time • Multivariate problems requires more complex, multidimensional analyses or dimensional reduction methods • Metabolomics experiments typically measure many metabolites at once, in other words the instruments are measuring multiple variables and so metabolomic data are inherently multivariate data • The key trick in multivariate statistics is to find a way that effectively reduces the multivariate data into univariate data • Then we can apply the same univariate concepts such as p-values, t-tests and ANOVA tests to the data 43 / 66
  • 54. Statistical methods in metabolomics Basic statistics Multivariate statistics Principal Component Analysis • PCA is a process that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components • PCA captures what should be visually detectable • If you can’t see it, PCA probably won’t help 44 / 66
  • 55. Statistical methods in metabolomics Basic statistics Multivariate statistics Principal Component Analysis > data(USArrests) > pc.cr <- princomp(USArrests, cor = TRUE) > biplot(pc.cr) 45 / 66
  • 56. Statistical methods in metabolomics Basic statistics Multivariate statistics Principal Component Analysis −0.2 −0.1 0.0 0.1 0.2 0.3 −0.2−0.10.00.10.20.3 Comp.1 Comp.2 AlabamaAlaska Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana MaineMaryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota Ohio Oklahoma OregonPennsylvania Rhode Island South Carolina South DakotaTennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming −5 0 5 −505 Murder Assault UrbanPop Rape 46 / 66
  • 57. Statistical methods in metabolomics Basic statistics Multivariate statistics Other methods There are several multivariate methods, with an increasing usage in metabo- lomics and related fields • Discriminant Analysis (DA, PLS-DA, OPLS-DA) • Factor Analysis • Structural Equation Modeling 47 / 66
  • 58. Statistical methods in metabolomics Available tools How to analyze data? R • R is a freely available language and environment for statistical computing and graphics. • It provides a wide variety of statistical and graphical techniques. • It is constantly expanding thanks to user-contributed packages. • Can be downloaded from http://cran.r-project.org. 48 / 66
  • 59. Statistical methods in metabolomics Available tools How to analyze data? R • R is a freely available language and environment for statistical computing and graphics. • It provides a wide variety of statistical and graphical techniques. • It is constantly expanding thanks to user-contributed packages. • Can be downloaded from http://cran.r-project.org. Bioconductor • Bioconductor is a repository of user-contributed R packages. • It is accessible from http://www.bioconductor.org. • Provides tools for the analysis and comprehension of high-throughput ge- nomic data. • It has mailing lists and a very active users/developers community. 48 / 66
  • 60. Statistical methods in metabolomics Available tools Bioconductor Bioconductor submitted packages 49 / 66
  • 61. Statistical methods in metabolomics Available tools Bioconductor Installation of Bioconductor packages The installation of Bioconductor can be done within the R session by source("http://bioconductor.org/biocLite.R") biocLite() 50 / 66
  • 62. Statistical methods in metabolomics R basics Getting help Getting help • ?mean • help(mean) • help.search("mean") • apropos("mean") • example(mean) 51 / 66
  • 63. Statistical methods in metabolomics R basics R packages for metabolomics Useful packages There are a number of useful packages in Bioconductor regarding metabolo- mics data analysis. • flagme: Analysis of metabolomics GC/MS data 52 / 66
  • 64. Statistical methods in metabolomics R basics R packages for metabolomics Useful packages There are a number of useful packages in Bioconductor regarding metabolo- mics data analysis. • flagme: Analysis of metabolomics GC/MS data • xcms: Analysis of metabolomics XC/MS data 52 / 66
  • 65. Statistical methods in metabolomics LC/MS example LC/MS example xcms • Can read data stored in several formats like netcdf, mzXML, mzData and mzML. • Provides methods for feature detection, non-linear retention time align- ment, visualization, relative quantization and statistics. • Is capable of simultaneously preprocessing, analyzing, and visualizing the raw data from hundreds of samples. • It’s available as an R package or as an online platform accessible through https://xcmsonline.scripps.edu/. 53 / 66
  • 66. Statistical methods in metabolomics LC/MS example LC/MS example Typical xcms workflow 54 / 66
  • 67. Statistical methods in metabolomics LC/MS example LC/MS example Reading the data # use biocLite to install a Bioconductor package > source("http://bioconductor.org/biocLite.R") # Install the xcms package > biocLite("xcms") # Install dataset package used in this session > biocLite("faahKO") 55 / 66
  • 68. Statistical methods in metabolomics LC/MS example LC/MS example Reading the data The data in faahKO consists of LC/MS peaks from the spinal cords of 6 wild- type and 6 FAAH knockout mice. The data is a subset of the original data from 200-600 m/z and 2500-4500 seconds. It was collected in positive ionization mode. # Load libraries > library("xcms") > library("faahKO") 56 / 66
  • 69. Statistical methods in metabolomics LC/MS example LC/MS example Reading the data > cdfpath <- system.file("cdf",package="faahKO") > files <- list.files(cdfpath, recursive=T, full=T) > data <- xcmsSet(files) 57 / 66
  • 70. Statistical methods in metabolomics LC/MS example LC/MS example Reading the data > cdfpath <- system.file("cdf",package="faahKO") > files <- list.files(cdfpath, recursive=T, full=T) > data <- xcmsSet(files) Some important parameters • scanrange=c(lower, upper): to scan part of the spectra • fwhm = seconds: specify full width at half maximum (default 30s) based on the type of chromatography • method = “centWave”): use wavelet algorithm for peak detection, suitable for high resolution spectra 57 / 66
  • 71. Statistical methods in metabolomics LC/MS example LC/MS example Peak alignment and retention time correction > xsg <- group(data) # peak alignment > xsg <- retcor(xsg) # retention time correction > xsg <- group(xsg) # re-align 58 / 66
  • 72. Statistical methods in metabolomics LC/MS example LC/MS example Peak alignment and retention time correction > xsg <- group(data) # peak alignment > xsg <- retcor(xsg) # retention time correction > xsg <- group(xsg) # re-align • Matching peaks across samples • Using the peak groups to correct drift • Re-do the alignment • Can be performed iteratively until no further change 58 / 66
  • 73. Statistical methods in metabolomics LC/MS example LC/MS example Peak alignment and retention time correction −2−10123 Retention Time Deviation vs. Retention Time Retention Time RetentionTimeDeviation q q q q q q ko15 ko16 ko18 ko19 ko21 ko22 wt15 wt16 wt18 wt19 wt21 wt22q q qq q q q q q q q q q q q qq q q q qqq q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q qq q q q q q q q q qq q q q qq q q q q q q q q q qq q q qq q q q qq q q q q q q q q q q q q q q q q q q q q q q q qq q qqq q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q qqq q q q q q q q q q q q q q q qqq q q q q q q q q qq q q q qq q q qq q q qq q q q qq q q q qq q q q q qq q q q q q q q q q q q qq q q q q q q q q q q q q q q qq q q q q qqq q q q q q q q q q q q q q q q qq q q q q qq q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q qq q q q qq q q qq q q qq qq q qq q q q qq q q q q q qq q q q q q q q q q q q qq q q q q q q q q q q q q q q q qq q q q qqq q q q q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q qqq q q q q q q q q q q q q q q q q q q q q q q q q qq q q qq q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q qqq q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q qqq q q q q q q q q q q qq q q q q q q q q q q q qq q q qq q q q qq q q q q q q q q qq q q q q qq q q q q q q q q q q q q q q q q q 2500 3000 3500 4000 4500 Retention Time PeakDensity All Correction 59 / 66
  • 74. Statistical methods in metabolomics LC/MS example LC/MS example Filling in missing peaks > xsg <- fillPeaks(xsg) 60 / 66
  • 75. Statistical methods in metabolomics LC/MS example LC/MS example Filling in missing peaks > xsg <- fillPeaks(xsg) • A significant number of potential peaks can be missed during peak detection • Missing values are problematic for robust statistical analysis • We now have a better idea about where to expect real peaks and their boundaries • Re-scan the raw spectra and integrate peaks in the regions of the missing peaks 60 / 66
  • 76. Statistical methods in metabolomics LC/MS example LC/MS example Results of peak detection > peaks(xsg) 61 / 66
  • 77. Statistical methods in metabolomics LC/MS example LC/MS example Results of peak detection > peaks(xsg) peaks() function gives a list of peaks with • mz • mzmin • mzmax • rt • rtmin • rtmax • peak intensities/areas (raw data) 61 / 66
  • 78. Statistical methods in metabolomics LC/MS example LC/MS example Statistical analysis > report <- diffreport(xsg, "WT", "KO") 62 / 66
  • 79. Statistical methods in metabolomics LC/MS example LC/MS example Statistical analysis > report <- diffreport(xsg, "WT", "KO") • diffreport() function computes Welch’s two-sample t-statistic for each analyte and ranks them by p-value. • It returns a summary report • Multivariate analysis and visualization can be performed using MetaboAnalyst • The report generated by diffreport() can be directly uploaded to MetaboAnalyst 62 / 66
  • 80. Statistical methods in metabolomics LC/MS example LC/MS example Visualizing important peaks # Select peaks with median retention time # between 3300 and 3400 and detected in # at least 8 samples > gr <- groups(xsg) > groupidx <- which(gr[,"rtmed"]>3300 & gr[,"rtmed"]<3400 & gr[,"npeaks"]>=8])[1] > eiccor <- getEIC(xsg, groupidx=groupidx) > plot(eiccor, col=as.numeric(phenoData(xsg)$class)) 63 / 66
  • 81. Statistical methods in metabolomics LC/MS example LC/MS example Visualizing important peaks # Select peaks with median retention time # between 3300 and 3400 and detected in # at least 8 samples > gr <- groups(xsg) > groupidx <- which(gr[,"rtmed"]>3300 & gr[,"rtmed"]<3400 & gr[,"npeaks"]>=8])[1] > eiccor <- getEIC(xsg, groupidx=groupidx) > plot(eiccor, col=as.numeric(phenoData(xsg)$class)) • When significant peaks are identified, it is critical to visualize these peaks to assess quality • This is done using the Extracted Ion Chromatogram (EIC) 63 / 66
  • 82. Statistical methods in metabolomics LC/MS example LC/MS example Visualizing important peaks 3300 3350 3400 3450 050000100000150000200000250000 Extracted Ion Chromatogram: 300.1 − 300.2 m/z Retention Time (seconds) Intensity 64 / 66
  • 83. Statistical methods in metabolomics Further reading Some references • Broadhurst, D. I., Kell, D. B. (2007). Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics, 2 (4), 171–196. • Worley, B., Powers, R. (2013). Multivariate Analysis in Metabolomics. Current metabolomics, 1 (1), 92–107. • Issaq, H. J., Van, Q. N., Waybright, T. J., Muschik, G. M., Veenstra, T. D. (2009). Analytical and statistical approaches to metabolomics research. Journal of separation science, 32, 2183–2199. • Smith, C. A. (2014). LC/MS Preprocessing and Analysis with xcms. R package documentation. • Korman, A., Oh, A., Raskind, A., Banks, D. (2012). Statistical methods in metabolomics. Methods in molecular biology, 856 (Evolutionary genomics), 381–413. Springer. 65 / 66
  • 84. Centre for Research in Environmental Epidemiology Parc de Recerca Biomèdica de Barcelona Doctor Aiguader, 88 08003 Barcelona (Spain) Tel. (+34) 93 214 70 00 Fax (+34) 93 214 73 02 info@creal.cat www.creal.cat