Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Statistical methods in Metabolomics
1. Statistical methods in metabolomics
David Moriñaa,b
david.morina@uab.cat
a
Centre for Research in Environmental Epidemiology (CREAL)
b
Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona
May 08 2014, Reus
2. Statistical methods in metabolomics
Contents
1 Introduction
2 Basic statistics
3 Available tools
4 R basics
5 LC/MS example
6 Further reading
2 / 66
3. Statistical methods in metabolomics
Introduction
Where does data come from?
Metabolomics
• Metabolomics is the analysis and study of the set of metabolites in a cell,
organ, or tissue.
• To detect and quantify metabolites, separation techniques like gas or li-
quid chromatography, followed by quantification by mass spectrometry
(GC/MS, or LC/MS) are often used.
• Nuclear magnetic resonance spectroscopy (NMR) is also frequently em-
ployed and has some appealing properties:
3 / 66
4. Statistical methods in metabolomics
Introduction
Where does data come from?
Metabolomics
• Metabolomics is the analysis and study of the set of metabolites in a cell,
organ, or tissue.
• To detect and quantify metabolites, separation techniques like gas or li-
quid chromatography, followed by quantification by mass spectrometry
(GC/MS, or LC/MS) are often used.
• Nuclear magnetic resonance spectroscopy (NMR) is also frequently em-
ployed and has some appealing properties:
• Is non-destructive, in the sense that it does not “destroy” the samples during
the analysis process.
3 / 66
5. Statistical methods in metabolomics
Introduction
Where does data come from?
Metabolomics
• Metabolomics is the analysis and study of the set of metabolites in a cell,
organ, or tissue.
• To detect and quantify metabolites, separation techniques like gas or li-
quid chromatography, followed by quantification by mass spectrometry
(GC/MS, or LC/MS) are often used.
• Nuclear magnetic resonance spectroscopy (NMR) is also frequently em-
ployed and has some appealing properties:
• Is non-destructive, in the sense that it does not “destroy” the samples during
the analysis process.
• Is useful when analyzing tissues or when sequential analysis of samples is
required.
3 / 66
6. Statistical methods in metabolomics
Basic statistics
What does people say about statistics?
• There are three kinds of lies: lies, damned lies, and statistics. (B. Disraeli)
4 / 66
7. Statistical methods in metabolomics
Basic statistics
What does people say about statistics?
• There are three kinds of lies: lies, damned lies, and statistics. (B. Disraeli)
• Statistics are like bikinis: What they reveal is suggestive, but what they
hide is vital. (A. Levenstein)
4 / 66
8. Statistical methods in metabolomics
Basic statistics
What does people say about statistics?
• There are three kinds of lies: lies, damned lies, and statistics. (B. Disraeli)
• Statistics are like bikinis: What they reveal is suggestive, but what they
hide is vital. (A. Levenstein)
• About 93% of all statistics are made up. (Any newspaper)
4 / 66
9. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Normal distribution
5 / 66
10. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Normal distribution
6 / 66
11. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Central Limit Theorem
Under some conditions (not much demanding), the distribution of the sum
of independent and identically distributed random variables tends to normal
distribution if the number of observations is not too small.
7 / 66
12. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Central Limit Theorem
The following example shows the distribution of the sum of the scores obtained
when rolling 1, 2, 3, 5 and 10 dices:
8 / 66
13. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Central Limit Theorem
The following example shows the distribution of the sum of the scores obtained
when rolling 1, 2, 3, 5 and 10 dices:
8 / 66
14. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Central Limit Theorem
9 / 66
15. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Central Limit Theorem
10 / 66
16. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Testing normality
There are some analytical methods to test if a random variable follow a normal
distribution or not. Some of them are
• Kolmogorov-Smirnov test
• Shapiro-Wilk test
• Graphical methods (QQ-plot, . . . )
11 / 66
17. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Testing normality
There are some mathematical functions that can be applied in order to stabili-
ze the variance of a random variable
12 / 66
18. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Testing normality
There are some mathematical functions that can be applied in order to stabili-
ze the variance of a random variable
• log transformation
• logit transformation
• Square root transformation
12 / 66
19. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Testing normality
Original variable
X scale
Density
0 1 2 3 4 5 6
0.00.20.40.6
log−transformed variable
log(X) scale
Density
−8 −6 −4 −2 0 2
0.00.20.40.6
13 / 66
20. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
Interpretation of p-value is well-known. . . Sure?
14 / 66
21. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
Interpretation of p-value is well-known. . . Sure?
• Probability of a difference at least as the observed if H0 is true (by
chance)
• Probability of mistake when rejecting H0
• Evidence against H0 provided by the sample. If the p-value is small, it’s
not likely to observe the sample differences by chance
• Probability that the observed differences are false
• 1 - p-value = Probability that the observed differences are real
14 / 66
22. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
Interpretation of p-value is well-known. . . Sure?
• Probability of a difference at least as the observed if H0 is true (by
chance)
• Probability of mistake when rejecting H0
• Evidence against H0 provided by the sample. If the p-value is small, it’s
not likely to observe the sample differences by chance
• Probability that the observed differences are false
• 1 - p-value = Probability that the observed differences are real
15 / 66
23. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
• The p-value is computed under the assumption that H0 is true and
therefore it cannot provide direct data about its certainty
• Scientist should decide on H0 based on the evidence against it that
sample provides, without reality knowledge
16 / 66
24. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
• If a confidence level 1 − α is fixed:
p < α −→ Statistically significant differences
p ≥ α −→ No statistically significant differences
17 / 66
25. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
18 / 66
26. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Significance
19 / 66
28. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Comparing two populations (non-gaussian)
• If the distribution of the variable of interest is not gaussian, we can still
compare two populations, by means of Mann-Whitney’s U test (for
independent samples) or Wilcoxon test (for paired samples).
• Formally, these non-parametric tests are comparing two medians.
21 / 66
29. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Comparing three (or more) populations
• If we want to compare more than two groups we can use ANOVA
technique.
• Essentially, it is a genearlization of Student’s t test.
• Intra-group variance should be similar.
• Normality is not crucial.
• Just tells if some of the compared groups is different from the others.
22 / 66
31. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Comparing three (or more) populations
If H0 can be rejected, which is the different group?
• We need to perform a posteriori mean tests.
• They compare each pair of means.
• More conservative to control αT .
24 / 66
32. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Multiple comparisons
There are several methods to control type I error:
• Bonferroni
• Holm
• Tukey
• Scheffé
• Dunnett (control)
25 / 66
33. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
False Discovery Rate
Suppose you performed 100 different t-tests, and found 20 results with a p-
value < 0.05.
• How many of these 20 tests are likely false positives?
26 / 66
34. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
False Discovery Rate
Suppose you performed 100 different t-tests, and found 20 results with a p-
value < 0.05.
• How many of these 20 tests are likely false positives?
• 20 · 0.05 = 1
• To correct for this we can consider as significant the results with a
p-value < 0.05
20
or p < 0.0025
26 / 66
35. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Correlation
• If there is some dependency between the two variables or if there is a
relationship between the predicted and observer variable or if the
“before” and “after” treatments led to some effect, then it is possible to
see some clear patterns to the scatter plot
• This pattern or relationship is called correlation
27 / 66
37. Statistical methods in metabolomics
Basic statistics
Distributions and hypothesis testing
Correlation
The correlation coefficient (Pearson coefficient) is computed by means of
r =
(xi − ¯x)(yi − ¯y)
(xi − ¯x)2(yi − ¯y)2
29 / 66
41. Statistical methods in metabolomics
Basic statistics
Clustering
Clustering
• Clustering is a process by which objects that are logically similar in
characteristics are grouped together
• It’s a previous step before classification.
• It requires a method to measure similarity (a similarity matrix) or
dissimilarity (a dissimilarity coefficient) between objects
• Uses a threshold value to decide whether an object belongs with a
cluster
• There are several clustering methods, differing in how they start the
clustering process
33 / 66
42. Statistical methods in metabolomics
Basic statistics
Clustering
Clustering
• K-means algorithm: divides a set of N objects into M clusters – with or
without overlap. M must be specified by the analist
• Hiearchical clustering: produces a set of nested clusters in which each
pair of objects is progressively nested into a larger cluster until only one
cluster remains
34 / 66
43. Statistical methods in metabolomics
Basic statistics
Clustering
K-means algorithm
• Make the first object the centroid for the first cluster
• For the next object calculate the similarity to each existing centroid
• If the similarity is greater than a threshold add the object to the existing
cluster and redetermine the centroid, else use the object to start new
cluster
• Return to step 2 and repeat until done
35 / 66
44. Statistical methods in metabolomics
Basic statistics
Clustering
K-means algorithm example
# Read data
> st1 <- read.table("Data/global_afegits.csv", sep=";",
dec=",", header=T)
# Determine number of clusters
> n <- nrow(st2.ado)
> wss <- rep(1:10)
> wss[1] <- (n-1)*sum(apply(st2.ado[,2:8],2,var))
> for (i in 2:10)
{
wss[i] <- sum(kmeans(na.omit(st2.ado[,2:8]),
centers=i)$withinss)
}
> plot(1:10,wss,type="b",xlab="Number of groups",
ylab="Within groups sum of squares")
36 / 66
45. Statistical methods in metabolomics
Basic statistics
Clustering
K-means algorithm example
q
q
q
q
q
q
q
q
q
q
2 4 6 8 10
12000140001600018000200002200024000
Number of groups
Withingroupssumofsquares
37 / 66
46. Statistical methods in metabolomics
Basic statistics
Clustering
K-means algorithm example
If we choose 5 clusters, we then
> fit <- kmeans(st2.ado, 5)
will classify the observations in the 5 groups
38 / 66
47. Statistical methods in metabolomics
Basic statistics
Clustering
Hierarchical clustering
• Find the two closest objects and merge them into a cluster
• Find and merge the next two closest objects (or an object and a cluster,
or two clusters) using some similarity measure and a predefined
threshold
• If more than one cluster remains return to step 2 until finished
39 / 66
48. Statistical methods in metabolomics
Basic statistics
Clustering
Hierarchical clustering example
# Ward Hierarchical Clustering
> d <- dist(st2.ado, method = "euclidean")
> fit <- hclust(d, method="ward.D")
> plot(fit) # display dendogram
> groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
> rect.hclust(fit, k=5, border="red")
40 / 66
49. Statistical methods in metabolomics
Basic statistics
Clustering
Hierarchical clustering example
2813022403092341710223204221563305251482921379283556152155672583003317327017026225115828614228216231518656729424153513194471094920522032682315583628928717559111031162601612021192252691415992092172222502492142972302293139160179595257254113200210205183215177324184193314235227181102552653211760129118514915414616994312608569528546326131187466062432676052486620732723324715716429029597140275172122568312748243111293221519820618823819218253256452883072591995760229803137124301126313515965135548226201176261511482242572342125590244751976826627714492545912841362462132378855562523381653371052122393281285321653435219963395821811033455358522416761057812831012318812017133961119127219650607156539271296341278303220263554421119516321626153281343062227313211127456611416818012117493106781151902981662551331181381457430410815013012311214310720871540982993162642324281189153609147561825385744127628030236159533852452795755766219275861660017822831943542219561
050100150200250300
Cluster Dendrogram
hclust (*, "ward.D")
d
Height
41 / 66
50. Statistical methods in metabolomics
Basic statistics
Clustering
Validating cluster solutions
There are several methods to compare different clustering solutions to the
same problem.
42 / 66
51. Statistical methods in metabolomics
Basic statistics
Clustering
Validating cluster solutions
There are several methods to compare different clustering solutions to the
same problem.
• Hubert’s gamma coefficient
• Dunn index
• Corrected rand index
42 / 66
52. Statistical methods in metabolomics
Basic statistics
Clustering
Validating cluster solutions
There are several methods to compare different clustering solutions to the
same problem.
• Hubert’s gamma coefficient
• Dunn index
• Corrected rand index
Some of them are implemented in R package fpc
42 / 66
53. Statistical methods in metabolomics
Basic statistics
Multivariate statistics
Multivariate statistics
• Multivariate statistics means dealing with several variables at the same
time
• Multivariate problems requires more complex, multidimensional analyses
or dimensional reduction methods
• Metabolomics experiments typically measure many metabolites at once,
in other words the instruments are measuring multiple variables and so
metabolomic data are inherently multivariate data
• The key trick in multivariate statistics is to find a way that effectively
reduces the multivariate data into univariate data
• Then we can apply the same univariate concepts such as p-values,
t-tests and ANOVA tests to the data
43 / 66
54. Statistical methods in metabolomics
Basic statistics
Multivariate statistics
Principal Component Analysis
• PCA is a process that transforms a number of possibly correlated
variables into a smaller number of uncorrelated variables called principal
components
• PCA captures what should be visually detectable
• If you can’t see it, PCA probably won’t help
44 / 66
56. Statistical methods in metabolomics
Basic statistics
Multivariate statistics
Principal Component Analysis
−0.2 −0.1 0.0 0.1 0.2 0.3
−0.2−0.10.00.10.20.3
Comp.1
Comp.2
AlabamaAlaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana Iowa
Kansas
Kentucky
Louisiana
MaineMaryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
OregonPennsylvania
Rhode Island
South Carolina
South DakotaTennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
−5 0 5
−505
Murder
Assault
UrbanPop
Rape
46 / 66
57. Statistical methods in metabolomics
Basic statistics
Multivariate statistics
Other methods
There are several multivariate methods, with an increasing usage in metabo-
lomics and related fields
• Discriminant Analysis (DA, PLS-DA, OPLS-DA)
• Factor Analysis
• Structural Equation Modeling
47 / 66
58. Statistical methods in metabolomics
Available tools
How to analyze data?
R
• R is a freely available language and environment for statistical computing
and graphics.
• It provides a wide variety of statistical and graphical techniques.
• It is constantly expanding thanks to user-contributed packages.
• Can be downloaded from http://cran.r-project.org.
48 / 66
59. Statistical methods in metabolomics
Available tools
How to analyze data?
R
• R is a freely available language and environment for statistical computing
and graphics.
• It provides a wide variety of statistical and graphical techniques.
• It is constantly expanding thanks to user-contributed packages.
• Can be downloaded from http://cran.r-project.org.
Bioconductor
• Bioconductor is a repository of user-contributed R packages.
• It is accessible from http://www.bioconductor.org.
• Provides tools for the analysis and comprehension of high-throughput ge-
nomic data.
• It has mailing lists and a very active users/developers community.
48 / 66
60. Statistical methods in metabolomics
Available tools
Bioconductor
Bioconductor submitted packages
49 / 66
61. Statistical methods in metabolomics
Available tools
Bioconductor
Installation of Bioconductor packages
The installation of Bioconductor can be done within the R session by
source("http://bioconductor.org/biocLite.R")
biocLite()
50 / 66
62. Statistical methods in metabolomics
R basics
Getting help
Getting help
• ?mean
• help(mean)
• help.search("mean")
• apropos("mean")
• example(mean)
51 / 66
63. Statistical methods in metabolomics
R basics
R packages for metabolomics
Useful packages
There are a number of useful packages in Bioconductor regarding metabolo-
mics data analysis.
• flagme: Analysis of metabolomics GC/MS data
52 / 66
64. Statistical methods in metabolomics
R basics
R packages for metabolomics
Useful packages
There are a number of useful packages in Bioconductor regarding metabolo-
mics data analysis.
• flagme: Analysis of metabolomics GC/MS data
• xcms: Analysis of metabolomics XC/MS data
52 / 66
65. Statistical methods in metabolomics
LC/MS example
LC/MS example
xcms
• Can read data stored in several formats like netcdf, mzXML, mzData and
mzML.
• Provides methods for feature detection, non-linear retention time align-
ment, visualization, relative quantization and statistics.
• Is capable of simultaneously preprocessing, analyzing, and visualizing
the raw data from hundreds of samples.
• It’s available as an R package or as an online platform accessible through
https://xcmsonline.scripps.edu/.
53 / 66
66. Statistical methods in metabolomics
LC/MS example
LC/MS example
Typical xcms workflow
54 / 66
67. Statistical methods in metabolomics
LC/MS example
LC/MS example
Reading the data
# use biocLite to install a Bioconductor package
> source("http://bioconductor.org/biocLite.R")
# Install the xcms package
> biocLite("xcms")
# Install dataset package used in this session
> biocLite("faahKO")
55 / 66
68. Statistical methods in metabolomics
LC/MS example
LC/MS example
Reading the data
The data in faahKO consists of LC/MS peaks from the spinal cords of 6 wild-
type and 6 FAAH knockout mice. The data is a subset of the original data from
200-600 m/z and 2500-4500 seconds. It was collected in positive ionization
mode.
# Load libraries
> library("xcms")
> library("faahKO")
56 / 66
69. Statistical methods in metabolomics
LC/MS example
LC/MS example
Reading the data
> cdfpath <- system.file("cdf",package="faahKO")
> files <- list.files(cdfpath, recursive=T, full=T)
> data <- xcmsSet(files)
57 / 66
70. Statistical methods in metabolomics
LC/MS example
LC/MS example
Reading the data
> cdfpath <- system.file("cdf",package="faahKO")
> files <- list.files(cdfpath, recursive=T, full=T)
> data <- xcmsSet(files)
Some important parameters
• scanrange=c(lower, upper): to scan part of the spectra
• fwhm = seconds: specify full width at half maximum (default 30s)
based on the type of chromatography
• method = “centWave”): use wavelet algorithm for peak detection,
suitable for high resolution spectra
57 / 66
71. Statistical methods in metabolomics
LC/MS example
LC/MS example
Peak alignment and retention time correction
> xsg <- group(data) # peak alignment
> xsg <- retcor(xsg) # retention time correction
> xsg <- group(xsg) # re-align
58 / 66
72. Statistical methods in metabolomics
LC/MS example
LC/MS example
Peak alignment and retention time correction
> xsg <- group(data) # peak alignment
> xsg <- retcor(xsg) # retention time correction
> xsg <- group(xsg) # re-align
• Matching peaks across samples
• Using the peak groups to correct drift
• Re-do the alignment
• Can be performed iteratively until no further change
58 / 66
74. Statistical methods in metabolomics
LC/MS example
LC/MS example
Filling in missing peaks
> xsg <- fillPeaks(xsg)
60 / 66
75. Statistical methods in metabolomics
LC/MS example
LC/MS example
Filling in missing peaks
> xsg <- fillPeaks(xsg)
• A significant number of potential peaks can be missed during peak
detection
• Missing values are problematic for robust statistical analysis
• We now have a better idea about where to expect real peaks and their
boundaries
• Re-scan the raw spectra and integrate peaks in the regions of the
missing peaks
60 / 66
76. Statistical methods in metabolomics
LC/MS example
LC/MS example
Results of peak detection
> peaks(xsg)
61 / 66
77. Statistical methods in metabolomics
LC/MS example
LC/MS example
Results of peak detection
> peaks(xsg)
peaks() function gives a list of peaks with
• mz
• mzmin
• mzmax
• rt
• rtmin
• rtmax
• peak intensities/areas (raw data)
61 / 66
78. Statistical methods in metabolomics
LC/MS example
LC/MS example
Statistical analysis
> report <- diffreport(xsg, "WT", "KO")
62 / 66
79. Statistical methods in metabolomics
LC/MS example
LC/MS example
Statistical analysis
> report <- diffreport(xsg, "WT", "KO")
• diffreport() function computes Welch’s two-sample t-statistic for
each analyte and ranks them by p-value.
• It returns a summary report
• Multivariate analysis and visualization can be performed using
MetaboAnalyst
• The report generated by diffreport() can be directly uploaded to
MetaboAnalyst
62 / 66
80. Statistical methods in metabolomics
LC/MS example
LC/MS example
Visualizing important peaks
# Select peaks with median retention time
# between 3300 and 3400 and detected in
# at least 8 samples
> gr <- groups(xsg)
> groupidx <- which(gr[,"rtmed"]>3300 &
gr[,"rtmed"]<3400 &
gr[,"npeaks"]>=8])[1]
> eiccor <- getEIC(xsg, groupidx=groupidx)
> plot(eiccor, col=as.numeric(phenoData(xsg)$class))
63 / 66
81. Statistical methods in metabolomics
LC/MS example
LC/MS example
Visualizing important peaks
# Select peaks with median retention time
# between 3300 and 3400 and detected in
# at least 8 samples
> gr <- groups(xsg)
> groupidx <- which(gr[,"rtmed"]>3300 &
gr[,"rtmed"]<3400 &
gr[,"npeaks"]>=8])[1]
> eiccor <- getEIC(xsg, groupidx=groupidx)
> plot(eiccor, col=as.numeric(phenoData(xsg)$class))
• When significant peaks are identified, it is critical to visualize these
peaks to assess quality
• This is done using the Extracted Ion Chromatogram (EIC)
63 / 66
82. Statistical methods in metabolomics
LC/MS example
LC/MS example
Visualizing important peaks
3300 3350 3400 3450
050000100000150000200000250000
Extracted Ion Chromatogram: 300.1 − 300.2 m/z
Retention Time (seconds)
Intensity
64 / 66
83. Statistical methods in metabolomics
Further reading
Some references
• Broadhurst, D. I., Kell, D. B. (2007). Statistical strategies for avoiding
false discoveries in metabolomics and related experiments.
Metabolomics, 2 (4), 171–196.
• Worley, B., Powers, R. (2013). Multivariate Analysis in Metabolomics.
Current metabolomics, 1 (1), 92–107.
• Issaq, H. J., Van, Q. N., Waybright, T. J., Muschik, G. M., Veenstra, T. D.
(2009). Analytical and statistical approaches to metabolomics research.
Journal of separation science, 32, 2183–2199.
• Smith, C. A. (2014). LC/MS Preprocessing and Analysis with xcms. R
package documentation.
• Korman, A., Oh, A., Raskind, A., Banks, D. (2012). Statistical methods in
metabolomics. Methods in molecular biology, 856 (Evolutionary
genomics), 381–413. Springer.
65 / 66
84. Centre for Research
in Environmental
Epidemiology
Parc de Recerca Biomèdica de Barcelona
Doctor Aiguader, 88
08003 Barcelona (Spain)
Tel. (+34) 93 214 70 00
Fax (+34) 93 214 73 02
info@creal.cat
www.creal.cat