2. DATA
ACQUISITION
Separation
Detection
SAMPLING
EXTRACTION
DATA
PROCESSING
File Conversion
Baseline Correction
Peak Detection
Deconvolution
Adduct Annotation
Alignment
Gap Filling
STATISTICS
Normalization
Multivariate Analysis
(Parametric, Nonparametric)
Univariate Analysis
(Unsupervised, Supervised)
BIOLOGICAL
INTERPRETATION
Pathway Mapping
Network Enrichment
STUDY DESIGN
VALIDATION
COMPOUND
IDENTIFICATION
Molecular Formula ID
Structure ID
MS Library Search
Database Search
In silico Fragmentation
WCMC
UC Davis
3. Questions :
• How to group metabolites into sets?
• Which statistical method to use for set
enrichment ?
• Which sets are significantly different
among two study groups ?
4. How to group metabolites into sets ?
Pros Cons
Pathway maps • Well-known definitions and
accepted by biologists.
• Canonical maps
• Easy interpretation
• Manual boundaries
• Poor coverage
• Overlapping maps
• Lack on consensus among
databases
Chemical classes • Well-known classes, accepted by
epidemiologists
• Good coverage
• Non-overlapping sets
Network modules • Study specific
• Non-overlapping
• All identified compounds are
covered.
• Interpretation is difficult
Correlation
modules
• Study specific
• Non overlapping
• Unknowns are included
• Interpretation is difficult
5. http://www.metaboanalyst.ca/
A typical pathway enrichment report
What is the probability of having n metabolites of a
pathway in the input list ?
Hypergeometric test is often used.
Pathways are commonly used for metabolite set enrichment
analysis
6. 385
173
MeSH
NCBI
BioSystems
All
187
KEGG
135
Example Metabolomics dataset: non-obese diabetic mice
(http://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Study&StudyID=ST000075)
385 identified primary metabolites, oxylipins, complex lipids.
Pathway maps as sets – limitation 1
Biochemical databases are incomplete
for metabolomics
7. Pathway maps as sets – limitation 2
Pathway definitions are manual and vary
across different databases
Major pathway databases
0
500
1000
1500
2000
2500
3000
Pathwaycount
9. Pathway maps as sets – limitation 3
Pathway definitions are overlapping
1
2
3
4
5
6
7
8 910111213141516171819202123262829303235367096
Number shows the count of pathway maps
Compounds from
NCBI Biosystems
Database
10. N
L
K M
pvalue = phyper(M,L,N-L,K)
All CPDs in HMDB with
pathway annotations (~1600)
A pathway
altered
compounds
Hypergeometric test is often used for pathway analyses
Pathway analysis output from the
metaboanalyst software
12. How about p-value correction ?
1. A p-value of 0.05 for one statistical test indicates that there is a 5%
chance that the null hypothesis was true.
2. If we do 100 independent tests, 5 null hypotheses were incorrected
rejected. Those 5 are possible false positives. (type 1 error)
3. Number of pathway maps = number of hypergeometric tests.
4. A p-value correction using the false discovery rate (FDR) method
rejects the pathways maps which are false positive.
5. More pathways we tests, higher the type 1 errors goes.
13. Pathway set analysis– limitation 4
Hypergeometric or fisher exact test is
inappropriate for metabolomics
15. Alternative A : MetaMapp clusters
http://metamapp.fiehnlab.ucdavis.edu/
Limitations
Cluster labels
Similarity cutoff
16. Alternative B : Chemical similarity clusters
Distance matrix is
Tanimoto coefficient
Limitation
Cluster labels
17. Alternative C : Chemical Ontologies
Medical Subject Headings ontology
Lipidmaps ontology
110K compounds with mesh annotations
MeSH is linked to PubMed
automated text mining on identified ontology groups.
Limitation
Not every detected
metabolite is covered
50K compounds
385
173
MeSH
NCBI
BioSystems
All
187
KEGG
135
18. KS test is a better statistical method for metabolomics
enrichment
Parameter
Fisher
Exact
Hypergeo
metric Bionomial K-S
Background
database Yes Yes No No
p-value cutoff Yes Yes Yes No
K-S :Kolmogorov–Smirnov test
is a nonparametric test of the equality of continuous, one-
dimensional probability distributions that can be used to
compare a sample with a reference probability distribution
(one-sample K–S test)
19. MeSH PubChem
Name CID SMILES MeSH IDs
Name CID SMILES MeSH IDs Fingerprint
PubChem fingerprint rCDK package
(91,444 unique structures & 2768 MeSH classes)
ChemRICH database
Name CID SMILES p-value effect size
Metabolomics dataset
statistics
lookup Tanimoto
MeSH IDs Classes
Name Class
Non-overlapping classes
KS Test Class P-value
Generation of the ChemRICH database ChemRICH analysis
NC
>0.9 HC
STR
SMILES Class
Enriched Sets
HC
New compounds
ChemRICH
impact plot
ChemRICH combines MeSH, Chemical similarity and KS Test
Barupal, Dinesh Kumar, and Oliver Fiehn. "Chemical Similarity Enrichment Analysis (ChemRICH)
as alternative to biochemical pathway mapping for metabolomic datasets." Scientific reports 7.1
(2017): 14567.
21. ChemRICH app
Interactive cluster plot
compound level data table
cluster level data table
chemical similarity tree
Result downloads as xlsx, pptx, png , pdf
ChemRICH is available online
www.ChemRICH.us
23. ChemRICH : Data preparation
Example dataset available in the
TeachingMaterialDataSetsBioinformatics_Training_DataChemRICH folder
Null_ChemRIC_input.xlsx
24. ChemRICH input file errors -
• Duplicate PubChem CIDs
• Duplicate names
• Missing SMILES codes
• Missing p-value or fold-change
• Headers mismatch
• > 1000 compounds
Always use the chemrich input template available at the chemrich.us website.
32. • Not all metabolites from a pathway map are present in a
metabolomics dataset
• Not all detected metabolites have pathway annotations
• Pathway boundaries are arbitrary and over-lapping
• Pathway maps vary across biochemical databases
• Background database size is varying over time for a
hypergeometric test
A pathway-independent method that
uses all identified metabolites
uses non-overlapping set definitions
that does not depend on any background databases
ChemRICH : Chemical Similarity Enrichment Analysis
Better:
Major problems in pathway based analysis
33. Main advantages of the ChemRICH method
• mapping of up to 95% of the known compounds in a metabolomics dataset.
• non-overlapping clusters.
• background database independent statistics.
• can map compounds that are not yet in any database, such as in-silico compounds.
• utilizes existing knowledge from chemical ontologies to enable straightforward literature
mining.
• allows identification of new chemical clusters that are not yet covered in ontologies yet.
• cluster impact plot visualize the chemical diversity.
• inclusion of well known chemical classes as well room for clustering of other chemical
classes.
Barupal Dinesh & Fiehn Oliver. ChemRICH : Chemical Similarity Enrichment
Analysis for metabolomics datasets. Scientific Report (2017)
Publication
Conclusions