This document describes the ChemRICH method for metabolite set enrichment analysis of metabolomics data. ChemRICH uses chemical similarity clustering to group metabolites into non-overlapping sets, addressing limitations of pathway-based enrichment methods. It employs the Kolmogorov-Smirnov test for enrichment statistics without relying on background databases. The method was developed as open source software and has been applied to analyze a non-alcoholic fatty liver disease metabolomics study.
1. Unit 5.3 & 5.6
Metabolite set enrichment
analysis (ChemRICH)
Dinesh Barupal
dinkumar@ucdavis.edu
2. DATA
ACQUISITION
Separation
Detection
SAMPLING
EXTRACTION
DATA
PROCESSING
File Conversion
Baseline Correction
Peak Detection
Deconvolution
Adduct Annotation
Alignment
Gap Filling
STATISTICS
Normalization
Multivariate Analysis
(Parametric, Nonparametric)
Univariate Analysis
(Unsupervised, Supervised)
BIOLOGICAL
INTERPRETATION
Pathway Mapping
Network Enrichment
STUDY DESIGN
VALIDATION
COMPOUND
IDENTIFICATION
Molecular Formula ID
Structure ID
MS Library Search
Database Search
In silico Fragmentation
WCMC
UC Davis
3. Questions :
• How to group metabolites into sets?
• Which statistical method to use for set
enrichment ?
• Which sets are significantly different
among two study groups ?
4. High quality metabolomics data is a commodity
http://metabolomics.ucdavis.edu/
+ Raw LC/GC MS data files
+ Quality control reports
+ ~ 5000 high quality unknown metabolites
~800 known metabolites for $280 only !
By 2020, blood metabolomics datasets
will have 1500 identified compounds.
5. How to groups metabolites into sets ?
Pros Cons
Pathway maps • Well-known definitions and
accepted by biologists.
• Canonical maps
• Easy interpretation
• Manual boundaries
• Poor coverage
• Overlapping maps
• Lack on consensus among
databases
Chemical classes • Well-known classes, accepted by
epidemiologists
• Good coverage
• Non-overlapping sets
Network modules • Study specific
• Non-overlapping
• All identified compounds are
covered.
• Interpretation is difficult
Correlation
modules
• Study specific
• Non overlapping
• Unknowns are included
• Interpretation is difficult
6. 385
173
MeSH
NCBI
BioSystems
All
187
KEGG
135
Example Metabolomics dataset: non-obese diabetic mice
(http://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Study&StudyID=ST000075)
385 identified primary metabolites, oxylipins, complex lipids.
Argument 1 :
Biochemical databases are incomplete
for metabolomics
7. Argument 2:
Pathway definitions are manual and
vary across different databases
Major pathway databases
0
500
1000
1500
2000
2500
3000
Pathwaycount
9. Argument 3:
Pathway definitions are overlapping
1
2
3
4
5
6
7
8 910111213141516171819202123262829303235367096
Number shows the count of pathway maps
Compounds from
NCBI Biosystems
Database
10. What is enrichment analysis ?
http://jura.wi.mit.edu/bio/education/hot_topics/
> 50,000 papers report use of enrichment or overrepresentation for
lists of genes, transcripts, proteins or metabolites.
An very hot area of research for
building new bioinformatics
software.
Tons of opportunities for
development in the field of
metabolomics.
11. http://www.metaboanalyst.ca/
A typical pathway enrichment report
N
L
K M
pvalue = phyper(M,L,N-L,K)
All CPDs in HMDB with
pathway annotations (~1600)
A pathway
altered
compounds
What is the probability of having n
metabolites of the a pathway in the input
list ?
Pathways are often used for enrichment analysis
Why we need another enrichment analysis approach ?
13. • expected compounds – entire HMDB (~110,000)
• compounds with pathway annotations – ~2000 for
human
• compound with reaction annotations - ~4000 for
human
• compound with literature annotations – ~15000 for
human blood
• detected known compounds – varies between 500-
1000
• detected all compounds - ~ 3000
Argument 5:
Background database size is not
defined for metabolomics
14. • Not all metabolites from a pathway map are present in a
metabolomics dataset
• Not all detected metabolites have pathway annotations
• Pathway boundaries are arbitrary and over-lapping
• Pathway maps vary across biochemical databases
• Background database size is varying over time for a
hypergeometric test
A pathway-independent method that
uses all identified metabolites
uses non-overlapping set definitions
that does not depend on any background databases
ChemRICH : Chemical Similarity Enrichment Analysis
Better:
Major problems in pathway based analysis
16. Alternative A : MetaMapp clusters
http://metamapp.fiehnlab.ucdavis.edu/
Limitations
Cluster labels
Similarity cutoff
17. Alternative B : Chemical similarity clusters
Distance matrix is
Tanimoto coefficient
Limitation
Cluster labels
18. Alternative C : Chemical Ontologies
Medical Subject Headings ontology
Lipidmaps ontology
110K compounds with mesh annotations
MeSH is linked to PubMed
automated text mining on identified ontology groups.
Limitation
Not every detected
metabolite is covered
50K compounds
385
173
MeSH
NCBI
BioSystems
All
187
KEGG
135
19. KS test is a better statistical method
for metabolomics enrichment
Parameter
Fisher
Exact
Hypergeo
metric Bionomial K-S
Background
database Yes Yes No No
p-value cutoff Yes Yes Yes No
K-S :Kolmogorov–Smirnov test
is a nonparametric test of the equality of continuous, one-
dimensional probability distributions that can be used to
compare a sample with a reference probability distribution
(one-sample K–S test)
20. MeSH PubChem
Name CID SMILES MeSH IDs
Name CID SMILES MeSH IDs Fingerprint
PubChem fingerprint rCDK package
(91,444 unique structures & 2768 MeSH classes)
ChemRICH database
Name CID SMILES p-value effect size
Metabolomics dataset
statistics
lookup Tanimoto
MeSH IDs Classes
Name Class
Non-overlapping classes
KS Test Class P-value
Generation of the ChemRICH database ChemRICH analysis
NC
>0.9 HC
STR
SMILES Class
Enriched Sets
HC
New compounds
ChemRICH
impact plot
ChemRICH combines MeSH, Chemical similarity and KS Test
21. Start
All
metabolites
ChemRICH
lookup
No
Yes
Label
found
No Tanimoto
Similarity
Yes
TM
score
>0.90
Yes
No Detection
of new
Clusters
New
Cluster ?
Yes
No
TM
score
>0.75
Yes
No
Reported
individually
Generation of non-
overlapping class annotation
p-values
SMILES
regex search
Similarity
matrix
HCL
ChemRICH
enrichment plot
END
Effect
sizes
Classes
found
(68)
(385)
(317) (151)
(166)
(147)
(19)
(0)
(19)
(5)
(14)
Set size >2
Yes (325)
No (55)
(50 sets)
KS-test
ChemRICH combines MeSH, Chemical similarity and KS Test
Precise steps in the ChemRICH analysis for a metabolomics dataset
23. ChemRICH app
Interactive cluster plot
compound level data table
cluster level data table
chemical similarity tree
Result downloads as xlsx, pptx, png , pdf
ChemRICH is available online
www.ChemRICH.us
25. ChemRICH : Data preparation
Example dataset available in the chemrich example folder
spring_2018_metabolomics_course_chemrich_example
Use PubChem Identified Exchange Service to obtain identifiers, InchiKeys and SMILES for compound names.
35. Main advantages of the ChemRICH method
• mapping of up to 95% of the known compounds in a metabolomics dataset.
• non-overlapping clusters.
• background database independent statistics.
• can map compounds that are not yet in any database, such as in-silico compounds.
• utilizes existing knowledge from chemical ontologies to enable straightforward literature
mining.
• allows identification of new chemical clusters that are not yet covered in ontologies yet.
• cluster impact plot visualize the chemical diversity.
• inclusion of well known chemical classes as well room for clustering of other chemical
classes.
Barupal Dinesh & Fiehn Oliver. ChemRICH : Chemical Similarity Enrichment
Analysis for metabolomics datasets. Scientific Report (2017)
Publication
Conclusions