Metabolite Set Enrichment Analysis (ChemRICH)

Unit 5.3 & 5.6
Metabolite set enrichment
analysis (ChemRICH)
Dinesh Barupal
dinkumar@ucdavis.edu

DATA
ACQUISITION
Separation
Detection
SAMPLING
EXTRACTION
DATA
PROCESSING
File Conversion
Baseline Correction
Peak Detection
Deconvolution
Adduct Annotation
Alignment
Gap Filling
STATISTICS
Normalization
Multivariate Analysis
(Parametric, Nonparametric)
Univariate Analysis
(Unsupervised, Supervised)
BIOLOGICAL
INTERPRETATION
Pathway Mapping
Network Enrichment
STUDY DESIGN
VALIDATION
COMPOUND
IDENTIFICATION
Molecular Formula ID
Structure ID
MS Library Search
Database Search
In silico Fragmentation
WCMC
UC Davis

Questions :
• How to group metabolites into sets?
• Which statistical method to use for set
enrichment ?
• Which sets are significantly different
among two study groups ?

High quality metabolomics data is a commodity
http://metabolomics.ucdavis.edu/
+ Raw LC/GC MS data files
+ Quality control reports
+ ~ 5000 high quality unknown metabolites
~800 known metabolites for $280 only !
By 2020, blood metabolomics datasets
will have 1500 identified compounds.

How to groups metabolites into sets ?
Pros Cons
Pathway maps • Well-known definitions and
accepted by biologists.
• Canonical maps
• Easy interpretation
• Manual boundaries
• Poor coverage
• Overlapping maps
• Lack on consensus among
databases
Chemical classes • Well-known classes, accepted by
epidemiologists
• Good coverage
• Non-overlapping sets
Network modules • Study specific
• Non-overlapping
• All identified compounds are
covered.
• Interpretation is difficult
Correlation
modules
• Study specific
• Non overlapping
• Unknowns are included
• Interpretation is difficult

385
173
MeSH
NCBI
BioSystems
All
187
KEGG
135
Example Metabolomics dataset: non-obese diabetic mice
(http://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Study&StudyID=ST000075)
385 identified primary metabolites, oxylipins, complex lipids.
Argument 1 :
Biochemical databases are incomplete
for metabolomics

Argument 2:
Pathway definitions are manual and
vary across different databases
Major pathway databases
0
500
1000
1500
2000
2500
3000
Pathwaycount

Which Krebs Cycle definitions ?
KEGG
Reactome
SMPDB
MetaCyc

Argument 3:
Pathway definitions are overlapping
1
2
3
4
5
6
7
8 910111213141516171819202123262829303235367096
Number shows the count of pathway maps
Compounds from
NCBI Biosystems
Database

What is enrichment analysis ?
http://jura.wi.mit.edu/bio/education/hot_topics/
> 50,000 papers report use of enrichment or overrepresentation for
lists of genes, transcripts, proteins or metabolites.
An very hot area of research for
building new bioinformatics
software.
Tons of opportunities for
development in the field of
metabolomics.

http://www.metaboanalyst.ca/
A typical pathway enrichment report
N
L
K M
pvalue = phyper(M,L,N-L,K)
All CPDs in HMDB with
pathway annotations (~1600)
A pathway
altered
compounds
What is the probability of having n
metabolites of the a pathway in the input
list ?
Pathways are often used for enrichment analysis
Why we need another enrichment analysis approach ?

Argument 4:
Hypergeometric or fisher exact test is
inappropriate for metabolomics

• expected compounds – entire HMDB (~110,000)
• compounds with pathway annotations – ~2000 for
human
• compound with reaction annotations - ~4000 for
human
• compound with literature annotations – ~15000 for
human blood
• detected known compounds – varies between 500-
1000
• detected all compounds - ~ 3000
Argument 5:
Background database size is not
defined for metabolomics

• Not all metabolites from a pathway map are present in a
metabolomics dataset
• Not all detected metabolites have pathway annotations
• Pathway boundaries are arbitrary and over-lapping
• Pathway maps vary across biochemical databases
• Background database size is varying over time for a
hypergeometric test
A pathway-independent method that
 uses all identified metabolites
 uses non-overlapping set definitions
 that does not depend on any background databases
ChemRICH : Chemical Similarity Enrichment Analysis
Better:
Major problems in pathway based analysis

What are alternative set definitions and
statistics ?

Alternative A : MetaMapp clusters
http://metamapp.fiehnlab.ucdavis.edu/
Limitations
Cluster labels
Similarity cutoff

Alternative B : Chemical similarity clusters
Distance matrix is
Tanimoto coefficient
Limitation
Cluster labels

Alternative C : Chemical Ontologies
Medical Subject Headings ontology
Lipidmaps ontology
110K compounds with mesh annotations
MeSH is linked to PubMed
 automated text mining on identified ontology groups.
Limitation
Not every detected
metabolite is covered
50K compounds
385
173
MeSH
NCBI
BioSystems
All
187
KEGG
135

KS test is a better statistical method
for metabolomics enrichment
Parameter
Fisher
Exact
Hypergeo
metric Bionomial K-S
Background
database Yes Yes No No
p-value cutoff Yes Yes Yes No
K-S :Kolmogorov–Smirnov test
is a nonparametric test of the equality of continuous, one-
dimensional probability distributions that can be used to
compare a sample with a reference probability distribution
(one-sample K–S test)

MeSH PubChem
Name CID SMILES MeSH IDs
Name CID SMILES MeSH IDs Fingerprint
PubChem fingerprint rCDK package
(91,444 unique structures & 2768 MeSH classes)
ChemRICH database
Name CID SMILES p-value effect size
Metabolomics dataset
statistics
lookup Tanimoto
MeSH IDs Classes
Name Class
Non-overlapping classes
KS Test Class P-value
Generation of the ChemRICH database ChemRICH analysis
NC
>0.9 HC
STR
SMILES Class
Enriched Sets
HC
New compounds
ChemRICH
impact plot
ChemRICH combines MeSH, Chemical similarity and KS Test

Start
All
metabolites
ChemRICH
lookup
No
Yes
Label
found
No Tanimoto
Similarity
Yes
TM
score
>0.90
Yes
No Detection
of new
Clusters
New
Cluster ?
Yes
No
TM
score
>0.75
Yes
No
Reported
individually
Generation of non-
overlapping class annotation
p-values
SMILES
regex search
Similarity
matrix
HCL
ChemRICH
enrichment plot
END
Effect
sizes
Classes
found
(68)
(385)
(317) (151)
(166)
(147)
(19)
(0)
(19)
(5)
(14)
Set size >2
Yes (325)
No (55)
(50 sets)
KS-test
ChemRICH combines MeSH, Chemical similarity and KS Test
Precise steps in the ChemRICH analysis for a metabolomics dataset

A
1
2
2
`
`
`
disaccharides
hexose-
phosphates
pentoses
hexoses
sugar
alcohols
sugar
acids
tricarboxylic
acids
butyrates
hydroxybutyrates
amino acids,
sulfur
amino acids,
branched-chain
cholesterol
esters
pyridines
amino acids,
aromatic
indoles
sphingomyelins
Unsaturated_lysophosphatidylcholines
phosphatidylcholines
phosphatidyl-
inositols
plasmalogens
phosphatidyl-
ethanolamines
DiHODE
oxo-ETE
HETrE
HETE
Unsaturated_triglycerides
Saturated FA
Saturated_triglycerides
Saturated_
lysophosphatidylcholines
cluster order on Tanimoto similarity tree
-log(pvalue)
0 10 20 30
0
10
20
30
40
50 Cluster name cluster size pvalues
adjusted
pvalue
total
changed increased decreased
UnSaturated PC 38 5.18E-10 2.54E-08 25 2 23
UnSaturated TG 35 7.38E-09 1.81E-07 22 21 1
UnSaturated SM 17 8.30E-06 0.000135 12 0 12
UnSaturated LPC 9 1.10E-05 0.000135 9 0 9
Butyrates 7 9.14E-05 0.000896 7 6 1
Disaccharides 8 0.00021 0.001712 7 6 1
PUFA TG 12 0.000266 0.001862 8 8 0
Hexoses 7 0.000597 0.003656 6 6 0
Sugar Acids 10 0.001707 0.009296 6 6 0
PUFA PI 4 0.002339 0.010419 4 0 4
Saturated TG 4 0.002339 0.010419 4 4 0
OH-FA_20 17 0.003475 0.014191 6 1 5
OH-FA_18 10 0.004912 0.018513 5 0 5
PUFA PC 11 0.005484 0.019193 5 0 5
Amino Acids,
Branched-Chain 3 0.007153 0.019472 3 3 0
Pentoses 3 0.007153 0.019472 3 3 0
PUFA LPC 3 0.007153 0.019472 3 0 3
PUFA PE 6 0.007153 0.019472 4 0 4
Sugar Alcohols 12 0.01423 0.036698 4 3 1
Amino Acids, Sulfur 3 0.041632 0.081599 2 0 2
Hexosephosphates 3 0.041632 0.081599 2 2 0
Indoles 3 0.041632 0.081599 2 2 0
O=FA_20 3 0.041632 0.081599 2 0 2
Pyridines 3 0.041632 0.081599 2 2 0
Tricarboxylic Acids 3 0.041632 0.081599 2 2 0
Using the ontology/chemistry clusters
to compute p-values for significant metabolic differences

ChemRICH app
Interactive cluster plot
compound level data table
cluster level data table
chemical similarity tree
Result downloads as xlsx, pptx, png , pdf
ChemRICH is available online
www.ChemRICH.us

ChemRICH analysis
for the NAFLD study

ChemRICH : Data preparation
Example dataset available in the chemrich example folder
spring_2018_metabolomics_course_chemrich_example
Use PubChem Identified Exchange Service to obtain identifiers, InchiKeys and SMILES for compound names.

ChemRICH input file errors -
• Duplicate PubChem CIDs
• Duplicate names
• Missing SMILES codes
• Missing p-value or fold-change
• Headers mismatch
• > 1000 compounds

Perform ChemRICH analysis
www.ChemRICH.us
Paste your data in this box

Explanation of results
Editable power-point slide

Download/interact with results
Imino Acids
Saturated_Lysophosphatidylcholines
Lysophospholipids
Unsaturated_Lysophosphatidylcholines
NewCluster_32
Cholestenes
Phosphatidylethanolamines
NewCluster_14
Unsaturated_Phosphatidylcholines
Sphingomyelins
Diglycerides
Plasmalogens
Unsaturated_Ceramides
Galactosylceramides
Cholesterol Esters
0
10
20
30
0 5 10 15 20
median XlogP of clusters
-log(pvalue)

User provided classes
http://chemrich.fiehnlab.ucdavis.edu/ocpu/library/ChemRICHTest3/www/class.html

Github
https://github.com/barupal/chemrich
Docker image and source codes
https://bitbucket.org/barupal/chemrich-docker
Bitbucket
https://hub.docker.com/r/barupal/chemrich-docker/
Docker
docker pull barupal/chemrich-docker

Main advantages of the ChemRICH method
• mapping of up to 95% of the known compounds in a metabolomics dataset.
• non-overlapping clusters.
• background database independent statistics.
• can map compounds that are not yet in any database, such as in-silico compounds.
• utilizes existing knowledge from chemical ontologies to enable straightforward literature
mining.
• allows identification of new chemical clusters that are not yet covered in ontologies yet.
• cluster impact plot visualize the chemical diversity.
• inclusion of well known chemical classes as well room for clustering of other chemical
classes.
Barupal Dinesh & Fiehn Oliver. ChemRICH : Chemical Similarity Enrichment
Analysis for metabolomics datasets. Scientific Report (2017)
Publication
Conclusions

Metabolite Set Enrichment Analysis (ChemRICH)

Recommended

Recommended

More Related Content

Similar to Metabolite Set Enrichment Analysis (ChemRICH)

Similar to Metabolite Set Enrichment Analysis (ChemRICH) (20)

Recently uploaded

Recently uploaded (20)

Metabolite Set Enrichment Analysis (ChemRICH)

Editor's Notes