Metabolic Set Enrichment Analysis - chemrich - 2019

Unit 4.6
Metabolite set enrichment
analysis (ChemRICH)
Dinesh Barupal
dinkumar@ucdavis.edu

DATA
ACQUISITION
Separation
Detection
SAMPLING
EXTRACTION
DATA
PROCESSING
File Conversion
Baseline Correction
Peak Detection
Deconvolution
Adduct Annotation
Alignment
Gap Filling
STATISTICS
Normalization
Multivariate Analysis
(Parametric, Nonparametric)
Univariate Analysis
(Unsupervised, Supervised)
BIOLOGICAL
INTERPRETATION
Pathway Mapping
Network Enrichment
STUDY DESIGN
VALIDATION
COMPOUND
IDENTIFICATION
Molecular Formula ID
Structure ID
MS Library Search
Database Search
In silico Fragmentation
WCMC
UC Davis

Questions :
• How to group metabolites into sets?
• Which statistical method to use for set
enrichment ?
• Which sets are significantly different
among two study groups ?

How to group metabolites into sets ?
Pros Cons
Pathway maps • Well-known definitions and
accepted by biologists.
• Canonical maps
• Easy interpretation
• Manual boundaries
• Poor coverage
• Overlapping maps
• Lack on consensus among
databases
Chemical classes • Well-known classes, accepted by
epidemiologists
• Good coverage
• Non-overlapping sets
Network modules • Study specific
• Non-overlapping
• All identified compounds are
covered.
• Interpretation is difficult
Correlation
modules
• Study specific
• Non overlapping
• Unknowns are included
• Interpretation is difficult

http://www.metaboanalyst.ca/
A typical pathway enrichment report
What is the probability of having n metabolites of a
pathway in the input list ?
Hypergeometric test is often used.
Pathways are commonly used for metabolite set enrichment
analysis

385
173
MeSH
NCBI
BioSystems
All
187
KEGG
135
Example Metabolomics dataset: non-obese diabetic mice
(http://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Study&StudyID=ST000075)
385 identified primary metabolites, oxylipins, complex lipids.
Pathway maps as sets – limitation 1
Biochemical databases are incomplete
for metabolomics

Pathway definitions are manual and vary
across different databases
Major pathway databases
0
500
1000
1500
2000
2500
3000
Pathwaycount

Which Krebs Cycle definitions ?
KEGG
Reactome
SMPDB
MetaCyc

Pathway definitions are overlapping
1
2
3
4
5
6
7
8 910111213141516171819202123262829303235367096
Number shows the count of pathway maps
Compounds from
NCBI Biosystems
Database

N
L
K M
pvalue = phyper(M,L,N-L,K)
All CPDs in HMDB with
pathway annotations (~1600)
A pathway
altered
compounds
Hypergeometric test is often used for pathway analyses
Pathway analysis output from the
metaboanalyst software

What can go wrong in a statistical test ?

How about p-value correction ?
1. A p-value of 0.05 for one statistical test indicates that there is a 5%
chance that the null hypothesis was true.
2. If we do 100 independent tests, 5 null hypotheses were incorrected
rejected. Those 5 are possible false positives. (type 1 error)
3. Number of pathway maps = number of hypergeometric tests.
4. A p-value correction using the false discovery rate (FDR) method
rejects the pathways maps which are false positive.
5. More pathways we tests, higher the type 1 errors goes.

Pathway set analysis– limitation 4
Hypergeometric or fisher exact test is
inappropriate for metabolomics

What are alternative set definitions and
statistics ?

Alternative A : MetaMapp clusters
http://metamapp.fiehnlab.ucdavis.edu/
Limitations
Cluster labels
Similarity cutoff

Alternative B : Chemical similarity clusters
Distance matrix is
Tanimoto coefficient
Limitation
Cluster labels

Alternative C : Chemical Ontologies
Medical Subject Headings ontology
Lipidmaps ontology
110K compounds with mesh annotations
MeSH is linked to PubMed
 automated text mining on identified ontology groups.
Limitation
Not every detected
metabolite is covered
50K compounds
385
173
MeSH
NCBI
BioSystems
All
187
KEGG
135

KS test is a better statistical method for metabolomics
enrichment
Parameter
Fisher
Exact
Hypergeo
metric Bionomial K-S
Background
database Yes Yes No No
p-value cutoff Yes Yes Yes No
K-S :Kolmogorov–Smirnov test
is a nonparametric test of the equality of continuous, one-
dimensional probability distributions that can be used to
compare a sample with a reference probability distribution
(one-sample K–S test)

MeSH PubChem
Name CID SMILES MeSH IDs
Name CID SMILES MeSH IDs Fingerprint
PubChem fingerprint rCDK package
(91,444 unique structures & 2768 MeSH classes)
ChemRICH database
Name CID SMILES p-value effect size
Metabolomics dataset
statistics
lookup Tanimoto
MeSH IDs Classes
Name Class
Non-overlapping classes
KS Test Class P-value
Generation of the ChemRICH database ChemRICH analysis
NC
>0.9 HC
STR
SMILES Class
Enriched Sets
HC
New compounds
ChemRICH
impact plot
ChemRICH combines MeSH, Chemical similarity and KS Test
Barupal, Dinesh Kumar, and Oliver Fiehn. "Chemical Similarity Enrichment Analysis (ChemRICH)
as alternative to biochemical pathway mapping for metabolomic datasets." Scientific reports 7.1
(2017): 14567.

A
1
2
2
`
`
`
disaccharides
hexose-
phosphates
pentoses
hexoses
sugar
alcohols
sugar
acids
tricarboxylic
acids
butyrates
hydroxybutyrates
amino acids,
sulfur
amino acids,
branched-chain
cholesterol
esters
pyridines
amino acids,
aromatic
indoles
sphingomyelins
Unsaturated_lysophosphatidylcholines
phosphatidylcholines
phosphatidyl-
inositols
plasmalogens
phosphatidyl-
ethanolamines
DiHODE
oxo-ETE
HETrE
HETE
Unsaturated_triglycerides
Saturated FA
Saturated_triglycerides
Saturated_
lysophosphatidylcholines
cluster order on Tanimoto similarity tree
-log(pvalue)
0 10 20 30
0
10
20
30
40
50 Cluster name cluster size pvalues
adjusted
pvalue
total
changed increased decreased
UnSaturated PC 38 5.18E-10 2.54E-08 25 2 23
UnSaturated TG 35 7.38E-09 1.81E-07 22 21 1
UnSaturated SM 17 8.30E-06 0.000135 12 0 12
UnSaturated LPC 9 1.10E-05 0.000135 9 0 9
Butyrates 7 9.14E-05 0.000896 7 6 1
Disaccharides 8 0.00021 0.001712 7 6 1
PUFA TG 12 0.000266 0.001862 8 8 0
Hexoses 7 0.000597 0.003656 6 6 0
Sugar Acids 10 0.001707 0.009296 6 6 0
PUFA PI 4 0.002339 0.010419 4 0 4
Saturated TG 4 0.002339 0.010419 4 4 0
OH-FA_20 17 0.003475 0.014191 6 1 5
OH-FA_18 10 0.004912 0.018513 5 0 5
PUFA PC 11 0.005484 0.019193 5 0 5
Amino Acids,
Branched-Chain 3 0.007153 0.019472 3 3 0
Pentoses 3 0.007153 0.019472 3 3 0
PUFA LPC 3 0.007153 0.019472 3 0 3
PUFA PE 6 0.007153 0.019472 4 0 4
Sugar Alcohols 12 0.01423 0.036698 4 3 1
Amino Acids, Sulfur 3 0.041632 0.081599 2 0 2
Hexosephosphates 3 0.041632 0.081599 2 2 0
Indoles 3 0.041632 0.081599 2 2 0
O=FA_20 3 0.041632 0.081599 2 0 2
Pyridines 3 0.041632 0.081599 2 2 0
Tricarboxylic Acids 3 0.041632 0.081599 2 2 0
Using the ontology/chemistry clusters
to compute p-values for significant metabolic differences

ChemRICH app
Interactive cluster plot
compound level data table
cluster level data table
chemical similarity tree
Result downloads as xlsx, pptx, png , pdf
ChemRICH is available online
www.ChemRICH.us

ChemRICH : Data preparation
Example dataset available in the
TeachingMaterialDataSetsBioinformatics_Training_DataChemRICH folder
Null_ChemRIC_input.xlsx

ChemRICH input file errors -
• Duplicate PubChem CIDs
• Duplicate names
• Missing SMILES codes
• Missing p-value or fold-change
• Headers mismatch
• > 1000 compounds
Always use the chemrich input template available at the chemrich.us website.

Perform ChemRICH analysis
www.ChemRICH.us
Paste your data in this box

Explanation of results
Editable power-point slide
Download these
three files

Download/interact with results

User provided classes
http://chemrich.fiehnlab.ucdavis.edu/ocpu/library/ChemRICHTest3/www/class.html

• Not all metabolites from a pathway map are present in a
metabolomics dataset
• Not all detected metabolites have pathway annotations
• Pathway boundaries are arbitrary and over-lapping
• Pathway maps vary across biochemical databases
• Background database size is varying over time for a
hypergeometric test
A pathway-independent method that
 uses all identified metabolites
 uses non-overlapping set definitions
 that does not depend on any background databases
ChemRICH : Chemical Similarity Enrichment Analysis
Better:
Major problems in pathway based analysis

Main advantages of the ChemRICH method
• mapping of up to 95% of the known compounds in a metabolomics dataset.
• non-overlapping clusters.
• background database independent statistics.
• can map compounds that are not yet in any database, such as in-silico compounds.
• utilizes existing knowledge from chemical ontologies to enable straightforward literature
mining.
• allows identification of new chemical clusters that are not yet covered in ontologies yet.
• cluster impact plot visualize the chemical diversity.
• inclusion of well known chemical classes as well room for clustering of other chemical
classes.
Barupal Dinesh & Fiehn Oliver. ChemRICH : Chemical Similarity Enrichment
Analysis for metabolomics datasets. Scientific Report (2017)
Publication
Conclusions

Metabolic Set Enrichment Analysis - chemrich - 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Metabolic Set Enrichment Analysis - chemrich - 2019

Similar to Metabolic Set Enrichment Analysis - chemrich - 2019 (20)

Recently uploaded

Recently uploaded (20)

Metabolic Set Enrichment Analysis - chemrich - 2019

Editor's Notes