IBSB tutorial

Computational methods for
metabolite identification from
tandem mass spectrometry
Dai Hai Nguyen
Kyoto University, Japan
20/07/2018 D. H. Nguyen, Kyoto University 1

Background of metabolites identification
Metabolites
 Intermediate or end products of metabolism
 Small molecules with important functions: energy transport, building
blocks of cells, etc.
 Many applications, e.g. drug discovery
 Identifying or profiling them is challenging

Tandem Mass spectrometry
 fragments compound into
many fragments
 each fragment corresponds to a peak
 There exist peak interactions
(co-occurrence of peaks)
Peak interaction

 Task: given a query spectrum, find similar molecules in database.
 Approaches:
20/07/2018 D. H. Nguyen, Kyoto University
MS library In silico fragmentation Machine learning
4

I. Mass spectra library
 Simply compare query spectrum with spectra in
library
 Best matching candidates are returned
 Drawback: size of library is limited
 E.g., Human metabolome database ~ 2000 compounds
MS library
5

II. In silico fragmentation
 To mitigate insufficiency of spectra library by
taking advantage of structural database.
 Can be divided into groups:
1) rule-based
2) combinatorial based
3) machine learning based
In silico fragmentation
6

II. In silico fragmentation (1)
1) Rule based fragmentation, e.g., Mass Frontier
 Use set of fragmentation rules to predict spectra from compound
structures.
 Rules are extracted from the literature.
 Not preferred in practice due to:
 fragmentation process can be variant due to small changes in molecular structure
 # rules insufficient to identify fragments with high accuracy
 intensities of peaks are ignored

2) Combinatorial based fragmentation, e.g. FiD
 From molecular structure, generate graph of all
connected substructures.
 Find most likely fragmentation trees that best
matches spectrum.
 Drawbacks:
 computationally expensive -> applied for small molecules
Intensities of fragments are ignored

3) Machine learning based fragmentation
 Use ML to learn fragmentation process from data.
 Peak intensities are considered and learned
 Very few work

Competitive Fragmentation Modeling (CFM)
models fragmentation as a Markov process of state
transition between fragments
1. Transition model
2. Observation model
20/07/2018
D. H. Nguyen, Kyoto University
10

 Task: given a query spectrum, find similar molecules in database.
 Approaches:
MS library In silico fragmentation Machine learning
11

III. Machine learning Approach
a) Supervised ML for
substructure prediction
b) Unsupervised ML for
substructure annotation

IV. Machine learning Approach
supervised ML for substructure prediction
Step 1:
fingerprint prediction
Step 2:
Candidate retrieval

Machine learning Approach
Supervised ML for substructure prediction
FingerID (Bioinformatics, 2012)
Kernel method
• Define probability product kernel (PPK) for spectra.
• Then, use SVM for classification.
 Drawback
 Peak interactions are ignored.
 Limited accuracy
𝑝 𝑋 =
1
𝑛 𝑋
𝑘=1
𝑛 𝑋
𝑝 𝑋(𝑘) 𝑝 𝑌 =
1
𝑛 𝑌
𝑘=1
𝑛 𝑌
𝑝 𝑌(𝑘)
𝐾 𝑋, 𝑌 =
1
𝑛 𝑋 𝑛 𝑌
𝑖,𝑗
𝑝 𝑋(𝑖)𝑝 𝑌(𝑗)

CSI:FingerID (Bioinformatics, 2014)
 Improved version of FingerID
 Define kernel for spectra by PPK
 Kernels for fragmentation trees are defined and combined with PPK
via MKL.
 Then, use SVM for classification.

Fragmentation trees
 Models of fragmentation of a molecule in MS/MS
 Nodes ~ peaks ~ molecular formula of fragments.
 Edges ~ losses ~ uncaptured uncharged fragments.
 Trees can be predicted from spectra provide structural information of
spectra.

Pros & Cos
 Improved accuracy due to
additional structural information
provided by trees
 Computationally expensive due to
conversion of trees from spectra
 Lack of interpretation

SIMPLE (Bioinformatics, 2018)
• Idea: introducing interaction term to model (two-way interaction model)
• Prediction model:
𝑓 𝑥 = 𝑏 + 𝑤 𝑇 𝑥 + 𝑥 𝑇 𝑊𝑥 , 𝑦 𝑥 = 𝑠𝑔𝑛(𝑓(𝑥))
• Objective function :
min
𝑏,𝑤,𝑊
𝑖=1
𝑛
[1 − 𝑦𝑖 𝑓(𝑥𝑖)]+ + 𝛼 𝑤 1 + 𝛽 𝑊 ∗
• Convexity guarantees to find globally optimal solution.
Hinge loss Sparsity Low-rank
Peaks Interactions

SIMPLE (Bioinformatics, 2018)
 Idea: use background knowledge (interactions from trees) to regularize W.
 Laplacian regularization
𝑥 𝑇 𝑊𝑥 = 𝑖,𝑗 𝑤𝑖𝑗 𝑥𝑖 𝑥𝑗 = 𝑖,𝑗(𝑣𝑖
𝑇
𝑣𝑗)𝑥𝑖 𝑥𝑗
𝑊 can be decomposed as 𝑉 𝑇
𝑉 (low rank decomposition)
 𝑅 𝑉 = 𝑖,𝑗 𝐴𝑖𝑗 𝑣𝑖 − 𝑣𝑗
2
= trace 𝑊𝐿 ,
where 𝐿 is Laplacian matrix.
 New objective function :
min
𝑏,𝑤,𝑊
𝑖=1
𝑛
[1 − 𝑦𝑖 𝑓(𝑥𝑖)]+ + 𝛼 𝑤 1 + 𝛽 𝑊 ∗ + 𝛾 trace(𝑊𝐿)
 Still convex
+

Input Output Kernel Regression (IOKR) (Bioinformatics, 2017)
Idea: using IOKR to learn the mapping between spectra and molecular structure.
Two steps:
1. Estimation of the output feature map by solving
2. Computation of pre-image problem

Unsupervised ML for substructure annotation
 Metabolites/molecules may have common substructures,
yielding similar fragments/peaks in spectra.
 Such substructures are pertaining to biochemical processes
 Allows to group metabolites based on shared substructures
 Improve the accuracy of metabolite identification

IV. Machine learning Approach
Unsupervised ML for substructure annotation
MS2LDA (Bioinformatics 2017)
 Automatically extract relevant substructures in
molecules in metabolites based on co-occurrence of
fragments and losses.
 Motivated by topic modeling for text application.
e.g. Latent Dirichlet Allocation (LDA)
 LDA for MS data (MS2LDA)
 Peaks ~ words
 set of peaks (substructures) ~ topics
 LDA decompose a text into topics, while MS2LDA
decomposes a molecule into substructures.
 Drawbacks: extracted substructures need to be annotated
based on expert knowledge (complex process and time-
consuming)

Unsupervised ML for
substructure annotation
Automated recommendation of subtructures
from MS/MS (Aida Mrzic et al, bioRxiv)
 Automatically extract relevant substructures
in molecules based on co-occurrence of
fragments and losses
 Applied Frequent Itemset Mining to extract
association rules.
 Given query spectrum, get recommendation
of substructures present in it by applying
extracted rules.

Conclusion
• Metabolite Identification is an essential part in metabolomics to enlarge
knowledge of biological systems.
• Many techniques/software with different approaches have been
proposed to deal with this task and can be categorized into groups
• ML methods are the key to recent progress in metabolite identification

IBSB tutorial

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to IBSB tutorial

Similar to IBSB tutorial (20)

More from Dai-Hai Nguyen

More from Dai-Hai Nguyen (7)

Recently uploaded

Recently uploaded (20)

IBSB tutorial

Editor's Notes