SlideShare a Scribd company logo
Computational methods for
metabolite identification from
tandem mass spectrometry
Dai Hai Nguyen
Kyoto University, Japan
20/07/2018 D. H. Nguyen, Kyoto University 1
Background of metabolites identification
Metabolites
 Intermediate or end products of metabolism
 Small molecules with important functions: energy transport, building
blocks of cells, etc.
 Many applications, e.g. drug discovery
 Identifying or profiling them is challenging
20/07/2018 D. H. Nguyen, Kyoto University 2
Background of metabolites identification
Tandem Mass spectrometry
 fragments compound into
many fragments
 each fragment corresponds to a peak
 There exist peak interactions
(co-occurrence of peaks)
20/07/2018 D. H. Nguyen, Kyoto University 3
Peak interaction
Background of metabolites identification
 Task: given a query spectrum, find similar molecules in database.
 Approaches:
20/07/2018 D. H. Nguyen, Kyoto University
MS library In silico fragmentation Machine learning
4
I. Mass spectra library
 Simply compare query spectrum with spectra in
library
 Best matching candidates are returned
 Drawback: size of library is limited
 E.g., Human metabolome database ~ 2000 compounds
20/07/2018 D. H. Nguyen, Kyoto University
MS library
5
II. In silico fragmentation
 To mitigate insufficiency of spectra library by
taking advantage of structural database.
 Can be divided into groups:
1) rule-based
2) combinatorial based
3) machine learning based
20/07/2018 D. H. Nguyen, Kyoto University
In silico fragmentation
6
II. In silico fragmentation (1)
1) Rule based fragmentation, e.g., Mass Frontier
 Use set of fragmentation rules to predict spectra from compound
structures.
 Rules are extracted from the literature.
 Not preferred in practice due to:
 fragmentation process can be variant due to small changes in molecular structure
 # rules insufficient to identify fragments with high accuracy
 intensities of peaks are ignored
20/07/2018 D. H. Nguyen, Kyoto University 7
II. In silico fragmentation (2)
2) Combinatorial based fragmentation, e.g. FiD
 From molecular structure, generate graph of all
connected substructures.
 Find most likely fragmentation trees that best
matches spectrum.
 Drawbacks:
 computationally expensive -> applied for small molecules
Intensities of fragments are ignored
20/07/2018 D. H. Nguyen, Kyoto University 8
II. In silico fragmentation (3)
3) Machine learning based fragmentation
 Use ML to learn fragmentation process from data.
 Peak intensities are considered and learned
 Very few work
20/07/2018 D. H. Nguyen, Kyoto University 9
II. In silico fragmentation (3)
Competitive Fragmentation Modeling (CFM)
models fragmentation as a Markov process of state
transition between fragments
1. Transition model
2. Observation model
20/07/2018
D. H. Nguyen, Kyoto University
10
Background of metabolites identification
 Task: given a query spectrum, find similar molecules in database.
 Approaches:
20/07/2018 D. H. Nguyen, Kyoto University
MS library In silico fragmentation Machine learning
11
III. Machine learning Approach
a) Supervised ML for
substructure prediction
b) Unsupervised ML for
substructure annotation
20/07/2018 D. H. Nguyen, Kyoto University 12
IV. Machine learning Approach
supervised ML for substructure prediction
Step 1:
fingerprint prediction
Step 2:
Candidate retrieval
20/07/2018 D. H. Nguyen, Kyoto University 13
Machine learning Approach
Supervised ML for substructure prediction
FingerID (Bioinformatics, 2012)
Kernel method
• Define probability product kernel (PPK) for spectra.
• Then, use SVM for classification.
 Drawback
 Peak interactions are ignored.
 Limited accuracy
𝑝 𝑋 =
1
𝑛 𝑋
𝑘=1
𝑛 𝑋
𝑝 𝑋(𝑘) 𝑝 𝑌 =
1
𝑛 𝑌
𝑘=1
𝑛 𝑌
𝑝 𝑌(𝑘)
𝐾 𝑋, 𝑌 =
1
𝑛 𝑋 𝑛 𝑌
𝑖,𝑗
𝑝 𝑋(𝑖)𝑝 𝑌(𝑗)
20/07/2018 D. H. Nguyen, Kyoto University 14
Machine learning Approach
Supervised ML for substructure prediction
CSI:FingerID (Bioinformatics, 2014)
 Improved version of FingerID
 Define kernel for spectra by PPK
 Kernels for fragmentation trees are defined and combined with PPK
via MKL.
 Then, use SVM for classification.
20/07/2018 D. H. Nguyen, Kyoto University 15
Machine learning Approach
Supervised ML for substructure prediction
CSI:FingerID (Bioinformatics, 2014)
Fragmentation trees
 Models of fragmentation of a molecule in MS/MS
 Nodes ~ peaks ~ molecular formula of fragments.
 Edges ~ losses ~ uncaptured uncharged fragments.
 Trees can be predicted from spectra provide structural information of
spectra.
20/07/2018 D. H. Nguyen, Kyoto University 16
Machine learning Approach
Supervised ML for substructure prediction
CSI:FingerID (Bioinformatics, 2014)
Pros & Cos
 Improved accuracy due to
additional structural information
provided by trees
 Computationally expensive due to
conversion of trees from spectra
 Lack of interpretation
20/07/2018 D. H. Nguyen, Kyoto University 17
Machine learning Approach
Supervised ML for substructure prediction
SIMPLE (Bioinformatics, 2018)
• Idea: introducing interaction term to model (two-way interaction model)
• Prediction model:
𝑓 𝑥 = 𝑏 + 𝑤 𝑇 𝑥 + 𝑥 𝑇 𝑊𝑥 , 𝑦 𝑥 = 𝑠𝑔𝑛(𝑓(𝑥))
• Objective function :
min
𝑏,𝑤,𝑊
𝑖=1
𝑛
[1 − 𝑦𝑖 𝑓(𝑥𝑖)]+ + 𝛼 𝑤 1 + 𝛽 𝑊 ∗
• Convexity guarantees to find globally optimal solution.
Hinge loss Sparsity Low-rank
Peaks Interactions
20/07/2018 D. H. Nguyen, Kyoto University 18
SIMPLE (Bioinformatics, 2018)
 Idea: use background knowledge (interactions from trees) to regularize W.
 Laplacian regularization
𝑥 𝑇 𝑊𝑥 = 𝑖,𝑗 𝑤𝑖𝑗 𝑥𝑖 𝑥𝑗 = 𝑖,𝑗(𝑣𝑖
𝑇
𝑣𝑗)𝑥𝑖 𝑥𝑗
𝑊 can be decomposed as 𝑉 𝑇
𝑉 (low rank decomposition)
 𝑅 𝑉 = 𝑖,𝑗 𝐴𝑖𝑗 𝑣𝑖 − 𝑣𝑗
2
= trace 𝑊𝐿 ,
where 𝐿 is Laplacian matrix.
 New objective function :
min
𝑏,𝑤,𝑊
𝑖=1
𝑛
[1 − 𝑦𝑖 𝑓(𝑥𝑖)]+ + 𝛼 𝑤 1 + 𝛽 𝑊 ∗ + 𝛾 trace(𝑊𝐿)
 Still convex
Machine learning Approach
Supervised ML for substructure prediction
20/07/2018 D. H. Nguyen, Kyoto University 19
+
Machine learning Approach
Supervised ML for substructure prediction
Input Output Kernel Regression (IOKR) (Bioinformatics, 2017)
Idea: using IOKR to learn the mapping between spectra and molecular structure.
Two steps:
1. Estimation of the output feature map by solving
2. Computation of pre-image problem
20/07/2018 D. H. Nguyen, Kyoto University 20
Machine learning Approach
Unsupervised ML for substructure annotation
 Metabolites/molecules may have common substructures,
yielding similar fragments/peaks in spectra.
 Such substructures are pertaining to biochemical processes
 Allows to group metabolites based on shared substructures
 Improve the accuracy of metabolite identification
20/07/2018 D. H. Nguyen, Kyoto University 21
IV. Machine learning Approach
Unsupervised ML for substructure annotation
MS2LDA (Bioinformatics 2017)
 Automatically extract relevant substructures in
molecules in metabolites based on co-occurrence of
fragments and losses.
 Motivated by topic modeling for text application.
e.g. Latent Dirichlet Allocation (LDA)
 LDA for MS data (MS2LDA)
 Peaks ~ words
 set of peaks (substructures) ~ topics
 LDA decompose a text into topics, while MS2LDA
decomposes a molecule into substructures.
 Drawbacks: extracted substructures need to be annotated
based on expert knowledge (complex process and time-
consuming)
20/07/2018 D. H. Nguyen, Kyoto University 22
Machine learning Approach
Unsupervised ML for
substructure annotation
Automated recommendation of subtructures
from MS/MS (Aida Mrzic et al, bioRxiv)
 Automatically extract relevant substructures
in molecules based on co-occurrence of
fragments and losses
 Applied Frequent Itemset Mining to extract
association rules.
 Given query spectrum, get recommendation
of substructures present in it by applying
extracted rules.
20/07/2018 D. H. Nguyen, Kyoto University 23
Conclusion
• Metabolite Identification is an essential part in metabolomics to enlarge
knowledge of biological systems.
• Many techniques/software with different approaches have been
proposed to deal with this task and can be categorized into groups
• ML methods are the key to recent progress in metabolite identification
20/07/2018 D. H. Nguyen, Kyoto University 24

More Related Content

What's hot

D1803012022
D1803012022D1803012022
D1803012022
IOSR Journals
 
Ijmet 10 01_029
Ijmet 10 01_029Ijmet 10 01_029
Ijmet 10 01_029
IAEME Publication
 
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL  FOR HEALTHCARE INFORMATION SYSTEM :   ...ONTOLOGY-DRIVEN INFORMATION RETRIEVAL  FOR HEALTHCARE INFORMATION SYSTEM :   ...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
IJNSA Journal
 
A BINARY BAT INSPIRED ALGORITHM FOR THE CLASSIFICATION OF BREAST CANCER DATA
A BINARY BAT INSPIRED ALGORITHM FOR THE CLASSIFICATION OF BREAST CANCER DATAA BINARY BAT INSPIRED ALGORITHM FOR THE CLASSIFICATION OF BREAST CANCER DATA
A BINARY BAT INSPIRED ALGORITHM FOR THE CLASSIFICATION OF BREAST CANCER DATA
IJSCAI Journal
 
Analysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTAnalysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOT
IJERA Editor
 
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERINGAN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
International Journal of Technical Research & Application
 
chalenges and apportunity of deep learning for big data analysis f
 chalenges and apportunity of deep learning for big data analysis f chalenges and apportunity of deep learning for big data analysis f
chalenges and apportunity of deep learning for big data analysis f
maru kindeneh
 
Advances of neural networks in 2020
Advances of neural networks in 2020Advances of neural networks in 2020
Advances of neural networks in 2020
kevig
 
Applying Soft Computing Techniques in Information Retrieval
Applying Soft Computing Techniques in Information RetrievalApplying Soft Computing Techniques in Information Retrieval
Applying Soft Computing Techniques in Information Retrieval
IJAEMSJORNAL
 
Decision Support System for Bat Identification using Random Forest and C5.0
Decision Support System for Bat Identification using Random Forest and C5.0Decision Support System for Bat Identification using Random Forest and C5.0
Decision Support System for Bat Identification using Random Forest and C5.0
TELKOMNIKA JOURNAL
 
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLSTWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
IJDKP
 
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET Journal
 
Computational of Bioinformatics
Computational of BioinformaticsComputational of Bioinformatics
Computational of Bioinformatics
ijtsrd
 
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...Multi Label Spatial Semi Supervised Classification using Spatial Associative ...
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...
cscpconf
 
TOP READ NATURAL LANGUAGE COMPUTING ARTICLE 2020
TOP READ NATURAL LANGUAGE  COMPUTING ARTICLE 2020TOP READ NATURAL LANGUAGE  COMPUTING ARTICLE 2020
TOP READ NATURAL LANGUAGE COMPUTING ARTICLE 2020
kevig
 
Prototype-based classifiers and their applications in the life sciences
Prototype-based classifiers and their applications in the life sciencesPrototype-based classifiers and their applications in the life sciences
Prototype-based classifiers and their applications in the life sciences
University of Groningen
 
Pattern recognition system based on support vector machines
Pattern recognition system based on support vector machinesPattern recognition system based on support vector machines
Pattern recognition system based on support vector machines
Alexander Decker
 

What's hot (20)

D1803012022
D1803012022D1803012022
D1803012022
 
Ijmet 10 01_029
Ijmet 10 01_029Ijmet 10 01_029
Ijmet 10 01_029
 
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL  FOR HEALTHCARE INFORMATION SYSTEM :   ...ONTOLOGY-DRIVEN INFORMATION RETRIEVAL  FOR HEALTHCARE INFORMATION SYSTEM :   ...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
 
CV
CVCV
CV
 
A BINARY BAT INSPIRED ALGORITHM FOR THE CLASSIFICATION OF BREAST CANCER DATA
A BINARY BAT INSPIRED ALGORITHM FOR THE CLASSIFICATION OF BREAST CANCER DATAA BINARY BAT INSPIRED ALGORITHM FOR THE CLASSIFICATION OF BREAST CANCER DATA
A BINARY BAT INSPIRED ALGORITHM FOR THE CLASSIFICATION OF BREAST CANCER DATA
 
Analysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTAnalysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOT
 
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERINGAN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
 
Heterogeneous data annotation
Heterogeneous data annotationHeterogeneous data annotation
Heterogeneous data annotation
 
chalenges and apportunity of deep learning for big data analysis f
 chalenges and apportunity of deep learning for big data analysis f chalenges and apportunity of deep learning for big data analysis f
chalenges and apportunity of deep learning for big data analysis f
 
Advances of neural networks in 2020
Advances of neural networks in 2020Advances of neural networks in 2020
Advances of neural networks in 2020
 
Applying Soft Computing Techniques in Information Retrieval
Applying Soft Computing Techniques in Information RetrievalApplying Soft Computing Techniques in Information Retrieval
Applying Soft Computing Techniques in Information Retrieval
 
Decision Support System for Bat Identification using Random Forest and C5.0
Decision Support System for Bat Identification using Random Forest and C5.0Decision Support System for Bat Identification using Random Forest and C5.0
Decision Support System for Bat Identification using Random Forest and C5.0
 
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLSTWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
 
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
 
Deliverable_5.1.2
Deliverable_5.1.2Deliverable_5.1.2
Deliverable_5.1.2
 
Computational of Bioinformatics
Computational of BioinformaticsComputational of Bioinformatics
Computational of Bioinformatics
 
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...Multi Label Spatial Semi Supervised Classification using Spatial Associative ...
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...
 
TOP READ NATURAL LANGUAGE COMPUTING ARTICLE 2020
TOP READ NATURAL LANGUAGE  COMPUTING ARTICLE 2020TOP READ NATURAL LANGUAGE  COMPUTING ARTICLE 2020
TOP READ NATURAL LANGUAGE COMPUTING ARTICLE 2020
 
Prototype-based classifiers and their applications in the life sciences
Prototype-based classifiers and their applications in the life sciencesPrototype-based classifiers and their applications in the life sciences
Prototype-based classifiers and their applications in the life sciences
 
Pattern recognition system based on support vector machines
Pattern recognition system based on support vector machinesPattern recognition system based on support vector machines
Pattern recognition system based on support vector machines
 

Similar to IBSB tutorial

Advanced machine learning for metabolite identification
Advanced machine learning for metabolite identificationAdvanced machine learning for metabolite identification
Advanced machine learning for metabolite identification
Dai-Hai Nguyen
 
Towards smart modeling of mechanical properties of a bio composite based on ...
Towards smart modeling of mechanical properties of a bio  composite based on ...Towards smart modeling of mechanical properties of a bio  composite based on ...
Towards smart modeling of mechanical properties of a bio composite based on ...
IJECEIAES
 
Large Graph Mining
Large Graph MiningLarge Graph Mining
Large Graph Mining
Sabri Skhiri
 
The interplay between data-driven and theory-driven methods for chemical scie...
The interplay between data-driven and theory-driven methods for chemical scie...The interplay between data-driven and theory-driven methods for chemical scie...
The interplay between data-driven and theory-driven methods for chemical scie...
Ichigaku Takigawa
 
Machine Learning for Molecules
Machine Learning for MoleculesMachine Learning for Molecules
Machine Learning for Molecules
Ichigaku Takigawa
 
Kernel based approaches in drug target interaction prediction
Kernel based approaches in drug target interaction predictionKernel based approaches in drug target interaction prediction
Kernel based approaches in drug target interaction prediction
Xinyi Z.
 
Digging deeper into data processing with emphasis on computational and micros...
Digging deeper into data processing with emphasis on computational and micros...Digging deeper into data processing with emphasis on computational and micros...
Digging deeper into data processing with emphasis on computational and micros...Liza Charalambous
 
AI that/for matters
AI that/for mattersAI that/for matters
AI that/for matters
Deakin University
 
Classifier Model using Artificial Neural Network
Classifier Model using Artificial Neural NetworkClassifier Model using Artificial Neural Network
Classifier Model using Artificial Neural Network
AI Publications
 
A study on cloud computing ppt n_24-12-2017
A study on cloud computing ppt n_24-12-2017A study on cloud computing ppt n_24-12-2017
A study on cloud computing ppt n_24-12-2017
Manish K Patel
 
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MININGANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
ijbbjournal
 
240318_Thuy_Labseminar[Fragment-based Pretraining and Finetuning on Molecular...
240318_Thuy_Labseminar[Fragment-based Pretraining and Finetuning on Molecular...240318_Thuy_Labseminar[Fragment-based Pretraining and Finetuning on Molecular...
240318_Thuy_Labseminar[Fragment-based Pretraining and Finetuning on Molecular...
thanhdowork
 
Review Paper on Shared and Distributed Memory Parallel Algorithms to Solve Bi...
Review Paper on Shared and Distributed Memory Parallel Algorithms to Solve Bi...Review Paper on Shared and Distributed Memory Parallel Algorithms to Solve Bi...
Review Paper on Shared and Distributed Memory Parallel Algorithms to Solve Bi...
JIEMS Akkalkuwa
 
Algorithms 14-00122
Algorithms 14-00122Algorithms 14-00122
Algorithms 14-00122
DrSafikureshiMondal
 
Multilinear Kernel Mapping for Feature Dimension Reduction in Content Based M...
Multilinear Kernel Mapping for Feature Dimension Reduction in Content Based M...Multilinear Kernel Mapping for Feature Dimension Reduction in Content Based M...
Multilinear Kernel Mapping for Feature Dimension Reduction in Content Based M...
ijma
 
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
IJECEIAES
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...
BrianDeCost
 
Java tutorial: Programmatic Access to Molecular Interactions
Java tutorial: Programmatic Access to Molecular InteractionsJava tutorial: Programmatic Access to Molecular Interactions
Java tutorial: Programmatic Access to Molecular Interactions
Rafael C. Jimenez
 
MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo...
 MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo... MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo...
MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo...
Peter Rose
 
Report: "MolGAN: An implicit generative model for small molecular graphs"
Report: "MolGAN: An implicit generative model for small molecular graphs"Report: "MolGAN: An implicit generative model for small molecular graphs"
Report: "MolGAN: An implicit generative model for small molecular graphs"
Ryohei Suzuki
 

Similar to IBSB tutorial (20)

Advanced machine learning for metabolite identification
Advanced machine learning for metabolite identificationAdvanced machine learning for metabolite identification
Advanced machine learning for metabolite identification
 
Towards smart modeling of mechanical properties of a bio composite based on ...
Towards smart modeling of mechanical properties of a bio  composite based on ...Towards smart modeling of mechanical properties of a bio  composite based on ...
Towards smart modeling of mechanical properties of a bio composite based on ...
 
Large Graph Mining
Large Graph MiningLarge Graph Mining
Large Graph Mining
 
The interplay between data-driven and theory-driven methods for chemical scie...
The interplay between data-driven and theory-driven methods for chemical scie...The interplay between data-driven and theory-driven methods for chemical scie...
The interplay between data-driven and theory-driven methods for chemical scie...
 
Machine Learning for Molecules
Machine Learning for MoleculesMachine Learning for Molecules
Machine Learning for Molecules
 
Kernel based approaches in drug target interaction prediction
Kernel based approaches in drug target interaction predictionKernel based approaches in drug target interaction prediction
Kernel based approaches in drug target interaction prediction
 
Digging deeper into data processing with emphasis on computational and micros...
Digging deeper into data processing with emphasis on computational and micros...Digging deeper into data processing with emphasis on computational and micros...
Digging deeper into data processing with emphasis on computational and micros...
 
AI that/for matters
AI that/for mattersAI that/for matters
AI that/for matters
 
Classifier Model using Artificial Neural Network
Classifier Model using Artificial Neural NetworkClassifier Model using Artificial Neural Network
Classifier Model using Artificial Neural Network
 
A study on cloud computing ppt n_24-12-2017
A study on cloud computing ppt n_24-12-2017A study on cloud computing ppt n_24-12-2017
A study on cloud computing ppt n_24-12-2017
 
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MININGANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
ANALYSIS OF PROTEIN MICROARRAY DATA USING DATA MINING
 
240318_Thuy_Labseminar[Fragment-based Pretraining and Finetuning on Molecular...
240318_Thuy_Labseminar[Fragment-based Pretraining and Finetuning on Molecular...240318_Thuy_Labseminar[Fragment-based Pretraining and Finetuning on Molecular...
240318_Thuy_Labseminar[Fragment-based Pretraining and Finetuning on Molecular...
 
Review Paper on Shared and Distributed Memory Parallel Algorithms to Solve Bi...
Review Paper on Shared and Distributed Memory Parallel Algorithms to Solve Bi...Review Paper on Shared and Distributed Memory Parallel Algorithms to Solve Bi...
Review Paper on Shared and Distributed Memory Parallel Algorithms to Solve Bi...
 
Algorithms 14-00122
Algorithms 14-00122Algorithms 14-00122
Algorithms 14-00122
 
Multilinear Kernel Mapping for Feature Dimension Reduction in Content Based M...
Multilinear Kernel Mapping for Feature Dimension Reduction in Content Based M...Multilinear Kernel Mapping for Feature Dimension Reduction in Content Based M...
Multilinear Kernel Mapping for Feature Dimension Reduction in Content Based M...
 
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...
 
Java tutorial: Programmatic Access to Molecular Interactions
Java tutorial: Programmatic Access to Molecular InteractionsJava tutorial: Programmatic Access to Molecular Interactions
Java tutorial: Programmatic Access to Molecular Interactions
 
MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo...
 MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo... MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo...
MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo...
 
Report: "MolGAN: An implicit generative model for small molecular graphs"
Report: "MolGAN: An implicit generative model for small molecular graphs"Report: "MolGAN: An implicit generative model for small molecular graphs"
Report: "MolGAN: An implicit generative model for small molecular graphs"
 

More from Dai-Hai Nguyen

Metrics for generativemodels
Metrics for generativemodelsMetrics for generativemodels
Metrics for generativemodels
Dai-Hai Nguyen
 
Brief introduction on GAN
Brief introduction on GANBrief introduction on GAN
Brief introduction on GAN
Dai-Hai Nguyen
 
Hierarchical selection
Hierarchical selectionHierarchical selection
Hierarchical selection
Dai-Hai Nguyen
 
Semi-supervised learning model for molecular property prediction
Semi-supervised learning model for molecular property predictionSemi-supervised learning model for molecular property prediction
Semi-supervised learning model for molecular property prediction
Dai-Hai Nguyen
 
DL for molecules
DL for moleculesDL for molecules
DL for molecules
Dai-Hai Nguyen
 
Seminar
SeminarSeminar
Collaborative DL
Collaborative DLCollaborative DL
Collaborative DL
Dai-Hai Nguyen
 

More from Dai-Hai Nguyen (7)

Metrics for generativemodels
Metrics for generativemodelsMetrics for generativemodels
Metrics for generativemodels
 
Brief introduction on GAN
Brief introduction on GANBrief introduction on GAN
Brief introduction on GAN
 
Hierarchical selection
Hierarchical selectionHierarchical selection
Hierarchical selection
 
Semi-supervised learning model for molecular property prediction
Semi-supervised learning model for molecular property predictionSemi-supervised learning model for molecular property prediction
Semi-supervised learning model for molecular property prediction
 
DL for molecules
DL for moleculesDL for molecules
DL for molecules
 
Seminar
SeminarSeminar
Seminar
 
Collaborative DL
Collaborative DLCollaborative DL
Collaborative DL
 

Recently uploaded

Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
ShahidSultan24
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
MuhammadTufail242431
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 

Recently uploaded (20)

Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 

IBSB tutorial

  • 1. Computational methods for metabolite identification from tandem mass spectrometry Dai Hai Nguyen Kyoto University, Japan 20/07/2018 D. H. Nguyen, Kyoto University 1
  • 2. Background of metabolites identification Metabolites  Intermediate or end products of metabolism  Small molecules with important functions: energy transport, building blocks of cells, etc.  Many applications, e.g. drug discovery  Identifying or profiling them is challenging 20/07/2018 D. H. Nguyen, Kyoto University 2
  • 3. Background of metabolites identification Tandem Mass spectrometry  fragments compound into many fragments  each fragment corresponds to a peak  There exist peak interactions (co-occurrence of peaks) 20/07/2018 D. H. Nguyen, Kyoto University 3 Peak interaction
  • 4. Background of metabolites identification  Task: given a query spectrum, find similar molecules in database.  Approaches: 20/07/2018 D. H. Nguyen, Kyoto University MS library In silico fragmentation Machine learning 4
  • 5. I. Mass spectra library  Simply compare query spectrum with spectra in library  Best matching candidates are returned  Drawback: size of library is limited  E.g., Human metabolome database ~ 2000 compounds 20/07/2018 D. H. Nguyen, Kyoto University MS library 5
  • 6. II. In silico fragmentation  To mitigate insufficiency of spectra library by taking advantage of structural database.  Can be divided into groups: 1) rule-based 2) combinatorial based 3) machine learning based 20/07/2018 D. H. Nguyen, Kyoto University In silico fragmentation 6
  • 7. II. In silico fragmentation (1) 1) Rule based fragmentation, e.g., Mass Frontier  Use set of fragmentation rules to predict spectra from compound structures.  Rules are extracted from the literature.  Not preferred in practice due to:  fragmentation process can be variant due to small changes in molecular structure  # rules insufficient to identify fragments with high accuracy  intensities of peaks are ignored 20/07/2018 D. H. Nguyen, Kyoto University 7
  • 8. II. In silico fragmentation (2) 2) Combinatorial based fragmentation, e.g. FiD  From molecular structure, generate graph of all connected substructures.  Find most likely fragmentation trees that best matches spectrum.  Drawbacks:  computationally expensive -> applied for small molecules Intensities of fragments are ignored 20/07/2018 D. H. Nguyen, Kyoto University 8
  • 9. II. In silico fragmentation (3) 3) Machine learning based fragmentation  Use ML to learn fragmentation process from data.  Peak intensities are considered and learned  Very few work 20/07/2018 D. H. Nguyen, Kyoto University 9
  • 10. II. In silico fragmentation (3) Competitive Fragmentation Modeling (CFM) models fragmentation as a Markov process of state transition between fragments 1. Transition model 2. Observation model 20/07/2018 D. H. Nguyen, Kyoto University 10
  • 11. Background of metabolites identification  Task: given a query spectrum, find similar molecules in database.  Approaches: 20/07/2018 D. H. Nguyen, Kyoto University MS library In silico fragmentation Machine learning 11
  • 12. III. Machine learning Approach a) Supervised ML for substructure prediction b) Unsupervised ML for substructure annotation 20/07/2018 D. H. Nguyen, Kyoto University 12
  • 13. IV. Machine learning Approach supervised ML for substructure prediction Step 1: fingerprint prediction Step 2: Candidate retrieval 20/07/2018 D. H. Nguyen, Kyoto University 13
  • 14. Machine learning Approach Supervised ML for substructure prediction FingerID (Bioinformatics, 2012) Kernel method • Define probability product kernel (PPK) for spectra. • Then, use SVM for classification.  Drawback  Peak interactions are ignored.  Limited accuracy 𝑝 𝑋 = 1 𝑛 𝑋 𝑘=1 𝑛 𝑋 𝑝 𝑋(𝑘) 𝑝 𝑌 = 1 𝑛 𝑌 𝑘=1 𝑛 𝑌 𝑝 𝑌(𝑘) 𝐾 𝑋, 𝑌 = 1 𝑛 𝑋 𝑛 𝑌 𝑖,𝑗 𝑝 𝑋(𝑖)𝑝 𝑌(𝑗) 20/07/2018 D. H. Nguyen, Kyoto University 14
  • 15. Machine learning Approach Supervised ML for substructure prediction CSI:FingerID (Bioinformatics, 2014)  Improved version of FingerID  Define kernel for spectra by PPK  Kernels for fragmentation trees are defined and combined with PPK via MKL.  Then, use SVM for classification. 20/07/2018 D. H. Nguyen, Kyoto University 15
  • 16. Machine learning Approach Supervised ML for substructure prediction CSI:FingerID (Bioinformatics, 2014) Fragmentation trees  Models of fragmentation of a molecule in MS/MS  Nodes ~ peaks ~ molecular formula of fragments.  Edges ~ losses ~ uncaptured uncharged fragments.  Trees can be predicted from spectra provide structural information of spectra. 20/07/2018 D. H. Nguyen, Kyoto University 16
  • 17. Machine learning Approach Supervised ML for substructure prediction CSI:FingerID (Bioinformatics, 2014) Pros & Cos  Improved accuracy due to additional structural information provided by trees  Computationally expensive due to conversion of trees from spectra  Lack of interpretation 20/07/2018 D. H. Nguyen, Kyoto University 17
  • 18. Machine learning Approach Supervised ML for substructure prediction SIMPLE (Bioinformatics, 2018) • Idea: introducing interaction term to model (two-way interaction model) • Prediction model: 𝑓 𝑥 = 𝑏 + 𝑤 𝑇 𝑥 + 𝑥 𝑇 𝑊𝑥 , 𝑦 𝑥 = 𝑠𝑔𝑛(𝑓(𝑥)) • Objective function : min 𝑏,𝑤,𝑊 𝑖=1 𝑛 [1 − 𝑦𝑖 𝑓(𝑥𝑖)]+ + 𝛼 𝑤 1 + 𝛽 𝑊 ∗ • Convexity guarantees to find globally optimal solution. Hinge loss Sparsity Low-rank Peaks Interactions 20/07/2018 D. H. Nguyen, Kyoto University 18
  • 19. SIMPLE (Bioinformatics, 2018)  Idea: use background knowledge (interactions from trees) to regularize W.  Laplacian regularization 𝑥 𝑇 𝑊𝑥 = 𝑖,𝑗 𝑤𝑖𝑗 𝑥𝑖 𝑥𝑗 = 𝑖,𝑗(𝑣𝑖 𝑇 𝑣𝑗)𝑥𝑖 𝑥𝑗 𝑊 can be decomposed as 𝑉 𝑇 𝑉 (low rank decomposition)  𝑅 𝑉 = 𝑖,𝑗 𝐴𝑖𝑗 𝑣𝑖 − 𝑣𝑗 2 = trace 𝑊𝐿 , where 𝐿 is Laplacian matrix.  New objective function : min 𝑏,𝑤,𝑊 𝑖=1 𝑛 [1 − 𝑦𝑖 𝑓(𝑥𝑖)]+ + 𝛼 𝑤 1 + 𝛽 𝑊 ∗ + 𝛾 trace(𝑊𝐿)  Still convex Machine learning Approach Supervised ML for substructure prediction 20/07/2018 D. H. Nguyen, Kyoto University 19 +
  • 20. Machine learning Approach Supervised ML for substructure prediction Input Output Kernel Regression (IOKR) (Bioinformatics, 2017) Idea: using IOKR to learn the mapping between spectra and molecular structure. Two steps: 1. Estimation of the output feature map by solving 2. Computation of pre-image problem 20/07/2018 D. H. Nguyen, Kyoto University 20
  • 21. Machine learning Approach Unsupervised ML for substructure annotation  Metabolites/molecules may have common substructures, yielding similar fragments/peaks in spectra.  Such substructures are pertaining to biochemical processes  Allows to group metabolites based on shared substructures  Improve the accuracy of metabolite identification 20/07/2018 D. H. Nguyen, Kyoto University 21
  • 22. IV. Machine learning Approach Unsupervised ML for substructure annotation MS2LDA (Bioinformatics 2017)  Automatically extract relevant substructures in molecules in metabolites based on co-occurrence of fragments and losses.  Motivated by topic modeling for text application. e.g. Latent Dirichlet Allocation (LDA)  LDA for MS data (MS2LDA)  Peaks ~ words  set of peaks (substructures) ~ topics  LDA decompose a text into topics, while MS2LDA decomposes a molecule into substructures.  Drawbacks: extracted substructures need to be annotated based on expert knowledge (complex process and time- consuming) 20/07/2018 D. H. Nguyen, Kyoto University 22
  • 23. Machine learning Approach Unsupervised ML for substructure annotation Automated recommendation of subtructures from MS/MS (Aida Mrzic et al, bioRxiv)  Automatically extract relevant substructures in molecules based on co-occurrence of fragments and losses  Applied Frequent Itemset Mining to extract association rules.  Given query spectrum, get recommendation of substructures present in it by applying extracted rules. 20/07/2018 D. H. Nguyen, Kyoto University 23
  • 24. Conclusion • Metabolite Identification is an essential part in metabolomics to enlarge knowledge of biological systems. • Many techniques/software with different approaches have been proposed to deal with this task and can be categorized into groups • ML methods are the key to recent progress in metabolite identification 20/07/2018 D. H. Nguyen, Kyoto University 24

Editor's Notes

  1. 大海