SlideShare a Scribd company logo
Introduction to
Retrosynthesis Prediction
2020. 06
Wonjun Jeong
wonjun.jg@kaist.ac.kr
wonjun.email@gmail.com
Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
Retrosynthesis prediction
• What is retrosynthesis prediction?
• Retrosynthesis or retrosynthetic pathway planning is the process of tracing back the
forward reaction, predicting which reactants are required to synthesize the target product.
4
Retrosynthesis prediction
• Retrosynthesis is crucial process of discovering new materials and drugs.
5
Desired
properties
Candidate
Product
Candidate
Reactants Test by chemist
Retrosynthesis prediction
• Each process of discovering new materials and drug has own error, it should be
verified by chemist.
• Expensive
6
Desired
properties
Candidate
Product
Candidate
Reactants Test by chemist
Retrosynthesis prediction
Retrosynthesis prediction
Retrosynthesis prediction
• Retrosynthesis prediction has highly depended on the trial-and-error cycles of
experienced researchers of chemical expertise.
7
Retrosynthesis prediction
• If retrosynthesis prediction can be done with high accuracy …
• Capable of unlocking future possibilities of a fully automated material/drug discovery
pipeline.
8
Desired
properties
Candidate
Product
Candidate
Reactants
Test by robot
Retrosynthesis prediction
Dataset description
• SMILES (Simplified Molecular-Input Line-Entry System) [1]
• SMILES is a specification in the form of a line notation for describing the structure of
chemical species [2].
• Generation of SMILES.
• By printing symbol nodes encountered in a depth-first tree traversal of a chemical graph
9[1] Weininger et al .[2] https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system
Dataset description
• SMILES in detail
• Character of carbon(C) is omitted in the graph.
• Hydrogen(H) is omitted in the SMILES.
• Ring structures are written by breaking each ring at an arbitrary point to make an acyclic str
ucture and adding numerical ring closure labels to show connectivity between non-adjacen
t atoms.
• Branches are described with parentheses.
• A bond is represented using one of the symbols: ., -, =, #, $, :, /, 
• “.” indicates two parts are not bonded together
10[1] Weininger et al .[2] https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system
Dataset description
• Benchmark:
1. USPTO (United States Patent and Trademark Office)
• USPTO benchmark contains SMIELS representation of single target product (input) and
reactants (target)
• Variants
• USPTO-50k
• USTPO-500K
• USPTO-MIT
2. Pistachio [32]
3. Reaxys [25]
11[25] reaxys.com [32] Mayfield et al.
Overview of general approaches: Template-based
• Template-based approaches [2, 3, 4, 5, 14, 15, 16, 17] use the known chemical
reaction which is called reaction template.
• Reaction template contains sub-graph reaction patterns that describing how the reaction
occur between reactants and product.
• Pros
• High interpretability
• Cons
• Low generalizability to unseen templates
• Require domain knowledge to extract the reaction templates
12
Overview of general approaches: Template-free
• Template-free approaches [6, 7, 8, 9, 10, 12] learn mapping function product to a set of
reactants by extracting features directly from data.
• Seq2Seq framework
• [6, 7, 8, 12]
• Graph2Grpah framework
• [9, 10]
• Pros
• Generalizability
• Not require domain knowledge
• Cons
• Invalid/Inaccessible predictions
• Low interpretability
13
f
Overview of general approaches: Selection-based
• Selection-based approaches [11] select a candidate set of purchasable reactants.
• The objective of [11] is to discover retrosynthetic routes from a given desired product to co
mmercially available reactants
• Pros
• Accessibility of the prediction
• Not require domain knowledge
• Cons
• Novelty
14[11] Guo et al.
Rank := f(product; )
Purchasable pool
Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
Classical computer-aided methods
• Before deep learning, computer-aided retrosynthesis were mainly conducted using
reaction template. [2, 3, 4, 15, 16, 17]
• They are mainly about how to use known reactions and extract meaningful reaction
context.
• Characteristics
• It needs chemical expertise.
• Heuristics
• Computationally expensive
• Chemical space is vast
• Subgraph isomorphism problem*1.
• Not scalable
• Not generalizable
16*1: Appendix-1
Classical computer-aided methods
• The first computer-aided retrosynthesis:
• [18] Corey et al., “Computer-assisted analysis in organic synthesis.”, Science, 1985
• The author won the Nobel Prize in Chemistry for his contribution of retrosynthetic analysis.
• [19] The Logic of Chemical Synthesis: Multistep Synthesis of Complex Carbogenic Mol
ecules (Nobel lecture), 1991
17[18, 19] Corey et al.
Classical computer-aided methods:
Recent work [3] 2017
18[3] Coley et al.
• Key Idea
• It uses product similarity and reactants similarity to rank template of precedent reactions.
19[3] Coley et al.
Classical computer-aided methods:
Recent work [3] 2017 – Key Idea
• How to measure molecular similarity*2?
• Molecular fingerprints are a way of encoding the structure of molecule. We can use RDKit
library to get it.
• Most common way is Tanimoto similarity, but there is no canonical definition of molecule
similarity (subgraph isomorphism problem*1).
• , : Molecular fingerprint
20*1: Appendix-1, *2: Appendix-2
Img from [20]
Classical computer-aided methods:
Recent work [3] 2017 – Method (Similarity)
• Example of using similarity in [3]
• Total similarity := Product Sim * Reactants (Precursor) sim
21[3] Coley et al.
Rank
Classical computer-aided methods:
Recent work [3] 2017 – Method (Using similarity)
• Result of [3]
• [3] performs better than seq2seq. However, the seq2seq in table is template-free and [3] is
template-based.
• Contribution
• It mimics the retrosynthetic strategy by using molecular similarity without need to encode
any chemical knowledge.
• Limitation
• It inherently disfavors making creative retrosynthetic strategy because it relies on
precedent reactions.
22*3: Appendix-3
*3
Classical computer-aided methods:
Recent work [3] 2017 - Results
Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• Open NMT
• Related works
• Future directions
• Reference
• Appendix
• Library
• Related works
Machine learning based methods
• Data-driven methods using machine learning and deep learning have been activated
since mid-2010s.
• The need for expertise has been reduced.
• More scalable and generalizable.
• Representative proposed methods
• Template-based
• NeuralSim [14], Graph Logic Network (GLN) [5]
• Template-free
• Seq2Seq [21], Molecular Transformer (MT) [6, 7], Latent variable Transformer (LV-MT)
[8], Self-Corrected Transformer (SCROP) [22], Graph2Graph (G2G) [9], GraphRetro [10]
• Selection-based
• Bayesian-Retro [11]
24
Machine learning based methods
Template-based: NeuralSim [14] 2017
25[14] Segler et al.
• Template-based: NeuralSim [14] (2017)
• Key Idea
• Given a target product, it uses neural network to predict most suitable rule in reaction
template.
26[14] Segler et al.
Machine learning based methods
Template-based: NeuralSim [14] 2017 – Key Idea
• Template-based: NeuralSim [14]
• It uses primitive models such as MLP and Highway network [23].
• It defines rule-selection as a multiclass classification.
• Molecular Descriptor [24] is defined as sum of molecular fingerprint:
27[14] Segler et al. [23] Srivastava et al. [24] pdf file
Machine learning based methods
Template-based: NeuralSim [14] 2017 - Method
• Template-based: NeuralSim [14]
• Experiments
• Dataset: Reaxys database [25]
• # of class: 8720
• Contribution
• It shows neural networks can learn to which molecular context particular rules can be applied.
• Limitation
• The performance is affected by rule set cardinality.
• The larger the set size, the lower the performance.
28[14] Segler et al.
Machine learning based methods
Template-based: NeuralSim [14] 2017 - Results
• Template-based: Graph Logic Network (GLN) [5] (NeurIPS 2019)
29[5] Dai et al.
Machine learning based methods
Template-based: Graph Logic Network [5] 2019
• Key Idea
• Modeling the joint distribution of reaction templates and reactants using logic variable.
• It learns when rules from reaction templates should be applied.
30[5] Dai et al.
Machine learning based methods
Template-based: Graph Logic Network [5] 2019 – Key Idea
• Retrosynthesis Template
• Using the retrosynthesis template can be decomposed into 2-step logic.
• Match template
• Match reactants
31[5] Dai et al.
Machine learning based methods
Template-based: Graph Logic Network [5] 2019 - Background
• Match template
• Match reactants
• Uncertainty
• Template score function
• Reactants score function
32[5] Dai et al.
Machine learning based methods
Template-based: Graph Logic Network [5] 2019 - Method
• Final joint probability
33[5] Dai et al. *4: Appendix-4
Machine learning based methods
Template-based: Graph Logic Network [5] 2019 - Method
Parameterizing by GNN (Graph Neural Network)*4
• MLE with Efficient Inference
• Gradient approximation
34
Machine learning based methods
Template-based: Graph Logic Network [5] 2019 - Method
[5] Dai et al.
• Top-k results
• Contribution
• Interpretability: Integration of probabilistic models and template(chemical rule)
• Limitation
• It share limitations of template-based method
• Scalability
35[5] Dai et al.
Machine learning based methods
Template-based: Graph Logic Network [5] 2019 - Results
36[21] Liu et al.
Machine learning based methods
Template-free: Seq2Seq [21] 2017
• Template-free: Seq2Seq [21] (2017)
• It tokenizes SMILES and treats retrosynthesis as machine translation.
• It uses bidirectional LSTM for a encoder and decoder.
• It uses beam search to produce a set of reactants.
37[21] Liu et al.
Machine learning based methods
Template-free: Seq2Seq [21] 2017 - Method
• Results
• It performs comparably to the rule-based expert system baseline.
• Contribution
• It shows fully data-driven seq2seq model can learn retrosynthetic pathway.
• Limitations
• It produces grammatically invalid SMILES and chemically implausible predictions.
• Just naïve application of seq2seq model.
• Predictions generated by a vanilla seq2seq model with beam search typically exemplifies
low diversity with only minor differences in the suffix. [8]
38[21] Liu et al, [8] Chen et al
Machine learning based methods
Template-free: Seq2Seq [21] 2017 – Results
• Grammatically invalid SMILES
• Grammatically valid but chemically implausible
39[21] Liu et al.
Machine learning based methods
Template-free: Seq2Seq [21] 2017 – Results
40[6] Schwaller et al., [7] Lee et al.
Machine learning based methods
Template-free: Molecular Transformer [6, 7] 2019
• Key Idea
• It also tokenizes SMILES and treats retrosynthesis as machine translation like [21].
• It uses Transformer instead of LSTM
• It performs better than seq2seq [21] but has same limitations.
41
Machine learning based methods
Template-free: Molecular Transformer [6, 7] 2019 – Key Idea
[6] Schwaller et al., [7] Lee et al. [21] Liu et al.
• Template-free: Latent variable Transformer (LV-MT) [8] (arXiv 2019)
42[8] Chen et al.
Machine learning based methods
Template-free: LV-MT [8] 2019
• It extends Molecular Transformer (MT) to become more generalizable to rare
reactions and produce diverse path.
• Key Idea
• It proposes novel pretrain method.
• Random bond cut
• Template-based bond cut
• It trains a mixture model with the online hard-EM algorithm.
43[8] Chen et al
Machine learning based methods
Template-free: LV-MT [8] 2019 – Key Idea
• Pretrain methods
• Random bond cut
• For each input target product, it generates new examples by selecting a random
bond to break.
• Template-based bond cut
• Instead of randomly breaking bonds, it uses the templates to break bonds.
• The model is pre-trained on these auxiliary examples, and then used as initialization
to be fine-tuned on the actual retrosynthesis data.
44
Machine learning based methods
Template-free: LV-MT [8] 2019 – Method (Pretrain)
[8] Chen et al
• Why latent variables are introduced?
• It tackles the problem of generating diverse predictions.
• The outputs of beam search tend to be similar to each other.
• Given a target SMILES string x and reactants SMILES string y, a mixture model
introduces a multinomial latent variable z ∈ { 1, · · · , K } to capture different reaction
types, and decomposes the marginal likelihood as:
45
Machine learning based methods
Template-free: LV-MT [8] 2019 – Method (Latent Var.)
[8] Chen et al
• Hard-EM algorithm
1. Taking a mini-batch of training examples
2. It enumerates all K values of z and compute their loss,
• Dropout should be turned off [26].
3. For each , it selects the value of z that yields the minimum loss:
• For p(y | z, x; θ), it shares the encoder-decoder network among mixture components, and
feed the embedding of z as an input to the decoder so that y is conditioned on it
4. Back-propagate through it, so only one component receives gradients per example.
• Dropout should be turned back on [26].
46[8] Chen et al., [26] Shen et al.
Machine learning based methods
Template-free: LV-MT [8] 2019 – Method (Latent Var.)
• Results*5
47*5: We report better hyper-parameters and the results in Appendix-5
Machine learning based methods
Template-free: LV-MT [8] 2019 – Results
• Contributions
• It proposes novel pretraining methods for retrosynthesis.
• It uses mixture model Transformer for diverse predictions.
• Limitations
• The more latent variables are used, the worse the top 1 performance.
• The latent variable does not appear to contain information about the reaction class.
48
Machine learning based methods
Template-free: LV-MT [8] 2019 – Results
[8] Chen et al
• Template-free: Self-Corrected Transformer (SCROP) [22] (2020)
49[22] Zheng et al.
Machine learning based methods
Template-free: SCROP [22] 2020
• Template-free: Self-Corrected Transformer (SCROP) [22] (2020)
• Key Idea
• It uses Transformer for correcting invalid predicted SMILES
• It makes syntax correction data via trained Transformer by constructing set of invalid
prediction-ground truth pairs.
• It trains another Transformer for syntax corrector using syntax correction data.
• At test time, it retains the top-1 candidate produced by the syntax corrector and
replace the original one.
50[22] Zheng et al.
Machine learning based methods
Template-free: SCROP [22] 2020 – Key Idea
• Results
• Compare to Transformer (SCROP-noSC), the performance is improved by 0.4~1.7%.
51
Machine learning based methods
Template-free: SCROP [22] 2020 – Results
[22] Zheng et al.
• Invalid SMILES rates
• Limitations
• Why SCROP? We can remove invalid SMILES by using RDKit without learned model.
52[22] Zheng et al.
Machine learning based methods
Template-free: SCROP [22] 2020 – Results
• Template-free: Graph2Graph (G2G) [9] (ICML 2020)
53[9] Shi et al.
Machine learning based methods
Template-free: G2G [9] 2020
• Key Idea
• It decomposes retrosynthesis as 2-step procedure:
• Breaking target product
• Transforming broken target product
• It trains Reaction Center Identification (RCI) module for making synthon(s) via breaking bonds in a
product graph.
• It trains Variational Graph Translation module for making reactants via a series of graph
transformation.
54
Machine learning based methods
Template-free: G2G [9] 2020 – Key Idea
[9] Shi et al.
• Reaction Center Identification (RCI)
• It uses a R-GCN [27] for learning graph representation.
• Overview
1. Given a chemical reaction , it derives a binary label matrix
2. Computing node embeddings and graph embedding.
3. To estimate the reactivity score of atom pair (i,j), the edge embedding is formed by
concatenating several features.
4. The final reactivity score of the atom pair (i, j) is calculated as:
5. The RCI is optimized by maximizing the cross entropy of the binary label
55
Machine learning based methods
Template-free: G2G [9] 2020 – Method (RCI)
[9] Shi et al. [27] Schlichtkrull et al.
• Reactants generation via Variational Graph Translation (VGT).
1. It receives synthons from the RCI and transform the synthons to reactants.
2. It generates a sequence of graph transformation actions , and apply them on
the initial synthon graph.
• It assumes graph generation as a Markov Decision Process (MDP).
56
Machine learning based methods
Template-free: G2G [9] 2020 – Method (VGT)
[9] Shi et al.
• Reactants generation via Variational Graph Translation (VGT).
• Overview
1. Let transformation trajectory := , the graph transformation is
deterministic if the transformation trajectory is defined.
=
2. Let denote the graph after applying the sequence of actions to
3. Leveraging assumption of a MDP,
=
4. Finally, Graph transformation cab be factorized as follows:
57
Machine learning based methods
Template-free: G2G [9] 2020 – Method (VGT)
[9] Shi et al.
• Reactants generation via Variational Graph Translation (VGT).
• Overview (cont’d)
4. Let an action is a tuple
5. It decomposes the distribution into 3 parts:
i. Termination prediction
ii. Nodes selection
iii. Edge labeling
6. It uses variational inference by introducing an approximate posterior
58[9] Shi et al.
Machine learning based methods
Template-free: G2G [9] 2020 – Method (VGT)
• Top-k result
59[9] Shi et al.
Reaction class is given Reaction class is unkwon
Machine learning based methods
Template-free: G2G [9] 2020 – Results
• Module performance
• Contribution
• It novelly formulates retrosynthesis prediction as a graph-to-graphs translation task
• Limitation
• Well-tuned Molecule Transformers performs better
60
Machine learning based methods
Template-free: G2G [9] 2020 – Results
[9] Shi et al.
• Template-free: GraphRetro [10] (arXiv 2020)
61
Machine learning based methods
Template-free: GraphRetro [10] 2020
[10] Somnath et al.
• Template-free: GraphRetro [10] (arXiv 2020)
• Key Idea
• It also uses the idea of breaking and modifying graphs like G2G[22].
• G2G[22] modified the graph at the level of atoms, but it operates at level of molecular fragments
called as leaving groups.
• G2G: Sequential generation
• GraphRetro: Leaving group selection
62
Machine learning based methods
Template-free: GraphRetro [10] 2020 – Key Idea
[10] Somnath et al.
• Top-k result
63
Machine learning based methods
Template-free: GraphRetro [10] 2020 - Results
[10] Somnath et al.
• Module performance
• Contribution
• Choosing a leaving group is a good idea for retrosynthesis problems
• Limitation
• Domain knowledge is required to create a leaving group vocabulary
64
Machine learning based methods
Template-free: GraphRetro [10] 2020 - Results
[10] Somnath et al.
Machine learning based
Selection-based: Bayesian Retrosynthesis [11]
65[11] Guo et al.
Machine learning based
Selection-based: Bayesian Retrosynthesis [11]
66
Cont’d
[11] Guo et al.
Machine learning based
Selection-based: Bayesian Retrosynthesis [11] – Key Idea
• Key Idea
• It uses pre-trained forward model for likelihood of Bayes’ theorem and uses approximate
posterior distribution of reactants.
• It uses Monte Carlo search for exploring synthetic routes
67[11] Guo et al.
Machine learning based
Selection-based: Bayesian Retrosynthesis [11] – Method
• Method
• Likelihood is the Boltzmann distribution with an inverse temperature.
• Energy function: Tanimoto distance between target product and predicted product
• Approximate posterior
• Exact computation across all candidates is generally infeasible.
68
Predicted product by forward model (Molecular Transformer)
[11] Guo et al.
Machine learning based
Selection-based: Bayesian Retrosynthesis [11] – Method (SMC)
• Method (Cont’d)
• Sampling from the posterior
• Sequential Monte Carlo (SMC)
• 
• Cons
• Particle impoverishment [38]
• Rapid loss of diversity
• Computation cost of using forward model (Molecular Transformer)
69[11] Guo et al. [38] Stavropoulos et al.
Machine learning based
Selection-based: Bayesian Retrosynthesis [11] – Method
• Method (Cont’d)
• SMC accelerated by surrogate likelihood.
• It trains Gradient Boosting Regression Tree that predicts likelihood of Molecular
Transformer
70[11] Guo et al.
Machine learning based
Selection-based: Bayesian Retrosynthesis [11] – Results
• Results
71[11] Guo et al.
Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
Challenges
Challenge 1. Balancing between template-free and template-based model
Challenge 2. Multi-Step retrosynthesis
Challenge 3. Extremely large space of synthesis routes
Challenge 4. Molecule decoding (Graph generation)
73[3] Coley et al. [14] Segler et al.
Challenges:
1. Balancing between template-free and template-based model
• How about a hybrid model using uncertainty ?
74
f
Pros
• High
interpretability
Cons
• Low
generalizability
• Require domain
knowledge
Pros
• Generalizability
Cons
• Invalid/Inaccessible
predictions
• Low interpretability
• Most chemical molecules in real world cannot be synthesized within one step.
• It could go up to 60 steps or even more.
• Error accumulation
• Extremely large space
• Most recent work [13] uses neural guided A* search.
75[13] Chen et al.
Challenges:
2. Multi-Step retrosynthesis
• Each molecule could be synthesized by hundreds of different possible reactants.
• How to measure a good synthesis routes ?
76
Challenges:
3. Extremely large space of synthesis routes
• Modeling complex distributions over graphs and then efficiently sampling is challengin
g!
• Why is it challenging?
• Non-unique
• High dimensional nature of graphs
• Complex, non-local dependencies b/w nodes and edges.
• Proposed methods
• Graph VAE [29] (ICANN 2018)
• Graph RNN [30] (ICML 2018)
• GRAN [31] (NeurIPS 2019)
• Junction tree VAE [35] (ICML 2019)
77[29] Schlichtkrull et al. [30] You et al. [31] Liao et al. [35] Jin et al.
Challenges:
4. Molecule decoding (Graph generation)
Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
Practice: RDkit
• Data pre-processing (RDKit)
• RDKit[20] is an open-source library for Cheminformatics.
• https://www.rdkit.org
• Why RDKit?
• Visualizing
• Substructure searching
• Calculate molecule similarity
• Validity check
• Various function for Cheminformatics
• We upload RDKit tutorial notebook:
• https://github.com/wonjun-dev/contrastive-retro
79
Practice: OpenNMT
• OpenNMT
• OpenNMT[28] is an open-source library for neural machine translations.
• https://opennmt.net
• It supports various models for encoder-decoder framework.
• Why OpenNMT?
• It supports various models for encoder-decoder framework.
• Built-in functions.
• Easy to engineer.
• Cons
• Too huge
• Flexibility
• Discontinued procedure (train-inference-performance check)*7
80[28] Klein et al., *7: We made fully-automated script.
Practice: OpenNMT – Where you should change
• OpenNMT
• Primary files in OpenNMT
• Data loader
• preprocess.py
• inputter.py (.onmt/inputters)
• Options
• opts.py (./onmt) => Several options for train, translate, preprocessing and etc. You can
make your own options in here.
• Train
• train.py => Entry point of training
• train_single.py (./ommt) => Second entry point of training
• trainer.py (./onmt) => Main training loop
• loss.py (.onmt/utils) => Several classes for loss function
• Model
• model_builder (./onmt)
• model.py (./onmt/models) => Model class
• model_saver (./onmt/models)
• Translation
• translate.py => Entry point of translation
• translator.py (./onmt/translate) => Translator class
• Performance check
• parse_output.py (./parse) => Parse predicted output and calculate accuracy via RDKit.
81
Practice: OpenNMT – Automated script
• OpenNMT
• We provide fully-automated (training to parsing) script.
• https://github.com/wonjun-dev/contrastive-retro @master branch
• run_experiment_mt.sh
• Train – Inference (Translate) – Performance check (Parse) – Averaging
• arg[0] : GPU id
• arg[1]: seed
• run_average.py
• The performance variation of MT and LV-MT is quite large depending on seed.
82
Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
Related works
• Forward synthesis
• Given reactants and reagents, predict the products.
• [7, 34, 36, 37]
• Reaction center prediction
• The task of identifying the reaction center is related to the step of deriving the synthons
(intermediate outcomes) in retrosynthesis.
• [9, 10, 33, 34]
• Graph generation
• Generative models for real-world graphs, including social, chemical and knowledge graph
• [29, 30, 31, 35]
84
Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
Future directions
• Training chemical language models like BERT
• Learning better chemical representation
• Atomic or molecular embedding considering chemical properties
• Robust to SMILES augmentation
• Contrastive learning
• Template-Generative Hybrid model
• Graph encoding – SMILES decoding
• Graph decoding is challenging
• Predictive model for subgraph isomorphism
• Subgraph isomorphism is a NP-complete problem, it is not scalable.
86
References
[1] Weininger et al. “A chemical language and information system. 1. introduction to methodology and encoding
rules.” Journal of Chemical Information and Modeling, 1988.
[2] Christ et al. “Mining electronic laboratory notebooks: Analysis, retrosynthesis, and reaction based
enumeration.” Journal of Chemical Information and Modeling, 2012.
[3] Coley et al. “Computer-assisted retrosynthesis based on molecular similarity.” ACS Central Science, 2017.
[4] Klucznik et al. “Efficient syntheses of diverse, medicinally relevant targets planned by computer and executed
in the laboratory.” Chem, 2018.
[5] Dai et al. “Retrosynthesis prediction with conditional graph logic network”. NeurIPS, 2019.
[6] Schwaller et al. “Molecular transformer: A model for uncertainty-calibrated chemical reaction prediction.” ACS
Central Science, 2019.
[7] Lee et al. “Molecular transformer unifies reaction prediction and retrosynthesis across pharma chemical space.”
Chemical Communications, 2019.
[8] Chen et al. “Learning to make generalizable and diverse predictions for retrosynthesis.” arXiv preprint 2019.
[9] Shi et al. “A graph to graphs framework for retrosynthesis prediction.”, ICML, 2020
[10] Somnath et al. “Learning graph models for template-free retrosynthesis.”, arXiv, 2020
[11] Guo et al. “A Bayesian algorithm for retrosynthesis.”, arXiv, 2020
[12] Lin et al. “Automatic retrosynthetic route planning using template-free models.”, Chem. Sci., 2020
[13] Chen et al. “Retro*: Learning Retrosynthetic Planning with Neural Guided A* Search”, ICML, 2020
87
References
[14] Segler et al., “Neural-Symbolic machine learning for retrosynthesis and reaction prediction.”, Chemistry-A European
Journal, 2017
[15] Satoh et al., “A novel approach to retrosynthetic analysis using knowledge bases derived from reaction databases.”,
Chem. Inf. Comput. Sci., 1999
[16] Law et al., “Route designer: A retrosynthetic analysis tool utilizing automated retrosynthetic rule generation.”, Chem.
Inf., 2009
[17] Gasteiger et al., “A collection of computer methods for synthesis design and reaction prediction.”, Recl. Trav. Chim.
Pays-Bas, 1992
[18] Corey et al., “Computer-assisted analysis in organic synthesis.”, Science, 1985
[19] Corey et al., “The logic of chemical synthesis: Multistep synthesis of complex carbogenic molecules. (Nobel lecture)”,
1991
[20] http://www.rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf
[21] Liu et al., “Retrosynthetic reaction prediction using neural sequence-to-sequence models.”, ACS Cent. Sci., 2017
[22] Zheng et al., “Predicting retrosynthetic reactions using self-corrected transformer neural networks.”, J. Chem. Inf.
Model., 2020
[23] Srivastava et al., “Highway networks”, NIPS, 2015
[24] https://chemistry-europe.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fchem.201605499&fil
e=chem201605499-sup-0001-misc_information.pdf
[25] http://www.reaxys.com, Reaxys is a registered trademark of RELX Intellectual Properties SA used under license.
[26] Shen et al., “Mixture model for diverse machine translations: Tricks off the trade.”, arXiv, 2019
88
References
[27] Schlichtkrull et al., “Modeling relational data with graph convolutional networks.”, In European
Semantic Web Conference, 2018
[28] Klein et al., “OpenNMT: Open-Source Toolkit for Neural Machine Translation.”, arXiv, 2017
[29] Simonovsky et al., “GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders.”,
ICANN, 2018
[30] You et al., “GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Models.”, ICML, 2018
[31] Liao et al., “Efficient Graph Generation with Graph Recurrent Attention Networks.”, NeurIPS, 2019
[32] Mayfield et al., “Pistachio 2.0 edn software.”, 2018
[33] Coley et al., “A graph-convolutional neural network model for the prediction of chemical reactivity.”,
Chemical Science 2019
[34] Coley et al., “Predicting organic reaction outcomes with Weisfeiler-Lehman Network.”, NeurIPS, 2017
[35] Jin et al., “Junction Tree Variational Autoencoder for molecular graph generation.”, ICML, 2019
[36] Bradshaw et al., “A generative model for electron path.”, ICLR, 2019
[37] DO et al., “Graph transformation policy network for chemical reaction prediction.”, KDD, 2019
[38] Stavropoulos et al., “Sequential Monte Carlo method in practice.”, Springer, 2001
89
Appendix
1. Subgraph isomorphism problem
• It is a computational task in which two graphs G and H are given as input, and one must det
ermine whether G contains a subgraph that is isomorphic to H
• NP-Complete
2. Molecular similarity metrics (x and y are molecular fingerprint)
90
Appendix
3. Reaction class
• Meta-information about type of chemical reactions.
• In USPTO, there are 10 reaction classes
91
Appendix
4. Parameterizing by GNN in [5]
• Graph embedding := Averaging node embedding
92
Appendix
5. Better hyper-parameters of MT and the results.
• Dropout p=0.25 is better than p=0.1
• We can remove invalid and repeated SMILES via RDKit.
• Also, Using 6 layers and increasing the dropout rate is better than using 4 layers.
93
Top 1 Top 3 Top 5 Top 10
MT [8] 0.420 0.570 0.619 0.657
MT (p=0.25, w/o
inval/repeat)
0.432 0.645 0.709 0.771
Thank you !
Any Questions ?

More Related Content

What's hot

Combinatorial chemistry and high throughputscreening
Combinatorial chemistry and high throughputscreeningCombinatorial chemistry and high throughputscreening
Combinatorial chemistry and high throughputscreening
SaikiranKulkarni
 
2D QSAR DESCRIPTORS
2D QSAR DESCRIPTORS2D QSAR DESCRIPTORS
2D QSAR DESCRIPTORS
Smita Jain
 
Drug and Chemical Databases 2018 - Drug Discovery
Drug and Chemical Databases 2018 - Drug DiscoveryDrug and Chemical Databases 2018 - Drug Discovery
Drug and Chemical Databases 2018 - Drug Discovery
Girinath Pillai
 
HOMOLOGY MODELING IN EASIER WAY
HOMOLOGY MODELING IN EASIER WAYHOMOLOGY MODELING IN EASIER WAY
HOMOLOGY MODELING IN EASIER WAY
Shikha Popali
 
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
1  -val_gillet_-_ligand-based_and_structure-based_virtual_screening1  -val_gillet_-_ligand-based_and_structure-based_virtual_screening
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
Deependra Ban
 
Drug properties (ADMET) prediction using AI
Drug properties (ADMET) prediction using AIDrug properties (ADMET) prediction using AI
Drug properties (ADMET) prediction using AI
IndrajeetKumar124
 
Global and local restrictions Peptidomimetics
Global and local restrictions Peptidomimetics Global and local restrictions Peptidomimetics
Global and local restrictions Peptidomimetics
ASHOK GAUTAM
 
3 d qsar approaches structure
3 d qsar approaches structure3 d qsar approaches structure
3 d qsar approaches structure
ROHIT PAL
 
Denovo
DenovoDenovo
Denovo
KeerthanaD21
 
Secondary structure prediction
Secondary structure predictionSecondary structure prediction
Secondary structure prediction
samantlalit
 
Molecular Activity Prediction Using Graph Convolutional Deep Neural Network C...
Molecular Activity Prediction Using Graph Convolutional Deep Neural Network C...Molecular Activity Prediction Using Graph Convolutional Deep Neural Network C...
Molecular Activity Prediction Using Graph Convolutional Deep Neural Network C...
Masahito Ohue
 
Structure based and ligand based drug designing
Structure based and ligand based drug designingStructure based and ligand based drug designing
Structure based and ligand based drug designing
Dr Vysakh Mohan M
 
Molecular and Quantum Mechanics in drug design
Molecular and Quantum Mechanics in drug designMolecular and Quantum Mechanics in drug design
Molecular and Quantum Mechanics in drug design
Ajay Kumar
 
Threading modeling methods
Threading modeling methodsThreading modeling methods
Threading modeling methods
ratanvishwas
 
Molecular docking by harendra ...power point presentation
Molecular docking by harendra ...power point presentationMolecular docking by harendra ...power point presentation
Molecular docking by harendra ...power point presentation
Harendra Bisht
 
Understanding Smiles
Understanding Smiles Understanding Smiles
Understanding Smiles
Abhik Seal
 
PREDICTION AND ANALYSIS OF ADMET PROPERTIES OF NEW.pptx
PREDICTION AND ANALYSIS OF ADMET PROPERTIES OF NEW.pptxPREDICTION AND ANALYSIS OF ADMET PROPERTIES OF NEW.pptx
PREDICTION AND ANALYSIS OF ADMET PROPERTIES OF NEW.pptx
MO.SHAHANAWAZ
 
Fragment based drug design
Fragment based drug designFragment based drug design
Fragment based drug design
Ekta Tembhare
 
Synthetic biology
Synthetic biologySynthetic biology
Synthetic biology
Vasyl Mykytyuk
 
CRAN Rパッケージ BNSLの概要
CRAN Rパッケージ BNSLの概要CRAN Rパッケージ BNSLの概要
CRAN Rパッケージ BNSLの概要
Joe Suzuki
 

What's hot (20)

Combinatorial chemistry and high throughputscreening
Combinatorial chemistry and high throughputscreeningCombinatorial chemistry and high throughputscreening
Combinatorial chemistry and high throughputscreening
 
2D QSAR DESCRIPTORS
2D QSAR DESCRIPTORS2D QSAR DESCRIPTORS
2D QSAR DESCRIPTORS
 
Drug and Chemical Databases 2018 - Drug Discovery
Drug and Chemical Databases 2018 - Drug DiscoveryDrug and Chemical Databases 2018 - Drug Discovery
Drug and Chemical Databases 2018 - Drug Discovery
 
HOMOLOGY MODELING IN EASIER WAY
HOMOLOGY MODELING IN EASIER WAYHOMOLOGY MODELING IN EASIER WAY
HOMOLOGY MODELING IN EASIER WAY
 
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
1  -val_gillet_-_ligand-based_and_structure-based_virtual_screening1  -val_gillet_-_ligand-based_and_structure-based_virtual_screening
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
 
Drug properties (ADMET) prediction using AI
Drug properties (ADMET) prediction using AIDrug properties (ADMET) prediction using AI
Drug properties (ADMET) prediction using AI
 
Global and local restrictions Peptidomimetics
Global and local restrictions Peptidomimetics Global and local restrictions Peptidomimetics
Global and local restrictions Peptidomimetics
 
3 d qsar approaches structure
3 d qsar approaches structure3 d qsar approaches structure
3 d qsar approaches structure
 
Denovo
DenovoDenovo
Denovo
 
Secondary structure prediction
Secondary structure predictionSecondary structure prediction
Secondary structure prediction
 
Molecular Activity Prediction Using Graph Convolutional Deep Neural Network C...
Molecular Activity Prediction Using Graph Convolutional Deep Neural Network C...Molecular Activity Prediction Using Graph Convolutional Deep Neural Network C...
Molecular Activity Prediction Using Graph Convolutional Deep Neural Network C...
 
Structure based and ligand based drug designing
Structure based and ligand based drug designingStructure based and ligand based drug designing
Structure based and ligand based drug designing
 
Molecular and Quantum Mechanics in drug design
Molecular and Quantum Mechanics in drug designMolecular and Quantum Mechanics in drug design
Molecular and Quantum Mechanics in drug design
 
Threading modeling methods
Threading modeling methodsThreading modeling methods
Threading modeling methods
 
Molecular docking by harendra ...power point presentation
Molecular docking by harendra ...power point presentationMolecular docking by harendra ...power point presentation
Molecular docking by harendra ...power point presentation
 
Understanding Smiles
Understanding Smiles Understanding Smiles
Understanding Smiles
 
PREDICTION AND ANALYSIS OF ADMET PROPERTIES OF NEW.pptx
PREDICTION AND ANALYSIS OF ADMET PROPERTIES OF NEW.pptxPREDICTION AND ANALYSIS OF ADMET PROPERTIES OF NEW.pptx
PREDICTION AND ANALYSIS OF ADMET PROPERTIES OF NEW.pptx
 
Fragment based drug design
Fragment based drug designFragment based drug design
Fragment based drug design
 
Synthetic biology
Synthetic biologySynthetic biology
Synthetic biology
 
CRAN Rパッケージ BNSLの概要
CRAN Rパッケージ BNSLの概要CRAN Rパッケージ BNSLの概要
CRAN Rパッケージ BNSLの概要
 

Similar to Retrosynthesis tutorial v2

Ontologies mining using association rules
Ontologies mining using association rulesOntologies mining using association rules
Ontologies mining using association rules
Chemseddine Berbague
 
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
LDBC council
 
Analytic Dependency Loops in Architectural Models of Cyber-Physical Systems
Analytic Dependency Loops in Architectural Models of Cyber-Physical SystemsAnalytic Dependency Loops in Architectural Models of Cyber-Physical Systems
Analytic Dependency Loops in Architectural Models of Cyber-Physical Systems
Ivan Ruchkin
 
Computational Chemical Engineering
Computational Chemical EngineeringComputational Chemical Engineering
Computational Chemical Engineering
IJRTEMJOURNAL
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
Databricks
 
NS-CUK Journal club: H.E.Lee, Review on " A biomedical knowledge graph-based ...
NS-CUK Journal club: H.E.Lee, Review on " A biomedical knowledge graph-based ...NS-CUK Journal club: H.E.Lee, Review on " A biomedical knowledge graph-based ...
NS-CUK Journal club: H.E.Lee, Review on " A biomedical knowledge graph-based ...
ssuser4b1f48
 
Unit 5
Unit 5Unit 5
Open Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & AnalysisOpen Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & Analysis
Marcus Hanwell
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
IRJET Journal
 
Apsec 2014 Presentation
Apsec 2014 PresentationApsec 2014 Presentation
Apsec 2014 Presentation
Ahrim Han, Ph.D.
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
Sunghwan Kim
 
How to improve your unit tests?
How to improve your unit tests?How to improve your unit tests?
How to improve your unit tests?
Péter Módos
 
2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinar2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinar
Pistoia Alliance
 
Sybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal PresentationSybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal Presentation
Justin Sybrandt, Ph.D.
 
Use of GAN's to analyze chemical reactions
Use of GAN's to analyze chemical reactionsUse of GAN's to analyze chemical reactions
Use of GAN's to analyze chemical reactions
Matthew Clark
 
Machine Learning Applications in Credit Risk
Machine Learning Applications in Credit RiskMachine Learning Applications in Credit Risk
Machine Learning Applications in Credit Risk
QuantUniversity
 
Novel algorithms for detection of unknown chemical molecules with specific bi...
Novel algorithms for detection of unknown chemical molecules with specific bi...Novel algorithms for detection of unknown chemical molecules with specific bi...
Novel algorithms for detection of unknown chemical molecules with specific bi...
Aboul Ella Hassanien
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Anubhav Jain
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
Azad public school
 
Method development
Method developmentMethod development
Method development
Gamal Abdel Hamid
 

Similar to Retrosynthesis tutorial v2 (20)

Ontologies mining using association rules
Ontologies mining using association rulesOntologies mining using association rules
Ontologies mining using association rules
 
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
 
Analytic Dependency Loops in Architectural Models of Cyber-Physical Systems
Analytic Dependency Loops in Architectural Models of Cyber-Physical SystemsAnalytic Dependency Loops in Architectural Models of Cyber-Physical Systems
Analytic Dependency Loops in Architectural Models of Cyber-Physical Systems
 
Computational Chemical Engineering
Computational Chemical EngineeringComputational Chemical Engineering
Computational Chemical Engineering
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
NS-CUK Journal club: H.E.Lee, Review on " A biomedical knowledge graph-based ...
NS-CUK Journal club: H.E.Lee, Review on " A biomedical knowledge graph-based ...NS-CUK Journal club: H.E.Lee, Review on " A biomedical knowledge graph-based ...
NS-CUK Journal club: H.E.Lee, Review on " A biomedical knowledge graph-based ...
 
Unit 5
Unit 5Unit 5
Unit 5
 
Open Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & AnalysisOpen Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & Analysis
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
 
Apsec 2014 Presentation
Apsec 2014 PresentationApsec 2014 Presentation
Apsec 2014 Presentation
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
How to improve your unit tests?
How to improve your unit tests?How to improve your unit tests?
How to improve your unit tests?
 
2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinar2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinar
 
Sybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal PresentationSybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal Presentation
 
Use of GAN's to analyze chemical reactions
Use of GAN's to analyze chemical reactionsUse of GAN's to analyze chemical reactions
Use of GAN's to analyze chemical reactions
 
Machine Learning Applications in Credit Risk
Machine Learning Applications in Credit RiskMachine Learning Applications in Credit Risk
Machine Learning Applications in Credit Risk
 
Novel algorithms for detection of unknown chemical molecules with specific bi...
Novel algorithms for detection of unknown chemical molecules with specific bi...Novel algorithms for detection of unknown chemical molecules with specific bi...
Novel algorithms for detection of unknown chemical molecules with specific bi...
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
Method development
Method developmentMethod development
Method development
 

Recently uploaded

Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
zubairahmad848137
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
gerogepatton
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball playEric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
enizeyimana36
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
gerogepatton
 

Recently uploaded (20)

Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball playEric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 

Retrosynthesis tutorial v2

  • 1. Introduction to Retrosynthesis Prediction 2020. 06 Wonjun Jeong wonjun.jg@kaist.ac.kr wonjun.email@gmail.com
  • 2. Table of Contents • Introduction • Retrosynthesis prediction • Dataset description • Overview of general approaches: Template-based, Template-free, Selection-based • Proposed methods • Classical computer-aided methods • Machine learning based methods • Challenges • Practice • RDKit • OpenNMT • Related works • Future directions • Reference • Appendix
  • 3. Table of Contents • Introduction • Retrosynthesis prediction • Dataset description • Overview of general approaches: Template-based, Template-free, Selection-based • Proposed methods • Classical computer-aided methods • Machine learning based methods • Challenges • Practice • RDKit • OpenNMT • Related works • Future directions • Reference • Appendix
  • 4. Retrosynthesis prediction • What is retrosynthesis prediction? • Retrosynthesis or retrosynthetic pathway planning is the process of tracing back the forward reaction, predicting which reactants are required to synthesize the target product. 4
  • 5. Retrosynthesis prediction • Retrosynthesis is crucial process of discovering new materials and drugs. 5 Desired properties Candidate Product Candidate Reactants Test by chemist Retrosynthesis prediction
  • 6. • Each process of discovering new materials and drug has own error, it should be verified by chemist. • Expensive 6 Desired properties Candidate Product Candidate Reactants Test by chemist Retrosynthesis prediction Retrosynthesis prediction
  • 7. Retrosynthesis prediction • Retrosynthesis prediction has highly depended on the trial-and-error cycles of experienced researchers of chemical expertise. 7
  • 8. Retrosynthesis prediction • If retrosynthesis prediction can be done with high accuracy … • Capable of unlocking future possibilities of a fully automated material/drug discovery pipeline. 8 Desired properties Candidate Product Candidate Reactants Test by robot Retrosynthesis prediction
  • 9. Dataset description • SMILES (Simplified Molecular-Input Line-Entry System) [1] • SMILES is a specification in the form of a line notation for describing the structure of chemical species [2]. • Generation of SMILES. • By printing symbol nodes encountered in a depth-first tree traversal of a chemical graph 9[1] Weininger et al .[2] https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system
  • 10. Dataset description • SMILES in detail • Character of carbon(C) is omitted in the graph. • Hydrogen(H) is omitted in the SMILES. • Ring structures are written by breaking each ring at an arbitrary point to make an acyclic str ucture and adding numerical ring closure labels to show connectivity between non-adjacen t atoms. • Branches are described with parentheses. • A bond is represented using one of the symbols: ., -, =, #, $, :, /, • “.” indicates two parts are not bonded together 10[1] Weininger et al .[2] https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system
  • 11. Dataset description • Benchmark: 1. USPTO (United States Patent and Trademark Office) • USPTO benchmark contains SMIELS representation of single target product (input) and reactants (target) • Variants • USPTO-50k • USTPO-500K • USPTO-MIT 2. Pistachio [32] 3. Reaxys [25] 11[25] reaxys.com [32] Mayfield et al.
  • 12. Overview of general approaches: Template-based • Template-based approaches [2, 3, 4, 5, 14, 15, 16, 17] use the known chemical reaction which is called reaction template. • Reaction template contains sub-graph reaction patterns that describing how the reaction occur between reactants and product. • Pros • High interpretability • Cons • Low generalizability to unseen templates • Require domain knowledge to extract the reaction templates 12
  • 13. Overview of general approaches: Template-free • Template-free approaches [6, 7, 8, 9, 10, 12] learn mapping function product to a set of reactants by extracting features directly from data. • Seq2Seq framework • [6, 7, 8, 12] • Graph2Grpah framework • [9, 10] • Pros • Generalizability • Not require domain knowledge • Cons • Invalid/Inaccessible predictions • Low interpretability 13 f
  • 14. Overview of general approaches: Selection-based • Selection-based approaches [11] select a candidate set of purchasable reactants. • The objective of [11] is to discover retrosynthetic routes from a given desired product to co mmercially available reactants • Pros • Accessibility of the prediction • Not require domain knowledge • Cons • Novelty 14[11] Guo et al. Rank := f(product; ) Purchasable pool
  • 15. Table of Contents • Introduction • Retrosynthesis prediction • Dataset description • Overview of general approaches: Template-based, Template-free, Selection-based • Proposed methods • Classical computer-aided methods • Machine learning based methods • Challenges • Practice • RDKit • OpenNMT • Related works • Future directions • Reference • Appendix
  • 16. Classical computer-aided methods • Before deep learning, computer-aided retrosynthesis were mainly conducted using reaction template. [2, 3, 4, 15, 16, 17] • They are mainly about how to use known reactions and extract meaningful reaction context. • Characteristics • It needs chemical expertise. • Heuristics • Computationally expensive • Chemical space is vast • Subgraph isomorphism problem*1. • Not scalable • Not generalizable 16*1: Appendix-1
  • 17. Classical computer-aided methods • The first computer-aided retrosynthesis: • [18] Corey et al., “Computer-assisted analysis in organic synthesis.”, Science, 1985 • The author won the Nobel Prize in Chemistry for his contribution of retrosynthetic analysis. • [19] The Logic of Chemical Synthesis: Multistep Synthesis of Complex Carbogenic Mol ecules (Nobel lecture), 1991 17[18, 19] Corey et al.
  • 18. Classical computer-aided methods: Recent work [3] 2017 18[3] Coley et al.
  • 19. • Key Idea • It uses product similarity and reactants similarity to rank template of precedent reactions. 19[3] Coley et al. Classical computer-aided methods: Recent work [3] 2017 – Key Idea
  • 20. • How to measure molecular similarity*2? • Molecular fingerprints are a way of encoding the structure of molecule. We can use RDKit library to get it. • Most common way is Tanimoto similarity, but there is no canonical definition of molecule similarity (subgraph isomorphism problem*1). • , : Molecular fingerprint 20*1: Appendix-1, *2: Appendix-2 Img from [20] Classical computer-aided methods: Recent work [3] 2017 – Method (Similarity)
  • 21. • Example of using similarity in [3] • Total similarity := Product Sim * Reactants (Precursor) sim 21[3] Coley et al. Rank Classical computer-aided methods: Recent work [3] 2017 – Method (Using similarity)
  • 22. • Result of [3] • [3] performs better than seq2seq. However, the seq2seq in table is template-free and [3] is template-based. • Contribution • It mimics the retrosynthetic strategy by using molecular similarity without need to encode any chemical knowledge. • Limitation • It inherently disfavors making creative retrosynthetic strategy because it relies on precedent reactions. 22*3: Appendix-3 *3 Classical computer-aided methods: Recent work [3] 2017 - Results
  • 23. Table of Contents • Introduction • Retrosynthesis prediction • Dataset description • Overview of general approaches: Template-based, Template-free, Selection-based • Proposed methods • Classical computer-aided methods • Machine learning based methods • Challenges • Practice • RDKit • Open NMT • Related works • Future directions • Reference • Appendix • Library • Related works
  • 24. Machine learning based methods • Data-driven methods using machine learning and deep learning have been activated since mid-2010s. • The need for expertise has been reduced. • More scalable and generalizable. • Representative proposed methods • Template-based • NeuralSim [14], Graph Logic Network (GLN) [5] • Template-free • Seq2Seq [21], Molecular Transformer (MT) [6, 7], Latent variable Transformer (LV-MT) [8], Self-Corrected Transformer (SCROP) [22], Graph2Graph (G2G) [9], GraphRetro [10] • Selection-based • Bayesian-Retro [11] 24
  • 25. Machine learning based methods Template-based: NeuralSim [14] 2017 25[14] Segler et al.
  • 26. • Template-based: NeuralSim [14] (2017) • Key Idea • Given a target product, it uses neural network to predict most suitable rule in reaction template. 26[14] Segler et al. Machine learning based methods Template-based: NeuralSim [14] 2017 – Key Idea
  • 27. • Template-based: NeuralSim [14] • It uses primitive models such as MLP and Highway network [23]. • It defines rule-selection as a multiclass classification. • Molecular Descriptor [24] is defined as sum of molecular fingerprint: 27[14] Segler et al. [23] Srivastava et al. [24] pdf file Machine learning based methods Template-based: NeuralSim [14] 2017 - Method
  • 28. • Template-based: NeuralSim [14] • Experiments • Dataset: Reaxys database [25] • # of class: 8720 • Contribution • It shows neural networks can learn to which molecular context particular rules can be applied. • Limitation • The performance is affected by rule set cardinality. • The larger the set size, the lower the performance. 28[14] Segler et al. Machine learning based methods Template-based: NeuralSim [14] 2017 - Results
  • 29. • Template-based: Graph Logic Network (GLN) [5] (NeurIPS 2019) 29[5] Dai et al. Machine learning based methods Template-based: Graph Logic Network [5] 2019
  • 30. • Key Idea • Modeling the joint distribution of reaction templates and reactants using logic variable. • It learns when rules from reaction templates should be applied. 30[5] Dai et al. Machine learning based methods Template-based: Graph Logic Network [5] 2019 – Key Idea
  • 31. • Retrosynthesis Template • Using the retrosynthesis template can be decomposed into 2-step logic. • Match template • Match reactants 31[5] Dai et al. Machine learning based methods Template-based: Graph Logic Network [5] 2019 - Background
  • 32. • Match template • Match reactants • Uncertainty • Template score function • Reactants score function 32[5] Dai et al. Machine learning based methods Template-based: Graph Logic Network [5] 2019 - Method
  • 33. • Final joint probability 33[5] Dai et al. *4: Appendix-4 Machine learning based methods Template-based: Graph Logic Network [5] 2019 - Method Parameterizing by GNN (Graph Neural Network)*4
  • 34. • MLE with Efficient Inference • Gradient approximation 34 Machine learning based methods Template-based: Graph Logic Network [5] 2019 - Method [5] Dai et al.
  • 35. • Top-k results • Contribution • Interpretability: Integration of probabilistic models and template(chemical rule) • Limitation • It share limitations of template-based method • Scalability 35[5] Dai et al. Machine learning based methods Template-based: Graph Logic Network [5] 2019 - Results
  • 36. 36[21] Liu et al. Machine learning based methods Template-free: Seq2Seq [21] 2017
  • 37. • Template-free: Seq2Seq [21] (2017) • It tokenizes SMILES and treats retrosynthesis as machine translation. • It uses bidirectional LSTM for a encoder and decoder. • It uses beam search to produce a set of reactants. 37[21] Liu et al. Machine learning based methods Template-free: Seq2Seq [21] 2017 - Method
  • 38. • Results • It performs comparably to the rule-based expert system baseline. • Contribution • It shows fully data-driven seq2seq model can learn retrosynthetic pathway. • Limitations • It produces grammatically invalid SMILES and chemically implausible predictions. • Just naïve application of seq2seq model. • Predictions generated by a vanilla seq2seq model with beam search typically exemplifies low diversity with only minor differences in the suffix. [8] 38[21] Liu et al, [8] Chen et al Machine learning based methods Template-free: Seq2Seq [21] 2017 – Results
  • 39. • Grammatically invalid SMILES • Grammatically valid but chemically implausible 39[21] Liu et al. Machine learning based methods Template-free: Seq2Seq [21] 2017 – Results
  • 40. 40[6] Schwaller et al., [7] Lee et al. Machine learning based methods Template-free: Molecular Transformer [6, 7] 2019
  • 41. • Key Idea • It also tokenizes SMILES and treats retrosynthesis as machine translation like [21]. • It uses Transformer instead of LSTM • It performs better than seq2seq [21] but has same limitations. 41 Machine learning based methods Template-free: Molecular Transformer [6, 7] 2019 – Key Idea [6] Schwaller et al., [7] Lee et al. [21] Liu et al.
  • 42. • Template-free: Latent variable Transformer (LV-MT) [8] (arXiv 2019) 42[8] Chen et al. Machine learning based methods Template-free: LV-MT [8] 2019
  • 43. • It extends Molecular Transformer (MT) to become more generalizable to rare reactions and produce diverse path. • Key Idea • It proposes novel pretrain method. • Random bond cut • Template-based bond cut • It trains a mixture model with the online hard-EM algorithm. 43[8] Chen et al Machine learning based methods Template-free: LV-MT [8] 2019 – Key Idea
  • 44. • Pretrain methods • Random bond cut • For each input target product, it generates new examples by selecting a random bond to break. • Template-based bond cut • Instead of randomly breaking bonds, it uses the templates to break bonds. • The model is pre-trained on these auxiliary examples, and then used as initialization to be fine-tuned on the actual retrosynthesis data. 44 Machine learning based methods Template-free: LV-MT [8] 2019 – Method (Pretrain) [8] Chen et al
  • 45. • Why latent variables are introduced? • It tackles the problem of generating diverse predictions. • The outputs of beam search tend to be similar to each other. • Given a target SMILES string x and reactants SMILES string y, a mixture model introduces a multinomial latent variable z ∈ { 1, · · · , K } to capture different reaction types, and decomposes the marginal likelihood as: 45 Machine learning based methods Template-free: LV-MT [8] 2019 – Method (Latent Var.) [8] Chen et al
  • 46. • Hard-EM algorithm 1. Taking a mini-batch of training examples 2. It enumerates all K values of z and compute their loss, • Dropout should be turned off [26]. 3. For each , it selects the value of z that yields the minimum loss: • For p(y | z, x; θ), it shares the encoder-decoder network among mixture components, and feed the embedding of z as an input to the decoder so that y is conditioned on it 4. Back-propagate through it, so only one component receives gradients per example. • Dropout should be turned back on [26]. 46[8] Chen et al., [26] Shen et al. Machine learning based methods Template-free: LV-MT [8] 2019 – Method (Latent Var.)
  • 47. • Results*5 47*5: We report better hyper-parameters and the results in Appendix-5 Machine learning based methods Template-free: LV-MT [8] 2019 – Results
  • 48. • Contributions • It proposes novel pretraining methods for retrosynthesis. • It uses mixture model Transformer for diverse predictions. • Limitations • The more latent variables are used, the worse the top 1 performance. • The latent variable does not appear to contain information about the reaction class. 48 Machine learning based methods Template-free: LV-MT [8] 2019 – Results [8] Chen et al
  • 49. • Template-free: Self-Corrected Transformer (SCROP) [22] (2020) 49[22] Zheng et al. Machine learning based methods Template-free: SCROP [22] 2020
  • 50. • Template-free: Self-Corrected Transformer (SCROP) [22] (2020) • Key Idea • It uses Transformer for correcting invalid predicted SMILES • It makes syntax correction data via trained Transformer by constructing set of invalid prediction-ground truth pairs. • It trains another Transformer for syntax corrector using syntax correction data. • At test time, it retains the top-1 candidate produced by the syntax corrector and replace the original one. 50[22] Zheng et al. Machine learning based methods Template-free: SCROP [22] 2020 – Key Idea
  • 51. • Results • Compare to Transformer (SCROP-noSC), the performance is improved by 0.4~1.7%. 51 Machine learning based methods Template-free: SCROP [22] 2020 – Results [22] Zheng et al.
  • 52. • Invalid SMILES rates • Limitations • Why SCROP? We can remove invalid SMILES by using RDKit without learned model. 52[22] Zheng et al. Machine learning based methods Template-free: SCROP [22] 2020 – Results
  • 53. • Template-free: Graph2Graph (G2G) [9] (ICML 2020) 53[9] Shi et al. Machine learning based methods Template-free: G2G [9] 2020
  • 54. • Key Idea • It decomposes retrosynthesis as 2-step procedure: • Breaking target product • Transforming broken target product • It trains Reaction Center Identification (RCI) module for making synthon(s) via breaking bonds in a product graph. • It trains Variational Graph Translation module for making reactants via a series of graph transformation. 54 Machine learning based methods Template-free: G2G [9] 2020 – Key Idea [9] Shi et al.
  • 55. • Reaction Center Identification (RCI) • It uses a R-GCN [27] for learning graph representation. • Overview 1. Given a chemical reaction , it derives a binary label matrix 2. Computing node embeddings and graph embedding. 3. To estimate the reactivity score of atom pair (i,j), the edge embedding is formed by concatenating several features. 4. The final reactivity score of the atom pair (i, j) is calculated as: 5. The RCI is optimized by maximizing the cross entropy of the binary label 55 Machine learning based methods Template-free: G2G [9] 2020 – Method (RCI) [9] Shi et al. [27] Schlichtkrull et al.
  • 56. • Reactants generation via Variational Graph Translation (VGT). 1. It receives synthons from the RCI and transform the synthons to reactants. 2. It generates a sequence of graph transformation actions , and apply them on the initial synthon graph. • It assumes graph generation as a Markov Decision Process (MDP). 56 Machine learning based methods Template-free: G2G [9] 2020 – Method (VGT) [9] Shi et al.
  • 57. • Reactants generation via Variational Graph Translation (VGT). • Overview 1. Let transformation trajectory := , the graph transformation is deterministic if the transformation trajectory is defined. = 2. Let denote the graph after applying the sequence of actions to 3. Leveraging assumption of a MDP, = 4. Finally, Graph transformation cab be factorized as follows: 57 Machine learning based methods Template-free: G2G [9] 2020 – Method (VGT) [9] Shi et al.
  • 58. • Reactants generation via Variational Graph Translation (VGT). • Overview (cont’d) 4. Let an action is a tuple 5. It decomposes the distribution into 3 parts: i. Termination prediction ii. Nodes selection iii. Edge labeling 6. It uses variational inference by introducing an approximate posterior 58[9] Shi et al. Machine learning based methods Template-free: G2G [9] 2020 – Method (VGT)
  • 59. • Top-k result 59[9] Shi et al. Reaction class is given Reaction class is unkwon Machine learning based methods Template-free: G2G [9] 2020 – Results
  • 60. • Module performance • Contribution • It novelly formulates retrosynthesis prediction as a graph-to-graphs translation task • Limitation • Well-tuned Molecule Transformers performs better 60 Machine learning based methods Template-free: G2G [9] 2020 – Results [9] Shi et al.
  • 61. • Template-free: GraphRetro [10] (arXiv 2020) 61 Machine learning based methods Template-free: GraphRetro [10] 2020 [10] Somnath et al.
  • 62. • Template-free: GraphRetro [10] (arXiv 2020) • Key Idea • It also uses the idea of breaking and modifying graphs like G2G[22]. • G2G[22] modified the graph at the level of atoms, but it operates at level of molecular fragments called as leaving groups. • G2G: Sequential generation • GraphRetro: Leaving group selection 62 Machine learning based methods Template-free: GraphRetro [10] 2020 – Key Idea [10] Somnath et al.
  • 63. • Top-k result 63 Machine learning based methods Template-free: GraphRetro [10] 2020 - Results [10] Somnath et al.
  • 64. • Module performance • Contribution • Choosing a leaving group is a good idea for retrosynthesis problems • Limitation • Domain knowledge is required to create a leaving group vocabulary 64 Machine learning based methods Template-free: GraphRetro [10] 2020 - Results [10] Somnath et al.
  • 65. Machine learning based Selection-based: Bayesian Retrosynthesis [11] 65[11] Guo et al.
  • 66. Machine learning based Selection-based: Bayesian Retrosynthesis [11] 66 Cont’d [11] Guo et al.
  • 67. Machine learning based Selection-based: Bayesian Retrosynthesis [11] – Key Idea • Key Idea • It uses pre-trained forward model for likelihood of Bayes’ theorem and uses approximate posterior distribution of reactants. • It uses Monte Carlo search for exploring synthetic routes 67[11] Guo et al.
  • 68. Machine learning based Selection-based: Bayesian Retrosynthesis [11] – Method • Method • Likelihood is the Boltzmann distribution with an inverse temperature. • Energy function: Tanimoto distance between target product and predicted product • Approximate posterior • Exact computation across all candidates is generally infeasible. 68 Predicted product by forward model (Molecular Transformer) [11] Guo et al.
  • 69. Machine learning based Selection-based: Bayesian Retrosynthesis [11] – Method (SMC) • Method (Cont’d) • Sampling from the posterior • Sequential Monte Carlo (SMC) • • Cons • Particle impoverishment [38] • Rapid loss of diversity • Computation cost of using forward model (Molecular Transformer) 69[11] Guo et al. [38] Stavropoulos et al.
  • 70. Machine learning based Selection-based: Bayesian Retrosynthesis [11] – Method • Method (Cont’d) • SMC accelerated by surrogate likelihood. • It trains Gradient Boosting Regression Tree that predicts likelihood of Molecular Transformer 70[11] Guo et al.
  • 71. Machine learning based Selection-based: Bayesian Retrosynthesis [11] – Results • Results 71[11] Guo et al.
  • 72. Table of Contents • Introduction • Retrosynthesis prediction • Dataset description • Overview of general approaches: Template-based, Template-free, Selection-based • Proposed methods • Classical computer-aided methods • Machine learning based methods • Challenges • Practice • RDKit • OpenNMT • Related works • Future directions • Reference • Appendix
  • 73. Challenges Challenge 1. Balancing between template-free and template-based model Challenge 2. Multi-Step retrosynthesis Challenge 3. Extremely large space of synthesis routes Challenge 4. Molecule decoding (Graph generation) 73[3] Coley et al. [14] Segler et al.
  • 74. Challenges: 1. Balancing between template-free and template-based model • How about a hybrid model using uncertainty ? 74 f Pros • High interpretability Cons • Low generalizability • Require domain knowledge Pros • Generalizability Cons • Invalid/Inaccessible predictions • Low interpretability
  • 75. • Most chemical molecules in real world cannot be synthesized within one step. • It could go up to 60 steps or even more. • Error accumulation • Extremely large space • Most recent work [13] uses neural guided A* search. 75[13] Chen et al. Challenges: 2. Multi-Step retrosynthesis
  • 76. • Each molecule could be synthesized by hundreds of different possible reactants. • How to measure a good synthesis routes ? 76 Challenges: 3. Extremely large space of synthesis routes
  • 77. • Modeling complex distributions over graphs and then efficiently sampling is challengin g! • Why is it challenging? • Non-unique • High dimensional nature of graphs • Complex, non-local dependencies b/w nodes and edges. • Proposed methods • Graph VAE [29] (ICANN 2018) • Graph RNN [30] (ICML 2018) • GRAN [31] (NeurIPS 2019) • Junction tree VAE [35] (ICML 2019) 77[29] Schlichtkrull et al. [30] You et al. [31] Liao et al. [35] Jin et al. Challenges: 4. Molecule decoding (Graph generation)
  • 78. Table of Contents • Introduction • Retrosynthesis prediction • Dataset description • Overview of general approaches: Template-based, Template-free, Selection-based • Proposed methods • Classical computer-aided methods • Machine learning based methods • Challenges • Practice • RDKit • OpenNMT • Related works • Future directions • Reference • Appendix
  • 79. Practice: RDkit • Data pre-processing (RDKit) • RDKit[20] is an open-source library for Cheminformatics. • https://www.rdkit.org • Why RDKit? • Visualizing • Substructure searching • Calculate molecule similarity • Validity check • Various function for Cheminformatics • We upload RDKit tutorial notebook: • https://github.com/wonjun-dev/contrastive-retro 79
  • 80. Practice: OpenNMT • OpenNMT • OpenNMT[28] is an open-source library for neural machine translations. • https://opennmt.net • It supports various models for encoder-decoder framework. • Why OpenNMT? • It supports various models for encoder-decoder framework. • Built-in functions. • Easy to engineer. • Cons • Too huge • Flexibility • Discontinued procedure (train-inference-performance check)*7 80[28] Klein et al., *7: We made fully-automated script.
  • 81. Practice: OpenNMT – Where you should change • OpenNMT • Primary files in OpenNMT • Data loader • preprocess.py • inputter.py (.onmt/inputters) • Options • opts.py (./onmt) => Several options for train, translate, preprocessing and etc. You can make your own options in here. • Train • train.py => Entry point of training • train_single.py (./ommt) => Second entry point of training • trainer.py (./onmt) => Main training loop • loss.py (.onmt/utils) => Several classes for loss function • Model • model_builder (./onmt) • model.py (./onmt/models) => Model class • model_saver (./onmt/models) • Translation • translate.py => Entry point of translation • translator.py (./onmt/translate) => Translator class • Performance check • parse_output.py (./parse) => Parse predicted output and calculate accuracy via RDKit. 81
  • 82. Practice: OpenNMT – Automated script • OpenNMT • We provide fully-automated (training to parsing) script. • https://github.com/wonjun-dev/contrastive-retro @master branch • run_experiment_mt.sh • Train – Inference (Translate) – Performance check (Parse) – Averaging • arg[0] : GPU id • arg[1]: seed • run_average.py • The performance variation of MT and LV-MT is quite large depending on seed. 82
  • 83. Table of Contents • Introduction • Retrosynthesis prediction • Dataset description • Overview of general approaches: Template-based, Template-free, Selection-based • Proposed methods • Classical computer-aided methods • Machine learning based methods • Challenges • Practice • RDKit • OpenNMT • Related works • Future directions • Reference • Appendix
  • 84. Related works • Forward synthesis • Given reactants and reagents, predict the products. • [7, 34, 36, 37] • Reaction center prediction • The task of identifying the reaction center is related to the step of deriving the synthons (intermediate outcomes) in retrosynthesis. • [9, 10, 33, 34] • Graph generation • Generative models for real-world graphs, including social, chemical and knowledge graph • [29, 30, 31, 35] 84
  • 85. Table of Contents • Introduction • Retrosynthesis prediction • Dataset description • Overview of general approaches: Template-based, Template-free, Selection-based • Proposed methods • Classical computer-aided methods • Machine learning based methods • Challenges • Practice • RDKit • OpenNMT • Related works • Future directions • Reference • Appendix
  • 86. Future directions • Training chemical language models like BERT • Learning better chemical representation • Atomic or molecular embedding considering chemical properties • Robust to SMILES augmentation • Contrastive learning • Template-Generative Hybrid model • Graph encoding – SMILES decoding • Graph decoding is challenging • Predictive model for subgraph isomorphism • Subgraph isomorphism is a NP-complete problem, it is not scalable. 86
  • 87. References [1] Weininger et al. “A chemical language and information system. 1. introduction to methodology and encoding rules.” Journal of Chemical Information and Modeling, 1988. [2] Christ et al. “Mining electronic laboratory notebooks: Analysis, retrosynthesis, and reaction based enumeration.” Journal of Chemical Information and Modeling, 2012. [3] Coley et al. “Computer-assisted retrosynthesis based on molecular similarity.” ACS Central Science, 2017. [4] Klucznik et al. “Efficient syntheses of diverse, medicinally relevant targets planned by computer and executed in the laboratory.” Chem, 2018. [5] Dai et al. “Retrosynthesis prediction with conditional graph logic network”. NeurIPS, 2019. [6] Schwaller et al. “Molecular transformer: A model for uncertainty-calibrated chemical reaction prediction.” ACS Central Science, 2019. [7] Lee et al. “Molecular transformer unifies reaction prediction and retrosynthesis across pharma chemical space.” Chemical Communications, 2019. [8] Chen et al. “Learning to make generalizable and diverse predictions for retrosynthesis.” arXiv preprint 2019. [9] Shi et al. “A graph to graphs framework for retrosynthesis prediction.”, ICML, 2020 [10] Somnath et al. “Learning graph models for template-free retrosynthesis.”, arXiv, 2020 [11] Guo et al. “A Bayesian algorithm for retrosynthesis.”, arXiv, 2020 [12] Lin et al. “Automatic retrosynthetic route planning using template-free models.”, Chem. Sci., 2020 [13] Chen et al. “Retro*: Learning Retrosynthetic Planning with Neural Guided A* Search”, ICML, 2020 87
  • 88. References [14] Segler et al., “Neural-Symbolic machine learning for retrosynthesis and reaction prediction.”, Chemistry-A European Journal, 2017 [15] Satoh et al., “A novel approach to retrosynthetic analysis using knowledge bases derived from reaction databases.”, Chem. Inf. Comput. Sci., 1999 [16] Law et al., “Route designer: A retrosynthetic analysis tool utilizing automated retrosynthetic rule generation.”, Chem. Inf., 2009 [17] Gasteiger et al., “A collection of computer methods for synthesis design and reaction prediction.”, Recl. Trav. Chim. Pays-Bas, 1992 [18] Corey et al., “Computer-assisted analysis in organic synthesis.”, Science, 1985 [19] Corey et al., “The logic of chemical synthesis: Multistep synthesis of complex carbogenic molecules. (Nobel lecture)”, 1991 [20] http://www.rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf [21] Liu et al., “Retrosynthetic reaction prediction using neural sequence-to-sequence models.”, ACS Cent. Sci., 2017 [22] Zheng et al., “Predicting retrosynthetic reactions using self-corrected transformer neural networks.”, J. Chem. Inf. Model., 2020 [23] Srivastava et al., “Highway networks”, NIPS, 2015 [24] https://chemistry-europe.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fchem.201605499&fil e=chem201605499-sup-0001-misc_information.pdf [25] http://www.reaxys.com, Reaxys is a registered trademark of RELX Intellectual Properties SA used under license. [26] Shen et al., “Mixture model for diverse machine translations: Tricks off the trade.”, arXiv, 2019 88
  • 89. References [27] Schlichtkrull et al., “Modeling relational data with graph convolutional networks.”, In European Semantic Web Conference, 2018 [28] Klein et al., “OpenNMT: Open-Source Toolkit for Neural Machine Translation.”, arXiv, 2017 [29] Simonovsky et al., “GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders.”, ICANN, 2018 [30] You et al., “GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Models.”, ICML, 2018 [31] Liao et al., “Efficient Graph Generation with Graph Recurrent Attention Networks.”, NeurIPS, 2019 [32] Mayfield et al., “Pistachio 2.0 edn software.”, 2018 [33] Coley et al., “A graph-convolutional neural network model for the prediction of chemical reactivity.”, Chemical Science 2019 [34] Coley et al., “Predicting organic reaction outcomes with Weisfeiler-Lehman Network.”, NeurIPS, 2017 [35] Jin et al., “Junction Tree Variational Autoencoder for molecular graph generation.”, ICML, 2019 [36] Bradshaw et al., “A generative model for electron path.”, ICLR, 2019 [37] DO et al., “Graph transformation policy network for chemical reaction prediction.”, KDD, 2019 [38] Stavropoulos et al., “Sequential Monte Carlo method in practice.”, Springer, 2001 89
  • 90. Appendix 1. Subgraph isomorphism problem • It is a computational task in which two graphs G and H are given as input, and one must det ermine whether G contains a subgraph that is isomorphic to H • NP-Complete 2. Molecular similarity metrics (x and y are molecular fingerprint) 90
  • 91. Appendix 3. Reaction class • Meta-information about type of chemical reactions. • In USPTO, there are 10 reaction classes 91
  • 92. Appendix 4. Parameterizing by GNN in [5] • Graph embedding := Averaging node embedding 92
  • 93. Appendix 5. Better hyper-parameters of MT and the results. • Dropout p=0.25 is better than p=0.1 • We can remove invalid and repeated SMILES via RDKit. • Also, Using 6 layers and increasing the dropout rate is better than using 4 layers. 93 Top 1 Top 3 Top 5 Top 10 MT [8] 0.420 0.570 0.619 0.657 MT (p=0.25, w/o inval/repeat) 0.432 0.645 0.709 0.771
  • 94. Thank you ! Any Questions ?