DeNA copyright document summarizes neural message passing approaches

Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
AI System Dept.
System & Design Management Unit
Kazuki Fujikawa
21 0 0 2 0 0 2
@ .2 2 82 2 2 @ 7
66 5 45 - 5 4 4 - 6- 4 - 4 6-
6 5 8-6 8 -5 -/ 4 / 68 4.
- 2 2 21 2 @ 7

n
o
n
D
• a g 1 2
• 0 ~ : : e :
• 0 : A b 0 :
• A 0 0 : 0 N M
D :
• 0 4 D :
•

n 7 1 ,, pr
NP p J p C
n G C
• Jl I
• p
• I J
n o Gn
a: : 30 h p y I
i o
S : pre C G
+24 ,
Figure 2: Overview of our approach. (1) we train a model to identify pairwise atom interactions

n
G
n
/ G
n C
n

n
C
n
C /
C
C
n
n

n T 7 J 0 i
I:A
A
S 2T I S 2T 2
, 0N PA O2 S n
1 i i
1 i
Figure 1: An example reaction where the reaction center is (27,28), (7,27), and (8,27), highlighted in
green. Here bond (27,28) is deleted and (7,27) and (8,27) are connected by aromatic bonds to form a
new ring. The corresponding reaction template consists of not only the reaction center, but nearby
functional groups that explicitly specify the context.
template involves graph matching and this makes examining large numbers of templates prohibitively
+,

n p y r
00 100 w n N
hc i Ld l
n d l t sRu a
k j Nov LGbge N L L G6 4
d l t
-66 /62 7G. 2 D C
64 3 7 236:2 2 2 6
65 2
:2 65 2 7

n -0 -01- - 2+ - -1 - + 0, -
[ S L e iv L
m Ll [ ] S L h
d K S W K m f
https://www.cc.gatech.edu/~lsong/teaching/8803ML/lecture22.pdf
http://art.ist.hokudai.ac.jp/~takigawa/data/fpai94_takigawa.pdf
sfeiler-Lehman algorithm
Use multiset labels to capture neighborhood structure in a
graph
If two graphs are the same, then the multiset labels should be
the same as well
7
sfeiler-Lehman algorithm
graph
the same as well
7
Weisfeiler-Lehman algorithm II
Relabel graph and construct new multiset label
Check whether the new labels are the same
8
8
If the new multilabel sets are not the same, stop the algorithm
and declare the two graphs are not the same
Effectively it is unrolling the graph into trees rooted at each
node and use multilabels to identify these trees
nr as
S
Weisfeiler-Lehman algorithm
graph
the same as well
7
S
8
8
nr as
!′!
S d
!′! !′!
# ! = [6, 3, 2, 1, 0, 1, 2, 2, 0, 1]
# !′ = 6, 3, 2, 1, 2, 1, 0, 0, 2, 1
-./ !, !0 = 52

n 1 010 1 4 4 4 21 4 ,+ 21
e H c ]d a
e R Fa CE P [ 310
4 21 4
https://chembioinfo.com/2011/10/30/revisiting-molecular-hashed-fingerprints/
http://art.ist.hokudai.ac.jp/~takigawa/data/fpai94_takigawa.pdf

n , 2 2 7 2 , 0)7 : (,+ 1
R N) 2 ( wM]r )7 : [ o
GP e L
• , 2 2 7 2
+ M
• i h sL k gm ]
• M]i h a d k gm uM]
n + M C i h k gm + RI]
• 2 2
, 2 2 7 R u Ni h p L
l sL oP k gm tp ]
ntum Chemistry
Oriol Vinyals 3
George E. Dahl 1
DFT
103
seconds
Message Passing Neural Net
10 2
seconds
E,!0, ...
Targets
)7 : (,+

Message Function:
!"(ℎ%
" , ℎ'(
" , )%'(
)
Σ
Message Function:
!"(ℎ%
" , ℎ'(
" , )%'(
)
Neural Message Passing for
time steps and is defined in terms of message functions Mt
and vertex update functions Ut. During the message pass-
ing phase, hidden states ht
v at each node in the graph are
updated based on messages mt+1
v according to
mt+1
v =
X
w2N(v)
Mt(ht
v, ht
w, evw) (1)
ht+1
v = Ut(ht
v, mt+1
v ) (2)
where in the sum, N(v) denotes the neighbors of v in graph
G. The readout phase computes a feature vector for the
whole graph using some readout function R according to
ˆy = R({hT
v | v 2 G}). (3)
The message functions Mt, vertex update functions Ut, and
readout function R are all learned differentiable functions.
R operates on the set of node states and must be invariant to
permutations of the node states in order for the MPNN to be
invariant to graph isomorphism. In what follows, we define
previous models in the literature by specifying the message
function M , vertex update function U , and readout func-
n + : , : : 1 C 5 2
l icN ]E S eF[Uh
• , :
l , : !" ℎ%
" , ℎ'
" , )%' = ,-.,/0(ℎ'
" , )%')
l 0 5 : 1" ℎ%
" , 2%
"34 = 5(6"
789 (%)
2%
"34)
• 6"
789 (%)
0D ; N ISda deg (;) MNfg L P
v
u1
u2
h(0)
v
h(0)
u1
h(0)
u2
Neural Message Passing for Quantum Chemistry
v according to
mt+1
v =
X
w2N(v)
Mt(ht
v, ht
w, evw) (1)
ht+1
v = Ut(ht
v, mt+1
v ) (2)
ˆy = R({hT
v | v 2 G}). (3)
function Mt, vertex update function Ut, and readout func-
tion R used. Note one could also learn edge features in
an MPNN by introducing hidden states for all edges in the
graph ht
evw
and updating them analogously to equations 1
and 2. Of the existing MPNNs, only Kearnes et al. (2016)
Recurrent Unit introduced in Cho et al. (2014). This work
used weight tying, so the same update function is used at
each time step t. Finally,
R =
X
v2V
⇣
i(h(T )
v , h0
v)
⌘ ⇣
j(h(T )
v )
⌘
(4)
where i and j are neural networks, and denotes element-
wise multiplication.
Interaction Networks, Battaglia et al. (2016)
This work considered both the case where there is a tar-
get at each node in the graph, and where there is a graph
level target. It also considered the case where there are
node level effects applied at each time step, in such a
case the update function takes as input the concatenation
(hv, xv, mv) where xv is an external vector representing
some outside influence on the vertex v. The message func-
tion M(hv, hw, evw) is a neural network which takes the
concatenation (hv, hw, evw). The vertex update function
U(hv, xv, mv) is a neural network which takes as input
the concatenation (hv, xv, mv). Finally, in the case where
there is a graph level output, R = f(
P
v2G
hT
v ) where f is
a neural network which takes the sum of the final hidden
states hT
v . Note the original work only defined the model
for T = 1.
Molecular Graph Convolutions, Kearnes et al. (2016)
)%'(
)%'?
Update Function:
1"(ℎ%
"
, 2%
"34
)

n ( 5 : D 5 + : C )DE 5D ,02
S R [ ] LI M N (+0 P a
• 1 5 DC 5
1 5 DC D C ! ℎ#
$
% ∈ ' = )( ∑ -.)/012 34ℎ#
4
#,4 )
v
u1
u2
h(0)
v
h(0)
u1
h(0)
u2
1 5 DC D C )( ∑ -.)/012 34ℎ#
4
#,4 )
ℎ#
(7)
ℎ89
(7)
ℎ8:
(7)
FFFFFF
ℎ89
($)
ℎ8:
(;)
ℎ#
($)
5: 5 :
5: 5 :
<#89
<#8:
=> = !({ℎ#
($)
|% ∈ '})

n , , 0 0 F C ,, 00 . - .1 (
2 + 6 M ,12U] h iU g
• CC : CC : C
CC : 6 ) !" ℎ$
" , ℎ&
" , '$& = )*&ℎ&
"
• )*&) Rde fa G a G 6 I [cL N
2 6 ) +" ℎ$
" , ,$
"-. = /0+ ℎ$
" , ,$
"-.
Message Function:
!"(ℎ$
" , ℎ&2
" , '$&2
)
Σ
Message Function:
!"(ℎ$
" , ℎ&2
" , '$&2
)
v according to
mt+1
v =
X
w2N(v)
Mt(ht
v, ht
w, evw) (1)
ht+1
v = Ut(ht
v, mt+1
v ) (2)
ˆy = R({hT
v | v 2 G}). (3)
v
u1
u2
h(0)
v
h(0)
u1
h(0)
u2
v according to
mt+1
v =
X
w2N(v)
Mt(ht
v, ht
w, evw) (1)
ht+1
v = Ut(ht
v, mt+1
v ) (2)
ˆy = R({hT
v | v 2 G}). (3)
graph ht
evw
R =
X
v2V
⇣
i(h(T )
v , h0
v)
⌘ ⇣
j(h(T )
v )
⌘
(4)
P
v2G
hT
v ) where f is
states hT
for T = 1.
'$&2
'$&4
Update Function:
+"(ℎ$
"
, ,$
"-.
)

n + 6 + 6 6 C : ++ 1- ,)- 2
0 6 LFG+ 0 N GU
• 6 6
6 (
! ℎ#
$
% ∈ ' = tanh( ∑ 0# 1 ℎ#
$
, ℎ#
3
⊙ tanh 5 ℎ#
$
, ℎ#
3
)
• 1, 5( I 0 1 ℎ#
$
, ℎ#
3
( 6 I R

n ( , : 0 (, +1 0
[ T ] US
• ) 0 0 7: 0
) 0 :1 7 :
!" ℎ$
" , ℎ&
" , '$& = tanh -./ -/.ℎ0
" + 23 ⊙ -6.'$0 + 27
• -./
, -/.
, -6.
D23, 27 M N
• 20 :1 7 : 8" ℎ$
"
, 9$
":3
= ℎ$
"
+ 9$
":3
Message Function:
!"(ℎ$
"
, ℎ&<
"
, '$&<
)
Σ
Message Function:
!"(ℎ$
"
, ℎ&<
"
, '$&<
)
Neural Message Passing f
v according to
mt+1
v =
X
w2N(v)
Mt(ht
v, ht
w, evw) (1)
ht+1
v = Ut(ht
v, mt+1
v ) (2)
ˆy = R({hT
v | v 2 G}). (3)
v
u1
u2
h(0)
v
h(0)
u1
h(0)
u2
v according to
mt+1
v =
X
w2N(v)
Mt(ht
v, ht
w, evw) (1)
ht+1
v = Ut(ht
v, mt+1
v ) (2)
ˆy = R({hT
v | v 2 G}). (3)
graph ht
evw
R =
X
v2V
⇣
i(h(T )
v , h0
v)
⌘ ⇣
j(h(T )
v )
⌘
(4)
P
v2G
hT
v ) where f is
states hT
for T = 1.
'$&<
'$&>
Update Function:
8"(ℎ$
" , 9$
":3)

n (22: ,2 )2 7 )2 (,)) +0 ) 2
N D
• 2 1 : 2
2 1 0 ! ℎ#
$
% ∈ ' = ∑ NN(ℎ#
$
)#

n + 0 2 2 E E , - (
[ 1 6 G 2 2 M
• EE6 C6EE C 6E
[ EE6 :G 7 ) !" ℎ$
" , ℎ&
" , '$& = )('$+)ℎ&
"
• )('$+)) N S R '$+ M IL00
[ C 6 :G 7 ) -" ℎ$
" , .$
"/0 = GRU ℎ$
" , .$
"/0
• ,,00 - 1 U
Message Function:
!"(ℎ$
"
, ℎ&4
"
, '$&4
)
Σ
Message Function:
!"(ℎ$
"
, ℎ&4
"
, '$&4
)
Neural Message Passing f
v according to
mt+1
v =
X
w2N(v)
Mt(ht
v, ht
w, evw) (1)
ht+1
v = Ut(ht
v, mt+1
v ) (2)
ˆy = R({hT
v | v 2 G}). (3)
v
u1
u2
h(0)
v
h(0)
u1
h(0)
u2
v according to
mt+1
v =
X
w2N(v)
Mt(ht
v, ht
w, evw) (1)
ht+1
v = Ut(ht
v, mt+1
v ) (2)
ˆy = R({hT
v | v 2 G}). (3)
graph ht
evw
R =
X
v2V
⇣
i(h(T )
v , h0
v)
⌘ ⇣
j(h(T )
v )
⌘
(4)
P
v2G
hT
v ) where f is
states hT
for T = 1.
'$&4
'$&5
Update Function:
-"(ℎ$
" , .$
"/0)

n + 0 2 2 C C , - (
1 6 E L2 2
• 1 6 E :E 7 )
! ℎ#
$
% ∈ ' = set2set ℎ#
$
% ∈ '
C C G6 C - 1 -.
∗ RM00M SLIN
size of the set, and which is order invariant. In the next sections, we explain such a modification,
which could also be seen as a special case of a Memory Network (Weston et al., 2015) or Neural
Turing Machine (Graves et al., 2014) – with a computation flow as depicted in Figure 1.
4.2 ATTENTION MECHANISMS
Neural models with memories coupled to differentiable addressing mechanism have been success-
fully applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah-
danau et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al.,
2015). Since we are interested in associative memories we employed a “content” based attention.
This has the property that the vector retrieved from our memory would not change if we randomly
shuffled the memory. This is crucial for proper treatment of the input set X as such. In particular,
our process block based on an attention mechanism uses the following:
qt = LSTM(q⇤
t 1) (3)
ei,t = f(mi, qt) (4)
ai,t =
exp(ei,t)
P
j exp(ej,t)
(5)
rt =
X
i
ai,tmi (6)
q⇤
t = [qt rt] (7)
Read
Process Write
Figure 1: The Read-Process-and-Write model.
where i indexes through each memory vector mi (typically equal to the cardinality of X), qt is
a query vector which allows us to read rt from the memories, f is a function that computes a
single scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a
recurrent state but which takes no inputs. q⇤
t is the state which this LSTM evolves, and is formed
by concatenating the query qt with the resulting attention readout rt. t is the index which indicates
V ) G6 C - 1

n 0 6 0 1 0 : 0 1 2 07 0
+ 0 ,
p m rw 0 20 0 6 20 : Rcf
n u gip W p s
a
gip u Ntde Fp l yh
W []Ro
A third strategy for reaction prediction algorithms uses fingerprints and extended circular fingerprints33,34
have been
Figure 1. An overview of our method for predicting reaction type and products. A reaction fingerprint, made from concatenating the fingerprints of
reactant and reagent molecules, is the input for a neural network that predicts the probability of 17 different reaction types, represented as a reaction
type probability vector. The algorithm then predicts a product by applying to the reactants a transformation that corresponds to the most probable
reaction type. In this work, we use a SMARTS transformation for the final step.
ACS Central Science Research Article
Figure 2. Cross validation results for (a) baseline fingerprint, (b) Morgan reaction fingerprint, and (c) neural reaction fingerprint. A confusion matrix
shows the average predicted probability for each reaction type. In these confusion matrices, the predicted reaction type is represented on the vertical
axis, and the correct reaction type is represented on the horizontal axis. These figures were generated on the basis of code from Schneider et al.43
large libraries of synthetically accessible compounds in the areas
of molecular discovery,44
metabolic networks,45
drug discov-
ery,46
and discovery of one-pot reactions.47
In our algorithm,
we use SMARTS transformation for targeted prediction of
product molecules from reactants. However, this method can
be replaced by any method that generates product molecule
graphs from reactant molecule graphs. An overview of our
method can be found in Figure 1 and is explained in further
detail in Prediction Methods.
We show the results of our prediction method on 16 basic
reactions of alkyl halides and alkenes, some of the first reactions
taught to organic chemistry students in many textbooks.48
The
training and validation reactions were generated by applying
simple SMARTS transformations to alkenes and alkyl halides.
While we limit our initial exploration to aliphatic, non-
stereospecific molecules, our method can easily be applied a
wider span of organic chemical space with enough example
reactions. The algorithm can also be expanded to include
experimental conditions such as reaction temperature and time.
With additional adjustments and a larger library of training data,
our algorithm will be able to predict multistep reactions and,
eventually, become a module in a larger machine-learning
system for suggesting retrosynthetic pathways for complex
molecules.
■ RESULTS AND DISCUSSION
Performance on Cross-Validation Set. We created a data
set of reactions of four alkyl halide reactions and 12 alkene
reactions; further details on the construction of the data set can
be found in Methods. Our training set consisted of 3400
reactions from this data set, and the test set consisted of 17,000
reactions; both the training set and the test set were balanced
across reaction types. During optimization on the training set,
k-fold cross-validation was used to help tune the parameters of
the neural net. Table 1 reports the cross-entropy score and the
accuracy of the baseline and fingerprinting methods on this test
set. Here the accuracy is defined by the percentage of matching
“NR”
overg
horiz
Pe
Ques
textb
set f
traini
of te
from
in F
assign
matc
netw
react
valida
traini
distan
mole
betw
set re
score
was 1
was o
detai
Fo
type
the a
algor
the p
the m
proba
for h
In
best
by th
both
Morg
algor
than
rings
of th
neura
This
algor
patte
In
for r
Table 1. Accuracy and Negative Log Likelihood (NLL) Error
of Fingerprint and Baseline Methods
fingerprint
method
fingerprint
length
train
NLL
train
accuracy
(%)
test
NLL
test
accuracy
(%)
baseline 51 0.2727 78.8 2.5573 24.7
Morgan 891 0.0971 86.0 0.1792 84.5
neural 181 0.0976 86.0 0.1340 85.7
ACS Central Science
kh 0

n + 1: 2 : 3 138 1 7 - 8 1
- 13 , 3 7: 0
r S RaPol MNbm h [ i Pcg
• e S R d ] S R d
r i Lf
taset of 278 reactions from the literature and 8 reaction classes.
During the preparation of this manuscript, we became aware
that Aspuru-Guzik and co-workers recently reported a neural
network approach for reaction prediction, in which reactant
and reagent molecules were encoded as neural[9]
or extended
connectivity fingerprints, which are concatenated.[10]
They
trained their models to predict 16 reaction types on an artificial
dataset of 3200 reactions.
Herein, we propose a novel neural-symbolic model, which
can be used for both reaction prediction and retrosynthesis.
Our model uses neural networks in order to predict the most
likely transformation rules to be applied to the input mole-
cule(s) (see Figures 1 and 2).
In short, the computer has to learn which named reaction
was used to make a molecule (or under which rule the starting
materials reacted). We trained neural networks by showing
them millions of examples of known reactions and the corre-
sponding correct reaction rule. The goal is to learn patterns in
ing with symbolic rules is that we retain the familiar concept
of rules, whereas the model learns to prioritize the rules and to
estimate selectivity and compatibility from the provided train-
ing data, which are successfully performed experiments. We
test our hypothesis in the first large-scale systematic investiga-
tion of machine learning and rule-based systems.
Several metrics are reported in Table 1 and Table 2 to evalu-
ate the models. Accuracy shows how many reactions are cor-
rectly predicted when the rule with the highest predicted
probability is evaluated. In top-n accuracy, we examine if the
correct reaction rule is among the n highest ranked rules, simi-
lar to being on the first page of the results of a search engine.
This means we allow the algorithm to propose more than one
applicable rule, which is reasonable, since the same molecules
can react differently, or several routes are possible to synthe-
size a target molecule in retrosynthesis. Furthermore, we
Figure 2. Overview of our neural-symbolic ansatz.
Table 1. Results for the study on 103 hand coded rules.
Task/Model Acc Top 3-Acc MRR W. Prec.
Reaction prediction
random 0.03 0.12 0.04 0.03
expert system 0.07 0.33 0.12 0.46
logistic regression 0.86 0.97 0.91 0.86
highway network 0.92 0.99 0.96 0.92
FC512 ELU 0.92 0.99 0.96 0.92
Retrosynthesis
random 0.03 0.12 0.04 0.03
expert system 0.05 0.30 0.06 0.11
FC512 ELU 0.78 0.98 0.87 0.78
Chem. Eur. J. 2017, 23, 5966 – 5971 www.chemeurj.org ⌫ 2017 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim5967 ch
se
a m
of
Au
Ta
87
da
sys
de
ble
th
Table 2. Results for the study on 8720 automatically extracted rules.
Task/model Acc Top 10-Acc. MRR W. Prec.
Reaction prediction
random 0.00 0.00 0.00 0.00
expert system 0.02 0.18 0.02 0.06
FC512 ELU 0.77 0.97 0.85 0.76
Retrosynthesis
random 0.00 0.00 0.00 0.00
expert system 0.02 0.19 0.01 0.06
FC512 ELU 0.62 0.94 0.74 0.62
quently applied to the molecules. They first encoded mole-
cules by unsupervised pre-training of self-organizing maps,
a type of neural network,[8]
which then serves as an input to
a random forest classifier.[7]
They conducted a study with a da-
or extended
the molecules’ functional groups that allow the machine to
generalize to molecules it has not seen before. In this problem
the model learns to predict the probability over all rules.[11]
We
hypothesize that the advantage of combining machine learn-
Figure 1. The challenge in retrosynthesis and reaction prediction is to select the correct rule among possibly tens or hundreds of matching rules. In this exam-
ple, both a Suzuki and a Kumada coupling (among others) formally match the biaryl moiety. However, the aldehyde in the molecular context would be in
conflict with a Grignard reagent. Therefore, the Kumada coupling cannot be used to make target 1. This information has to be encoded by hand in expert
systems. Our system simply learns it from data.
Communication
quently applied to the molecules. They first encoded mole-
cules by unsupervised pre-training of self-organizing maps,
a type of neural network,[8]
which then serves as an input to
a random forest classifier.[7]
They conducted a study with a da-
or extended
connectivity fingerprints, which are concatenated.[10]
They
trained their models to predict 16 reaction types on an artificial
dataset of 3200 reactions.
Herein, we propose a novel neural-symbolic model, which
can be used for both reaction prediction and retrosynthesis.
Our model uses neural networks in order to predict the most
likely transformation rules to be applied to the input mole-
cule(s) (see Figures 1 and 2).
In short, the computer has to learn which named reaction
was used to make a molecule (or under which rule the starting
materials reacted). We trained neural networks by showing
them millions of examples of known reactions and the corre-
sponding correct reaction rule. The goal is to learn patterns in
the molecules’ functional groups that allow the machine to
generalize to molecules it has not seen before. In this problem
the model learns to predict the probability over all rules.[11]
We
hypothesize that the advantage of combining machine learn-
tion of machine learning and rule-based systems.
Several metrics are reported in Table 1 and Table 2 to evalu-
ate the models. Accuracy shows how many reactions are cor-
rectly predicted when the rule with the highest predicted
probability is evaluated. In top-n accuracy, we examine if the
correct reaction rule is among the n highest ranked rules, simi-
lar to being on the first page of the results of a search engine.
This means we allow the algorithm to propose more than one
applicable rule, which is reasonable, since the same molecules
can react differently, or several routes are possible to synthe-
size a target molecule in retrosynthesis. Furthermore, we
Figure 1. The challenge in retrosynthesis and reaction prediction is to select the correct rule among possibly tens or hundreds of matching rules. In this exam-
ple, both a Suzuki and a Kumada coupling (among others) formally match the biaryl moiety. However, the aldehyde in the molecular context would be in
conflict with a Grignard reagent. Therefore, the Kumada coupling cannot be used to make target 1. This information has to be encoded by hand in expert
systems. Our system simply learns it from data.
Figure 2. Overview of our neural-symbolic ansatz.
Table 1. Results for the study on 103 hand coded rules.
Task/Model Acc Top 3-Acc MRR W. Prec.
Reaction prediction
random 0.03 0.12 0.04 0.03
expert system 0.07 0.33 0.12 0.46
FC512 ELU 0.92 0.99 0.96 0.92
Retrosynthesis
random 0.03 0.12 0.04 0.03
expert system 0.05 0.30 0.06 0.11
FC512 ELU 0.78 0.98 0.87 0.78
n 7:

n 7 IC : S CL
IC :RC 7 IC 2 01
Workshop track - ICLR 2017
OH
O
OHO
a) Retrosynthesis: Analyse the target in reverse (chemical representation)
c) Synthesis: Go to the lab and actually conduct the reactions, following plan a)
OH
O
Me OH
CO
O
H2 CO H2 CO
Ac2O H2 CO
Ibuprofen4 3 2
Ibuprofen 21
1
3 4
OH
O
OH O
b) Applied transformations (productions) applied in each analysis step
Ac2O
Ac2O
1
d) Search
e) Neural N
S8
S6
Mole
Desc
(ECF
Figure 1: a) The synthesis of the target (ibuprofen) is planned with retro
analyzed until a set of building blocks is found that we can buy or we alre
+, +

C 7 2 1 1 ,+ 02
n M + -
l P NS M
A :
! ", $, "%
= '("%
|", $) t C R
g l M 7 7 7
n + 0
'($) ,, : +($|") D L:
,(-) I - T
.(-) I - :
/ e M
OH
O
Me OH
CO
21
d) Search Tree Representation
e) Neural Networks determine the Actions
S1
S2 S5
S8
S6
S7
S3
S4
S1 = {1}
S2 = {2,CO}
S3 = {3,CO,H2}
S4 = {4,CO,H2,Ac2O}
Root (Ibuprofen)
OH
O
H
an a)
O
CO H2 CO
CO
Ibuprofen1
4
O
Ac2O
Ac2O
OH
O
Me OH
CO
21
d) Search Tree Representation
e) Neural Networks determine the Actions
S1
S2 S5
S8
S6
S7
S3
S4
S1 = {1}
S2 = {2,CO}
S3 = {3,CO,H2}
S4 = {4,CO,H2,Ac2O}
p(rn|x)
most probable
rules
apply
rules to 1
Molecular
Descriptor
(ECFP4) Deep Neural Network
Classification
x
Root (Ibuprofen)
Workshop track - ICLR 2017
2.3 MONTE CARLO TREE SEARCH (MCTS)
MCTS combines tree search with random sampling [Browne et al. (2012)]
work in three of the four MCTS steps: Selection, Expansion, and Rollout. W
employ Eq. (1), a variation of the UCT function similar to AlphaGo’s, to de
of a node v, where P(a) is the probability assigned to this action by the p
number of visits and Q(v) the accumulated reward at node v, and c the expl
arg max
v′∈children of v
Q(v′
)
N(v′)
+ cP(a)
N(v)
1 + N(v′)
The tree policy is applied until a leaf node is found, which is expanded
network. Our policy network has a top1 accuracy of 0.62 and a top50 (of 1
This allows to kill two birds with one stone: First, to reduce the branching
the top 50 most promising actions, and not all possible ones (≈ 200). Secon
actions, we have to solve the subgraph isomorphism problem, which determ
rule can be applied to a molecule and yields the next molecule(s), only 50
times for all rules. During rollout, we sample actions from the policy netw
r 12 1

n , 5 + 71 51 + 5 7 1 5 51 7
/ :5 0 O21 5: 5
u n edi i m
• u n
]iagC b [ y O M R
• edi i
r L u s lO ot Rc f
P d m M h Ui h C u R
representation, less fundamental changes (e.g., formal charge,
association/disassociation of salts) are neglected. Loss or gain
of a hydrogen is represented by 32 easy-to-compute features of
that reactant atom alone, ∈ai
32
. Loss or gain of a bond is
represented by a concatenation of the features of the atoms
involved and four features of the bond, ∈a b a[ , , ]i ij j
68
.
Because edits occur at the reaction center by definition, the
overall representation of a candidate reaction depends only on
the atoms and bonds at the reaction core. There is no explicit
inclusion of other molecular features, e.g., adjacency to certain
functional groups. However, in our featurization, we include
rapidly calculable structural and electronic features of the
reactants’ atoms that reflect the local chemical environment first
and foremost, but also reflect the surrounding molecular
context.19−22
The chosen features can be found in Table 2 and
Table S2 and are discussed in more detail in the Supporting
Information.
The design of the neural network is motivated by the
likelihood of a reaction being a function of the atom/bond
changes that are required for it to occur. Individual edits are
first analyzed in isolation using a neural network unique to the
Figure 3. Edit-based model architecture for scoring candidate
in of a bond is
es of the atoms
∈a ]j
68
.
y definition, the
depends only on
re is no explicit
cency to certain
on, we include
features of the
nvironment first
ding molecular
in Table 2 and
the Supporting
otivated by the
the atom/bond
vidual edits are
k unique to the
dense layers are
mediate feature
es vectors of all
ural network to
each candidate
oposed reaction
e free energy of
candidates are
uces a vector of
ting their values
with kBT = 1. An
evel features is
reaction shown
ecture, the four
assessing their
picted in Figure
on tests after an
to describe the
ed with bias and
001 parameters,
attempts to rank
cts alone; no
corresponding
molecules are
prints of length
is used prior to
odel is shown in
able S4).
which trains the
10%/20% training/validation/testing split and ceased training
once the validation loss did not improve for five epochs. The
edit-based model achieves an test accuracy of 68.5%, averaged
across all folds. In this context, accuracy refers to the percentage
of reaction examples where the recorded product was assigned
a rank of 1. The baseline model was similarly trained and tested
in a 5-fold CV, reaching an accuracy of 33.3%, suggesting that
the set of recorded products in the data set is fairly
homogeneous. The hybrid model, combining the edit-based
representation with the proposed products’ fingerprint
representations, achieves an accuracy of 71.8%. These results
are displayed in Table 1.
Figure 3. Edit-based model architecture for scoring candidate
reactions. Reactions are represented by four types of edits. Initial
atom- and bond-level attributes are converted into feature
representations, which are summed and used to calculate that
candidate reaction’s likelihood score.
Table 1. Comparison between Baseline, Edit-Based, and
Hybrid Models in Terms of Categorical Crossentropy Loss
and Accuracya
model loss acc. (%) top-3 (%) top-5 (%) top-10 (%)
random guess 5.46 0.8 2.3 3.8 7.6
baseline 3.28 33.3 48.2 55.8 65.9
edit-based 1.34 68.5 84.8 89.4 93.6
hybrid 1.21 71.8 86.7 90.8 94.6
a
Top-n refers to the percentage of examples where the recorded
product was ranked within the top n candidates.
scalabity. Wei et al.10
describe the use of neural networks to
predict the outcome of reactions based on reactant fingerprints,
but limit their study to 16 types of reactions covering a very
narrow scope of possible alkyl halide and alkene reactions.
Given two reactants and one reagent, the model was trained to
identify which of 16 templates was most applicable. The data
set used for cross-validation comes from artificially generated
examples with limited chemical functionality, rather than
experimental data.
Quite recently, Segler and Waller describe two approaches to
forward synthesis prediction. The first is a knowledge-graph
approach that uses the concept of half reactions to generate
possible products given exactly two reactants by looking at the
known reactions in which each of those reactants participates.11
required; (3) a new reaction representation focused on the
fundamental transformation at the reaction site rather than
constituent reactant and product fingerprints; (4) the
implementation and validation of a neural network-based
model that learns when certain modes of reactivity are more or
less likely to occur than other potential modes. Despite the
literature bias toward reporting only high-yielding reactions, we
develop a successful workflow that can be performed without
any manual curation using actual reactions reported in the
USPTO literature.
■ APPROACH
Overview. Our model predicts the outcome of a chemical
reaction in a two-step manner: (1) applying overgeneralized
Figure 1. Model framework combining forward enumeration and candidate ranking. The primary aim of this work is the creation of the parametrized
scoring model, which is trained to maximize the probability assigned to the recorded experimental outcome.
:5

n
•
•
•

n 1 : : - 2 21 12 : / 1 1 1 0
r C , p r m r
KO ga e
n P N LR stL ]
M]
dc o M]++ w
1 i o
o h l ][
M]P o LR y
] ]
Journal of Chemical Information and Modeling ARTICLE
Figure 2. Overall reaction prediction framework: (a) A user inputs the reactants and conditions. (b) We identify potential electron donors and
acceptors using coarse approximations of electron-filled and -unfilled MOs. (c) Highly sensitive reactive site classifiers are trained and used to filter out
1 1 1

n y Fx
• R pt y
• y CG R pt C
FoKe
• ptn cln N C R i
E R i F s 1
• ptn cln NF s v y
r ah
• g + 1 aoKFr
• 1 y

n N+ C C G C I CO 2G :I C 1 I: E L 1G C :
IGM :I C C 0 G F C: I F C:
7 : G 0-2
g -. py[ T SThlmR
• F F II CI C [P r o gpyWe
• u I E EG : C[ n “cd f Se T ]
n “ P,G E 00g S P lm i]a
S s WSqt
BLEU [36] ROUGE [37] top-1
Jin’s USPTO test set [17] 38,648 95.9 96.0 83.2
Lowe’s test set [26] 50,258 90.3 90.9 65.4
6.2 Comparison with the state of the art
To the best of our knowledge, no previous work has attempted to predict reac
US patent dataset of Lowe [26]. Table 6 shows a comparison with the Weisfeil
networks (WLDN) from Jin et al. [17] on their USPTO test set. To make a fair
all the multiple product reactions in the test set as false predictions for our mod
only on the single product reactions. By achieving 80.3% top-1 accuracy, we o
by a margin of 6.3%, which is even higher than for their augmented setup. A
rank candidates, but was trained on accurately predicting the top-1 outcome, it
the WLDN beats our model in the top-3 and top-5 accuracy. The decoding o
test set reactions takes on average 25 ms per reaction, inferred with a beam se
therefore compete with the state of the art.
Table 6: Comparison with Jin et al. [17]. The 1,352 multiple product reactions
are counted as false predictions for our model.
Jin’s USPTO test set [17], accuracies in [%]
Method top-1 top-2 top-3 top-5
WLDN [17] 74.0 86.7 89.5
Our model 80.3 84.7 86.2 87.5
7
(a) Distribution of top1 probabilities (b) Coverage / Accuracy plot
igure 1: Top-1 prediction confidence plots for Lowe’s test set inferred with a beam search of 10.
4 Attention
tention is the key to take into account complex long-range dependencies between multiple tokens.
ecific functional groups, solvents or catalysts have an impact on the outcome of a reaction, even if
ey are far from the reaction center in the molecular graph and therefore also in the SMILES string.
gure 2 shows how the network learned to focus first on the C[O ] molecule, to map the [O ] in the
put correctly to the O in the target, and to ignore the Br, which is replaced in the target.
(a) Attention weights (b) Reaction plotted with RDKit [27]
gure 2: Reaction 120 from Jin’s USPTO test set. The atom mapping between reactants and product
highlighted. SMILES: Brc1cncc(Br)c1.C[O-]>CN(C)C=O.[Na+]>COc1cncc(Br)c1
5 Limitations
x w : G 0-2

n
C
n
/ C
n
n

MN L
P
P G G
u1
u2
u3
u4
u5
1 3
. / - /
u2 u3 u4 u5
u1
u2
u3
u4
u1
u2
u3 u5u1
u2
u3
u4
u1
u2
u3
u4
3 2 Weisfeiler-Lehman Difference Network (WLDN)
MN

1 .
32
32
u1
u2
u3
u4
u5
- /
u2 u3 u4 u5
u1
u2
u3
u4
u1
u2
u3 u5u1
u2
u3
u4
u1
u2
u3
u4
32
/ Weisfeiler-Lehman Difference Network (WLDN)
1

G N
ML
u1
u2
u3
u4
u5
/3 /3 / / /
- ./ ./
u2 u3 u4 u5
u1
u2
u3
u4
u1
u2
u3 u5u1
u2
u3
u4
u1
u2
u3
u4
3 3 1 Weisfeiler-Lehman Difference Network (WLDN)

G L
.
u1
u2
u3
u4
u5
/ / / /2 /
- ./ ./
u2 u3 u4 u5
u1
u2
u3
u4
u1
u2
u3 u5u1
u2
u3
u4
u1
u2
u3
u4
.
2 1 Weisfeiler-Lehman Difference Network (WLDN)

n - / c SGb G
PL W f
n cS
e Gda
u1
u2
u3
u4
u5
- /
u2 u3 u4 u5
u1
u2
u3
u4
u1
u2
u3 u5u1
u2
u3
u4
u1
u2
u3
u4
cNM

L
n - 1 . L
n SPN W
G
u1
u2
u3
u4
u5
- /
u2 u3 u4 u5
u1
u2
u3
u4
u1
u2
u3 u5u1
u2
u3
u4
u1
u2
u3
u4
N
L
ML

n - U c
NW h
i , , - ML Nd
• , !" ℎ$
"
, ℎ&
"
, '$& = )(+ ,-.,/0 ℎ&
"
, '$& )
i ) gef : + a
• 2" ℎ$
"
, 3$
"45
= )(65ℎ$
"75
+ 693$
"45
)
i ) gef :65 69 a
Message Function:
)(+ ,-.,/0 ℎ&
" , '$&:
)
Σ
Message Function:
)(+ ,-.,/0 ℎ&
"
, '$&;
)
v according to
mt+1
v =
X
w2N(v)
Mt(ht
v, ht
w, evw) (1)
ht+1
v = Ut(ht
v, mt+1
v ) (2)
ˆy = R({hT
v | v 2 G}). (3)
v
u1
u2
h(0)
v
h(0)
u1
h(0)
u2
v according to
mt+1
v =
X
w2N(v)
Mt(ht
v, ht
w, evw) (1)
ht+1
v = Ut(ht
v, mt+1
v ) (2)
ˆy = R({hT
v | v 2 G}). (3)
graph ht
evw
R =
X
v2V
⇣
i(h(T )
v , h0
v)
⌘ ⇣
j(h(T )
v )
⌘
(4)
P
v2G
hT
v ) where f is
states hT
for T = 1.
'$&:
'$&;
Update Function:
)(65ℎ$
"75
+ 693$
"45
)

n - M W
N a
O N
v
u1
u2
u3
u4
h(0)
v
fu1v
h(0)
u1
fu2v
h(0)
u2
W(2)
W(0)
W(1)
W(0)
W(1)
L
Σ
said to be isomorphic if their set representations are the same. The number of distinct labels grows
exponentially with the number of iterations L.
WL Network The discrete relabeling process does not directly generalize to continuous feature
vectors. Instead, we appeal to neural networks to continuously embed the computations inherent
in the WL test. Let r be the analogous continuous relabeling function. Then a node v 2 G with
neighbor nodes N(v), node features fv, and edge features fuv is “relabeled” according to
r(v) = ⌧(U1fv + U2
X
u2N(v)
⌧(V[fu, fuv])) (1)
where ⌧(·) could be any non-linear function. We apply this relabeling operation iteratively to obtain
context-dependent atom vectors
h(l)
v = ⌧(U1h(l 1)
v + U2
X
u2N(v)
⌧(V[h(l 1)
u , fuv])) (1  l  L) (2)
where h
(0)
v = fv and U1, U2, V are shared across layers. The ﬁnal atom representations arise from
mimicking the set comparison function in the WL isomorphism test, yielding
cv =
X
u2N(v)
W(0)
h(L)
u W(1)
fuv W(2)
h(L)
v (3)
The set comparison here is realized by matching each rank-1 edge tensor h
(L)
u ⌦ fuv ⌦ h
(L)
v to a set
of reference edges also cast as rank-1 tensors W(0)
[k] ⌦ W(1)
[k] ⌦ W(2)
[k], where W[k] is the
k-th row of matrix W. In other words, Eq. 3 above could be written as
cv[k] =
X
u2N(v)
D
W(0)
[k] ⌦ W(1)
[k] ⌦ W(2)
[k], h(L)
u ⌦ fuv ⌦ h(L)
v
E
(4)
The resulting cv is a vector representation that captures the local chemical environment of the atom
(through relabeling) and involves a comparison against a learned set of reference environments. The
representation of the whole graph G is simply the sum over all the atom representations: cG =
P
v cv.
4ℎ"
($)

n - / P N L
G M S
n /1
. . M G
u1
u2
u3
u4
u5
- /
u2 u3 u4 u5
u1
u2
u3
u4
u1
u2
u3 u5u1
u2
u3
u4
u1
u2
u3
u4
L
P

n N m : : u a e
. .
d W
y gr v a . . . . e c
r!"# ML W
v
u1
u2
u3
u4
h(0)
v
h(0)
u1
h(0)
u2
$# $"%&"#
'( ')')
+
. .
ℎ"+
(-)
ℎ"/
(-)
u1 u2 u3 u4
v
u1
u2
u3
&"#: n i p o
l : : s tv
3.1.2 Finding Reaction Centers with WLN
We present two models to predict reactivity: the local and global models. Our local model is based
directly on the atom representations cu and cv in predicting label yuv. The global model, on the other
hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms
outside of the reaction center may be necessary for the reaction to occur. For example, the reaction
center may be influenced by certain reagents1
. We incorporate these distal effects into the global
model through an attention mechanism.
Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by
the WLN. We predict the reactivity score of (u, v) by passing these through another neural network:
suv = uT
⌧(Macu + Macv + Mbbuv) (5)
where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary
information about the pair such as whether the two atoms are in different molecules or which type of
bond connects them.
Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation
˜cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from
the attention module:
˜cu =
X
v
↵uvcv; ↵uv = uT
⌧(Pacu + Pacv + Pbbuv) (6)
suv = uT
⌧(Ma˜cu + Ma˜cv + Mbbuv) (7)
Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be
multiple atoms relevant to a particular atom u.
Training Both models are trained to minimize the following loss function:
suv = uT
bond connects them.
˜cu =
X
v
↵uvcv; ↵uv = uT
suv = uT
L(T ) =
X
R2T
X
u6=v2R
yuv log(suv) + (1 yuv) log(1 suv) (8)
Here we predict each label independently because of the large number of variables. For a given
reaction with N atoms, we need to predict the reactivity score of O(N2
) pairs. This quadratic
complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless,
we found independent prediction yields sufficiently good performance.
3.2 Candidate Generation
We select the top K atom pairs with the highest predicted reactivity score and designate them,
collectively, as the reaction center. The set of candidate products are then obtained by enumerating all
possible bond configuration changes within the set. While the resulting set of candidate products is
exponential in K, many can be ruled out by invoking additional constraints. For example, every atom
has a maximum number of neighbors they can connect to (valence constraint). We also leverage
the statistical bias that reaction centers are very unlikely to consist of disconnected components
(connectivity constraint). Some multi-step reactions do exist that violate the connectivity constraint.
As we will show, the set of candidates arising from this procedure is more compact than those arising
mML

n MN G W A
(
) AMN L )
v
u1
u2
u3
u4
h(0)
v
h(0)
u1
h(0)
u2
!" !#$ !#% !#& !#'(#"
)* )+)+
+
(
(#"
)* )+
+
(#"
)* )+
+
(#"
)* )+
+
ℎ#-
(/)
ℎ#1
(/)
suv = uT
bond connects them.
˜cu =
X
v
↵uvcv; ↵uv = uT
suv = uT
L(T ) =
X
R2T
X
u6=v2R
from templates without sacriﬁcing coverage.

n d NG rg Ml
(
) o WMc ) M
t G e WM a
• n N WM L
yA s bGi
v
u1
u2
u3
u4
h(0)
v
h(0)
u1
h(0)
u2
(
ℎ"#
(%)
ℎ"'
(%)
Σ
(
suv = uT
bond connects them.
˜cu =
X
v
↵uvcv; ↵uv = uT
suv = uT
L(T ) =
X
R2T
X
u6=v2R

n LG d n li b c
)
) oe ) W
G A a M p
p a M g ) oe
v
u1
u2
u3
u4
h(0)
v
h(0)
u1
h(0)
u2
) ) )
( &
ℎ"#
(%)
ℎ"'
(%)
(̃* (̃"+,"*
-. -/-/
+
u1 u2 u3 u4
v
u1
u2
u3
)
)
suv = uT
bond connects them.
˜cu =
X
v
↵uvcv; ↵uv = uT
suv = uT
L(T ) =
X
R2T
X
u6=v2R
suv = uT
bond connects them.
˜cu =
X
v
↵uvcv; ↵uv = uT
suv = uT
L(T ) =
X
R2T
X
u6=v2R
ONL

n , voc S y
s : N p nN
ae N T y
• vo t 1 2 2 7 027
y vo I yP
• y t 22 70 07 2 7 027
y K vol P i N TN
IJ : P tN
view of our approach. (1) we train a model to identify pairwise atom interactions
r 02 +

n -
P : N L !"
($%)
= ("
($%)
− ("
(*)
P !"
($%)
v1
v2
v3
v4
v5
v1
v2
v3
v4
v5()
- - - - -
Σ
e each candidate pair (r, p). The ﬁrst model naively
ectors of all atom representations obtained from a
. Our second and improved model, called WLDN,
ween these differences vectors.
d atom representation of atom v in candidate product
ertaining to atom v as follows:
s(pi) = uT
⌧(M
X
v2pi
d(pi)
v ) (9)
-mapped so we can use v to refer to the same atom.
e difference vectors, resulting in a single vector for
her neural network to score the candidate product pi.
DN) Instead of simply summing all difference vec-
d a difference graph. A difference graph D(r, pi) is
atoms and bonds as pi, with atom v’s feature vector
aph has several beneﬁts. First, in D(r, pi), atom v’s
e to the reaction center, thus focusing the processing
Second, D(r, pi) explicates neighbor dependencies
!"
($%)
= ("
($%)
− ("
(*)
)

n 1 ):3 ( 21
l f ih S g e !"
($%)
= ("
($%)
− ("
(*)
l !"
($%)
a cP
l 3 - 2 NL W
u1
u2
u3 u5u1
u2
u3
u4
u1
u2
u3
u4
1 ):3 ( 21
ih ih ih
y ∈ {0,1,2,3}
3 -

n ) ) ) ) ) ) () ) -
W L N D : : !"
($%)
= ("
($%)
− ("
(*)
W !"
($%)
( . ) 3 :
v1
v2
v3
v4
v5
v1
v2
v3
v4
v5
- - - - -
Σ
e each candidate pair (r, p). The first model naively
ectors of all atom representations obtained from a
. Our second and improved model, called WLDN,
ween these differences vectors.
d atom representation of atom v in candidate product
ertaining to atom v as follows:
s(pi) = uT
⌧(M
X
v2pi
d(pi)
v ) (9)
-mapped so we can use v to refer to the same atom.
e difference vectors, resulting in a single vector for
her neural network to score the candidate product pi.
DN) Instead of simply summing all difference vec-
d a difference graph. A difference graph D(r, pi) is
atoms and bonds as pi, with atom v’s feature vector
aph has several benefits. First, in D(r, pi), atom v’s
e to the reaction center, thus focusing the processing
Second, D(r, pi) explicates neighbor dependencies
molecule pi. We define difference vector d
(pi)
v pertaining to atom v as follows:
d(pi)
v = c(pi)
v c(r)
v ; s(pi) = uT
⌧(M
X
v2pi
d(pi)
v ) (9)
Recall that the reactants and products are atom-mapped so we can use v to refer to the same atom.
The pooling operation is a simple sum over these difference vectors, resulting in a single vector for
each (r, pi) pair. This vector is then fed into another neural network to score the candidate product pi.
Weisfeiler-Lehman Difference Network (WLDN) Instead of simply summing all difference vec-
tors, the WLDN operates on another graph called a difference graph. A difference graph D(r, pi) is
defined as a molecular graph which has the same atoms and bonds as pi, with atom v’s feature vector
replaced by d
(pi)
v . Operating on the difference graph has several benefits. First, in D(r, pi), atom v’s
feature vector deviates from zero only if it is close to the reaction center, thus focusing the processing
on the reaction center and its immediate context. Second, D(r, pi) explicates neighbor dependencies
between difference vectors. The WLDN maps this graph-based representation into a fixed-length
vector, by applying a separately parameterized WLN on top of D(r, pi):
h(pi,l)
v = ⌧
0
@U1h(pi,l 1)
v + U2
X
u2N(v)
⌧
⇣
V[h(pi,l 1)
u , fuv]
⌘
1
A (1  l  L) (10)
d(pi,L)
v =
X
u2N(v)
W(0)
h(pi,L)
u W(1)
fuv W(2)
h(pi,L)
v (11)
where h
(pi,0)
v = d
(pi)
v . The final score of pi is s(pi) = uT
⌧(M
P
v2pi
d
(pi,L)
v ).
Training Both models are trained to minimize the softmax log-likelihood objective over the scores
{s(p0), s(p1), · · · , s(pm)} where s(p0) corresponds to the target.
4 Experiments
Data As a source of data for our experiments, we used reactions from USPTO granted patents,
collected by Lowe [13]. After removing duplicates and erroneous reactions, we obtained a set of
480K reactions, to which we refer in the paper as USPTO. This dataset is divided into 400K, 40K,
and 40K for training, development, and testing purposes.
In addition, for comparison purposes we report the results on the subset of 15K reaction from this
dataset (referred as USPTO-15K) used by Coley et al. [3]. They selected this subset to include
reactions covered by the 1.7K most common templates. We follow their split, with 10.5K, 1.5K, and
3K for training, development, and testing.
Setup for Reaction Center Identification The output of this component consists of K atom pairs
with the highest reactivity scores. We compute the coverage as the proportion of reactions where all
atom pairs in the true reaction center are predicted by the model, i.e., where the recorded product is
found in the model-generated candidate set.
The model features reflect basic chemical properties of atoms and bonds. Atom-level features include
its elemental identity, degree of connectivity, number of attached hydrogen atoms, implicit valence,
and aromaticity. Bond-level features include bond type (single, double, triple, or aromatic), whether
it is conjugated, and whether the bond is part of a ring.
Both our local and global models are build upon a Weisfeiler-Lehman Network, with unrolled depth
d(pi)
v = c(pi)
v c(r)
v ; s(pi) = uT
⌧(M
X
v2pi
d(pi)
v ) (9)
Recall that the reactants and products are atom-mapped so we can use v to refer to the same atom.
The pooling operation is a simple sum over these difference vectors, resulting in a single vector for
each (r, pi) pair. This vector is then fed into another neural network to score the candidate product pi.
Weisfeiler-Lehman Difference Network (WLDN) Instead of simply summing all difference vec-
tors, the WLDN operates on another graph called a difference graph. A difference graph D(r, pi) is
defined as a molecular graph which has the same atoms and bonds as pi, with atom v’s feature vector
replaced by d
(pi)
v . Operating on the difference graph has several benefits. First, in D(r, pi), atom v’s
feature vector deviates from zero only if it is close to the reaction center, thus focusing the processing
on the reaction center and its immediate context. Second, D(r, pi) explicates neighbor dependencies
between difference vectors. The WLDN maps this graph-based representation into a fixed-length
vector, by applying a separately parameterized WLN on top of D(r, pi):
h(pi,l)
v = ⌧
0
@U1h(pi,l 1)
v + U2
X
u2N(v)
⌧
⇣
V[h(pi,l 1)
u , fuv]
⌘
1
A (1  l  L) (10)
d(pi,L)
v =
X
u2N(v)
W(0)
h(pi,L)
u W(1)
fuv W(2)
h(pi,L)
v (11)
where h
(pi,0)
v = d
(pi)
v . The final score of pi is s(pi) = uT
⌧(M
P
v2pi
d
(pi,L)
v ).
Training Both models are trained to minimize the softmax log-likelihood objective over the scores
{s(p0), s(p1), · · · , s(pm)} where s(p0) corresponds to the target.
4 Experiments
Data As a source of data for our experiments, we used reactions from USPTO granted patents,
collected by Lowe [13]. After removing duplicates and erroneous reactions, we obtained a set of
480K reactions, to which we refer in the paper as USPTO. This dataset is divided into 400K, 40K,
and 40K for training, development, and testing purposes.
In addition, for comparison purposes we report the results on the subset of 15K reaction from this
dataset (referred as USPTO-15K) used by Coley et al. [3]. They selected this subset to include
reactions covered by the 1.7K most common templates. We follow their split, with 10.5K, 1.5K, and
3K for training, development, and testing.
Setup for Reaction Center Identification The output of this component consists of K atom pairs
with the highest reactivity scores. We compute the coverage as the proportion of reactions where all
atom pairs in the true reaction center are predicted by the model, i.e., where the recorded product is
found in the model-generated candidate set.
The model features reflect basic chemical properties of atoms and bonds. Atom-level features include
its elemental identity, degree of connectivity, number of attached hydrogen atoms, implicit valence,
and aromaticity. Bond-level features include bond type (single, double, triple, or aromatic), whether
it is conjugated, and whether the bond is part of a ring.

. .
n ) - 1 2 3 -- 3 ( ) (
k e h af c W !"
($%)
= ("
($%)
− ("
(*)
k !"
($%)
3
k - 2 : 3 1 W D LN i
u1
u2
u3 u5u1
u2
u3
u4
u1
u2
u3
u4
Weisfeiler-Lehman Difference Network (WLDN)
h h h
y ∈ {0,1,2,3}
- 2 : 3

n
n
/
C
C
n
n

n . -
P a =
75 4 3 0 71 7
n . -
. - : eT = K
Ud SO
75 4 3 0 71 7

n I
c
• c g / a
n e
• / / o
• D o K

n bi
7 107 , 7 + 207 , 7
d sM La G L e G Ibi
• c l G L I g o
S eNP
• g J I n >: 0 l rt S
Figure 3: A reaction that reduces the carbonyl carbon of an amide by removing bond 4-23 (red circle).
Reactivity at this site would be highly unlikely without the presence of borohydride (atom 25, blue
circle). The global model correctly predicts bond 4-23 as the most susceptible to change, while the
local model does not even include it in the top ten predictions. The attention map of the global model
show that atoms 1, 25, and 26 were determinants of atom 4’s predicted reactivity.
USPTO-15K
Method |✓| K=6 K=8 K=10
Local 572K 80.1 85.0 87.7
Local 1003K 81.6 86.1 89.1
Global 756K 86.7 90.1 92.2
USPTO
Local 572K 83.0 87.2 89.6
Local 1003K 82.4 86.7 89.1
Global 756K 89.8 92.0 93.3
Avg. Num. of Candidates (USPTO)
Template - 482.3 out of 5006
Global - 60.9 246.5 1076
(a) Reaction Center Prediction Performance. Coverage
is reported by picking the top K (K=6,8,10) reactivity
pairs. |✓| is the number of model parameters.
USPTO-15K
Method Cov. P@1 P@3 P@5
Coley et al. 100.0 72.1 86.6 90.7
WLN 90.1 74.9 84.6 86.3
WLDN 90.1 76.7 85.6 86.8
WLN (*) 100.0 81.4 92.5 94.8
WLDN (*) 100.0 84.1 94.1 96.1
USPTO
WLN 92.0 73.5 86.1 89.0
WLDN 92.0 74.0 86.7 89.5
WLN (*) 100.0 76.7 91.0 94.6
WLDN (*) 100.0 77.8 91.9 95.4
(b) Candidate Ranking Performance. Precision at ranks
1,3,5 are reported. (*) denotes that the true product was
added if not covered by the previous stage.
Table 1: Results on USPTO-15K and USPTO datasets.
/

n g
0 9 C C
n g
C 0 9
• r , C .
• 8 9 7:9. %
• c. +
yv 3/,
• r
• 8 9 7:9.
• c.
C S% a
o N s I
r K J P ne
USPTO-15K
Method |✓| K=6 K=8 K=10
Local 572K 80.1 85.0 87.7
Local 1003K 81.6 86.1 89.1
Global 756K 86.7 90.1 92.2
USPTO
Local 572K 83.0 87.2 89.6
Local 1003K 82.4 86.7 89.1
Global 756K 89.8 92.0 93.3
Avg. Num. of Candidates (USPTO)
Template - 482.3 out of 5006
Global - 60.9 246.5 1076
(a) Reaction Center Prediction Performance. Coverage
is reported by picking the top K (K=6,8,10) reactivity
pairs. |✓| is the number of model parameters.
Method
Coley et al
WLN
WLDN
WLN (*)
WLDN (*)
WLN
WLDN
WLN (*)
WLDN (*)
(b) Candidate R
1,3,5 are repor
added if not co
Table 1: Results on USPTO-15K and US
accuracy against the top-performing template-based approach
employs frequency-based heuristics to construct reaction tem
l i. 2 = 4156 %+

n y
*= Ceni ,, D
*= 5 1 U y
• t T5 1U So S d FONN
/ /+ P )(
• C* 8 8 - = D Kr F
• PGNr T L K C / /+ D
• PGNr W Pl n naK
C / /+ D

n p l
2 7 :J L W JLT S IP50 1D
c S
( ) 2 7 :JP 501 50 1 , 7 S
• W JT Dn Ni D I C
501 50 1LosJ r I50 1D501S
=8 K=10
.0 87.7
.1 89.1
.1 92.2
.2 89.6
.7 89.1
.0 93.3
USPTO)
t of 5006
6.5 1076
ance. Coverage
8,10) reactivity
meters.
USPTO-15K
Method Cov. P@1 P@3 P@5
Coley et al. 100.0 72.1 86.6 90.7
WLN 90.1 74.9 84.6 86.3
WLDN 90.1 76.7 85.6 86.8
WLN (*) 100.0 81.4 92.5 94.8
WLDN (*) 100.0 84.1 94.1 96.1
USPTO
WLN 92.0 73.5 86.1 89.0
WLDN 92.0 74.0 86.7 89.5
WLN (*) 100.0 76.7 91.0 94.6
WLDN (*) 100.0 77.8 91.9 95.4
(b) Candidate Ranking Performance. Precision at ranks
1,3,5 are reported. (*) denotes that the true product was
added if not covered by the previous stage.
ults on USPTO-15K and USPTO datasets.
e + / : 1 23 *

n y
, + U T S Nc
• n Ac J
ie OP ra N I
n you
,721 8 OP I ouN
A Human Evaluation Setup
Here we describe in detail the human evaluation results in Table 3. The evaluation dataset consists
of eight groups, defined by the reaction template popularity as binned in Figure 4b, each with ten
instances selected randomly from the USPTO test set. We invited in total ten chemists to predict the
product given reactants in all groups.
Table 3: Human and model performance on 80 reactions randomly selected from the USPTO test
set to cover a diverse range of reaction types. The first 8 are experts with rich experience in organic
chemistry (graduate and postdoctoral chemists) and the last two are graduate students in chemical
engineering who use organic chemistry concepts regularly but have less formal training. Our model
performs at the expert chemist level in terms of top 1 accuracy.
Chemist 56.3 50.0 40.0 63.8 66.3 65.0 40.0 58.8 25.0 16.3
Our Model 69.1
Figure 5: Details of human performance
OP ra
OP ra sp 11 701:
A Human Evaluation Setup
Here we describe in detail the human evaluation results in Table 3. The evaluation dataset consists
of eight groups, defined by the reaction template popularity as binned in Figure 4b, each with ten
instances selected randomly from the USPTO test set. We invited in total ten chemists to predict the
product given reactants in all groups.
Table 3: Human and model performance on 80 reactions randomly selected from the USPTO test
set to cover a diverse range of reaction types. The first 8 are experts with rich experience in organic
chemistry (graduate and postdoctoral chemists) and the last two are graduate students in chemical
engineering who use organic chemistry concepts regularly but have less formal training. Our model
performs at the expert chemist level in terms of top 1 accuracy.
Chemist 56.3 50.0 40.0 63.8 66.3 65.0 40.0 58.8 25.0 16.3
Our Model 69.1
Figure 5: Details of human performance
(a) Histogram showing the distribution of question
difficulties as evaluated by the average expert perfor-
mance across all ten performers.
(b) Comparison of model performance against human
performance for sets of questions as grouped by the
average human accuracy shown in Figure 5a
.
,

n G 4 e G
e N Go e G
r N
• a h 1 p
• e
• e y
n rG
0 0 1 + e l
G
0 e C 3 G

n THGLF LPJ TJDPLF HDF L P W F OHU L K HLUIHLNHT 9HKODP H TM
7LP" HPJ PJ" H DN THGLF LPJ TJDPLF HDF L P W F OHU L K HLUIHLNHT 9HKODP
H TM 0GXDPFHU LP HWTDN 6PI TOD L P T FHUULPJ U HOU ,
n HORND H DUH
HL" 7HPPLIHT " 2DXLG 2WXHPDWG" DPG 0N P 0URWTW 4W LM HWTDN PH TMU I T KH
RTHGLF L P I TJDPLF FKHOLU T THDF L PU , ,(
HJNHT" :DT LP 5 " DPG :DTM DNNHT HWTDN O NLF :DFKLPH 9HDTPLPJ I T
H T U P KHULU DPG HDF L P THGLF L P ( ,
. .,
HJNHT" :DT LP" :LMH THW " DPG :DTM DNNHT DTGU 0NRKD1KHO 1KHOLFDN
P KHULU NDPPLPJ L K THH HDTFK DPG 2HHR HWTDN H TM NLFLHU
,
1 NH " 1 PP T " H DN THGLF L P I TJDPLF HDF L P W F OHU AULPJ :DFKLPH
9HDTPLPJ ( , )() ))(
n HORND H ITHH
8D DND" :D KH 0 " H DN 9HDTPLPJ RTHGLF FKHOLFDN THDF L PU
. .
FK DNNHT" KLNLRRH" H DN 3 WPG LP TDPUND L P THGLF LPJ W F OH I 1 ORNH
TJDPLF 1KHOLU T HDF L PU WULPJ HWTDN HSWHPFH HSWHPFH : GHNU DTCLX
RTHRTLP DTCLX , )- ,

n 4 ESK HS HVHQWEWLRQ
a RJH V" 2EYLG" EQG :EWKHZ 5EKQ 3 WHQGHG RQQH WLYLW ILQJH S LQWV
,) , )
a KH YEVKLG H" LQR" HW EO BHLVIHLOH OHK EQ J ESK NH QHOV
HS (.
n 4 ESK 1
a 4LO H " 7 VWLQ" HW EO H EO HVVEJH SEVVLQJ IR T EQW KH LVW 6Q
" SEJHV ( , " ,
a 2 YHQE G" 2EYLG 8 " HW EO 1RQYRO WLRQEO QHWZR NV RQ J ESKV IR OHE QLQJ ROH OE
ILQJH S LQWV
a 9L" C MLE" E ORZ" 2EQLHO" 0 R NV K LGW" :E " EQG DH HO" L KE G 4EWHG J ESK
VHT HQ H QH EO QHWZR NV "
a K WW" 8 LVWRI " HW EO EQW KH L EO LQVLJKWV I R GHHS WHQVR QH EO
QHWZR NV EW H R QL EWLRQV - , (-.
a LQ EOV" LRO" E 0HQJLR" EQG :EQM QEWK 8 GO GH EWWH V HT HQ H WR
VHT HQ H IR VHWV 619 "

DeNA copyright document summarizes neural message passing approaches

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to DeNA copyright document summarizes neural message passing approaches

Similar to DeNA copyright document summarizes neural message passing approaches (20)

More from Kazuki Fujikawa

More from Kazuki Fujikawa (14)

Recently uploaded

Recently uploaded (20)

DeNA copyright document summarizes neural message passing approaches