SlideShare a Scribd company logo
1 of 59
Download to read offline
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
AI System Dept.
System & Design Management Unit
Kazuki Fujikawa
21 0 0 2 0 0 2
@ .2 2 82 2 2 @ 7
66 5 45 - 5 4 4 - 6- 4 - 4 6-
6 5 8-6 8 -5 -/ 4 / 68 4.
- 2 2 21 2 @ 7
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n
o
n
D
• a g 1 2
• 0 ~ : : e :
• 0 : A b 0 :
• A 0 0 : 0 N M
D :
• 0 4 D :
•
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n 7 1 ,, pr
NP p J p C
n G C
• Jl I
• p
• I J
n o Gn
a: : 30 h p y I
i o
S : pre C G
+24 ,
Figure 2: Overview of our approach. (1) we train a model to identify pairwise atom interactions
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n
G
n
/ G
n C
n
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n
C
n
C /
C
C
n
n
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n T 7 J 0 i
I:A
A
S 2T I S 2T 2
, 0N PA O2 S n
1 i i
1 i
Figure 1: An example reaction where the reaction center is (27,28), (7,27), and (8,27), highlighted in
green. Here bond (27,28) is deleted and (7,27) and (8,27) are connected by aromatic bonds to form a
new ring. The corresponding reaction template consists of not only the reaction center, but nearby
functional groups that explicitly specify the context.
template involves graph matching and this makes examining large numbers of templates prohibitively
+,
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n p y r
00 100 w n N
hc i Ld l
n d l t sRu a
k j Nov LGbge N L L G6 4
d l t
-66 /62 7G. 2 D C
64 3 7 236:2 2 2 6
65 2
:2 65 2 7
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n -0 -01- - 2+ - -1 - + 0, -
[ S L e iv L
m Ll [ ] S L h
d K S W K m f
https://www.cc.gatech.edu/~lsong/teaching/8803ML/lecture22.pdf
http://art.ist.hokudai.ac.jp/~takigawa/data/fpai94_takigawa.pdf
sfeiler-Lehman algorithm
Use multiset labels to capture neighborhood structure in a
graph
If two graphs are the same, then the multiset labels should be
the same as well
7
sfeiler-Lehman algorithm
Use multiset labels to capture neighborhood structure in a
graph
If two graphs are the same, then the multiset labels should be
the same as well
7
Weisfeiler-Lehman algorithm II
Relabel graph and construct new multiset label
Check whether the new labels are the same
8
Weisfeiler-Lehman algorithm II
Relabel graph and construct new multiset label
Check whether the new labels are the same
8
Weisfeiler-Lehman algorithm II
If the new multilabel sets are not the same, stop the algorithm
and declare the two graphs are not the same
Effectively it is unrolling the graph into trees rooted at each
node and use multilabels to identify these trees
nr as
S
Weisfeiler-Lehman algorithm
Use multiset labels to capture neighborhood structure in a
graph
If two graphs are the same, then the multiset labels should be
the same as well
7
S
Weisfeiler-Lehman algorithm II
Relabel graph and construct new multiset label
Check whether the new labels are the same
8
Weisfeiler-Lehman algorithm II
Relabel graph and construct new multiset label
Check whether the new labels are the same
8
nr as
!′!
S d
!′! !′!
# ! = [6, 3, 2, 1, 0, 1, 2, 2, 0, 1]
# !′ = 6, 3, 2, 1, 2, 1, 0, 0, 2, 1
-./ !, !0 = 52
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n 1 010 1 4 4 4 21 4 ,+ 21
e H c ]d a
e R Fa CE P [ 310
4 21 4
https://chembioinfo.com/2011/10/30/revisiting-molecular-hashed-fingerprints/
http://art.ist.hokudai.ac.jp/~takigawa/data/fpai94_takigawa.pdf
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n , 2 2 7 2 , 0)7 : (,+ 1
R N) 2 ( wM]r )7 : [ o
GP e L
• , 2 2 7 2
+ M
• i h sL k gm ]
• M]i h a d k gm uM]
n + M C i h k gm + RI]
• 2 2
, 2 2 7 R u Ni h p L
l sL oP k gm tp ]
ntum Chemistry
Oriol Vinyals 3
George E. Dahl 1
DFT
103
seconds
Message Passing Neural Net
10 2
seconds
E,!0, ...
Targets
)7 : (,+
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
Message Function:
!"(ℎ%
" , ℎ'(
" , )%'(
)
Σ
Message Function:
!"(ℎ%
" , ℎ'(
" , )%'(
)
Neural Message Passing for
time steps and is defined in terms of message functions Mt
and vertex update functions Ut. During the message pass-
ing phase, hidden states ht
v at each node in the graph are
updated based on messages mt+1
v according to
mt+1
v =
X
w2N(v)
Mt(ht
v, ht
w, evw) (1)
ht+1
v = Ut(ht
v, mt+1
v ) (2)
where in the sum, N(v) denotes the neighbors of v in graph
G. The readout phase computes a feature vector for the
whole graph using some readout function R according to
ˆy = R({hT
v | v 2 G}). (3)
The message functions Mt, vertex update functions Ut, and
readout function R are all learned differentiable functions.
R operates on the set of node states and must be invariant to
permutations of the node states in order for the MPNN to be
invariant to graph isomorphism. In what follows, we define
previous models in the literature by specifying the message
function M , vertex update function U , and readout func-
n + : , : : 1 C 5 2
l icN ]E S eF[Uh
• , :
l , : !" ℎ%
" , ℎ'
" , )%' = ,-.,/0(ℎ'
" , )%')
l 0 5 : 1" ℎ%
" , 2%
"34 = 5(6"
789	(%)
2%
"34)
• 6"
789	(%)
0D ; N ISda deg	(;) MNfg L P
v
u1
u2
h(0)
v
h(0)
u1
h(0)
u2
Neural Message Passing for Quantum Chemistry
time steps and is defined in terms of message functions Mt
and vertex update functions Ut. During the message pass-
ing phase, hidden states ht
v at each node in the graph are
updated based on messages mt+1
v according to
mt+1
v =
X
w2N(v)
Mt(ht
v, ht
w, evw) (1)
ht+1
v = Ut(ht
v, mt+1
v ) (2)
where in the sum, N(v) denotes the neighbors of v in graph
G. The readout phase computes a feature vector for the
whole graph using some readout function R according to
ˆy = R({hT
v | v 2 G}). (3)
The message functions Mt, vertex update functions Ut, and
readout function R are all learned differentiable functions.
R operates on the set of node states and must be invariant to
permutations of the node states in order for the MPNN to be
invariant to graph isomorphism. In what follows, we define
previous models in the literature by specifying the message
function Mt, vertex update function Ut, and readout func-
tion R used. Note one could also learn edge features in
an MPNN by introducing hidden states for all edges in the
graph ht
evw
and updating them analogously to equations 1
and 2. Of the existing MPNNs, only Kearnes et al. (2016)
Recurrent Unit introduced in Cho et al. (2014). This work
used weight tying, so the same update function is used at
each time step t. Finally,
R =
X
v2V
⇣
i(h(T )
v , h0
v)
⌘ ⇣
j(h(T )
v )
⌘
(4)
where i and j are neural networks, and denotes element-
wise multiplication.
Interaction Networks, Battaglia et al. (2016)
This work considered both the case where there is a tar-
get at each node in the graph, and where there is a graph
level target. It also considered the case where there are
node level effects applied at each time step, in such a
case the update function takes as input the concatenation
(hv, xv, mv) where xv is an external vector representing
some outside influence on the vertex v. The message func-
tion M(hv, hw, evw) is a neural network which takes the
concatenation (hv, hw, evw). The vertex update function
U(hv, xv, mv) is a neural network which takes as input
the concatenation (hv, xv, mv). Finally, in the case where
there is a graph level output, R = f(
P
v2G
hT
v ) where f is
a neural network which takes the sum of the final hidden
states hT
v . Note the original work only defined the model
for T = 1.
Molecular Graph Convolutions, Kearnes et al. (2016)
)%'(
)%'?
Update Function:
1"(ℎ%
"
, 2%
"34
)
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n ( 5 : D 5 + : C )DE 5D ,02
S R [ ] LI M N (+0 P a
• 1 5 DC 5
1 5 DC D C ! ℎ#
$
% ∈ ' = )(	∑ -.)/012 34ℎ#
4
#,4 	)
v
u1
u2
h(0)
v
h(0)
u1
h(0)
u2
1 5 DC D C )(	∑ -.)/012 34ℎ#
4
#,4 	)
ℎ#
(7)
ℎ89
(7)
ℎ8:
(7)
FFFFFF
ℎ89
($)
ℎ8:
(;)
ℎ#
($)
5: 5 :
5: 5 :
<#89
<#8:
=> = !({ℎ#
($)
|% ∈ '})
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n , , 0 0 F C ,, 00 . - .1 (
2 + 6 M ,12U] h iU g
• CC : CC : C
CC : 6 ) !" ℎ$
" , ℎ&
" , '$& = )*&ℎ&
"
• )*&) Rde fa G a G 6 I [cL N
2 6 ) +" ℎ$
" , ,$
"-. = /0+ ℎ$
" , ,$
"-.
Message Function:
!"(ℎ$
" , ℎ&2
" , '$&2
)
Σ
Message Function:
!"(ℎ$
" , ℎ&2
" , '$&2
)
Neural Message Passing for
time steps and is defined in terms of message functions Mt
and vertex update functions Ut. During the message pass-
ing phase, hidden states ht
v at each node in the graph are
updated based on messages mt+1
v according to
mt+1
v =
X
w2N(v)
Mt(ht
v, ht
w, evw) (1)
ht+1
v = Ut(ht
v, mt+1
v ) (2)
where in the sum, N(v) denotes the neighbors of v in graph
G. The readout phase computes a feature vector for the
whole graph using some readout function R according to
ˆy = R({hT
v | v 2 G}). (3)
The message functions Mt, vertex update functions Ut, and
readout function R are all learned differentiable functions.
R operates on the set of node states and must be invariant to
permutations of the node states in order for the MPNN to be
invariant to graph isomorphism. In what follows, we define
previous models in the literature by specifying the message
function M , vertex update function U , and readout func-
v
u1
u2
h(0)
v
h(0)
u1
h(0)
u2
Neural Message Passing for Quantum Chemistry
time steps and is defined in terms of message functions Mt
and vertex update functions Ut. During the message pass-
ing phase, hidden states ht
v at each node in the graph are
updated based on messages mt+1
v according to
mt+1
v =
X
w2N(v)
Mt(ht
v, ht
w, evw) (1)
ht+1
v = Ut(ht
v, mt+1
v ) (2)
where in the sum, N(v) denotes the neighbors of v in graph
G. The readout phase computes a feature vector for the
whole graph using some readout function R according to
ˆy = R({hT
v | v 2 G}). (3)
The message functions Mt, vertex update functions Ut, and
readout function R are all learned differentiable functions.
R operates on the set of node states and must be invariant to
permutations of the node states in order for the MPNN to be
invariant to graph isomorphism. In what follows, we define
previous models in the literature by specifying the message
function Mt, vertex update function Ut, and readout func-
tion R used. Note one could also learn edge features in
an MPNN by introducing hidden states for all edges in the
graph ht
evw
and updating them analogously to equations 1
and 2. Of the existing MPNNs, only Kearnes et al. (2016)
Recurrent Unit introduced in Cho et al. (2014). This work
used weight tying, so the same update function is used at
each time step t. Finally,
R =
X
v2V
⇣
i(h(T )
v , h0
v)
⌘ ⇣
j(h(T )
v )
⌘
(4)
where i and j are neural networks, and denotes element-
wise multiplication.
Interaction Networks, Battaglia et al. (2016)
This work considered both the case where there is a tar-
get at each node in the graph, and where there is a graph
level target. It also considered the case where there are
node level effects applied at each time step, in such a
case the update function takes as input the concatenation
(hv, xv, mv) where xv is an external vector representing
some outside influence on the vertex v. The message func-
tion M(hv, hw, evw) is a neural network which takes the
concatenation (hv, hw, evw). The vertex update function
U(hv, xv, mv) is a neural network which takes as input
the concatenation (hv, xv, mv). Finally, in the case where
there is a graph level output, R = f(
P
v2G
hT
v ) where f is
a neural network which takes the sum of the final hidden
states hT
v . Note the original work only defined the model
for T = 1.
Molecular Graph Convolutions, Kearnes et al. (2016)
'$&2
'$&4
Update Function:
+"(ℎ$
"
, ,$
"-.
)
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n + 6 + 6 6 C : ++ 1- ,)- 2
0 6 LFG+ 0 N GU
• 6 6
6 (
! ℎ#
$
% ∈ ' = tanh(	∑ 	0# 1 ℎ#
$
, ℎ#
3
	⊙ 	tanh	 5 ℎ#
$
, ℎ#
3
)
• 1, 5( I 0 1 ℎ#
$
, ℎ#
3
	( 6 I R
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n ( , : 0 (, +1 0
[ T ] US
• ) 0 0 7: 0
) 0 :1 7 :
!" ℎ$
" , ℎ&
" , '$& = tanh -./ -/.ℎ0
" + 23 	⊙ -6.'$0 + 27
• -./
, -/.
, -6.
D23, 27 M N
• 20 :1 7 : 8" ℎ$
"
, 9$
":3
= ℎ$
"
+ 9$
":3
Message Function:
!"(ℎ$
"
, ℎ&<
"
, '$&<
)
Σ
Message Function:
!"(ℎ$
"
, ℎ&<
"
, '$&<
)
Neural Message Passing f
time steps and is defined in terms of message functions Mt
and vertex update functions Ut. During the message pass-
ing phase, hidden states ht
v at each node in the graph are
updated based on messages mt+1
v according to
mt+1
v =
X
w2N(v)
Mt(ht
v, ht
w, evw) (1)
ht+1
v = Ut(ht
v, mt+1
v ) (2)
where in the sum, N(v) denotes the neighbors of v in graph
G. The readout phase computes a feature vector for the
whole graph using some readout function R according to
ˆy = R({hT
v | v 2 G}). (3)
The message functions Mt, vertex update functions Ut, and
readout function R are all learned differentiable functions.
R operates on the set of node states and must be invariant to
permutations of the node states in order for the MPNN to be
invariant to graph isomorphism. In what follows, we define
previous models in the literature by specifying the message
v
u1
u2
h(0)
v
h(0)
u1
h(0)
u2
Neural Message Passing for Quantum Chemistry
time steps and is defined in terms of message functions Mt
and vertex update functions Ut. During the message pass-
ing phase, hidden states ht
v at each node in the graph are
updated based on messages mt+1
v according to
mt+1
v =
X
w2N(v)
Mt(ht
v, ht
w, evw) (1)
ht+1
v = Ut(ht
v, mt+1
v ) (2)
where in the sum, N(v) denotes the neighbors of v in graph
G. The readout phase computes a feature vector for the
whole graph using some readout function R according to
ˆy = R({hT
v | v 2 G}). (3)
The message functions Mt, vertex update functions Ut, and
readout function R are all learned differentiable functions.
R operates on the set of node states and must be invariant to
permutations of the node states in order for the MPNN to be
invariant to graph isomorphism. In what follows, we define
previous models in the literature by specifying the message
function Mt, vertex update function Ut, and readout func-
tion R used. Note one could also learn edge features in
an MPNN by introducing hidden states for all edges in the
graph ht
evw
and updating them analogously to equations 1
and 2. Of the existing MPNNs, only Kearnes et al. (2016)
Recurrent Unit introduced in Cho et al. (2014). This work
used weight tying, so the same update function is used at
each time step t. Finally,
R =
X
v2V
⇣
i(h(T )
v , h0
v)
⌘ ⇣
j(h(T )
v )
⌘
(4)
where i and j are neural networks, and denotes element-
wise multiplication.
Interaction Networks, Battaglia et al. (2016)
This work considered both the case where there is a tar-
get at each node in the graph, and where there is a graph
level target. It also considered the case where there are
node level effects applied at each time step, in such a
case the update function takes as input the concatenation
(hv, xv, mv) where xv is an external vector representing
some outside influence on the vertex v. The message func-
tion M(hv, hw, evw) is a neural network which takes the
concatenation (hv, hw, evw). The vertex update function
U(hv, xv, mv) is a neural network which takes as input
the concatenation (hv, xv, mv). Finally, in the case where
there is a graph level output, R = f(
P
v2G
hT
v ) where f is
a neural network which takes the sum of the final hidden
states hT
v . Note the original work only defined the model
for T = 1.
'$&<
'$&>
Update Function:
8"(ℎ$
" , 9$
":3)
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n (22: ,2 )2 7 )2 (,)) +0 ) 2
N D
• 2 1 : 2
2 1 0 ! ℎ#
$
% ∈ ' = ∑ NN(ℎ#
$
)#
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n + 0 2 2 E E , - (
[ 1 6 G 2 2 M
• EE6 C6EE C 6E
[ EE6 :G 7 ) !" ℎ$
" , ℎ&
" , '$& = )('$+)ℎ&
"
• )('$+)) N S R '$+ M IL00
[ C 6 :G 7 ) -" ℎ$
" , .$
"/0 = GRU ℎ$
" , .$
"/0
• ,,00 - 1 U
Message Function:
!"(ℎ$
"
, ℎ&4
"
, '$&4
)
Σ
Message Function:
!"(ℎ$
"
, ℎ&4
"
, '$&4
)
Neural Message Passing f
time steps and is defined in terms of message functions Mt
and vertex update functions Ut. During the message pass-
ing phase, hidden states ht
v at each node in the graph are
updated based on messages mt+1
v according to
mt+1
v =
X
w2N(v)
Mt(ht
v, ht
w, evw) (1)
ht+1
v = Ut(ht
v, mt+1
v ) (2)
where in the sum, N(v) denotes the neighbors of v in graph
G. The readout phase computes a feature vector for the
whole graph using some readout function R according to
ˆy = R({hT
v | v 2 G}). (3)
The message functions Mt, vertex update functions Ut, and
readout function R are all learned differentiable functions.
R operates on the set of node states and must be invariant to
permutations of the node states in order for the MPNN to be
invariant to graph isomorphism. In what follows, we define
previous models in the literature by specifying the message
v
u1
u2
h(0)
v
h(0)
u1
h(0)
u2
Neural Message Passing for Quantum Chemistry
time steps and is defined in terms of message functions Mt
and vertex update functions Ut. During the message pass-
ing phase, hidden states ht
v at each node in the graph are
updated based on messages mt+1
v according to
mt+1
v =
X
w2N(v)
Mt(ht
v, ht
w, evw) (1)
ht+1
v = Ut(ht
v, mt+1
v ) (2)
where in the sum, N(v) denotes the neighbors of v in graph
G. The readout phase computes a feature vector for the
whole graph using some readout function R according to
ˆy = R({hT
v | v 2 G}). (3)
The message functions Mt, vertex update functions Ut, and
readout function R are all learned differentiable functions.
R operates on the set of node states and must be invariant to
permutations of the node states in order for the MPNN to be
invariant to graph isomorphism. In what follows, we define
previous models in the literature by specifying the message
function Mt, vertex update function Ut, and readout func-
tion R used. Note one could also learn edge features in
an MPNN by introducing hidden states for all edges in the
graph ht
evw
and updating them analogously to equations 1
and 2. Of the existing MPNNs, only Kearnes et al. (2016)
Recurrent Unit introduced in Cho et al. (2014). This work
used weight tying, so the same update function is used at
each time step t. Finally,
R =
X
v2V
⇣
i(h(T )
v , h0
v)
⌘ ⇣
j(h(T )
v )
⌘
(4)
where i and j are neural networks, and denotes element-
wise multiplication.
Interaction Networks, Battaglia et al. (2016)
This work considered both the case where there is a tar-
get at each node in the graph, and where there is a graph
level target. It also considered the case where there are
node level effects applied at each time step, in such a
case the update function takes as input the concatenation
(hv, xv, mv) where xv is an external vector representing
some outside influence on the vertex v. The message func-
tion M(hv, hw, evw) is a neural network which takes the
concatenation (hv, hw, evw). The vertex update function
U(hv, xv, mv) is a neural network which takes as input
the concatenation (hv, xv, mv). Finally, in the case where
there is a graph level output, R = f(
P
v2G
hT
v ) where f is
a neural network which takes the sum of the final hidden
states hT
v . Note the original work only defined the model
for T = 1.
'$&4
'$&5
Update Function:
-"(ℎ$
" , .$
"/0)
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n + 0 2 2 C C , - (
1 6 E L2 2
• 1 6 E :E 7 )
! ℎ#
$
% ∈ ' = set2set ℎ#
$
% ∈ '
C C G6 C - 1 -.
∗ RM00M SLIN
size of the set, and which is order invariant. In the next sections, we explain such a modification,
which could also be seen as a special case of a Memory Network (Weston et al., 2015) or Neural
Turing Machine (Graves et al., 2014) – with a computation flow as depicted in Figure 1.
4.2 ATTENTION MECHANISMS
Neural models with memories coupled to differentiable addressing mechanism have been success-
fully applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah-
danau et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al.,
2015). Since we are interested in associative memories we employed a “content” based attention.
This has the property that the vector retrieved from our memory would not change if we randomly
shuffled the memory. This is crucial for proper treatment of the input set X as such. In particular,
our process block based on an attention mechanism uses the following:
qt = LSTM(q⇤
t 1) (3)
ei,t = f(mi, qt) (4)
ai,t =
exp(ei,t)
P
j exp(ej,t)
(5)
rt =
X
i
ai,tmi (6)
q⇤
t = [qt rt] (7)
Read
Process Write
Figure 1: The Read-Process-and-Write model.
where i indexes through each memory vector mi (typically equal to the cardinality of X), qt is
a query vector which allows us to read rt from the memories, f is a function that computes a
single scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a
recurrent state but which takes no inputs. q⇤
t is the state which this LSTM evolves, and is formed
by concatenating the query qt with the resulting attention readout rt. t is the index which indicates
V ) G6 C - 1
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n 0 6 0 1 0 : 0 1 2 07 0
+ 0 ,
p m rw 0 20 0 6 20 : Rcf
n u gip W p s
a
gip u Ntde Fp l yh
W []Ro
A third strategy for reaction prediction algorithms uses fingerprints and extended circular fingerprints33,34
have been
Figure 1. An overview of our method for predicting reaction type and products. A reaction fingerprint, made from concatenating the fingerprints of
reactant and reagent molecules, is the input for a neural network that predicts the probability of 17 different reaction types, represented as a reaction
type probability vector. The algorithm then predicts a product by applying to the reactants a transformation that corresponds to the most probable
reaction type. In this work, we use a SMARTS transformation for the final step.
ACS Central Science Research Article
Figure 2. Cross validation results for (a) baseline fingerprint, (b) Morgan reaction fingerprint, and (c) neural reaction fingerprint. A confusion matrix
shows the average predicted probability for each reaction type. In these confusion matrices, the predicted reaction type is represented on the vertical
axis, and the correct reaction type is represented on the horizontal axis. These figures were generated on the basis of code from Schneider et al.43
ACS Central Science Research Article
large libraries of synthetically accessible compounds in the areas
of molecular discovery,44
metabolic networks,45
drug discov-
ery,46
and discovery of one-pot reactions.47
In our algorithm,
we use SMARTS transformation for targeted prediction of
product molecules from reactants. However, this method can
be replaced by any method that generates product molecule
graphs from reactant molecule graphs. An overview of our
method can be found in Figure 1 and is explained in further
detail in Prediction Methods.
We show the results of our prediction method on 16 basic
reactions of alkyl halides and alkenes, some of the first reactions
taught to organic chemistry students in many textbooks.48
The
training and validation reactions were generated by applying
simple SMARTS transformations to alkenes and alkyl halides.
While we limit our initial exploration to aliphatic, non-
stereospecific molecules, our method can easily be applied a
wider span of organic chemical space with enough example
reactions. The algorithm can also be expanded to include
experimental conditions such as reaction temperature and time.
With additional adjustments and a larger library of training data,
our algorithm will be able to predict multistep reactions and,
eventually, become a module in a larger machine-learning
system for suggesting retrosynthetic pathways for complex
molecules.
■ RESULTS AND DISCUSSION
Performance on Cross-Validation Set. We created a data
set of reactions of four alkyl halide reactions and 12 alkene
reactions; further details on the construction of the data set can
be found in Methods. Our training set consisted of 3400
reactions from this data set, and the test set consisted of 17,000
reactions; both the training set and the test set were balanced
across reaction types. During optimization on the training set,
k-fold cross-validation was used to help tune the parameters of
the neural net. Table 1 reports the cross-entropy score and the
accuracy of the baseline and fingerprinting methods on this test
set. Here the accuracy is defined by the percentage of matching
“NR”
overg
horiz
Pe
Ques
textb
set f
traini
of te
from
in F
assign
matc
netw
react
valida
traini
distan
mole
betw
set re
score
was 1
was o
detai
Fo
type
the a
algor
the p
the m
proba
for h
In
best
by th
both
Morg
algor
than
rings
of th
neura
This
algor
patte
In
for r
Table 1. Accuracy and Negative Log Likelihood (NLL) Error
of Fingerprint and Baseline Methods
fingerprint
method
fingerprint
length
train
NLL
train
accuracy
(%)
test
NLL
test
accuracy
(%)
baseline 51 0.2727 78.8 2.5573 24.7
Morgan 891 0.0971 86.0 0.1792 84.5
neural 181 0.0976 86.0 0.1340 85.7
ACS Central Science
kh 0
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n + 1: 2 : 3 138 1 7 - 8 1
- 13 , 3 7: 0
r S RaPol MNbm h [ i Pcg
• e S R d ] S R d
r i Lf
taset of 278 reactions from the literature and 8 reaction classes.
During the preparation of this manuscript, we became aware
that Aspuru-Guzik and co-workers recently reported a neural
network approach for reaction prediction, in which reactant
and reagent molecules were encoded as neural[9]
or extended
connectivity fingerprints, which are concatenated.[10]
They
trained their models to predict 16 reaction types on an artificial
dataset of 3200 reactions.
Herein, we propose a novel neural-symbolic model, which
can be used for both reaction prediction and retrosynthesis.
Our model uses neural networks in order to predict the most
likely transformation rules to be applied to the input mole-
cule(s) (see Figures 1 and 2).
In short, the computer has to learn which named reaction
was used to make a molecule (or under which rule the starting
materials reacted). We trained neural networks by showing
them millions of examples of known reactions and the corre-
sponding correct reaction rule. The goal is to learn patterns in
ing with symbolic rules is that we retain the familiar concept
of rules, whereas the model learns to prioritize the rules and to
estimate selectivity and compatibility from the provided train-
ing data, which are successfully performed experiments. We
test our hypothesis in the first large-scale systematic investiga-
tion of machine learning and rule-based systems.
Several metrics are reported in Table 1 and Table 2 to evalu-
ate the models. Accuracy shows how many reactions are cor-
rectly predicted when the rule with the highest predicted
probability is evaluated. In top-n accuracy, we examine if the
correct reaction rule is among the n highest ranked rules, simi-
lar to being on the first page of the results of a search engine.
This means we allow the algorithm to propose more than one
applicable rule, which is reasonable, since the same molecules
can react differently, or several routes are possible to synthe-
size a target molecule in retrosynthesis. Furthermore, we
Figure 2. Overview of our neural-symbolic ansatz.
Table 1. Results for the study on 103 hand coded rules.
Task/Model Acc Top 3-Acc MRR W. Prec.
Reaction prediction
random 0.03 0.12 0.04 0.03
expert system 0.07 0.33 0.12 0.46
logistic regression 0.86 0.97 0.91 0.86
highway network 0.92 0.99 0.96 0.92
FC512 ELU 0.92 0.99 0.96 0.92
Retrosynthesis
random 0.03 0.12 0.04 0.03
expert system 0.05 0.30 0.06 0.11
logistic regression 0.64 0.95 0.77 0.62
highway network 0.77 0.98 0.86 0.77
FC512 ELU 0.78 0.98 0.87 0.78
Chem. Eur. J. 2017, 23, 5966 – 5971 www.chemeurj.org ⌫ 2017 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim5967 ch
se
a m
of
Au
Ta
87
da
sys
de
ble
th
Table 2. Results for the study on 8720 automatically extracted rules.
Task/model Acc Top 10-Acc. MRR W. Prec.
Reaction prediction
random 0.00 0.00 0.00 0.00
expert system 0.02 0.18 0.02 0.06
logistic regression 0.41 0.65 0.49 0.31
highway network 0.78 0.98 0.86 0.77
FC512 ELU 0.77 0.97 0.85 0.76
Retrosynthesis
random 0.00 0.00 0.00 0.00
expert system 0.02 0.19 0.01 0.06
logistic regression 0.31 0.59 0.41 0.23
highway network 0.64 0.95 0.75 0.63
FC512 ELU 0.62 0.94 0.74 0.62
quently applied to the molecules. They first encoded mole-
cules by unsupervised pre-training of self-organizing maps,
a type of neural network,[8]
which then serves as an input to
a random forest classifier.[7]
They conducted a study with a da-
taset of 278 reactions from the literature and 8 reaction classes.
During the preparation of this manuscript, we became aware
that Aspuru-Guzik and co-workers recently reported a neural
network approach for reaction prediction, in which reactant
and reagent molecules were encoded as neural[9]
or extended
the molecules’ functional groups that allow the machine to
generalize to molecules it has not seen before. In this problem
the model learns to predict the probability over all rules.[11]
We
hypothesize that the advantage of combining machine learn-
ing with symbolic rules is that we retain the familiar concept
of rules, whereas the model learns to prioritize the rules and to
estimate selectivity and compatibility from the provided train-
ing data, which are successfully performed experiments. We
test our hypothesis in the first large-scale systematic investiga-
Figure 1. The challenge in retrosynthesis and reaction prediction is to select the correct rule among possibly tens or hundreds of matching rules. In this exam-
ple, both a Suzuki and a Kumada coupling (among others) formally match the biaryl moiety. However, the aldehyde in the molecular context would be in
conflict with a Grignard reagent. Therefore, the Kumada coupling cannot be used to make target 1. This information has to be encoded by hand in expert
systems. Our system simply learns it from data.
Communication
quently applied to the molecules. They first encoded mole-
cules by unsupervised pre-training of self-organizing maps,
a type of neural network,[8]
which then serves as an input to
a random forest classifier.[7]
They conducted a study with a da-
taset of 278 reactions from the literature and 8 reaction classes.
During the preparation of this manuscript, we became aware
that Aspuru-Guzik and co-workers recently reported a neural
network approach for reaction prediction, in which reactant
and reagent molecules were encoded as neural[9]
or extended
connectivity fingerprints, which are concatenated.[10]
They
trained their models to predict 16 reaction types on an artificial
dataset of 3200 reactions.
Herein, we propose a novel neural-symbolic model, which
can be used for both reaction prediction and retrosynthesis.
Our model uses neural networks in order to predict the most
likely transformation rules to be applied to the input mole-
cule(s) (see Figures 1 and 2).
In short, the computer has to learn which named reaction
was used to make a molecule (or under which rule the starting
materials reacted). We trained neural networks by showing
them millions of examples of known reactions and the corre-
sponding correct reaction rule. The goal is to learn patterns in
the molecules’ functional groups that allow the machine to
generalize to molecules it has not seen before. In this problem
the model learns to predict the probability over all rules.[11]
We
hypothesize that the advantage of combining machine learn-
ing with symbolic rules is that we retain the familiar concept
of rules, whereas the model learns to prioritize the rules and to
estimate selectivity and compatibility from the provided train-
ing data, which are successfully performed experiments. We
test our hypothesis in the first large-scale systematic investiga-
tion of machine learning and rule-based systems.
Several metrics are reported in Table 1 and Table 2 to evalu-
ate the models. Accuracy shows how many reactions are cor-
rectly predicted when the rule with the highest predicted
probability is evaluated. In top-n accuracy, we examine if the
correct reaction rule is among the n highest ranked rules, simi-
lar to being on the first page of the results of a search engine.
This means we allow the algorithm to propose more than one
applicable rule, which is reasonable, since the same molecules
can react differently, or several routes are possible to synthe-
size a target molecule in retrosynthesis. Furthermore, we
Figure 1. The challenge in retrosynthesis and reaction prediction is to select the correct rule among possibly tens or hundreds of matching rules. In this exam-
ple, both a Suzuki and a Kumada coupling (among others) formally match the biaryl moiety. However, the aldehyde in the molecular context would be in
conflict with a Grignard reagent. Therefore, the Kumada coupling cannot be used to make target 1. This information has to be encoded by hand in expert
systems. Our system simply learns it from data.
Figure 2. Overview of our neural-symbolic ansatz.
Table 1. Results for the study on 103 hand coded rules.
Task/Model Acc Top 3-Acc MRR W. Prec.
Reaction prediction
random 0.03 0.12 0.04 0.03
expert system 0.07 0.33 0.12 0.46
logistic regression 0.86 0.97 0.91 0.86
highway network 0.92 0.99 0.96 0.92
FC512 ELU 0.92 0.99 0.96 0.92
Retrosynthesis
random 0.03 0.12 0.04 0.03
expert system 0.05 0.30 0.06 0.11
logistic regression 0.64 0.95 0.77 0.62
highway network 0.77 0.98 0.86 0.77
FC512 ELU 0.78 0.98 0.87 0.78
n 7:
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n 7 IC : S CL
IC :RC 7 IC 2 01
Workshop track - ICLR 2017
OH
O
OHO
a) Retrosynthesis: Analyse the target in reverse (chemical representation)
c) Synthesis: Go to the lab and actually conduct the reactions, following plan a)
OH
O
Me OH
CO
O
H2 CO H2 CO
Ac2O H2 CO
Ibuprofen4 3 2
Ibuprofen 21
1
3 4
OH
O
OH O
b) Applied transformations (productions) applied in each analysis step
Ac2O
Ac2O
1
d) Search
e) Neural N
S8
S6
Mole
Desc
(ECF
Figure 1: a) The synthesis of the target (ibuprofen) is planned with retro
analyzed until a set of building blocks is found that we can buy or we alre
+, +
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
C 7 2 1 1 ,+ 02
n M + -
l P NS M
A :
! ", $, "%
= '("%
|", $) t C R
g l M 7 7 7
n + 0
'($) ,, : +($|") D L:
,(-) I - T
.(-) I - :
/ e M
OH
O
Me OH
CO
21
d) Search Tree Representation
e) Neural Networks determine the Actions
S1
S2 S5
S8
S6
S7
S3
S4
S1 = {1}
S2 = {2,CO}
S3 = {3,CO,H2}
S4 = {4,CO,H2,Ac2O}
Root (Ibuprofen)
OH
O
H
an a)
O
CO H2 CO
CO
Ibuprofen1
4
O
Ac2O
Ac2O
OH
O
Me OH
CO
21
d) Search Tree Representation
e) Neural Networks determine the Actions
S1
S2 S5
S8
S6
S7
S3
S4
S1 = {1}
S2 = {2,CO}
S3 = {3,CO,H2}
S4 = {4,CO,H2,Ac2O}
p(rn|x)
most probable
rules
apply
rules to 1
Molecular
Descriptor
(ECFP4) Deep Neural Network
Classification
x
Root (Ibuprofen)
Workshop track - ICLR 2017
2.3 MONTE CARLO TREE SEARCH (MCTS)
MCTS combines tree search with random sampling [Browne et al. (2012)]
work in three of the four MCTS steps: Selection, Expansion, and Rollout. W
employ Eq. (1), a variation of the UCT function similar to AlphaGo’s, to de
of a node v, where P(a) is the probability assigned to this action by the p
number of visits and Q(v) the accumulated reward at node v, and c the expl
arg max
v′∈children of v
Q(v′
)
N(v′)
+ cP(a)
N(v)
1 + N(v′)
The tree policy is applied until a leaf node is found, which is expanded
network. Our policy network has a top1 accuracy of 0.62 and a top50 (of 1
This allows to kill two birds with one stone: First, to reduce the branching
the top 50 most promising actions, and not all possible ones (≈ 200). Secon
actions, we have to solve the subgraph isomorphism problem, which determ
rule can be applied to a molecule and yields the next molecule(s), only 50
times for all rules. During rollout, we sample actions from the policy netw
r 12 1
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n , 5 + 71 51 + 5 7 1 5 51 7
/ :5 0 O21 5: 5
u n edi i m
• u n
]iagC b [ y O M R
• edi i
r L u s lO ot Rc f
P d m M h Ui h C u R
representation, less fundamental changes (e.g., formal charge,
association/disassociation of salts) are neglected. Loss or gain
of a hydrogen is represented by 32 easy-to-compute features of
that reactant atom alone, ∈ai
32
. Loss or gain of a bond is
represented by a concatenation of the features of the atoms
involved and four features of the bond, ∈a b a[ , , ]i ij j
68
.
Because edits occur at the reaction center by definition, the
overall representation of a candidate reaction depends only on
the atoms and bonds at the reaction core. There is no explicit
inclusion of other molecular features, e.g., adjacency to certain
functional groups. However, in our featurization, we include
rapidly calculable structural and electronic features of the
reactants’ atoms that reflect the local chemical environment first
and foremost, but also reflect the surrounding molecular
context.19−22
The chosen features can be found in Table 2 and
Table S2 and are discussed in more detail in the Supporting
Information.
The design of the neural network is motivated by the
likelihood of a reaction being a function of the atom/bond
changes that are required for it to occur. Individual edits are
first analyzed in isolation using a neural network unique to the
Figure 3. Edit-based model architecture for scoring candidate
ACS Central Science Research Article
in of a bond is
es of the atoms
∈a ]j
68
.
y definition, the
depends only on
re is no explicit
cency to certain
on, we include
features of the
nvironment first
ding molecular
in Table 2 and
the Supporting
otivated by the
the atom/bond
vidual edits are
k unique to the
dense layers are
mediate feature
es vectors of all
ural network to
each candidate
oposed reaction
e free energy of
candidates are
uces a vector of
ting their values
with kBT = 1. An
evel features is
reaction shown
ecture, the four
assessing their
picted in Figure
on tests after an
to describe the
ed with bias and
001 parameters,
attempts to rank
cts alone; no
corresponding
molecules are
prints of length
is used prior to
odel is shown in
able S4).
which trains the
10%/20% training/validation/testing split and ceased training
once the validation loss did not improve for five epochs. The
edit-based model achieves an test accuracy of 68.5%, averaged
across all folds. In this context, accuracy refers to the percentage
of reaction examples where the recorded product was assigned
a rank of 1. The baseline model was similarly trained and tested
in a 5-fold CV, reaching an accuracy of 33.3%, suggesting that
the set of recorded products in the data set is fairly
homogeneous. The hybrid model, combining the edit-based
representation with the proposed products’ fingerprint
representations, achieves an accuracy of 71.8%. These results
are displayed in Table 1.
Figure 3. Edit-based model architecture for scoring candidate
reactions. Reactions are represented by four types of edits. Initial
atom- and bond-level attributes are converted into feature
representations, which are summed and used to calculate that
candidate reaction’s likelihood score.
Table 1. Comparison between Baseline, Edit-Based, and
Hybrid Models in Terms of Categorical Crossentropy Loss
and Accuracya
model loss acc. (%) top-3 (%) top-5 (%) top-10 (%)
random guess 5.46 0.8 2.3 3.8 7.6
baseline 3.28 33.3 48.2 55.8 65.9
edit-based 1.34 68.5 84.8 89.4 93.6
hybrid 1.21 71.8 86.7 90.8 94.6
a
Top-n refers to the percentage of examples where the recorded
product was ranked within the top n candidates.
scalabity. Wei et al.10
describe the use of neural networks to
predict the outcome of reactions based on reactant fingerprints,
but limit their study to 16 types of reactions covering a very
narrow scope of possible alkyl halide and alkene reactions.
Given two reactants and one reagent, the model was trained to
identify which of 16 templates was most applicable. The data
set used for cross-validation comes from artificially generated
examples with limited chemical functionality, rather than
experimental data.
Quite recently, Segler and Waller describe two approaches to
forward synthesis prediction. The first is a knowledge-graph
approach that uses the concept of half reactions to generate
possible products given exactly two reactants by looking at the
known reactions in which each of those reactants participates.11
required; (3) a new reaction representation focused on the
fundamental transformation at the reaction site rather than
constituent reactant and product fingerprints; (4) the
implementation and validation of a neural network-based
model that learns when certain modes of reactivity are more or
less likely to occur than other potential modes. Despite the
literature bias toward reporting only high-yielding reactions, we
develop a successful workflow that can be performed without
any manual curation using actual reactions reported in the
USPTO literature.
■ APPROACH
Overview. Our model predicts the outcome of a chemical
reaction in a two-step manner: (1) applying overgeneralized
Figure 1. Model framework combining forward enumeration and candidate ranking. The primary aim of this work is the creation of the parametrized
scoring model, which is trained to maximize the probability assigned to the recorded experimental outcome.
ACS Central Science Research Article
:5
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n
•
•
•
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n 1 : : - 2 21 12 : / 1 1 1 0
r C , p r m r
KO ga e
n P N LR stL ]
M]
dc o M]++ w
1 i o
o h l ][
M]P o LR y
] ]
Journal of Chemical Information and Modeling ARTICLE
Figure 2. Overall reaction prediction framework: (a) A user inputs the reactants and conditions. (b) We identify potential electron donors and
acceptors using coarse approximations of electron-filled and -unfilled MOs. (c) Highly sensitive reactive site classifiers are trained and used to filter out
1 1 1
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n y Fx
• R pt y
• y CG R pt C
FoKe
• ptn cln N C R i
E R i F s 1
• ptn cln NF s v y
r ah
• g + 1 aoKFr
• 1 y
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n N+ C C G C I CO 2G :I C 1 I: E L 1G C :
IGM :I C C 0 G F C: I F C:
7 : G 0-2
g -. py[ T SThlmR
• F F II CI C [P r o gpyWe
• u I E EG : C[ n “cd f Se T ]
n “ P,G E 00g S P lm i]a
S s WSqt
BLEU [36] ROUGE [37] top-1
Jin’s USPTO test set [17] 38,648 95.9 96.0 83.2
Lowe’s test set [26] 50,258 90.3 90.9 65.4
6.2 Comparison with the state of the art
To the best of our knowledge, no previous work has attempted to predict reac
US patent dataset of Lowe [26]. Table 6 shows a comparison with the Weisfeil
networks (WLDN) from Jin et al. [17] on their USPTO test set. To make a fair
all the multiple product reactions in the test set as false predictions for our mod
only on the single product reactions. By achieving 80.3% top-1 accuracy, we o
by a margin of 6.3%, which is even higher than for their augmented setup. A
rank candidates, but was trained on accurately predicting the top-1 outcome, it
the WLDN beats our model in the top-3 and top-5 accuracy. The decoding o
test set reactions takes on average 25 ms per reaction, inferred with a beam se
therefore compete with the state of the art.
Table 6: Comparison with Jin et al. [17]. The 1,352 multiple product reactions
are counted as false predictions for our model.
Jin’s USPTO test set [17], accuracies in [%]
Method top-1 top-2 top-3 top-5
WLDN [17] 74.0 86.7 89.5
Our model 80.3 84.7 86.2 87.5
7
(a) Distribution of top1 probabilities (b) Coverage / Accuracy plot
igure 1: Top-1 prediction confidence plots for Lowe’s test set inferred with a beam search of 10.
4 Attention
tention is the key to take into account complex long-range dependencies between multiple tokens.
ecific functional groups, solvents or catalysts have an impact on the outcome of a reaction, even if
ey are far from the reaction center in the molecular graph and therefore also in the SMILES string.
gure 2 shows how the network learned to focus first on the C[O ] molecule, to map the [O ] in the
put correctly to the O in the target, and to ignore the Br, which is replaced in the target.
(a) Attention weights (b) Reaction plotted with RDKit [27]
gure 2: Reaction 120 from Jin’s USPTO test set. The atom mapping between reactants and product
highlighted. SMILES: Brc1cncc(Br)c1.C[O-]>CN(C)C=O.[Na+]>COc1cncc(Br)c1
5 Limitations
x w : G 0-2
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n
C
n
/ C
n
n
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
MN L
P
P G G
u1
u2
u3
u4
u5
1 3
. / - /
u2 u3 u4 u5
u1
u2
u3
u4
u1
u2
u3 u5u1
u2
u3
u4
u1
u2
u3
u4
3 2 Weisfeiler-Lehman Difference Network (WLDN)
MN
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
1 .
32
32
u1
u2
u3
u4
u5
- /
u2 u3 u4 u5
u1
u2
u3
u4
u1
u2
u3 u5u1
u2
u3
u4
u1
u2
u3
u4
32
/ Weisfeiler-Lehman Difference Network (WLDN)
1
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
G N
ML
u1
u2
u3
u4
u5
/3 /3 / / /
- ./ ./
u2 u3 u4 u5
u1
u2
u3
u4
u1
u2
u3 u5u1
u2
u3
u4
u1
u2
u3
u4
3 3 1 Weisfeiler-Lehman Difference Network (WLDN)
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
G L
.
u1
u2
u3
u4
u5
/ / / /2 /
- ./ ./
u2 u3 u4 u5
u1
u2
u3
u4
u1
u2
u3 u5u1
u2
u3
u4
u1
u2
u3
u4
.
2 1 Weisfeiler-Lehman Difference Network (WLDN)
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n - / c SGb G
PL W f
n cS
e Gda
u1
u2
u3
u4
u5
- /
u2 u3 u4 u5
u1
u2
u3
u4
u1
u2
u3 u5u1
u2
u3
u4
u1
u2
u3
u4
/ Weisfeiler-Lehman Difference Network (WLDN)
cNM
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
L
n - 1 . L
n SPN W
G
u1
u2
u3
u4
u5
- /
u2 u3 u4 u5
u1
u2
u3
u4
u1
u2
u3 u5u1
u2
u3
u4
u1
u2
u3
u4
N
/ Weisfeiler-Lehman Difference Network (WLDN)
L
ML
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n - U c
NW h
i , , - ML Nd
• , !" ℎ$
"
, ℎ&
"
, '$& = )(+ ,-.,/0 ℎ&
"
, '$& )
i ) gef : + a
• 2" ℎ$
"
, 3$
"45
= )(65ℎ$
"75
+ 693$
"45
)
i ) gef :65 69 a
Message Function:
)(+ ,-.,/0 ℎ&
" , '$&:
)
Σ
Message Function:
)(+ ,-.,/0 ℎ&
"
, '$&;
)
Neural Message Passing for
time steps and is defined in terms of message functions Mt
and vertex update functions Ut. During the message pass-
ing phase, hidden states ht
v at each node in the graph are
updated based on messages mt+1
v according to
mt+1
v =
X
w2N(v)
Mt(ht
v, ht
w, evw) (1)
ht+1
v = Ut(ht
v, mt+1
v ) (2)
where in the sum, N(v) denotes the neighbors of v in graph
G. The readout phase computes a feature vector for the
whole graph using some readout function R according to
ˆy = R({hT
v | v 2 G}). (3)
The message functions Mt, vertex update functions Ut, and
readout function R are all learned differentiable functions.
R operates on the set of node states and must be invariant to
permutations of the node states in order for the MPNN to be
invariant to graph isomorphism. In what follows, we define
previous models in the literature by specifying the message
function M , vertex update function U , and readout func-
v
u1
u2
h(0)
v
h(0)
u1
h(0)
u2
Neural Message Passing for Quantum Chemistry
time steps and is defined in terms of message functions Mt
and vertex update functions Ut. During the message pass-
ing phase, hidden states ht
v at each node in the graph are
updated based on messages mt+1
v according to
mt+1
v =
X
w2N(v)
Mt(ht
v, ht
w, evw) (1)
ht+1
v = Ut(ht
v, mt+1
v ) (2)
where in the sum, N(v) denotes the neighbors of v in graph
G. The readout phase computes a feature vector for the
whole graph using some readout function R according to
ˆy = R({hT
v | v 2 G}). (3)
The message functions Mt, vertex update functions Ut, and
readout function R are all learned differentiable functions.
R operates on the set of node states and must be invariant to
permutations of the node states in order for the MPNN to be
invariant to graph isomorphism. In what follows, we define
previous models in the literature by specifying the message
function Mt, vertex update function Ut, and readout func-
tion R used. Note one could also learn edge features in
an MPNN by introducing hidden states for all edges in the
graph ht
evw
and updating them analogously to equations 1
and 2. Of the existing MPNNs, only Kearnes et al. (2016)
Recurrent Unit introduced in Cho et al. (2014). This work
used weight tying, so the same update function is used at
each time step t. Finally,
R =
X
v2V
⇣
i(h(T )
v , h0
v)
⌘ ⇣
j(h(T )
v )
⌘
(4)
where i and j are neural networks, and denotes element-
wise multiplication.
Interaction Networks, Battaglia et al. (2016)
This work considered both the case where there is a tar-
get at each node in the graph, and where there is a graph
level target. It also considered the case where there are
node level effects applied at each time step, in such a
case the update function takes as input the concatenation
(hv, xv, mv) where xv is an external vector representing
some outside influence on the vertex v. The message func-
tion M(hv, hw, evw) is a neural network which takes the
concatenation (hv, hw, evw). The vertex update function
U(hv, xv, mv) is a neural network which takes as input
the concatenation (hv, xv, mv). Finally, in the case where
there is a graph level output, R = f(
P
v2G
hT
v ) where f is
a neural network which takes the sum of the final hidden
states hT
v . Note the original work only defined the model
for T = 1.
Molecular Graph Convolutions, Kearnes et al. (2016)
'$&:
'$&;
Update Function:
)(65ℎ$
"75
+ 693$
"45
)
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n - M W
N a
O N
v
u1
u2
u3
u4
h(0)
v
fu1v
h(0)
u1
fu2v
h(0)
u2
W(2)
W(0)
W(1)
W(0)
W(1)
L
Σ
said to be isomorphic if their set representations are the same. The number of distinct labels grows
exponentially with the number of iterations L.
WL Network The discrete relabeling process does not directly generalize to continuous feature
vectors. Instead, we appeal to neural networks to continuously embed the computations inherent
in the WL test. Let r be the analogous continuous relabeling function. Then a node v 2 G with
neighbor nodes N(v), node features fv, and edge features fuv is “relabeled” according to
r(v) = ⌧(U1fv + U2
X
u2N(v)
⌧(V[fu, fuv])) (1)
where ⌧(·) could be any non-linear function. We apply this relabeling operation iteratively to obtain
context-dependent atom vectors
h(l)
v = ⌧(U1h(l 1)
v + U2
X
u2N(v)
⌧(V[h(l 1)
u , fuv])) (1  l  L) (2)
where h
(0)
v = fv and U1, U2, V are shared across layers. The final atom representations arise from
mimicking the set comparison function in the WL isomorphism test, yielding
cv =
X
u2N(v)
W(0)
h(L)
u W(1)
fuv W(2)
h(L)
v (3)
The set comparison here is realized by matching each rank-1 edge tensor h
(L)
u ⌦ fuv ⌦ h
(L)
v to a set
of reference edges also cast as rank-1 tensors W(0)
[k] ⌦ W(1)
[k] ⌦ W(2)
[k], where W[k] is the
k-th row of matrix W. In other words, Eq. 3 above could be written as
cv[k] =
X
u2N(v)
D
W(0)
[k] ⌦ W(1)
[k] ⌦ W(2)
[k], h(L)
u ⌦ fuv ⌦ h(L)
v
E
(4)
The resulting cv is a vector representation that captures the local chemical environment of the atom
(through relabeling) and involves a comparison against a learned set of reference environments. The
representation of the whole graph G is simply the sum over all the atom representations: cG =
P
v cv.
4ℎ"
($)
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n - / P N L
G M S
n /1
. . M G
u1
u2
u3
u4
u5
- /
u2 u3 u4 u5
u1
u2
u3
u4
u1
u2
u3 u5u1
u2
u3
u4
u1
u2
u3
u4
L
/ Weisfeiler-Lehman Difference Network (WLDN)
P
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n N m : : u a e
. .
d W
y gr v a . . . . e c
r!"# ML W
v
u1
u2
u3
u4
h(0)
v
h(0)
u1
h(0)
u2
$# $"%&"#
'( ')')
+
. .
ℎ"+
(-)
ℎ"/
(-)
u1 u2 u3 u4
v
u1
u2
u3
&"#: n i p o
l : : s tv
3.1.2 Finding Reaction Centers with WLN
We present two models to predict reactivity: the local and global models. Our local model is based
directly on the atom representations cu and cv in predicting label yuv. The global model, on the other
hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms
outside of the reaction center may be necessary for the reaction to occur. For example, the reaction
center may be influenced by certain reagents1
. We incorporate these distal effects into the global
model through an attention mechanism.
Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by
the WLN. We predict the reactivity score of (u, v) by passing these through another neural network:
suv = uT
⌧(Macu + Macv + Mbbuv) (5)
where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary
information about the pair such as whether the two atoms are in different molecules or which type of
bond connects them.
Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation
˜cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from
the attention module:
˜cu =
X
v
↵uvcv; ↵uv = uT
⌧(Pacu + Pacv + Pbbuv) (6)
suv = uT
⌧(Ma˜cu + Ma˜cv + Mbbuv) (7)
Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be
multiple atoms relevant to a particular atom u.
Training Both models are trained to minimize the following loss function:
3.1.2 Finding Reaction Centers with WLN
We present two models to predict reactivity: the local and global models. Our local model is based
directly on the atom representations cu and cv in predicting label yuv. The global model, on the other
hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms
outside of the reaction center may be necessary for the reaction to occur. For example, the reaction
center may be influenced by certain reagents1
. We incorporate these distal effects into the global
model through an attention mechanism.
Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by
the WLN. We predict the reactivity score of (u, v) by passing these through another neural network:
suv = uT
⌧(Macu + Macv + Mbbuv) (5)
where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary
information about the pair such as whether the two atoms are in different molecules or which type of
bond connects them.
Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation
˜cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from
the attention module:
˜cu =
X
v
↵uvcv; ↵uv = uT
⌧(Pacu + Pacv + Pbbuv) (6)
suv = uT
⌧(Ma˜cu + Ma˜cv + Mbbuv) (7)
Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be
multiple atoms relevant to a particular atom u.
Training Both models are trained to minimize the following loss function:
L(T ) =
X
R2T
X
u6=v2R
yuv log(suv) + (1 yuv) log(1 suv) (8)
Here we predict each label independently because of the large number of variables. For a given
reaction with N atoms, we need to predict the reactivity score of O(N2
) pairs. This quadratic
complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless,
we found independent prediction yields sufficiently good performance.
3.2 Candidate Generation
We select the top K atom pairs with the highest predicted reactivity score and designate them,
collectively, as the reaction center. The set of candidate products are then obtained by enumerating all
possible bond configuration changes within the set. While the resulting set of candidate products is
exponential in K, many can be ruled out by invoking additional constraints. For example, every atom
has a maximum number of neighbors they can connect to (valence constraint). We also leverage
the statistical bias that reaction centers are very unlikely to consist of disconnected components
(connectivity constraint). Some multi-step reactions do exist that violate the connectivity constraint.
As we will show, the set of candidates arising from this procedure is more compact than those arising
mML
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n MN G W A
(
) AMN L )
v
u1
u2
u3
u4
h(0)
v
h(0)
u1
h(0)
u2
!" !#$ !#% !#& !#'(#"
)* )+)+
+
(
(#"
)* )+
+
(#"
)* )+
+
(#"
)* )+
+
ℎ#-
(/)
ℎ#1
(/)
hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms
outside of the reaction center may be necessary for the reaction to occur. For example, the reaction
center may be influenced by certain reagents1
. We incorporate these distal effects into the global
model through an attention mechanism.
Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by
the WLN. We predict the reactivity score of (u, v) by passing these through another neural network:
suv = uT
⌧(Macu + Macv + Mbbuv) (5)
where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary
information about the pair such as whether the two atoms are in different molecules or which type of
bond connects them.
Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation
˜cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from
the attention module:
˜cu =
X
v
↵uvcv; ↵uv = uT
⌧(Pacu + Pacv + Pbbuv) (6)
suv = uT
⌧(Ma˜cu + Ma˜cv + Mbbuv) (7)
Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be
multiple atoms relevant to a particular atom u.
Training Both models are trained to minimize the following loss function:
L(T ) =
X
R2T
X
u6=v2R
yuv log(suv) + (1 yuv) log(1 suv) (8)
Here we predict each label independently because of the large number of variables. For a given
reaction with N atoms, we need to predict the reactivity score of O(N2
) pairs. This quadratic
complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless,
we found independent prediction yields sufficiently good performance.
3.2 Candidate Generation
We select the top K atom pairs with the highest predicted reactivity score and designate them,
collectively, as the reaction center. The set of candidate products are then obtained by enumerating all
possible bond configuration changes within the set. While the resulting set of candidate products is
exponential in K, many can be ruled out by invoking additional constraints. For example, every atom
has a maximum number of neighbors they can connect to (valence constraint). We also leverage
the statistical bias that reaction centers are very unlikely to consist of disconnected components
(connectivity constraint). Some multi-step reactions do exist that violate the connectivity constraint.
As we will show, the set of candidates arising from this procedure is more compact than those arising
from templates without sacrificing coverage.
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n d NG rg Ml
(
) o WMc ) M
t G e WM a
• n N WM L
yA s bGi
v
u1
u2
u3
u4
h(0)
v
h(0)
u1
h(0)
u2
(
ℎ"#
(%)
ℎ"'
(%)
Σ
(
3.1.2 Finding Reaction Centers with WLN
We present two models to predict reactivity: the local and global models. Our local model is based
directly on the atom representations cu and cv in predicting label yuv. The global model, on the other
hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms
outside of the reaction center may be necessary for the reaction to occur. For example, the reaction
center may be influenced by certain reagents1
. We incorporate these distal effects into the global
model through an attention mechanism.
Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by
the WLN. We predict the reactivity score of (u, v) by passing these through another neural network:
suv = uT
⌧(Macu + Macv + Mbbuv) (5)
where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary
information about the pair such as whether the two atoms are in different molecules or which type of
bond connects them.
Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation
˜cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from
the attention module:
˜cu =
X
v
↵uvcv; ↵uv = uT
⌧(Pacu + Pacv + Pbbuv) (6)
suv = uT
⌧(Ma˜cu + Ma˜cv + Mbbuv) (7)
Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be
multiple atoms relevant to a particular atom u.
Training Both models are trained to minimize the following loss function:
L(T ) =
X
R2T
X
u6=v2R
yuv log(suv) + (1 yuv) log(1 suv) (8)
Here we predict each label independently because of the large number of variables. For a given
reaction with N atoms, we need to predict the reactivity score of O(N2
) pairs. This quadratic
complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless,
we found independent prediction yields sufficiently good performance.
3.2 Candidate Generation
We select the top K atom pairs with the highest predicted reactivity score and designate them,
collectively, as the reaction center. The set of candidate products are then obtained by enumerating all
possible bond configuration changes within the set. While the resulting set of candidate products is
exponential in K, many can be ruled out by invoking additional constraints. For example, every atom
has a maximum number of neighbors they can connect to (valence constraint). We also leverage
the statistical bias that reaction centers are very unlikely to consist of disconnected components
(connectivity constraint). Some multi-step reactions do exist that violate the connectivity constraint.
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n LG d n li b c
)
) oe ) W
G A a M p
p a M g ) oe
v
u1
u2
u3
u4
h(0)
v
h(0)
u1
h(0)
u2
) ) )
( &
ℎ"#
(%)
ℎ"'
(%)
(̃* (̃"+,"*
-. -/-/
+
u1 u2 u3 u4
v
u1
u2
u3
)
)
3.1.2 Finding Reaction Centers with WLN
We present two models to predict reactivity: the local and global models. Our local model is based
directly on the atom representations cu and cv in predicting label yuv. The global model, on the other
hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms
outside of the reaction center may be necessary for the reaction to occur. For example, the reaction
center may be influenced by certain reagents1
. We incorporate these distal effects into the global
model through an attention mechanism.
Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by
the WLN. We predict the reactivity score of (u, v) by passing these through another neural network:
suv = uT
⌧(Macu + Macv + Mbbuv) (5)
where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary
information about the pair such as whether the two atoms are in different molecules or which type of
bond connects them.
Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation
˜cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from
the attention module:
˜cu =
X
v
↵uvcv; ↵uv = uT
⌧(Pacu + Pacv + Pbbuv) (6)
suv = uT
⌧(Ma˜cu + Ma˜cv + Mbbuv) (7)
Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be
multiple atoms relevant to a particular atom u.
Training Both models are trained to minimize the following loss function:
L(T ) =
X
R2T
X
u6=v2R
yuv log(suv) + (1 yuv) log(1 suv) (8)
Here we predict each label independently because of the large number of variables. For a given
reaction with N atoms, we need to predict the reactivity score of O(N2
) pairs. This quadratic
complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless,
we found independent prediction yields sufficiently good performance.
3.2 Candidate Generation
We select the top K atom pairs with the highest predicted reactivity score and designate them,
collectively, as the reaction center. The set of candidate products are then obtained by enumerating all
possible bond configuration changes within the set. While the resulting set of candidate products is
3.1.2 Finding Reaction Centers with WLN
We present two models to predict reactivity: the local and global models. Our local model is based
directly on the atom representations cu and cv in predicting label yuv. The global model, on the other
hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms
outside of the reaction center may be necessary for the reaction to occur. For example, the reaction
center may be influenced by certain reagents1
. We incorporate these distal effects into the global
model through an attention mechanism.
Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by
the WLN. We predict the reactivity score of (u, v) by passing these through another neural network:
suv = uT
⌧(Macu + Macv + Mbbuv) (5)
where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary
information about the pair such as whether the two atoms are in different molecules or which type of
bond connects them.
Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation
˜cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from
the attention module:
˜cu =
X
v
↵uvcv; ↵uv = uT
⌧(Pacu + Pacv + Pbbuv) (6)
suv = uT
⌧(Ma˜cu + Ma˜cv + Mbbuv) (7)
Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be
multiple atoms relevant to a particular atom u.
Training Both models are trained to minimize the following loss function:
L(T ) =
X
R2T
X
u6=v2R
yuv log(suv) + (1 yuv) log(1 suv) (8)
Here we predict each label independently because of the large number of variables. For a given
reaction with N atoms, we need to predict the reactivity score of O(N2
) pairs. This quadratic
complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless,
we found independent prediction yields sufficiently good performance.
3.2 Candidate Generation
We select the top K atom pairs with the highest predicted reactivity score and designate them,
collectively, as the reaction center. The set of candidate products are then obtained by enumerating all
possible bond configuration changes within the set. While the resulting set of candidate products is
exponential in K, many can be ruled out by invoking additional constraints. For example, every atom
has a maximum number of neighbors they can connect to (valence constraint). We also leverage
the statistical bias that reaction centers are very unlikely to consist of disconnected components
(connectivity constraint). Some multi-step reactions do exist that violate the connectivity constraint.
As we will show, the set of candidates arising from this procedure is more compact than those arising
ONL
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
G N
ML
u1
u2
u3
u4
u5
/3 /3 / / /
- ./ ./
u2 u3 u4 u5
u1
u2
u3
u4
u1
u2
u3 u5u1
u2
u3
u4
u1
u2
u3
u4
3 3 1 Weisfeiler-Lehman Difference Network (WLDN)
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n , voc S y
s : N p nN
ae N T y
• vo t 1 2 2 7 027
y vo I yP
• y t 22 70 07 2 7 027
y K vol P i N TN
IJ : P tN
view of our approach. (1) we train a model to identify pairwise atom interactions
r 02 +
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
G L
.
u1
u2
u3
u4
u5
/ / / /2 /
- ./ ./
u2 u3 u4 u5
u1
u2
u3
u4
u1
u2
u3 u5u1
u2
u3
u4
u1
u2
u3
u4
.
2 1 Weisfeiler-Lehman Difference Network (WLDN)
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n -
P : N L !"
($%)
= ("
($%)
− ("
(*)
P !"
($%)
v1
v2
v3
v4
v5
v1
v2
v3
v4
v5()
- - - - -
Σ
e each candidate pair (r, p). The first model naively
ectors of all atom representations obtained from a
. Our second and improved model, called WLDN,
ween these differences vectors.
d atom representation of atom v in candidate product
ertaining to atom v as follows:
s(pi) = uT
⌧(M
X
v2pi
d(pi)
v ) (9)
-mapped so we can use v to refer to the same atom.
e difference vectors, resulting in a single vector for
her neural network to score the candidate product pi.
DN) Instead of simply summing all difference vec-
d a difference graph. A difference graph D(r, pi) is
atoms and bonds as pi, with atom v’s feature vector
aph has several benefits. First, in D(r, pi), atom v’s
e to the reaction center, thus focusing the processing
Second, D(r, pi) explicates neighbor dependencies
!"
($%)
= ("
($%)
− ("
(*)
)
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n 1 ):3 ( 21
l f ih S g e !"
($%)
= ("
($%)
− ("
(*)
l !"
($%)
a cP
l 3 - 2 NL W
u1
u2
u3 u5u1
u2
u3
u4
u1
u2
u3
u4
1 ):3 ( 21
ih ih ih
y ∈ {0,1,2,3}
3 -
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n ) ) ) ) ) ) () ) -
W L N D : : !"
($%)
= ("
($%)
− ("
(*)
W !"
($%)
( . ) 3 :
v1
v2
v3
v4
v5
v1
v2
v3
v4
v5
- - - - -
Σ
e each candidate pair (r, p). The first model naively
ectors of all atom representations obtained from a
. Our second and improved model, called WLDN,
ween these differences vectors.
d atom representation of atom v in candidate product
ertaining to atom v as follows:
s(pi) = uT
⌧(M
X
v2pi
d(pi)
v ) (9)
-mapped so we can use v to refer to the same atom.
e difference vectors, resulting in a single vector for
her neural network to score the candidate product pi.
DN) Instead of simply summing all difference vec-
d a difference graph. A difference graph D(r, pi) is
atoms and bonds as pi, with atom v’s feature vector
aph has several benefits. First, in D(r, pi), atom v’s
e to the reaction center, thus focusing the processing
Second, D(r, pi) explicates neighbor dependencies
molecule pi. We define difference vector d
(pi)
v pertaining to atom v as follows:
d(pi)
v = c(pi)
v c(r)
v ; s(pi) = uT
⌧(M
X
v2pi
d(pi)
v ) (9)
Recall that the reactants and products are atom-mapped so we can use v to refer to the same atom.
The pooling operation is a simple sum over these difference vectors, resulting in a single vector for
each (r, pi) pair. This vector is then fed into another neural network to score the candidate product pi.
Weisfeiler-Lehman Difference Network (WLDN) Instead of simply summing all difference vec-
tors, the WLDN operates on another graph called a difference graph. A difference graph D(r, pi) is
defined as a molecular graph which has the same atoms and bonds as pi, with atom v’s feature vector
replaced by d
(pi)
v . Operating on the difference graph has several benefits. First, in D(r, pi), atom v’s
feature vector deviates from zero only if it is close to the reaction center, thus focusing the processing
on the reaction center and its immediate context. Second, D(r, pi) explicates neighbor dependencies
between difference vectors. The WLDN maps this graph-based representation into a fixed-length
vector, by applying a separately parameterized WLN on top of D(r, pi):
h(pi,l)
v = ⌧
0
@U1h(pi,l 1)
v + U2
X
u2N(v)
⌧
⇣
V[h(pi,l 1)
u , fuv]
⌘
1
A (1  l  L) (10)
d(pi,L)
v =
X
u2N(v)
W(0)
h(pi,L)
u W(1)
fuv W(2)
h(pi,L)
v (11)
where h
(pi,0)
v = d
(pi)
v . The final score of pi is s(pi) = uT
⌧(M
P
v2pi
d
(pi,L)
v ).
Training Both models are trained to minimize the softmax log-likelihood objective over the scores
{s(p0), s(p1), · · · , s(pm)} where s(p0) corresponds to the target.
4 Experiments
Data As a source of data for our experiments, we used reactions from USPTO granted patents,
collected by Lowe [13]. After removing duplicates and erroneous reactions, we obtained a set of
480K reactions, to which we refer in the paper as USPTO. This dataset is divided into 400K, 40K,
and 40K for training, development, and testing purposes.
In addition, for comparison purposes we report the results on the subset of 15K reaction from this
dataset (referred as USPTO-15K) used by Coley et al. [3]. They selected this subset to include
reactions covered by the 1.7K most common templates. We follow their split, with 10.5K, 1.5K, and
3K for training, development, and testing.
Setup for Reaction Center Identification The output of this component consists of K atom pairs
with the highest reactivity scores. We compute the coverage as the proportion of reactions where all
atom pairs in the true reaction center are predicted by the model, i.e., where the recorded product is
found in the model-generated candidate set.
The model features reflect basic chemical properties of atoms and bonds. Atom-level features include
its elemental identity, degree of connectivity, number of attached hydrogen atoms, implicit valence,
and aromaticity. Bond-level features include bond type (single, double, triple, or aromatic), whether
it is conjugated, and whether the bond is part of a ring.
Both our local and global models are build upon a Weisfeiler-Lehman Network, with unrolled depth
d(pi)
v = c(pi)
v c(r)
v ; s(pi) = uT
⌧(M
X
v2pi
d(pi)
v ) (9)
Recall that the reactants and products are atom-mapped so we can use v to refer to the same atom.
The pooling operation is a simple sum over these difference vectors, resulting in a single vector for
each (r, pi) pair. This vector is then fed into another neural network to score the candidate product pi.
Weisfeiler-Lehman Difference Network (WLDN) Instead of simply summing all difference vec-
tors, the WLDN operates on another graph called a difference graph. A difference graph D(r, pi) is
defined as a molecular graph which has the same atoms and bonds as pi, with atom v’s feature vector
replaced by d
(pi)
v . Operating on the difference graph has several benefits. First, in D(r, pi), atom v’s
feature vector deviates from zero only if it is close to the reaction center, thus focusing the processing
on the reaction center and its immediate context. Second, D(r, pi) explicates neighbor dependencies
between difference vectors. The WLDN maps this graph-based representation into a fixed-length
vector, by applying a separately parameterized WLN on top of D(r, pi):
h(pi,l)
v = ⌧
0
@U1h(pi,l 1)
v + U2
X
u2N(v)
⌧
⇣
V[h(pi,l 1)
u , fuv]
⌘
1
A (1  l  L) (10)
d(pi,L)
v =
X
u2N(v)
W(0)
h(pi,L)
u W(1)
fuv W(2)
h(pi,L)
v (11)
where h
(pi,0)
v = d
(pi)
v . The final score of pi is s(pi) = uT
⌧(M
P
v2pi
d
(pi,L)
v ).
Training Both models are trained to minimize the softmax log-likelihood objective over the scores
{s(p0), s(p1), · · · , s(pm)} where s(p0) corresponds to the target.
4 Experiments
Data As a source of data for our experiments, we used reactions from USPTO granted patents,
collected by Lowe [13]. After removing duplicates and erroneous reactions, we obtained a set of
480K reactions, to which we refer in the paper as USPTO. This dataset is divided into 400K, 40K,
and 40K for training, development, and testing purposes.
In addition, for comparison purposes we report the results on the subset of 15K reaction from this
dataset (referred as USPTO-15K) used by Coley et al. [3]. They selected this subset to include
reactions covered by the 1.7K most common templates. We follow their split, with 10.5K, 1.5K, and
3K for training, development, and testing.
Setup for Reaction Center Identification The output of this component consists of K atom pairs
with the highest reactivity scores. We compute the coverage as the proportion of reactions where all
atom pairs in the true reaction center are predicted by the model, i.e., where the recorded product is
found in the model-generated candidate set.
The model features reflect basic chemical properties of atoms and bonds. Atom-level features include
its elemental identity, degree of connectivity, number of attached hydrogen atoms, implicit valence,
and aromaticity. Bond-level features include bond type (single, double, triple, or aromatic), whether
it is conjugated, and whether the bond is part of a ring.
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
. .
n ) - 1 2 3 -- 3 ( ) (
k e h af c W !"
($%)
= ("
($%)
− ("
(*)
k !"
($%)
3
k - 2 : 3 1 W D LN i
u1
u2
u3 u5u1
u2
u3
u4
u1
u2
u3
u4
Weisfeiler-Lehman Difference Network (WLDN)
h h h
y ∈ {0,1,2,3}
- 2 : 3
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n
n
/
C
C
n
n
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n . -
P a =
75 4 3 0 71 7
n . -
. - : eT = K
Ud SO
75 4 3 0 71 7
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n I
c
• c g / a
n e
• / / o
• D o K
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n bi
7 107 , 7 + 207 , 7
d sM La G L e G Ibi
• c l G L I g o
S eNP
• g J I n >: 0 l rt S
Figure 3: A reaction that reduces the carbonyl carbon of an amide by removing bond 4-23 (red circle).
Reactivity at this site would be highly unlikely without the presence of borohydride (atom 25, blue
circle). The global model correctly predicts bond 4-23 as the most susceptible to change, while the
local model does not even include it in the top ten predictions. The attention map of the global model
show that atoms 1, 25, and 26 were determinants of atom 4’s predicted reactivity.
USPTO-15K
Method |✓| K=6 K=8 K=10
Local 572K 80.1 85.0 87.7
Local 1003K 81.6 86.1 89.1
Global 756K 86.7 90.1 92.2
USPTO
Local 572K 83.0 87.2 89.6
Local 1003K 82.4 86.7 89.1
Global 756K 89.8 92.0 93.3
Avg. Num. of Candidates (USPTO)
Template - 482.3 out of 5006
Global - 60.9 246.5 1076
(a) Reaction Center Prediction Performance. Coverage
is reported by picking the top K (K=6,8,10) reactivity
pairs. |✓| is the number of model parameters.
USPTO-15K
Method Cov. P@1 P@3 P@5
Coley et al. 100.0 72.1 86.6 90.7
WLN 90.1 74.9 84.6 86.3
WLDN 90.1 76.7 85.6 86.8
WLN (*) 100.0 81.4 92.5 94.8
WLDN (*) 100.0 84.1 94.1 96.1
USPTO
WLN 92.0 73.5 86.1 89.0
WLDN 92.0 74.0 86.7 89.5
WLN (*) 100.0 76.7 91.0 94.6
WLDN (*) 100.0 77.8 91.9 95.4
(b) Candidate Ranking Performance. Precision at ranks
1,3,5 are reported. (*) denotes that the true product was
added if not covered by the previous stage.
Table 1: Results on USPTO-15K and USPTO datasets.
/
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n g
0 9 C C
n g
C 0 9
• r , C .
• 8 9 7:9. %
• c. +
yv 3/,
• r
• 8 9 7:9.
• c.
C S% a
o N s I
r K J P ne
USPTO-15K
Method |✓| K=6 K=8 K=10
Local 572K 80.1 85.0 87.7
Local 1003K 81.6 86.1 89.1
Global 756K 86.7 90.1 92.2
USPTO
Local 572K 83.0 87.2 89.6
Local 1003K 82.4 86.7 89.1
Global 756K 89.8 92.0 93.3
Avg. Num. of Candidates (USPTO)
Template - 482.3 out of 5006
Global - 60.9 246.5 1076
(a) Reaction Center Prediction Performance. Coverage
is reported by picking the top K (K=6,8,10) reactivity
pairs. |✓| is the number of model parameters.
Method
Coley et al
WLN
WLDN
WLN (*)
WLDN (*)
WLN
WLDN
WLN (*)
WLDN (*)
(b) Candidate R
1,3,5 are repor
added if not co
Table 1: Results on USPTO-15K and US
accuracy against the top-performing template-based approach
employs frequency-based heuristics to construct reaction tem
l i. 2 = 4156 %+
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n y
*= Ceni ,, D
*= 5 1 U y
• t T5 1U So S d FONN
/ /+ P )(
• C* 8 8 - = D Kr F
• PGNr T L K C / /+ D
• PGNr W Pl n naK
C / /+ D
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n p l
2 7 :J L W JLT S IP50 1D
c S
( ) 2 7 :JP 501 50 1 , 7 S
• W JT Dn Ni D I C
501 50 1LosJ r I50 1D501S
=8 K=10
.0 87.7
.1 89.1
.1 92.2
.2 89.6
.7 89.1
.0 93.3
USPTO)
t of 5006
6.5 1076
ance. Coverage
8,10) reactivity
meters.
USPTO-15K
Method Cov. P@1 P@3 P@5
Coley et al. 100.0 72.1 86.6 90.7
WLN 90.1 74.9 84.6 86.3
WLDN 90.1 76.7 85.6 86.8
WLN (*) 100.0 81.4 92.5 94.8
WLDN (*) 100.0 84.1 94.1 96.1
USPTO
WLN 92.0 73.5 86.1 89.0
WLDN 92.0 74.0 86.7 89.5
WLN (*) 100.0 76.7 91.0 94.6
WLDN (*) 100.0 77.8 91.9 95.4
(b) Candidate Ranking Performance. Precision at ranks
1,3,5 are reported. (*) denotes that the true product was
added if not covered by the previous stage.
ults on USPTO-15K and USPTO datasets.
e + / : 1 23 *
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n y
, + U T S Nc
• n Ac J
ie OP ra N I
n you
,721 8 OP I ouN
A Human Evaluation Setup
Here we describe in detail the human evaluation results in Table 3. The evaluation dataset consists
of eight groups, defined by the reaction template popularity as binned in Figure 4b, each with ten
instances selected randomly from the USPTO test set. We invited in total ten chemists to predict the
product given reactants in all groups.
Table 3: Human and model performance on 80 reactions randomly selected from the USPTO test
set to cover a diverse range of reaction types. The first 8 are experts with rich experience in organic
chemistry (graduate and postdoctoral chemists) and the last two are graduate students in chemical
engineering who use organic chemistry concepts regularly but have less formal training. Our model
performs at the expert chemist level in terms of top 1 accuracy.
Chemist 56.3 50.0 40.0 63.8 66.3 65.0 40.0 58.8 25.0 16.3
Our Model 69.1
Figure 5: Details of human performance
OP ra
OP ra sp 11 701:
A Human Evaluation Setup
Here we describe in detail the human evaluation results in Table 3. The evaluation dataset consists
of eight groups, defined by the reaction template popularity as binned in Figure 4b, each with ten
instances selected randomly from the USPTO test set. We invited in total ten chemists to predict the
product given reactants in all groups.
Table 3: Human and model performance on 80 reactions randomly selected from the USPTO test
set to cover a diverse range of reaction types. The first 8 are experts with rich experience in organic
chemistry (graduate and postdoctoral chemists) and the last two are graduate students in chemical
engineering who use organic chemistry concepts regularly but have less formal training. Our model
performs at the expert chemist level in terms of top 1 accuracy.
Chemist 56.3 50.0 40.0 63.8 66.3 65.0 40.0 58.8 25.0 16.3
Our Model 69.1
Figure 5: Details of human performance
(a) Histogram showing the distribution of question
difficulties as evaluated by the average expert perfor-
mance across all ten performers.
(b) Comparison of model performance against human
performance for sets of questions as grouped by the
average human accuracy shown in Figure 5a
.
,
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n G 4 e G
e N Go e G
r N
• a h 1 p
• e
• e y
n rG
0 0 1 + e l
G
0 e C 3 G
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n THGLF LPJ TJDPLF HDF L P W F OHU L K HLUIHLNHT 9HKODP H TM
7LP" HPJ PJ" H DN THGLF LPJ TJDPLF HDF L P W F OHU L K HLUIHLNHT 9HKODP
H TM 0GXDPFHU LP HWTDN 6PI TOD L P T FHUULPJ U HOU ,
n HORND H DUH
HL" 7HPPLIHT " 2DXLG 2WXHPDWG" DPG 0N P 0URWTW 4W LM HWTDN PH TMU I T KH
RTHGLF L P I TJDPLF FKHOLU T THDF L PU , ,(
HJNHT" :DT LP 5 " DPG :DTM DNNHT HWTDN O NLF :DFKLPH 9HDTPLPJ I T
H T U P KHULU DPG HDF L P THGLF L P ( ,
. .,
HJNHT" :DT LP" :LMH THW " DPG :DTM DNNHT DTGU 0NRKD1KHO 1KHOLFDN
P KHULU NDPPLPJ L K THH HDTFK DPG 2HHR HWTDN H TM NLFLHU
,
1 NH " 1 PP T " H DN THGLF L P I TJDPLF HDF L P W F OHU AULPJ :DFKLPH
9HDTPLPJ ( , )() ))(
n HORND H ITHH
8D DND" :D KH 0 " H DN 9HDTPLPJ RTHGLF FKHOLFDN THDF L PU
. .
FK DNNHT" KLNLRRH" H DN 3 WPG LP TDPUND L P THGLF LPJ W F OH I 1 ORNH
TJDPLF 1KHOLU T HDF L PU WULPJ HWTDN HSWHPFH HSWHPFH : GHNU DTCLX
RTHRTLP DTCLX , )- ,
Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
n 4 ESK HS HVHQWEWLRQ
a RJH V" 2EYLG" EQG :EWKHZ 5EKQ 3 WHQGHG RQQH WLYLW ILQJH S LQWV
,) , )
a KH YEVKLG H" LQR" HW EO BHLVIHLOH OHK EQ J ESK NH QHOV
HS (.
n 4 ESK 1
a 4LO H " 7 VWLQ" HW EO H EO HVVEJH SEVVLQJ IR T EQW KH LVW 6Q
" SEJHV ( , " ,
a 2 YHQE G" 2EYLG 8 " HW EO 1RQYRO WLRQEO QHWZR NV RQ J ESKV IR OHE QLQJ ROH OE
ILQJH S LQWV
a 9L" C MLE" E ORZ" 2EQLHO" 0 R NV K LGW" :E " EQG DH HO" L KE G 4EWHG J ESK
VHT HQ H QH EO QHWZR NV "
a K WW" 8 LVWRI " HW EO EQW KH L EO LQVLJKWV I R GHHS WHQVR QH EO
QHWZR NV EW H R QL EWLRQV - , (-.
a LQ EOV" LRO" E 0HQJLR" EQG :EQM QEWK 8 GO GH EWWH V HT HQ H WR
VHT HQ H IR VHWV 619 "

More Related Content

What's hot

グラフィカル Lasso を用いた異常検知
グラフィカル Lasso を用いた異常検知グラフィカル Lasso を用いた異常検知
グラフィカル Lasso を用いた異常検知Yuya Takashina
 
PRML輪読#2
PRML輪読#2PRML輪読#2
PRML輪読#2matsuolab
 
ようやく分かった!最尤推定とベイズ推定
ようやく分かった!最尤推定とベイズ推定ようやく分かった!最尤推定とベイズ推定
ようやく分かった!最尤推定とベイズ推定Akira Masuda
 
第8章 ガウス過程回帰による異常検知
第8章 ガウス過程回帰による異常検知第8章 ガウス過程回帰による異常検知
第8章 ガウス過程回帰による異常検知Chika Inoshita
 
研究室内PRML勉強会 11章2-4節
研究室内PRML勉強会 11章2-4節研究室内PRML勉強会 11章2-4節
研究室内PRML勉強会 11章2-4節Koji Matsuda
 
機械学習による統計的実験計画(ベイズ最適化を中心に)
機械学習による統計的実験計画(ベイズ最適化を中心に)機械学習による統計的実験計画(ベイズ最適化を中心に)
機械学習による統計的実験計画(ベイズ最適化を中心に)Kota Matsui
 
PRML 5章 PP.227-PP.247
PRML 5章 PP.227-PP.247PRML 5章 PP.227-PP.247
PRML 5章 PP.227-PP.247Tomoki Hayashi
 
スペクトラル・クラスタリング
スペクトラル・クラスタリングスペクトラル・クラスタリング
スペクトラル・クラスタリングAkira Miyazawa
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択Masahiro Suzuki
 
変分推論と Normalizing Flow
変分推論と Normalizing Flow変分推論と Normalizing Flow
変分推論と Normalizing FlowAkihiro Nitta
 
変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)Takao Yamanaka
 
PRMLの線形回帰モデル(線形基底関数モデル)
PRMLの線形回帰モデル(線形基底関数モデル)PRMLの線形回帰モデル(線形基底関数モデル)
PRMLの線形回帰モデル(線形基底関数モデル)Yasunori Ozaki
 
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.Deep Learning JP
 
3分でわかる多項分布とディリクレ分布
3分でわかる多項分布とディリクレ分布3分でわかる多項分布とディリクレ分布
3分でわかる多項分布とディリクレ分布Junya Saito
 
Generative Models(メタサーベイ )
Generative Models(メタサーベイ )Generative Models(メタサーベイ )
Generative Models(メタサーベイ )cvpaper. challenge
 
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介Deep Learning JP
 
グラフィカルモデル入門
グラフィカルモデル入門グラフィカルモデル入門
グラフィカルモデル入門Kawamoto_Kazuhiko
 
モンテカルロ法と情報量
モンテカルロ法と情報量モンテカルロ法と情報量
モンテカルロ法と情報量Shohei Miyashita
 

What's hot (20)

グラフィカル Lasso を用いた異常検知
グラフィカル Lasso を用いた異常検知グラフィカル Lasso を用いた異常検知
グラフィカル Lasso を用いた異常検知
 
PRML輪読#2
PRML輪読#2PRML輪読#2
PRML輪読#2
 
ようやく分かった!最尤推定とベイズ推定
ようやく分かった!最尤推定とベイズ推定ようやく分かった!最尤推定とベイズ推定
ようやく分かった!最尤推定とベイズ推定
 
第8章 ガウス過程回帰による異常検知
第8章 ガウス過程回帰による異常検知第8章 ガウス過程回帰による異常検知
第8章 ガウス過程回帰による異常検知
 
研究室内PRML勉強会 11章2-4節
研究室内PRML勉強会 11章2-4節研究室内PRML勉強会 11章2-4節
研究室内PRML勉強会 11章2-4節
 
機械学習による統計的実験計画(ベイズ最適化を中心に)
機械学習による統計的実験計画(ベイズ最適化を中心に)機械学習による統計的実験計画(ベイズ最適化を中心に)
機械学習による統計的実験計画(ベイズ最適化を中心に)
 
PRML 5章 PP.227-PP.247
PRML 5章 PP.227-PP.247PRML 5章 PP.227-PP.247
PRML 5章 PP.227-PP.247
 
スペクトラル・クラスタリング
スペクトラル・クラスタリングスペクトラル・クラスタリング
スペクトラル・クラスタリング
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択
 
変分推論と Normalizing Flow
変分推論と Normalizing Flow変分推論と Normalizing Flow
変分推論と Normalizing Flow
 
変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)
 
PRMLの線形回帰モデル(線形基底関数モデル)
PRMLの線形回帰モデル(線形基底関数モデル)PRMLの線形回帰モデル(線形基底関数モデル)
PRMLの線形回帰モデル(線形基底関数モデル)
 
一般向けのDeep Learning
一般向けのDeep Learning一般向けのDeep Learning
一般向けのDeep Learning
 
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
 
3分でわかる多項分布とディリクレ分布
3分でわかる多項分布とディリクレ分布3分でわかる多項分布とディリクレ分布
3分でわかる多項分布とディリクレ分布
 
ELBO型VAEのダメなところ
ELBO型VAEのダメなところELBO型VAEのダメなところ
ELBO型VAEのダメなところ
 
Generative Models(メタサーベイ )
Generative Models(メタサーベイ )Generative Models(メタサーベイ )
Generative Models(メタサーベイ )
 
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
[DL輪読会]Convolutional Conditional Neural Processesと Neural Processes Familyの紹介
 
グラフィカルモデル入門
グラフィカルモデル入門グラフィカルモデル入門
グラフィカルモデル入門
 
モンテカルロ法と情報量
モンテカルロ法と情報量モンテカルロ法と情報量
モンテカルロ法と情報量
 

Similar to DeNA copyright document summarizes neural message passing approaches

Some Engg. Applications of Matrices and Partial Derivatives
Some Engg. Applications of Matrices and Partial DerivativesSome Engg. Applications of Matrices and Partial Derivatives
Some Engg. Applications of Matrices and Partial DerivativesSanjaySingh011996
 
【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning FrameworksTakeo Imai
 
NIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionNIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionKazuki Fujikawa
 
Final Design Project - Memo (with GUI)
Final Design Project - Memo (with GUI)Final Design Project - Memo (with GUI)
Final Design Project - Memo (with GUI)Alex Larcheveque
 
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfreservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfRTEFGDFGJU
 
My Postdoctoral Research
My Postdoctoral ResearchMy Postdoctoral Research
My Postdoctoral ResearchPo-Ting Wu
 
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...AIRCC Publishing Corporation
 
Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...ijcsit
 
Simultaneous State and Actuator Fault Estimation With Fuzzy Descriptor PMID a...
Simultaneous State and Actuator Fault Estimation With Fuzzy Descriptor PMID a...Simultaneous State and Actuator Fault Estimation With Fuzzy Descriptor PMID a...
Simultaneous State and Actuator Fault Estimation With Fuzzy Descriptor PMID a...Waqas Tariq
 
RNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingRNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingDongang (Sean) Wang
 
Anomaly Detection in Sequences of Short Text Using Iterative Language Models
Anomaly Detection in Sequences of Short Text Using Iterative Language ModelsAnomaly Detection in Sequences of Short Text Using Iterative Language Models
Anomaly Detection in Sequences of Short Text Using Iterative Language ModelsCynthia Freeman
 
DSP_Lab_MAnual_-_Final_Edition[1].docx
DSP_Lab_MAnual_-_Final_Edition[1].docxDSP_Lab_MAnual_-_Final_Edition[1].docx
DSP_Lab_MAnual_-_Final_Edition[1].docxParthDoshi66
 
Data structures notes for college students btech.pptx
Data structures notes for college students btech.pptxData structures notes for college students btech.pptx
Data structures notes for college students btech.pptxKarthikVijay59
 
Formulas for Surface Weighted Numbers on Graph
Formulas for Surface Weighted Numbers on GraphFormulas for Surface Weighted Numbers on Graph
Formulas for Surface Weighted Numbers on Graphijtsrd
 
Paper computer
Paper computerPaper computer
Paper computerbikram ...
 
Paper computer
Paper computerPaper computer
Paper computerbikram ...
 
Reed_Solomon_Implementation
Reed_Solomon_ImplementationReed_Solomon_Implementation
Reed_Solomon_Implementationramya c b
 
DSP_Lab_MAnual_-_Final_Edition.pdf
DSP_Lab_MAnual_-_Final_Edition.pdfDSP_Lab_MAnual_-_Final_Edition.pdf
DSP_Lab_MAnual_-_Final_Edition.pdfParthDoshi66
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Matlab for beginners, Introduction, signal processing
Matlab for beginners, Introduction, signal processingMatlab for beginners, Introduction, signal processing
Matlab for beginners, Introduction, signal processingDr. Manjunatha. P
 

Similar to DeNA copyright document summarizes neural message passing approaches (20)

Some Engg. Applications of Matrices and Partial Derivatives
Some Engg. Applications of Matrices and Partial DerivativesSome Engg. Applications of Matrices and Partial Derivatives
Some Engg. Applications of Matrices and Partial Derivatives
 
【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks
 
NIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionNIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph Convolution
 
Final Design Project - Memo (with GUI)
Final Design Project - Memo (with GUI)Final Design Project - Memo (with GUI)
Final Design Project - Memo (with GUI)
 
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfreservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
 
My Postdoctoral Research
My Postdoctoral ResearchMy Postdoctoral Research
My Postdoctoral Research
 
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
Automated Information Retrieval Model Using FP Growth Based Fuzzy Particle Sw...
 
Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...Improving initial generations in pso algorithm for transportation network des...
Improving initial generations in pso algorithm for transportation network des...
 
Simultaneous State and Actuator Fault Estimation With Fuzzy Descriptor PMID a...
Simultaneous State and Actuator Fault Estimation With Fuzzy Descriptor PMID a...Simultaneous State and Actuator Fault Estimation With Fuzzy Descriptor PMID a...
Simultaneous State and Actuator Fault Estimation With Fuzzy Descriptor PMID a...
 
RNN and sequence-to-sequence processing
RNN and sequence-to-sequence processingRNN and sequence-to-sequence processing
RNN and sequence-to-sequence processing
 
Anomaly Detection in Sequences of Short Text Using Iterative Language Models
Anomaly Detection in Sequences of Short Text Using Iterative Language ModelsAnomaly Detection in Sequences of Short Text Using Iterative Language Models
Anomaly Detection in Sequences of Short Text Using Iterative Language Models
 
DSP_Lab_MAnual_-_Final_Edition[1].docx
DSP_Lab_MAnual_-_Final_Edition[1].docxDSP_Lab_MAnual_-_Final_Edition[1].docx
DSP_Lab_MAnual_-_Final_Edition[1].docx
 
Data structures notes for college students btech.pptx
Data structures notes for college students btech.pptxData structures notes for college students btech.pptx
Data structures notes for college students btech.pptx
 
Formulas for Surface Weighted Numbers on Graph
Formulas for Surface Weighted Numbers on GraphFormulas for Surface Weighted Numbers on Graph
Formulas for Surface Weighted Numbers on Graph
 
Paper computer
Paper computerPaper computer
Paper computer
 
Paper computer
Paper computerPaper computer
Paper computer
 
Reed_Solomon_Implementation
Reed_Solomon_ImplementationReed_Solomon_Implementation
Reed_Solomon_Implementation
 
DSP_Lab_MAnual_-_Final_Edition.pdf
DSP_Lab_MAnual_-_Final_Edition.pdfDSP_Lab_MAnual_-_Final_Edition.pdf
DSP_Lab_MAnual_-_Final_Edition.pdf
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Matlab for beginners, Introduction, signal processing
Matlab for beginners, Introduction, signal processingMatlab for beginners, Introduction, signal processing
Matlab for beginners, Introduction, signal processing
 

More from Kazuki Fujikawa

Stanford Covid Vaccine 2nd place solution
Stanford Covid Vaccine 2nd place solutionStanford Covid Vaccine 2nd place solution
Stanford Covid Vaccine 2nd place solutionKazuki Fujikawa
 
BMS Molecular Translation 3rd place solution
BMS Molecular Translation 3rd place solutionBMS Molecular Translation 3rd place solution
BMS Molecular Translation 3rd place solutionKazuki Fujikawa
 
Kaggle参加報告: Champs Predicting Molecular Properties
Kaggle参加報告: Champs Predicting Molecular PropertiesKaggle参加報告: Champs Predicting Molecular Properties
Kaggle参加報告: Champs Predicting Molecular PropertiesKazuki Fujikawa
 
Kaggle参加報告: Quora Insincere Questions Classification
Kaggle参加報告: Quora Insincere Questions ClassificationKaggle参加報告: Quora Insincere Questions Classification
Kaggle参加報告: Quora Insincere Questions ClassificationKazuki Fujikawa
 
Ordered neurons integrating tree structures into recurrent neural networks
Ordered neurons integrating tree structures into recurrent neural networksOrdered neurons integrating tree structures into recurrent neural networks
Ordered neurons integrating tree structures into recurrent neural networksKazuki Fujikawa
 
A closer look at few shot classification
A closer look at few shot classificationA closer look at few shot classification
A closer look at few shot classificationKazuki Fujikawa
 
Graph convolutional policy network for goal directed molecular graph generation
Graph convolutional policy network for goal directed molecular graph generationGraph convolutional policy network for goal directed molecular graph generation
Graph convolutional policy network for goal directed molecular graph generationKazuki Fujikawa
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processesKazuki Fujikawa
 
Matrix capsules with em routing
Matrix capsules with em routingMatrix capsules with em routing
Matrix capsules with em routingKazuki Fujikawa
 
SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...Kazuki Fujikawa
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learningKazuki Fujikawa
 
DeNAにおける機械学習・深層学習活用
DeNAにおける機械学習・深層学習活用DeNAにおける機械学習・深層学習活用
DeNAにおける機械学習・深層学習活用Kazuki Fujikawa
 

More from Kazuki Fujikawa (14)

Stanford Covid Vaccine 2nd place solution
Stanford Covid Vaccine 2nd place solutionStanford Covid Vaccine 2nd place solution
Stanford Covid Vaccine 2nd place solution
 
BMS Molecular Translation 3rd place solution
BMS Molecular Translation 3rd place solutionBMS Molecular Translation 3rd place solution
BMS Molecular Translation 3rd place solution
 
ACL2020 best papers
ACL2020 best papersACL2020 best papers
ACL2020 best papers
 
Kaggle参加報告: Champs Predicting Molecular Properties
Kaggle参加報告: Champs Predicting Molecular PropertiesKaggle参加報告: Champs Predicting Molecular Properties
Kaggle参加報告: Champs Predicting Molecular Properties
 
NLP@ICLR2019
NLP@ICLR2019NLP@ICLR2019
NLP@ICLR2019
 
Kaggle参加報告: Quora Insincere Questions Classification
Kaggle参加報告: Quora Insincere Questions ClassificationKaggle参加報告: Quora Insincere Questions Classification
Kaggle参加報告: Quora Insincere Questions Classification
 
Ordered neurons integrating tree structures into recurrent neural networks
Ordered neurons integrating tree structures into recurrent neural networksOrdered neurons integrating tree structures into recurrent neural networks
Ordered neurons integrating tree structures into recurrent neural networks
 
A closer look at few shot classification
A closer look at few shot classificationA closer look at few shot classification
A closer look at few shot classification
 
Graph convolutional policy network for goal directed molecular graph generation
Graph convolutional policy network for goal directed molecular graph generationGraph convolutional policy network for goal directed molecular graph generation
Graph convolutional policy network for goal directed molecular graph generation
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processes
 
Matrix capsules with em routing
Matrix capsules with em routingMatrix capsules with em routing
Matrix capsules with em routing
 
SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learning
 
DeNAにおける機械学習・深層学習活用
DeNAにおける機械学習・深層学習活用DeNAにおける機械学習・深層学習活用
DeNAにおける機械学習・深層学習活用
 

Recently uploaded

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 

Recently uploaded (20)

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 

DeNA copyright document summarizes neural message passing approaches

  • 1. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. AI System Dept. System & Design Management Unit Kazuki Fujikawa 21 0 0 2 0 0 2 @ .2 2 82 2 2 @ 7 66 5 45 - 5 4 4 - 6- 4 - 4 6- 6 5 8-6 8 -5 -/ 4 / 68 4. - 2 2 21 2 @ 7
  • 2. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n o n D • a g 1 2 • 0 ~ : : e : • 0 : A b 0 : • A 0 0 : 0 N M D : • 0 4 D : •
  • 3. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n 7 1 ,, pr NP p J p C n G C • Jl I • p • I J n o Gn a: : 30 h p y I i o S : pre C G +24 , Figure 2: Overview of our approach. (1) we train a model to identify pairwise atom interactions
  • 4. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n G n / G n C n
  • 5. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n C n C / C C n n
  • 6. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n T 7 J 0 i I:A A S 2T I S 2T 2 , 0N PA O2 S n 1 i i 1 i Figure 1: An example reaction where the reaction center is (27,28), (7,27), and (8,27), highlighted in green. Here bond (27,28) is deleted and (7,27) and (8,27) are connected by aromatic bonds to form a new ring. The corresponding reaction template consists of not only the reaction center, but nearby functional groups that explicitly specify the context. template involves graph matching and this makes examining large numbers of templates prohibitively +,
  • 7. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n p y r 00 100 w n N hc i Ld l n d l t sRu a k j Nov LGbge N L L G6 4 d l t -66 /62 7G. 2 D C 64 3 7 236:2 2 2 6 65 2 :2 65 2 7
  • 8. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n -0 -01- - 2+ - -1 - + 0, - [ S L e iv L m Ll [ ] S L h d K S W K m f https://www.cc.gatech.edu/~lsong/teaching/8803ML/lecture22.pdf http://art.ist.hokudai.ac.jp/~takigawa/data/fpai94_takigawa.pdf sfeiler-Lehman algorithm Use multiset labels to capture neighborhood structure in a graph If two graphs are the same, then the multiset labels should be the same as well 7 sfeiler-Lehman algorithm Use multiset labels to capture neighborhood structure in a graph If two graphs are the same, then the multiset labels should be the same as well 7 Weisfeiler-Lehman algorithm II Relabel graph and construct new multiset label Check whether the new labels are the same 8 Weisfeiler-Lehman algorithm II Relabel graph and construct new multiset label Check whether the new labels are the same 8 Weisfeiler-Lehman algorithm II If the new multilabel sets are not the same, stop the algorithm and declare the two graphs are not the same Effectively it is unrolling the graph into trees rooted at each node and use multilabels to identify these trees nr as S Weisfeiler-Lehman algorithm Use multiset labels to capture neighborhood structure in a graph If two graphs are the same, then the multiset labels should be the same as well 7 S Weisfeiler-Lehman algorithm II Relabel graph and construct new multiset label Check whether the new labels are the same 8 Weisfeiler-Lehman algorithm II Relabel graph and construct new multiset label Check whether the new labels are the same 8 nr as !′! S d !′! !′! # ! = [6, 3, 2, 1, 0, 1, 2, 2, 0, 1] # !′ = 6, 3, 2, 1, 2, 1, 0, 0, 2, 1 -./ !, !0 = 52
  • 9. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n 1 010 1 4 4 4 21 4 ,+ 21 e H c ]d a e R Fa CE P [ 310 4 21 4 https://chembioinfo.com/2011/10/30/revisiting-molecular-hashed-fingerprints/ http://art.ist.hokudai.ac.jp/~takigawa/data/fpai94_takigawa.pdf
  • 10. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n , 2 2 7 2 , 0)7 : (,+ 1 R N) 2 ( wM]r )7 : [ o GP e L • , 2 2 7 2 + M • i h sL k gm ] • M]i h a d k gm uM] n + M C i h k gm + RI] • 2 2 , 2 2 7 R u Ni h p L l sL oP k gm tp ] ntum Chemistry Oriol Vinyals 3 George E. Dahl 1 DFT 103 seconds Message Passing Neural Net 10 2 seconds E,!0, ... Targets )7 : (,+
  • 11. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. Message Function: !"(ℎ% " , ℎ'( " , )%'( ) Σ Message Function: !"(ℎ% " , ℎ'( " , )%'( ) Neural Message Passing for time steps and is defined in terms of message functions Mt and vertex update functions Ut. During the message pass- ing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v, ht w, evw) (1) ht+1 v = Ut(ht v, mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆy = R({hT v | v 2 G}). (3) The message functions Mt, vertex update functions Ut, and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function M , vertex update function U , and readout func- n + : , : : 1 C 5 2 l icN ]E S eF[Uh • , : l , : !" ℎ% " , ℎ' " , )%' = ,-.,/0(ℎ' " , )%') l 0 5 : 1" ℎ% " , 2% "34 = 5(6" 789 (%) 2% "34) • 6" 789 (%) 0D ; N ISda deg (;) MNfg L P v u1 u2 h(0) v h(0) u1 h(0) u2 Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut. During the message pass- ing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v, ht w, evw) (1) ht+1 v = Ut(ht v, mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆy = R({hT v | v 2 G}). (3) The message functions Mt, vertex update functions Ut, and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt, vertex update function Ut, and readout func- tion R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a tar- get at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv, xv, mv) where xv is an external vector representing some outside influence on the vertex v. The message func- tion M(hv, hw, evw) is a neural network which takes the concatenation (hv, hw, evw). The vertex update function U(hv, xv, mv) is a neural network which takes as input the concatenation (hv, xv, mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) )%'( )%'? Update Function: 1"(ℎ% " , 2% "34 )
  • 12. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n ( 5 : D 5 + : C )DE 5D ,02 S R [ ] LI M N (+0 P a • 1 5 DC 5 1 5 DC D C ! ℎ# $ % ∈ ' = )( ∑ -.)/012 34ℎ# 4 #,4 ) v u1 u2 h(0) v h(0) u1 h(0) u2 1 5 DC D C )( ∑ -.)/012 34ℎ# 4 #,4 ) ℎ# (7) ℎ89 (7) ℎ8: (7) FFFFFF ℎ89 ($) ℎ8: (;) ℎ# ($) 5: 5 : 5: 5 : <#89 <#8: => = !({ℎ# ($) |% ∈ '})
  • 13. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n , , 0 0 F C ,, 00 . - .1 ( 2 + 6 M ,12U] h iU g • CC : CC : C CC : 6 ) !" ℎ$ " , ℎ& " , '$& = )*&ℎ& " • )*&) Rde fa G a G 6 I [cL N 2 6 ) +" ℎ$ " , ,$ "-. = /0+ ℎ$ " , ,$ "-. Message Function: !"(ℎ$ " , ℎ&2 " , '$&2 ) Σ Message Function: !"(ℎ$ " , ℎ&2 " , '$&2 ) Neural Message Passing for time steps and is defined in terms of message functions Mt and vertex update functions Ut. During the message pass- ing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v, ht w, evw) (1) ht+1 v = Ut(ht v, mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆy = R({hT v | v 2 G}). (3) The message functions Mt, vertex update functions Ut, and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function M , vertex update function U , and readout func- v u1 u2 h(0) v h(0) u1 h(0) u2 Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut. During the message pass- ing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v, ht w, evw) (1) ht+1 v = Ut(ht v, mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆy = R({hT v | v 2 G}). (3) The message functions Mt, vertex update functions Ut, and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt, vertex update function Ut, and readout func- tion R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a tar- get at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv, xv, mv) where xv is an external vector representing some outside influence on the vertex v. The message func- tion M(hv, hw, evw) is a neural network which takes the concatenation (hv, hw, evw). The vertex update function U(hv, xv, mv) is a neural network which takes as input the concatenation (hv, xv, mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) '$&2 '$&4 Update Function: +"(ℎ$ " , ,$ "-. )
  • 14. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n + 6 + 6 6 C : ++ 1- ,)- 2 0 6 LFG+ 0 N GU • 6 6 6 ( ! ℎ# $ % ∈ ' = tanh( ∑ 0# 1 ℎ# $ , ℎ# 3 ⊙ tanh 5 ℎ# $ , ℎ# 3 ) • 1, 5( I 0 1 ℎ# $ , ℎ# 3 ( 6 I R
  • 15. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n ( , : 0 (, +1 0 [ T ] US • ) 0 0 7: 0 ) 0 :1 7 : !" ℎ$ " , ℎ& " , '$& = tanh -./ -/.ℎ0 " + 23 ⊙ -6.'$0 + 27 • -./ , -/. , -6. D23, 27 M N • 20 :1 7 : 8" ℎ$ " , 9$ ":3 = ℎ$ " + 9$ ":3 Message Function: !"(ℎ$ " , ℎ&< " , '$&< ) Σ Message Function: !"(ℎ$ " , ℎ&< " , '$&< ) Neural Message Passing f time steps and is defined in terms of message functions Mt and vertex update functions Ut. During the message pass- ing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v, ht w, evw) (1) ht+1 v = Ut(ht v, mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆy = R({hT v | v 2 G}). (3) The message functions Mt, vertex update functions Ut, and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message v u1 u2 h(0) v h(0) u1 h(0) u2 Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut. During the message pass- ing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v, ht w, evw) (1) ht+1 v = Ut(ht v, mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆy = R({hT v | v 2 G}). (3) The message functions Mt, vertex update functions Ut, and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt, vertex update function Ut, and readout func- tion R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a tar- get at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv, xv, mv) where xv is an external vector representing some outside influence on the vertex v. The message func- tion M(hv, hw, evw) is a neural network which takes the concatenation (hv, hw, evw). The vertex update function U(hv, xv, mv) is a neural network which takes as input the concatenation (hv, xv, mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. '$&< '$&> Update Function: 8"(ℎ$ " , 9$ ":3)
  • 16. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n (22: ,2 )2 7 )2 (,)) +0 ) 2 N D • 2 1 : 2 2 1 0 ! ℎ# $ % ∈ ' = ∑ NN(ℎ# $ )#
  • 17. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n + 0 2 2 E E , - ( [ 1 6 G 2 2 M • EE6 C6EE C 6E [ EE6 :G 7 ) !" ℎ$ " , ℎ& " , '$& = )('$+)ℎ& " • )('$+)) N S R '$+ M IL00 [ C 6 :G 7 ) -" ℎ$ " , .$ "/0 = GRU ℎ$ " , .$ "/0 • ,,00 - 1 U Message Function: !"(ℎ$ " , ℎ&4 " , '$&4 ) Σ Message Function: !"(ℎ$ " , ℎ&4 " , '$&4 ) Neural Message Passing f time steps and is defined in terms of message functions Mt and vertex update functions Ut. During the message pass- ing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v, ht w, evw) (1) ht+1 v = Ut(ht v, mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆy = R({hT v | v 2 G}). (3) The message functions Mt, vertex update functions Ut, and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message v u1 u2 h(0) v h(0) u1 h(0) u2 Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut. During the message pass- ing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v, ht w, evw) (1) ht+1 v = Ut(ht v, mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆy = R({hT v | v 2 G}). (3) The message functions Mt, vertex update functions Ut, and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt, vertex update function Ut, and readout func- tion R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a tar- get at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv, xv, mv) where xv is an external vector representing some outside influence on the vertex v. The message func- tion M(hv, hw, evw) is a neural network which takes the concatenation (hv, hw, evw). The vertex update function U(hv, xv, mv) is a neural network which takes as input the concatenation (hv, xv, mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. '$&4 '$&5 Update Function: -"(ℎ$ " , .$ "/0)
  • 18. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n + 0 2 2 C C , - ( 1 6 E L2 2 • 1 6 E :E 7 ) ! ℎ# $ % ∈ ' = set2set ℎ# $ % ∈ ' C C G6 C - 1 -. ∗ RM00M SLIN size of the set, and which is order invariant. In the next sections, we explain such a modification, which could also be seen as a special case of a Memory Network (Weston et al., 2015) or Neural Turing Machine (Graves et al., 2014) – with a computation flow as depicted in Figure 1. 4.2 ATTENTION MECHANISMS Neural models with memories coupled to differentiable addressing mechanism have been success- fully applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah- danau et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al., 2015). Since we are interested in associative memories we employed a “content” based attention. This has the property that the vector retrieved from our memory would not change if we randomly shuffled the memory. This is crucial for proper treatment of the input set X as such. In particular, our process block based on an attention mechanism uses the following: qt = LSTM(q⇤ t 1) (3) ei,t = f(mi, qt) (4) ai,t = exp(ei,t) P j exp(ej,t) (5) rt = X i ai,tmi (6) q⇤ t = [qt rt] (7) Read Process Write Figure 1: The Read-Process-and-Write model. where i indexes through each memory vector mi (typically equal to the cardinality of X), qt is a query vector which allows us to read rt from the memories, f is a function that computes a single scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a recurrent state but which takes no inputs. q⇤ t is the state which this LSTM evolves, and is formed by concatenating the query qt with the resulting attention readout rt. t is the index which indicates V ) G6 C - 1
  • 19. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n 0 6 0 1 0 : 0 1 2 07 0 + 0 , p m rw 0 20 0 6 20 : Rcf n u gip W p s a gip u Ntde Fp l yh W []Ro A third strategy for reaction prediction algorithms uses fingerprints and extended circular fingerprints33,34 have been Figure 1. An overview of our method for predicting reaction type and products. A reaction fingerprint, made from concatenating the fingerprints of reactant and reagent molecules, is the input for a neural network that predicts the probability of 17 different reaction types, represented as a reaction type probability vector. The algorithm then predicts a product by applying to the reactants a transformation that corresponds to the most probable reaction type. In this work, we use a SMARTS transformation for the final step. ACS Central Science Research Article Figure 2. Cross validation results for (a) baseline fingerprint, (b) Morgan reaction fingerprint, and (c) neural reaction fingerprint. A confusion matrix shows the average predicted probability for each reaction type. In these confusion matrices, the predicted reaction type is represented on the vertical axis, and the correct reaction type is represented on the horizontal axis. These figures were generated on the basis of code from Schneider et al.43 ACS Central Science Research Article large libraries of synthetically accessible compounds in the areas of molecular discovery,44 metabolic networks,45 drug discov- ery,46 and discovery of one-pot reactions.47 In our algorithm, we use SMARTS transformation for targeted prediction of product molecules from reactants. However, this method can be replaced by any method that generates product molecule graphs from reactant molecule graphs. An overview of our method can be found in Figure 1 and is explained in further detail in Prediction Methods. We show the results of our prediction method on 16 basic reactions of alkyl halides and alkenes, some of the first reactions taught to organic chemistry students in many textbooks.48 The training and validation reactions were generated by applying simple SMARTS transformations to alkenes and alkyl halides. While we limit our initial exploration to aliphatic, non- stereospecific molecules, our method can easily be applied a wider span of organic chemical space with enough example reactions. The algorithm can also be expanded to include experimental conditions such as reaction temperature and time. With additional adjustments and a larger library of training data, our algorithm will be able to predict multistep reactions and, eventually, become a module in a larger machine-learning system for suggesting retrosynthetic pathways for complex molecules. ■ RESULTS AND DISCUSSION Performance on Cross-Validation Set. We created a data set of reactions of four alkyl halide reactions and 12 alkene reactions; further details on the construction of the data set can be found in Methods. Our training set consisted of 3400 reactions from this data set, and the test set consisted of 17,000 reactions; both the training set and the test set were balanced across reaction types. During optimization on the training set, k-fold cross-validation was used to help tune the parameters of the neural net. Table 1 reports the cross-entropy score and the accuracy of the baseline and fingerprinting methods on this test set. Here the accuracy is defined by the percentage of matching “NR” overg horiz Pe Ques textb set f traini of te from in F assign matc netw react valida traini distan mole betw set re score was 1 was o detai Fo type the a algor the p the m proba for h In best by th both Morg algor than rings of th neura This algor patte In for r Table 1. Accuracy and Negative Log Likelihood (NLL) Error of Fingerprint and Baseline Methods fingerprint method fingerprint length train NLL train accuracy (%) test NLL test accuracy (%) baseline 51 0.2727 78.8 2.5573 24.7 Morgan 891 0.0971 86.0 0.1792 84.5 neural 181 0.0976 86.0 0.1340 85.7 ACS Central Science kh 0
  • 20. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n + 1: 2 : 3 138 1 7 - 8 1 - 13 , 3 7: 0 r S RaPol MNbm h [ i Pcg • e S R d ] S R d r i Lf taset of 278 reactions from the literature and 8 reaction classes. During the preparation of this manuscript, we became aware that Aspuru-Guzik and co-workers recently reported a neural network approach for reaction prediction, in which reactant and reagent molecules were encoded as neural[9] or extended connectivity fingerprints, which are concatenated.[10] They trained their models to predict 16 reaction types on an artificial dataset of 3200 reactions. Herein, we propose a novel neural-symbolic model, which can be used for both reaction prediction and retrosynthesis. Our model uses neural networks in order to predict the most likely transformation rules to be applied to the input mole- cule(s) (see Figures 1 and 2). In short, the computer has to learn which named reaction was used to make a molecule (or under which rule the starting materials reacted). We trained neural networks by showing them millions of examples of known reactions and the corre- sponding correct reaction rule. The goal is to learn patterns in ing with symbolic rules is that we retain the familiar concept of rules, whereas the model learns to prioritize the rules and to estimate selectivity and compatibility from the provided train- ing data, which are successfully performed experiments. We test our hypothesis in the first large-scale systematic investiga- tion of machine learning and rule-based systems. Several metrics are reported in Table 1 and Table 2 to evalu- ate the models. Accuracy shows how many reactions are cor- rectly predicted when the rule with the highest predicted probability is evaluated. In top-n accuracy, we examine if the correct reaction rule is among the n highest ranked rules, simi- lar to being on the first page of the results of a search engine. This means we allow the algorithm to propose more than one applicable rule, which is reasonable, since the same molecules can react differently, or several routes are possible to synthe- size a target molecule in retrosynthesis. Furthermore, we Figure 2. Overview of our neural-symbolic ansatz. Table 1. Results for the study on 103 hand coded rules. Task/Model Acc Top 3-Acc MRR W. Prec. Reaction prediction random 0.03 0.12 0.04 0.03 expert system 0.07 0.33 0.12 0.46 logistic regression 0.86 0.97 0.91 0.86 highway network 0.92 0.99 0.96 0.92 FC512 ELU 0.92 0.99 0.96 0.92 Retrosynthesis random 0.03 0.12 0.04 0.03 expert system 0.05 0.30 0.06 0.11 logistic regression 0.64 0.95 0.77 0.62 highway network 0.77 0.98 0.86 0.77 FC512 ELU 0.78 0.98 0.87 0.78 Chem. Eur. J. 2017, 23, 5966 – 5971 www.chemeurj.org ⌫ 2017 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim5967 ch se a m of Au Ta 87 da sys de ble th Table 2. Results for the study on 8720 automatically extracted rules. Task/model Acc Top 10-Acc. MRR W. Prec. Reaction prediction random 0.00 0.00 0.00 0.00 expert system 0.02 0.18 0.02 0.06 logistic regression 0.41 0.65 0.49 0.31 highway network 0.78 0.98 0.86 0.77 FC512 ELU 0.77 0.97 0.85 0.76 Retrosynthesis random 0.00 0.00 0.00 0.00 expert system 0.02 0.19 0.01 0.06 logistic regression 0.31 0.59 0.41 0.23 highway network 0.64 0.95 0.75 0.63 FC512 ELU 0.62 0.94 0.74 0.62 quently applied to the molecules. They first encoded mole- cules by unsupervised pre-training of self-organizing maps, a type of neural network,[8] which then serves as an input to a random forest classifier.[7] They conducted a study with a da- taset of 278 reactions from the literature and 8 reaction classes. During the preparation of this manuscript, we became aware that Aspuru-Guzik and co-workers recently reported a neural network approach for reaction prediction, in which reactant and reagent molecules were encoded as neural[9] or extended the molecules’ functional groups that allow the machine to generalize to molecules it has not seen before. In this problem the model learns to predict the probability over all rules.[11] We hypothesize that the advantage of combining machine learn- ing with symbolic rules is that we retain the familiar concept of rules, whereas the model learns to prioritize the rules and to estimate selectivity and compatibility from the provided train- ing data, which are successfully performed experiments. We test our hypothesis in the first large-scale systematic investiga- Figure 1. The challenge in retrosynthesis and reaction prediction is to select the correct rule among possibly tens or hundreds of matching rules. In this exam- ple, both a Suzuki and a Kumada coupling (among others) formally match the biaryl moiety. However, the aldehyde in the molecular context would be in conflict with a Grignard reagent. Therefore, the Kumada coupling cannot be used to make target 1. This information has to be encoded by hand in expert systems. Our system simply learns it from data. Communication quently applied to the molecules. They first encoded mole- cules by unsupervised pre-training of self-organizing maps, a type of neural network,[8] which then serves as an input to a random forest classifier.[7] They conducted a study with a da- taset of 278 reactions from the literature and 8 reaction classes. During the preparation of this manuscript, we became aware that Aspuru-Guzik and co-workers recently reported a neural network approach for reaction prediction, in which reactant and reagent molecules were encoded as neural[9] or extended connectivity fingerprints, which are concatenated.[10] They trained their models to predict 16 reaction types on an artificial dataset of 3200 reactions. Herein, we propose a novel neural-symbolic model, which can be used for both reaction prediction and retrosynthesis. Our model uses neural networks in order to predict the most likely transformation rules to be applied to the input mole- cule(s) (see Figures 1 and 2). In short, the computer has to learn which named reaction was used to make a molecule (or under which rule the starting materials reacted). We trained neural networks by showing them millions of examples of known reactions and the corre- sponding correct reaction rule. The goal is to learn patterns in the molecules’ functional groups that allow the machine to generalize to molecules it has not seen before. In this problem the model learns to predict the probability over all rules.[11] We hypothesize that the advantage of combining machine learn- ing with symbolic rules is that we retain the familiar concept of rules, whereas the model learns to prioritize the rules and to estimate selectivity and compatibility from the provided train- ing data, which are successfully performed experiments. We test our hypothesis in the first large-scale systematic investiga- tion of machine learning and rule-based systems. Several metrics are reported in Table 1 and Table 2 to evalu- ate the models. Accuracy shows how many reactions are cor- rectly predicted when the rule with the highest predicted probability is evaluated. In top-n accuracy, we examine if the correct reaction rule is among the n highest ranked rules, simi- lar to being on the first page of the results of a search engine. This means we allow the algorithm to propose more than one applicable rule, which is reasonable, since the same molecules can react differently, or several routes are possible to synthe- size a target molecule in retrosynthesis. Furthermore, we Figure 1. The challenge in retrosynthesis and reaction prediction is to select the correct rule among possibly tens or hundreds of matching rules. In this exam- ple, both a Suzuki and a Kumada coupling (among others) formally match the biaryl moiety. However, the aldehyde in the molecular context would be in conflict with a Grignard reagent. Therefore, the Kumada coupling cannot be used to make target 1. This information has to be encoded by hand in expert systems. Our system simply learns it from data. Figure 2. Overview of our neural-symbolic ansatz. Table 1. Results for the study on 103 hand coded rules. Task/Model Acc Top 3-Acc MRR W. Prec. Reaction prediction random 0.03 0.12 0.04 0.03 expert system 0.07 0.33 0.12 0.46 logistic regression 0.86 0.97 0.91 0.86 highway network 0.92 0.99 0.96 0.92 FC512 ELU 0.92 0.99 0.96 0.92 Retrosynthesis random 0.03 0.12 0.04 0.03 expert system 0.05 0.30 0.06 0.11 logistic regression 0.64 0.95 0.77 0.62 highway network 0.77 0.98 0.86 0.77 FC512 ELU 0.78 0.98 0.87 0.78 n 7:
  • 21. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n 7 IC : S CL IC :RC 7 IC 2 01 Workshop track - ICLR 2017 OH O OHO a) Retrosynthesis: Analyse the target in reverse (chemical representation) c) Synthesis: Go to the lab and actually conduct the reactions, following plan a) OH O Me OH CO O H2 CO H2 CO Ac2O H2 CO Ibuprofen4 3 2 Ibuprofen 21 1 3 4 OH O OH O b) Applied transformations (productions) applied in each analysis step Ac2O Ac2O 1 d) Search e) Neural N S8 S6 Mole Desc (ECF Figure 1: a) The synthesis of the target (ibuprofen) is planned with retro analyzed until a set of building blocks is found that we can buy or we alre +, +
  • 22. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. C 7 2 1 1 ,+ 02 n M + - l P NS M A : ! ", $, "% = '("% |", $) t C R g l M 7 7 7 n + 0 '($) ,, : +($|") D L: ,(-) I - T .(-) I - : / e M OH O Me OH CO 21 d) Search Tree Representation e) Neural Networks determine the Actions S1 S2 S5 S8 S6 S7 S3 S4 S1 = {1} S2 = {2,CO} S3 = {3,CO,H2} S4 = {4,CO,H2,Ac2O} Root (Ibuprofen) OH O H an a) O CO H2 CO CO Ibuprofen1 4 O Ac2O Ac2O OH O Me OH CO 21 d) Search Tree Representation e) Neural Networks determine the Actions S1 S2 S5 S8 S6 S7 S3 S4 S1 = {1} S2 = {2,CO} S3 = {3,CO,H2} S4 = {4,CO,H2,Ac2O} p(rn|x) most probable rules apply rules to 1 Molecular Descriptor (ECFP4) Deep Neural Network Classification x Root (Ibuprofen) Workshop track - ICLR 2017 2.3 MONTE CARLO TREE SEARCH (MCTS) MCTS combines tree search with random sampling [Browne et al. (2012)] work in three of the four MCTS steps: Selection, Expansion, and Rollout. W employ Eq. (1), a variation of the UCT function similar to AlphaGo’s, to de of a node v, where P(a) is the probability assigned to this action by the p number of visits and Q(v) the accumulated reward at node v, and c the expl arg max v′∈children of v Q(v′ ) N(v′) + cP(a) N(v) 1 + N(v′) The tree policy is applied until a leaf node is found, which is expanded network. Our policy network has a top1 accuracy of 0.62 and a top50 (of 1 This allows to kill two birds with one stone: First, to reduce the branching the top 50 most promising actions, and not all possible ones (≈ 200). Secon actions, we have to solve the subgraph isomorphism problem, which determ rule can be applied to a molecule and yields the next molecule(s), only 50 times for all rules. During rollout, we sample actions from the policy netw r 12 1
  • 23. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n , 5 + 71 51 + 5 7 1 5 51 7 / :5 0 O21 5: 5 u n edi i m • u n ]iagC b [ y O M R • edi i r L u s lO ot Rc f P d m M h Ui h C u R representation, less fundamental changes (e.g., formal charge, association/disassociation of salts) are neglected. Loss or gain of a hydrogen is represented by 32 easy-to-compute features of that reactant atom alone, ∈ai 32 . Loss or gain of a bond is represented by a concatenation of the features of the atoms involved and four features of the bond, ∈a b a[ , , ]i ij j 68 . Because edits occur at the reaction center by definition, the overall representation of a candidate reaction depends only on the atoms and bonds at the reaction core. There is no explicit inclusion of other molecular features, e.g., adjacency to certain functional groups. However, in our featurization, we include rapidly calculable structural and electronic features of the reactants’ atoms that reflect the local chemical environment first and foremost, but also reflect the surrounding molecular context.19−22 The chosen features can be found in Table 2 and Table S2 and are discussed in more detail in the Supporting Information. The design of the neural network is motivated by the likelihood of a reaction being a function of the atom/bond changes that are required for it to occur. Individual edits are first analyzed in isolation using a neural network unique to the Figure 3. Edit-based model architecture for scoring candidate ACS Central Science Research Article in of a bond is es of the atoms ∈a ]j 68 . y definition, the depends only on re is no explicit cency to certain on, we include features of the nvironment first ding molecular in Table 2 and the Supporting otivated by the the atom/bond vidual edits are k unique to the dense layers are mediate feature es vectors of all ural network to each candidate oposed reaction e free energy of candidates are uces a vector of ting their values with kBT = 1. An evel features is reaction shown ecture, the four assessing their picted in Figure on tests after an to describe the ed with bias and 001 parameters, attempts to rank cts alone; no corresponding molecules are prints of length is used prior to odel is shown in able S4). which trains the 10%/20% training/validation/testing split and ceased training once the validation loss did not improve for five epochs. The edit-based model achieves an test accuracy of 68.5%, averaged across all folds. In this context, accuracy refers to the percentage of reaction examples where the recorded product was assigned a rank of 1. The baseline model was similarly trained and tested in a 5-fold CV, reaching an accuracy of 33.3%, suggesting that the set of recorded products in the data set is fairly homogeneous. The hybrid model, combining the edit-based representation with the proposed products’ fingerprint representations, achieves an accuracy of 71.8%. These results are displayed in Table 1. Figure 3. Edit-based model architecture for scoring candidate reactions. Reactions are represented by four types of edits. Initial atom- and bond-level attributes are converted into feature representations, which are summed and used to calculate that candidate reaction’s likelihood score. Table 1. Comparison between Baseline, Edit-Based, and Hybrid Models in Terms of Categorical Crossentropy Loss and Accuracya model loss acc. (%) top-3 (%) top-5 (%) top-10 (%) random guess 5.46 0.8 2.3 3.8 7.6 baseline 3.28 33.3 48.2 55.8 65.9 edit-based 1.34 68.5 84.8 89.4 93.6 hybrid 1.21 71.8 86.7 90.8 94.6 a Top-n refers to the percentage of examples where the recorded product was ranked within the top n candidates. scalabity. Wei et al.10 describe the use of neural networks to predict the outcome of reactions based on reactant fingerprints, but limit their study to 16 types of reactions covering a very narrow scope of possible alkyl halide and alkene reactions. Given two reactants and one reagent, the model was trained to identify which of 16 templates was most applicable. The data set used for cross-validation comes from artificially generated examples with limited chemical functionality, rather than experimental data. Quite recently, Segler and Waller describe two approaches to forward synthesis prediction. The first is a knowledge-graph approach that uses the concept of half reactions to generate possible products given exactly two reactants by looking at the known reactions in which each of those reactants participates.11 required; (3) a new reaction representation focused on the fundamental transformation at the reaction site rather than constituent reactant and product fingerprints; (4) the implementation and validation of a neural network-based model that learns when certain modes of reactivity are more or less likely to occur than other potential modes. Despite the literature bias toward reporting only high-yielding reactions, we develop a successful workflow that can be performed without any manual curation using actual reactions reported in the USPTO literature. ■ APPROACH Overview. Our model predicts the outcome of a chemical reaction in a two-step manner: (1) applying overgeneralized Figure 1. Model framework combining forward enumeration and candidate ranking. The primary aim of this work is the creation of the parametrized scoring model, which is trained to maximize the probability assigned to the recorded experimental outcome. ACS Central Science Research Article :5
  • 24. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n • • •
  • 25. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n 1 : : - 2 21 12 : / 1 1 1 0 r C , p r m r KO ga e n P N LR stL ] M] dc o M]++ w 1 i o o h l ][ M]P o LR y ] ] Journal of Chemical Information and Modeling ARTICLE Figure 2. Overall reaction prediction framework: (a) A user inputs the reactants and conditions. (b) We identify potential electron donors and acceptors using coarse approximations of electron-filled and -unfilled MOs. (c) Highly sensitive reactive site classifiers are trained and used to filter out 1 1 1
  • 26. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n y Fx • R pt y • y CG R pt C FoKe • ptn cln N C R i E R i F s 1 • ptn cln NF s v y r ah • g + 1 aoKFr • 1 y
  • 27. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n N+ C C G C I CO 2G :I C 1 I: E L 1G C : IGM :I C C 0 G F C: I F C: 7 : G 0-2 g -. py[ T SThlmR • F F II CI C [P r o gpyWe • u I E EG : C[ n “cd f Se T ] n “ P,G E 00g S P lm i]a S s WSqt BLEU [36] ROUGE [37] top-1 Jin’s USPTO test set [17] 38,648 95.9 96.0 83.2 Lowe’s test set [26] 50,258 90.3 90.9 65.4 6.2 Comparison with the state of the art To the best of our knowledge, no previous work has attempted to predict reac US patent dataset of Lowe [26]. Table 6 shows a comparison with the Weisfeil networks (WLDN) from Jin et al. [17] on their USPTO test set. To make a fair all the multiple product reactions in the test set as false predictions for our mod only on the single product reactions. By achieving 80.3% top-1 accuracy, we o by a margin of 6.3%, which is even higher than for their augmented setup. A rank candidates, but was trained on accurately predicting the top-1 outcome, it the WLDN beats our model in the top-3 and top-5 accuracy. The decoding o test set reactions takes on average 25 ms per reaction, inferred with a beam se therefore compete with the state of the art. Table 6: Comparison with Jin et al. [17]. The 1,352 multiple product reactions are counted as false predictions for our model. Jin’s USPTO test set [17], accuracies in [%] Method top-1 top-2 top-3 top-5 WLDN [17] 74.0 86.7 89.5 Our model 80.3 84.7 86.2 87.5 7 (a) Distribution of top1 probabilities (b) Coverage / Accuracy plot igure 1: Top-1 prediction confidence plots for Lowe’s test set inferred with a beam search of 10. 4 Attention tention is the key to take into account complex long-range dependencies between multiple tokens. ecific functional groups, solvents or catalysts have an impact on the outcome of a reaction, even if ey are far from the reaction center in the molecular graph and therefore also in the SMILES string. gure 2 shows how the network learned to focus first on the C[O ] molecule, to map the [O ] in the put correctly to the O in the target, and to ignore the Br, which is replaced in the target. (a) Attention weights (b) Reaction plotted with RDKit [27] gure 2: Reaction 120 from Jin’s USPTO test set. The atom mapping between reactants and product highlighted. SMILES: Brc1cncc(Br)c1.C[O-]>CN(C)C=O.[Na+]>COc1cncc(Br)c1 5 Limitations x w : G 0-2
  • 28. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n C n / C n n
  • 29. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. MN L P P G G u1 u2 u3 u4 u5 1 3 . / - / u2 u3 u4 u5 u1 u2 u3 u4 u1 u2 u3 u5u1 u2 u3 u4 u1 u2 u3 u4 3 2 Weisfeiler-Lehman Difference Network (WLDN) MN
  • 30. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 1 . 32 32 u1 u2 u3 u4 u5 - / u2 u3 u4 u5 u1 u2 u3 u4 u1 u2 u3 u5u1 u2 u3 u4 u1 u2 u3 u4 32 / Weisfeiler-Lehman Difference Network (WLDN) 1
  • 31. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. G N ML u1 u2 u3 u4 u5 /3 /3 / / / - ./ ./ u2 u3 u4 u5 u1 u2 u3 u4 u1 u2 u3 u5u1 u2 u3 u4 u1 u2 u3 u4 3 3 1 Weisfeiler-Lehman Difference Network (WLDN)
  • 32. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. G L . u1 u2 u3 u4 u5 / / / /2 / - ./ ./ u2 u3 u4 u5 u1 u2 u3 u4 u1 u2 u3 u5u1 u2 u3 u4 u1 u2 u3 u4 . 2 1 Weisfeiler-Lehman Difference Network (WLDN)
  • 33. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n - / c SGb G PL W f n cS e Gda u1 u2 u3 u4 u5 - / u2 u3 u4 u5 u1 u2 u3 u4 u1 u2 u3 u5u1 u2 u3 u4 u1 u2 u3 u4 / Weisfeiler-Lehman Difference Network (WLDN) cNM
  • 34. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. L n - 1 . L n SPN W G u1 u2 u3 u4 u5 - / u2 u3 u4 u5 u1 u2 u3 u4 u1 u2 u3 u5u1 u2 u3 u4 u1 u2 u3 u4 N / Weisfeiler-Lehman Difference Network (WLDN) L ML
  • 35. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n - U c NW h i , , - ML Nd • , !" ℎ$ " , ℎ& " , '$& = )(+ ,-.,/0 ℎ& " , '$& ) i ) gef : + a • 2" ℎ$ " , 3$ "45 = )(65ℎ$ "75 + 693$ "45 ) i ) gef :65 69 a Message Function: )(+ ,-.,/0 ℎ& " , '$&: ) Σ Message Function: )(+ ,-.,/0 ℎ& " , '$&; ) Neural Message Passing for time steps and is defined in terms of message functions Mt and vertex update functions Ut. During the message pass- ing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v, ht w, evw) (1) ht+1 v = Ut(ht v, mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆy = R({hT v | v 2 G}). (3) The message functions Mt, vertex update functions Ut, and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function M , vertex update function U , and readout func- v u1 u2 h(0) v h(0) u1 h(0) u2 Neural Message Passing for Quantum Chemistry time steps and is defined in terms of message functions Mt and vertex update functions Ut. During the message pass- ing phase, hidden states ht v at each node in the graph are updated based on messages mt+1 v according to mt+1 v = X w2N(v) Mt(ht v, ht w, evw) (1) ht+1 v = Ut(ht v, mt+1 v ) (2) where in the sum, N(v) denotes the neighbors of v in graph G. The readout phase computes a feature vector for the whole graph using some readout function R according to ˆy = R({hT v | v 2 G}). (3) The message functions Mt, vertex update functions Ut, and readout function R are all learned differentiable functions. R operates on the set of node states and must be invariant to permutations of the node states in order for the MPNN to be invariant to graph isomorphism. In what follows, we define previous models in the literature by specifying the message function Mt, vertex update function Ut, and readout func- tion R used. Note one could also learn edge features in an MPNN by introducing hidden states for all edges in the graph ht evw and updating them analogously to equations 1 and 2. Of the existing MPNNs, only Kearnes et al. (2016) Recurrent Unit introduced in Cho et al. (2014). This work used weight tying, so the same update function is used at each time step t. Finally, R = X v2V ⇣ i(h(T ) v , h0 v) ⌘ ⇣ j(h(T ) v ) ⌘ (4) where i and j are neural networks, and denotes element- wise multiplication. Interaction Networks, Battaglia et al. (2016) This work considered both the case where there is a tar- get at each node in the graph, and where there is a graph level target. It also considered the case where there are node level effects applied at each time step, in such a case the update function takes as input the concatenation (hv, xv, mv) where xv is an external vector representing some outside influence on the vertex v. The message func- tion M(hv, hw, evw) is a neural network which takes the concatenation (hv, hw, evw). The vertex update function U(hv, xv, mv) is a neural network which takes as input the concatenation (hv, xv, mv). Finally, in the case where there is a graph level output, R = f( P v2G hT v ) where f is a neural network which takes the sum of the final hidden states hT v . Note the original work only defined the model for T = 1. Molecular Graph Convolutions, Kearnes et al. (2016) '$&: '$&; Update Function: )(65ℎ$ "75 + 693$ "45 )
  • 36. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n - M W N a O N v u1 u2 u3 u4 h(0) v fu1v h(0) u1 fu2v h(0) u2 W(2) W(0) W(1) W(0) W(1) L Σ said to be isomorphic if their set representations are the same. The number of distinct labels grows exponentially with the number of iterations L. WL Network The discrete relabeling process does not directly generalize to continuous feature vectors. Instead, we appeal to neural networks to continuously embed the computations inherent in the WL test. Let r be the analogous continuous relabeling function. Then a node v 2 G with neighbor nodes N(v), node features fv, and edge features fuv is “relabeled” according to r(v) = ⌧(U1fv + U2 X u2N(v) ⌧(V[fu, fuv])) (1) where ⌧(·) could be any non-linear function. We apply this relabeling operation iteratively to obtain context-dependent atom vectors h(l) v = ⌧(U1h(l 1) v + U2 X u2N(v) ⌧(V[h(l 1) u , fuv])) (1  l  L) (2) where h (0) v = fv and U1, U2, V are shared across layers. The final atom representations arise from mimicking the set comparison function in the WL isomorphism test, yielding cv = X u2N(v) W(0) h(L) u W(1) fuv W(2) h(L) v (3) The set comparison here is realized by matching each rank-1 edge tensor h (L) u ⌦ fuv ⌦ h (L) v to a set of reference edges also cast as rank-1 tensors W(0) [k] ⌦ W(1) [k] ⌦ W(2) [k], where W[k] is the k-th row of matrix W. In other words, Eq. 3 above could be written as cv[k] = X u2N(v) D W(0) [k] ⌦ W(1) [k] ⌦ W(2) [k], h(L) u ⌦ fuv ⌦ h(L) v E (4) The resulting cv is a vector representation that captures the local chemical environment of the atom (through relabeling) and involves a comparison against a learned set of reference environments. The representation of the whole graph G is simply the sum over all the atom representations: cG = P v cv. 4ℎ" ($)
  • 37. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n - / P N L G M S n /1 . . M G u1 u2 u3 u4 u5 - / u2 u3 u4 u5 u1 u2 u3 u4 u1 u2 u3 u5u1 u2 u3 u4 u1 u2 u3 u4 L / Weisfeiler-Lehman Difference Network (WLDN) P
  • 38. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n N m : : u a e . . d W y gr v a . . . . e c r!"# ML W v u1 u2 u3 u4 h(0) v h(0) u1 h(0) u2 $# $"%&"# '( ')') + . . ℎ"+ (-) ℎ"/ (-) u1 u2 u3 u4 v u1 u2 u3 &"#: n i p o l : : s tv 3.1.2 Finding Reaction Centers with WLN We present two models to predict reactivity: the local and global models. Our local model is based directly on the atom representations cu and cv in predicting label yuv. The global model, on the other hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms outside of the reaction center may be necessary for the reaction to occur. For example, the reaction center may be influenced by certain reagents1 . We incorporate these distal effects into the global model through an attention mechanism. Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by the WLN. We predict the reactivity score of (u, v) by passing these through another neural network: suv = uT ⌧(Macu + Macv + Mbbuv) (5) where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary information about the pair such as whether the two atoms are in different molecules or which type of bond connects them. Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation ˜cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from the attention module: ˜cu = X v ↵uvcv; ↵uv = uT ⌧(Pacu + Pacv + Pbbuv) (6) suv = uT ⌧(Ma˜cu + Ma˜cv + Mbbuv) (7) Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be multiple atoms relevant to a particular atom u. Training Both models are trained to minimize the following loss function: 3.1.2 Finding Reaction Centers with WLN We present two models to predict reactivity: the local and global models. Our local model is based directly on the atom representations cu and cv in predicting label yuv. The global model, on the other hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms outside of the reaction center may be necessary for the reaction to occur. For example, the reaction center may be influenced by certain reagents1 . We incorporate these distal effects into the global model through an attention mechanism. Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by the WLN. We predict the reactivity score of (u, v) by passing these through another neural network: suv = uT ⌧(Macu + Macv + Mbbuv) (5) where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary information about the pair such as whether the two atoms are in different molecules or which type of bond connects them. Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation ˜cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from the attention module: ˜cu = X v ↵uvcv; ↵uv = uT ⌧(Pacu + Pacv + Pbbuv) (6) suv = uT ⌧(Ma˜cu + Ma˜cv + Mbbuv) (7) Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be multiple atoms relevant to a particular atom u. Training Both models are trained to minimize the following loss function: L(T ) = X R2T X u6=v2R yuv log(suv) + (1 yuv) log(1 suv) (8) Here we predict each label independently because of the large number of variables. For a given reaction with N atoms, we need to predict the reactivity score of O(N2 ) pairs. This quadratic complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless, we found independent prediction yields sufficiently good performance. 3.2 Candidate Generation We select the top K atom pairs with the highest predicted reactivity score and designate them, collectively, as the reaction center. The set of candidate products are then obtained by enumerating all possible bond configuration changes within the set. While the resulting set of candidate products is exponential in K, many can be ruled out by invoking additional constraints. For example, every atom has a maximum number of neighbors they can connect to (valence constraint). We also leverage the statistical bias that reaction centers are very unlikely to consist of disconnected components (connectivity constraint). Some multi-step reactions do exist that violate the connectivity constraint. As we will show, the set of candidates arising from this procedure is more compact than those arising mML
  • 39. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n MN G W A ( ) AMN L ) v u1 u2 u3 u4 h(0) v h(0) u1 h(0) u2 !" !#$ !#% !#& !#'(#" )* )+)+ + ( (#" )* )+ + (#" )* )+ + (#" )* )+ + ℎ#- (/) ℎ#1 (/) hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms outside of the reaction center may be necessary for the reaction to occur. For example, the reaction center may be influenced by certain reagents1 . We incorporate these distal effects into the global model through an attention mechanism. Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by the WLN. We predict the reactivity score of (u, v) by passing these through another neural network: suv = uT ⌧(Macu + Macv + Mbbuv) (5) where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary information about the pair such as whether the two atoms are in different molecules or which type of bond connects them. Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation ˜cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from the attention module: ˜cu = X v ↵uvcv; ↵uv = uT ⌧(Pacu + Pacv + Pbbuv) (6) suv = uT ⌧(Ma˜cu + Ma˜cv + Mbbuv) (7) Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be multiple atoms relevant to a particular atom u. Training Both models are trained to minimize the following loss function: L(T ) = X R2T X u6=v2R yuv log(suv) + (1 yuv) log(1 suv) (8) Here we predict each label independently because of the large number of variables. For a given reaction with N atoms, we need to predict the reactivity score of O(N2 ) pairs. This quadratic complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless, we found independent prediction yields sufficiently good performance. 3.2 Candidate Generation We select the top K atom pairs with the highest predicted reactivity score and designate them, collectively, as the reaction center. The set of candidate products are then obtained by enumerating all possible bond configuration changes within the set. While the resulting set of candidate products is exponential in K, many can be ruled out by invoking additional constraints. For example, every atom has a maximum number of neighbors they can connect to (valence constraint). We also leverage the statistical bias that reaction centers are very unlikely to consist of disconnected components (connectivity constraint). Some multi-step reactions do exist that violate the connectivity constraint. As we will show, the set of candidates arising from this procedure is more compact than those arising from templates without sacrificing coverage.
  • 40. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n d NG rg Ml ( ) o WMc ) M t G e WM a • n N WM L yA s bGi v u1 u2 u3 u4 h(0) v h(0) u1 h(0) u2 ( ℎ"# (%) ℎ"' (%) Σ ( 3.1.2 Finding Reaction Centers with WLN We present two models to predict reactivity: the local and global models. Our local model is based directly on the atom representations cu and cv in predicting label yuv. The global model, on the other hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms outside of the reaction center may be necessary for the reaction to occur. For example, the reaction center may be influenced by certain reagents1 . We incorporate these distal effects into the global model through an attention mechanism. Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by the WLN. We predict the reactivity score of (u, v) by passing these through another neural network: suv = uT ⌧(Macu + Macv + Mbbuv) (5) where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary information about the pair such as whether the two atoms are in different molecules or which type of bond connects them. Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation ˜cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from the attention module: ˜cu = X v ↵uvcv; ↵uv = uT ⌧(Pacu + Pacv + Pbbuv) (6) suv = uT ⌧(Ma˜cu + Ma˜cv + Mbbuv) (7) Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be multiple atoms relevant to a particular atom u. Training Both models are trained to minimize the following loss function: L(T ) = X R2T X u6=v2R yuv log(suv) + (1 yuv) log(1 suv) (8) Here we predict each label independently because of the large number of variables. For a given reaction with N atoms, we need to predict the reactivity score of O(N2 ) pairs. This quadratic complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless, we found independent prediction yields sufficiently good performance. 3.2 Candidate Generation We select the top K atom pairs with the highest predicted reactivity score and designate them, collectively, as the reaction center. The set of candidate products are then obtained by enumerating all possible bond configuration changes within the set. While the resulting set of candidate products is exponential in K, many can be ruled out by invoking additional constraints. For example, every atom has a maximum number of neighbors they can connect to (valence constraint). We also leverage the statistical bias that reaction centers are very unlikely to consist of disconnected components (connectivity constraint). Some multi-step reactions do exist that violate the connectivity constraint.
  • 41. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n LG d n li b c ) ) oe ) W G A a M p p a M g ) oe v u1 u2 u3 u4 h(0) v h(0) u1 h(0) u2 ) ) ) ( & ℎ"# (%) ℎ"' (%) (̃* (̃"+,"* -. -/-/ + u1 u2 u3 u4 v u1 u2 u3 ) ) 3.1.2 Finding Reaction Centers with WLN We present two models to predict reactivity: the local and global models. Our local model is based directly on the atom representations cu and cv in predicting label yuv. The global model, on the other hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms outside of the reaction center may be necessary for the reaction to occur. For example, the reaction center may be influenced by certain reagents1 . We incorporate these distal effects into the global model through an attention mechanism. Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by the WLN. We predict the reactivity score of (u, v) by passing these through another neural network: suv = uT ⌧(Macu + Macv + Mbbuv) (5) where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary information about the pair such as whether the two atoms are in different molecules or which type of bond connects them. Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation ˜cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from the attention module: ˜cu = X v ↵uvcv; ↵uv = uT ⌧(Pacu + Pacv + Pbbuv) (6) suv = uT ⌧(Ma˜cu + Ma˜cv + Mbbuv) (7) Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be multiple atoms relevant to a particular atom u. Training Both models are trained to minimize the following loss function: L(T ) = X R2T X u6=v2R yuv log(suv) + (1 yuv) log(1 suv) (8) Here we predict each label independently because of the large number of variables. For a given reaction with N atoms, we need to predict the reactivity score of O(N2 ) pairs. This quadratic complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless, we found independent prediction yields sufficiently good performance. 3.2 Candidate Generation We select the top K atom pairs with the highest predicted reactivity score and designate them, collectively, as the reaction center. The set of candidate products are then obtained by enumerating all possible bond configuration changes within the set. While the resulting set of candidate products is 3.1.2 Finding Reaction Centers with WLN We present two models to predict reactivity: the local and global models. Our local model is based directly on the atom representations cu and cv in predicting label yuv. The global model, on the other hand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms outside of the reaction center may be necessary for the reaction to occur. For example, the reaction center may be influenced by certain reagents1 . We incorporate these distal effects into the global model through an attention mechanism. Local Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by the WLN. We predict the reactivity score of (u, v) by passing these through another neural network: suv = uT ⌧(Macu + Macv + Mbbuv) (5) where (·) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary information about the pair such as whether the two atoms are in different molecules or which type of bond connects them. Global Model Let ↵uv be the attention score of atom v on atom u. The global context representation ˜cu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from the attention module: ˜cu = X v ↵uvcv; ↵uv = uT ⌧(Pacu + Pacv + Pbbuv) (6) suv = uT ⌧(Ma˜cu + Ma˜cv + Mbbuv) (7) Note that the attention is obtained with sigmoid rather than softmax non-linearity since there may be multiple atoms relevant to a particular atom u. Training Both models are trained to minimize the following loss function: L(T ) = X R2T X u6=v2R yuv log(suv) + (1 yuv) log(1 suv) (8) Here we predict each label independently because of the large number of variables. For a given reaction with N atoms, we need to predict the reactivity score of O(N2 ) pairs. This quadratic complexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless, we found independent prediction yields sufficiently good performance. 3.2 Candidate Generation We select the top K atom pairs with the highest predicted reactivity score and designate them, collectively, as the reaction center. The set of candidate products are then obtained by enumerating all possible bond configuration changes within the set. While the resulting set of candidate products is exponential in K, many can be ruled out by invoking additional constraints. For example, every atom has a maximum number of neighbors they can connect to (valence constraint). We also leverage the statistical bias that reaction centers are very unlikely to consist of disconnected components (connectivity constraint). Some multi-step reactions do exist that violate the connectivity constraint. As we will show, the set of candidates arising from this procedure is more compact than those arising ONL
  • 42. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. G N ML u1 u2 u3 u4 u5 /3 /3 / / / - ./ ./ u2 u3 u4 u5 u1 u2 u3 u4 u1 u2 u3 u5u1 u2 u3 u4 u1 u2 u3 u4 3 3 1 Weisfeiler-Lehman Difference Network (WLDN)
  • 43. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n , voc S y s : N p nN ae N T y • vo t 1 2 2 7 027 y vo I yP • y t 22 70 07 2 7 027 y K vol P i N TN IJ : P tN view of our approach. (1) we train a model to identify pairwise atom interactions r 02 +
  • 44. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. G L . u1 u2 u3 u4 u5 / / / /2 / - ./ ./ u2 u3 u4 u5 u1 u2 u3 u4 u1 u2 u3 u5u1 u2 u3 u4 u1 u2 u3 u4 . 2 1 Weisfeiler-Lehman Difference Network (WLDN)
  • 45. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n - P : N L !" ($%) = (" ($%) − (" (*) P !" ($%) v1 v2 v3 v4 v5 v1 v2 v3 v4 v5() - - - - - Σ e each candidate pair (r, p). The first model naively ectors of all atom representations obtained from a . Our second and improved model, called WLDN, ween these differences vectors. d atom representation of atom v in candidate product ertaining to atom v as follows: s(pi) = uT ⌧(M X v2pi d(pi) v ) (9) -mapped so we can use v to refer to the same atom. e difference vectors, resulting in a single vector for her neural network to score the candidate product pi. DN) Instead of simply summing all difference vec- d a difference graph. A difference graph D(r, pi) is atoms and bonds as pi, with atom v’s feature vector aph has several benefits. First, in D(r, pi), atom v’s e to the reaction center, thus focusing the processing Second, D(r, pi) explicates neighbor dependencies !" ($%) = (" ($%) − (" (*) )
  • 46. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n 1 ):3 ( 21 l f ih S g e !" ($%) = (" ($%) − (" (*) l !" ($%) a cP l 3 - 2 NL W u1 u2 u3 u5u1 u2 u3 u4 u1 u2 u3 u4 1 ):3 ( 21 ih ih ih y ∈ {0,1,2,3} 3 -
  • 47. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n ) ) ) ) ) ) () ) - W L N D : : !" ($%) = (" ($%) − (" (*) W !" ($%) ( . ) 3 : v1 v2 v3 v4 v5 v1 v2 v3 v4 v5 - - - - - Σ e each candidate pair (r, p). The first model naively ectors of all atom representations obtained from a . Our second and improved model, called WLDN, ween these differences vectors. d atom representation of atom v in candidate product ertaining to atom v as follows: s(pi) = uT ⌧(M X v2pi d(pi) v ) (9) -mapped so we can use v to refer to the same atom. e difference vectors, resulting in a single vector for her neural network to score the candidate product pi. DN) Instead of simply summing all difference vec- d a difference graph. A difference graph D(r, pi) is atoms and bonds as pi, with atom v’s feature vector aph has several benefits. First, in D(r, pi), atom v’s e to the reaction center, thus focusing the processing Second, D(r, pi) explicates neighbor dependencies molecule pi. We define difference vector d (pi) v pertaining to atom v as follows: d(pi) v = c(pi) v c(r) v ; s(pi) = uT ⌧(M X v2pi d(pi) v ) (9) Recall that the reactants and products are atom-mapped so we can use v to refer to the same atom. The pooling operation is a simple sum over these difference vectors, resulting in a single vector for each (r, pi) pair. This vector is then fed into another neural network to score the candidate product pi. Weisfeiler-Lehman Difference Network (WLDN) Instead of simply summing all difference vec- tors, the WLDN operates on another graph called a difference graph. A difference graph D(r, pi) is defined as a molecular graph which has the same atoms and bonds as pi, with atom v’s feature vector replaced by d (pi) v . Operating on the difference graph has several benefits. First, in D(r, pi), atom v’s feature vector deviates from zero only if it is close to the reaction center, thus focusing the processing on the reaction center and its immediate context. Second, D(r, pi) explicates neighbor dependencies between difference vectors. The WLDN maps this graph-based representation into a fixed-length vector, by applying a separately parameterized WLN on top of D(r, pi): h(pi,l) v = ⌧ 0 @U1h(pi,l 1) v + U2 X u2N(v) ⌧ ⇣ V[h(pi,l 1) u , fuv] ⌘ 1 A (1  l  L) (10) d(pi,L) v = X u2N(v) W(0) h(pi,L) u W(1) fuv W(2) h(pi,L) v (11) where h (pi,0) v = d (pi) v . The final score of pi is s(pi) = uT ⌧(M P v2pi d (pi,L) v ). Training Both models are trained to minimize the softmax log-likelihood objective over the scores {s(p0), s(p1), · · · , s(pm)} where s(p0) corresponds to the target. 4 Experiments Data As a source of data for our experiments, we used reactions from USPTO granted patents, collected by Lowe [13]. After removing duplicates and erroneous reactions, we obtained a set of 480K reactions, to which we refer in the paper as USPTO. This dataset is divided into 400K, 40K, and 40K for training, development, and testing purposes. In addition, for comparison purposes we report the results on the subset of 15K reaction from this dataset (referred as USPTO-15K) used by Coley et al. [3]. They selected this subset to include reactions covered by the 1.7K most common templates. We follow their split, with 10.5K, 1.5K, and 3K for training, development, and testing. Setup for Reaction Center Identification The output of this component consists of K atom pairs with the highest reactivity scores. We compute the coverage as the proportion of reactions where all atom pairs in the true reaction center are predicted by the model, i.e., where the recorded product is found in the model-generated candidate set. The model features reflect basic chemical properties of atoms and bonds. Atom-level features include its elemental identity, degree of connectivity, number of attached hydrogen atoms, implicit valence, and aromaticity. Bond-level features include bond type (single, double, triple, or aromatic), whether it is conjugated, and whether the bond is part of a ring. Both our local and global models are build upon a Weisfeiler-Lehman Network, with unrolled depth d(pi) v = c(pi) v c(r) v ; s(pi) = uT ⌧(M X v2pi d(pi) v ) (9) Recall that the reactants and products are atom-mapped so we can use v to refer to the same atom. The pooling operation is a simple sum over these difference vectors, resulting in a single vector for each (r, pi) pair. This vector is then fed into another neural network to score the candidate product pi. Weisfeiler-Lehman Difference Network (WLDN) Instead of simply summing all difference vec- tors, the WLDN operates on another graph called a difference graph. A difference graph D(r, pi) is defined as a molecular graph which has the same atoms and bonds as pi, with atom v’s feature vector replaced by d (pi) v . Operating on the difference graph has several benefits. First, in D(r, pi), atom v’s feature vector deviates from zero only if it is close to the reaction center, thus focusing the processing on the reaction center and its immediate context. Second, D(r, pi) explicates neighbor dependencies between difference vectors. The WLDN maps this graph-based representation into a fixed-length vector, by applying a separately parameterized WLN on top of D(r, pi): h(pi,l) v = ⌧ 0 @U1h(pi,l 1) v + U2 X u2N(v) ⌧ ⇣ V[h(pi,l 1) u , fuv] ⌘ 1 A (1  l  L) (10) d(pi,L) v = X u2N(v) W(0) h(pi,L) u W(1) fuv W(2) h(pi,L) v (11) where h (pi,0) v = d (pi) v . The final score of pi is s(pi) = uT ⌧(M P v2pi d (pi,L) v ). Training Both models are trained to minimize the softmax log-likelihood objective over the scores {s(p0), s(p1), · · · , s(pm)} where s(p0) corresponds to the target. 4 Experiments Data As a source of data for our experiments, we used reactions from USPTO granted patents, collected by Lowe [13]. After removing duplicates and erroneous reactions, we obtained a set of 480K reactions, to which we refer in the paper as USPTO. This dataset is divided into 400K, 40K, and 40K for training, development, and testing purposes. In addition, for comparison purposes we report the results on the subset of 15K reaction from this dataset (referred as USPTO-15K) used by Coley et al. [3]. They selected this subset to include reactions covered by the 1.7K most common templates. We follow their split, with 10.5K, 1.5K, and 3K for training, development, and testing. Setup for Reaction Center Identification The output of this component consists of K atom pairs with the highest reactivity scores. We compute the coverage as the proportion of reactions where all atom pairs in the true reaction center are predicted by the model, i.e., where the recorded product is found in the model-generated candidate set. The model features reflect basic chemical properties of atoms and bonds. Atom-level features include its elemental identity, degree of connectivity, number of attached hydrogen atoms, implicit valence, and aromaticity. Bond-level features include bond type (single, double, triple, or aromatic), whether it is conjugated, and whether the bond is part of a ring.
  • 48. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. . . n ) - 1 2 3 -- 3 ( ) ( k e h af c W !" ($%) = (" ($%) − (" (*) k !" ($%) 3 k - 2 : 3 1 W D LN i u1 u2 u3 u5u1 u2 u3 u4 u1 u2 u3 u4 Weisfeiler-Lehman Difference Network (WLDN) h h h y ∈ {0,1,2,3} - 2 : 3
  • 49. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n n / C C n n
  • 50. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n . - P a = 75 4 3 0 71 7 n . - . - : eT = K Ud SO 75 4 3 0 71 7
  • 51. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n I c • c g / a n e • / / o • D o K
  • 52. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n bi 7 107 , 7 + 207 , 7 d sM La G L e G Ibi • c l G L I g o S eNP • g J I n >: 0 l rt S Figure 3: A reaction that reduces the carbonyl carbon of an amide by removing bond 4-23 (red circle). Reactivity at this site would be highly unlikely without the presence of borohydride (atom 25, blue circle). The global model correctly predicts bond 4-23 as the most susceptible to change, while the local model does not even include it in the top ten predictions. The attention map of the global model show that atoms 1, 25, and 26 were determinants of atom 4’s predicted reactivity. USPTO-15K Method |✓| K=6 K=8 K=10 Local 572K 80.1 85.0 87.7 Local 1003K 81.6 86.1 89.1 Global 756K 86.7 90.1 92.2 USPTO Local 572K 83.0 87.2 89.6 Local 1003K 82.4 86.7 89.1 Global 756K 89.8 92.0 93.3 Avg. Num. of Candidates (USPTO) Template - 482.3 out of 5006 Global - 60.9 246.5 1076 (a) Reaction Center Prediction Performance. Coverage is reported by picking the top K (K=6,8,10) reactivity pairs. |✓| is the number of model parameters. USPTO-15K Method Cov. P@1 P@3 P@5 Coley et al. 100.0 72.1 86.6 90.7 WLN 90.1 74.9 84.6 86.3 WLDN 90.1 76.7 85.6 86.8 WLN (*) 100.0 81.4 92.5 94.8 WLDN (*) 100.0 84.1 94.1 96.1 USPTO WLN 92.0 73.5 86.1 89.0 WLDN 92.0 74.0 86.7 89.5 WLN (*) 100.0 76.7 91.0 94.6 WLDN (*) 100.0 77.8 91.9 95.4 (b) Candidate Ranking Performance. Precision at ranks 1,3,5 are reported. (*) denotes that the true product was added if not covered by the previous stage. Table 1: Results on USPTO-15K and USPTO datasets. /
  • 53. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n g 0 9 C C n g C 0 9 • r , C . • 8 9 7:9. % • c. + yv 3/, • r • 8 9 7:9. • c. C S% a o N s I r K J P ne USPTO-15K Method |✓| K=6 K=8 K=10 Local 572K 80.1 85.0 87.7 Local 1003K 81.6 86.1 89.1 Global 756K 86.7 90.1 92.2 USPTO Local 572K 83.0 87.2 89.6 Local 1003K 82.4 86.7 89.1 Global 756K 89.8 92.0 93.3 Avg. Num. of Candidates (USPTO) Template - 482.3 out of 5006 Global - 60.9 246.5 1076 (a) Reaction Center Prediction Performance. Coverage is reported by picking the top K (K=6,8,10) reactivity pairs. |✓| is the number of model parameters. Method Coley et al WLN WLDN WLN (*) WLDN (*) WLN WLDN WLN (*) WLDN (*) (b) Candidate R 1,3,5 are repor added if not co Table 1: Results on USPTO-15K and US accuracy against the top-performing template-based approach employs frequency-based heuristics to construct reaction tem l i. 2 = 4156 %+
  • 54. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n y *= Ceni ,, D *= 5 1 U y • t T5 1U So S d FONN / /+ P )( • C* 8 8 - = D Kr F • PGNr T L K C / /+ D • PGNr W Pl n naK C / /+ D
  • 55. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n p l 2 7 :J L W JLT S IP50 1D c S ( ) 2 7 :JP 501 50 1 , 7 S • W JT Dn Ni D I C 501 50 1LosJ r I50 1D501S =8 K=10 .0 87.7 .1 89.1 .1 92.2 .2 89.6 .7 89.1 .0 93.3 USPTO) t of 5006 6.5 1076 ance. Coverage 8,10) reactivity meters. USPTO-15K Method Cov. P@1 P@3 P@5 Coley et al. 100.0 72.1 86.6 90.7 WLN 90.1 74.9 84.6 86.3 WLDN 90.1 76.7 85.6 86.8 WLN (*) 100.0 81.4 92.5 94.8 WLDN (*) 100.0 84.1 94.1 96.1 USPTO WLN 92.0 73.5 86.1 89.0 WLDN 92.0 74.0 86.7 89.5 WLN (*) 100.0 76.7 91.0 94.6 WLDN (*) 100.0 77.8 91.9 95.4 (b) Candidate Ranking Performance. Precision at ranks 1,3,5 are reported. (*) denotes that the true product was added if not covered by the previous stage. ults on USPTO-15K and USPTO datasets. e + / : 1 23 *
  • 56. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n y , + U T S Nc • n Ac J ie OP ra N I n you ,721 8 OP I ouN A Human Evaluation Setup Here we describe in detail the human evaluation results in Table 3. The evaluation dataset consists of eight groups, defined by the reaction template popularity as binned in Figure 4b, each with ten instances selected randomly from the USPTO test set. We invited in total ten chemists to predict the product given reactants in all groups. Table 3: Human and model performance on 80 reactions randomly selected from the USPTO test set to cover a diverse range of reaction types. The first 8 are experts with rich experience in organic chemistry (graduate and postdoctoral chemists) and the last two are graduate students in chemical engineering who use organic chemistry concepts regularly but have less formal training. Our model performs at the expert chemist level in terms of top 1 accuracy. Chemist 56.3 50.0 40.0 63.8 66.3 65.0 40.0 58.8 25.0 16.3 Our Model 69.1 Figure 5: Details of human performance OP ra OP ra sp 11 701: A Human Evaluation Setup Here we describe in detail the human evaluation results in Table 3. The evaluation dataset consists of eight groups, defined by the reaction template popularity as binned in Figure 4b, each with ten instances selected randomly from the USPTO test set. We invited in total ten chemists to predict the product given reactants in all groups. Table 3: Human and model performance on 80 reactions randomly selected from the USPTO test set to cover a diverse range of reaction types. The first 8 are experts with rich experience in organic chemistry (graduate and postdoctoral chemists) and the last two are graduate students in chemical engineering who use organic chemistry concepts regularly but have less formal training. Our model performs at the expert chemist level in terms of top 1 accuracy. Chemist 56.3 50.0 40.0 63.8 66.3 65.0 40.0 58.8 25.0 16.3 Our Model 69.1 Figure 5: Details of human performance (a) Histogram showing the distribution of question difficulties as evaluated by the average expert perfor- mance across all ten performers. (b) Comparison of model performance against human performance for sets of questions as grouped by the average human accuracy shown in Figure 5a . ,
  • 57. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n G 4 e G e N Go e G r N • a h 1 p • e • e y n rG 0 0 1 + e l G 0 e C 3 G
  • 58. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n THGLF LPJ TJDPLF HDF L P W F OHU L K HLUIHLNHT 9HKODP H TM 7LP" HPJ PJ" H DN THGLF LPJ TJDPLF HDF L P W F OHU L K HLUIHLNHT 9HKODP H TM 0GXDPFHU LP HWTDN 6PI TOD L P T FHUULPJ U HOU , n HORND H DUH HL" 7HPPLIHT " 2DXLG 2WXHPDWG" DPG 0N P 0URWTW 4W LM HWTDN PH TMU I T KH RTHGLF L P I TJDPLF FKHOLU T THDF L PU , ,( HJNHT" :DT LP 5 " DPG :DTM DNNHT HWTDN O NLF :DFKLPH 9HDTPLPJ I T H T U P KHULU DPG HDF L P THGLF L P ( , . ., HJNHT" :DT LP" :LMH THW " DPG :DTM DNNHT DTGU 0NRKD1KHO 1KHOLFDN P KHULU NDPPLPJ L K THH HDTFK DPG 2HHR HWTDN H TM NLFLHU , 1 NH " 1 PP T " H DN THGLF L P I TJDPLF HDF L P W F OHU AULPJ :DFKLPH 9HDTPLPJ ( , )() ))( n HORND H ITHH 8D DND" :D KH 0 " H DN 9HDTPLPJ RTHGLF FKHOLFDN THDF L PU . . FK DNNHT" KLNLRRH" H DN 3 WPG LP TDPUND L P THGLF LPJ W F OH I 1 ORNH TJDPLF 1KHOLU T HDF L PU WULPJ HWTDN HSWHPFH HSWHPFH : GHNU DTCLX RTHRTLP DTCLX , )- ,
  • 59. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. n 4 ESK HS HVHQWEWLRQ a RJH V" 2EYLG" EQG :EWKHZ 5EKQ 3 WHQGHG RQQH WLYLW ILQJH S LQWV ,) , ) a KH YEVKLG H" LQR" HW EO BHLVIHLOH OHK EQ J ESK NH QHOV HS (. n 4 ESK 1 a 4LO H " 7 VWLQ" HW EO H EO HVVEJH SEVVLQJ IR T EQW KH LVW 6Q " SEJHV ( , " , a 2 YHQE G" 2EYLG 8 " HW EO 1RQYRO WLRQEO QHWZR NV RQ J ESKV IR OHE QLQJ ROH OE ILQJH S LQWV a 9L" C MLE" E ORZ" 2EQLHO" 0 R NV K LGW" :E " EQG DH HO" L KE G 4EWHG J ESK VHT HQ H QH EO QHWZR NV " a K WW" 8 LVWRI " HW EO EQW KH L EO LQVLJKWV I R GHHS WHQVR QH EO QHWZR NV EW H R QL EWLRQV - , (-. a LQ EOV" LRO" E 0HQJLR" EQG :EQM QEWK 8 GO GH EWWH V HT HQ H WR VHT HQ H IR VHWV 619 "