Materials Informatics and Python

Materials Informatics Python 
PyData 1Day Conference
2018 10 20

■
■ IT
■  
■
■  
■
■ twitter: @sfchaos @shifukushima 
1

2
2000 2006 2009 2018
• •
• etc.
•
•
•
•
• etc.
2014
•
•
“ ”

■Materials Informatics
■
■
■Materials Informatics Python
3

1.
2. Materials Informatics
3. Python Materials Informatics
4.
5.
4

Materials Informatics
■
5
http://cms.mtl.kyoto-u.ac.jp/informatics.html

Materials Informatics
■2011  
Materials Genome Initiative  
https://www.mgi.gov/
■ 2015 NIMS  
 
http://www.nims.go.jp/MII-I/ 
 
NIMS:
7

■
■
■
■  
 
(Li 2 )
■ TDK
■
■ NEC
■  
etc.
8
 
https://tech.nikkeibp.co.jp/atcl/nxt/mag/ne/18/00030/00001/

■ NIPS  
9
NIPS2018 workshop
Machine Learning for Molecules
and Materials
http://www.quantum-
machine.org/workshops/
nips2018draft/
NIPS2017 workshop
Machine Learning for Molecules and
Materials
http://www.quantum-
machine.org/workshops/nips2017/

1.
4.
5.
10

SMILES
■
11
D.Weinberger et al., J.Chem.Inf.Model,28,31(1988)
: (C6H6)
c1ccccc1
: (C6H5COOH)
OC(=O)c1ccccc1

SMILES
■
12
D.Weinberger et al., J.Chem.Inf.Model,28,31(1988)
https://ja.wikipedia.org/wiki/SMILES%E8%A8%98%E6%B3%95

Fingerprint
■
14
:
• D.Rogers and M.Hahn, J.Chem.Inf.Model.,50(5), 742(2010) 
https://pubs.acs.org/doi/10.1021/ci100050t
• 94  
https://art.ist.hokudai.ac.jp/~takigawa/data/fpai94_takigawa.pdf

1.
4.
5.
15

RDKit
■
■ C++
■Python
16
https://www.rdkit.org/

RDKit
■Anaconda
17
$ conda install -c rdkit rdkit
Anaconda
https://www.rdkit.org/docs/Install.html
(variational autoencoder, VAE)

RDKit
■
■
■  
Getting Started with the RDKit in Python 
https://www.rdkit.org/docs/
GettingStartedInPython.html
■RDKit  
https://future-chem.com/rdkit-intro/
■  
https://github.com/chemo-wakate
18

■
23
https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-
available-descriptors

■
25
1. 2. 3.
4.  
 
5.
 
Molecular neural network models with RDKit and Keras in Python
http://www.wildcardconsulting.dk/useful-information/molecular-neural-
network-models-with-rdkit-and-keras-in-python/
Keras
http://www.ag.kagawa-u.ac.jp/charlesy/2017/07/21/
keras%E3%81%A7%E5%8C%96%E5%90%88%E7%89%A9%E3%81%AE%E6%BA%B6%E8%A7%
A3%E5%BA%A6%E4%BA%88%E6%B8%AC%EF%BC%88%E3%83%8B%E3%83%A5%E3%83%BC
%E3%83%A9%E3%83%AB%E3%83%8D%E3%83%83%E3%83%88%E3%83%AF%E3%83%BC/

1.  
RDKit github
26
https://github.com/rdkit/rdkit/blob/
master/Docs/Book/data
solubility.train.sdf ( )
solubility.test.sdf ( )
※ SDF (Structured Data Format)
: SDF  
https://www.chem-station.com/blog/2012/04/sdf.html

1.  
SDF solubility.train.sdf )
27
n-pentane
RDKit 2D
5 4 0 0 0 0 0 0 0 0999 V2000
0.2606 0.1503 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.3000 0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.6000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.9000 0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.9394 0.1503 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
2 3 1 0
3 4 1 0
4 5 1 0
M END
> <ID> (1)
1
> <NAME> (1)
n-pentane

1.
4.
5.
33

■ :
■  
https://www.jstage.jst.go.jp/article/cicsj/36/1/36_9/
_pdf/-char/ja
■ Deep Learning  
https://kivantium.net/deep-for-chem
35

■
■ORGAN / ORGANIC
■MolGAN
■ChemTS
36

GAN
■
37
https://medium.com/@devnag/generative-adversarial-
networks-gans-in-50-lines-of-code-pytorch-e81b79659e3f

SeqGAN
38
L.Yu, et al., AAAI2017.
https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/
14344/14489
https://github.com/LantaoYu/SeqGAN
(state): t-1
(action): t

ORGAN
■SeqGAN
39
s
d
,
d
-
h
d
o
e
)
.
t
,
Finally in SeqGAN the reward function is provided by D .
4 ORGAN
Figure 1: Schema for ORGAN. Left: D is trained as a classiﬁer
receiving as input a mix of real data and generated data by G. Right:
G is trained by RL where the reward is a combination of D and the
objectives, and is passed back to the policy function via Monte Carlo
sampling. We penalize non-unique sequences.
Figure 1 illustrates the main idea of ORGAN. To take into
account domain-speciﬁc desired objectives Oi, we extend the
SeqGAN
SMILES
G.Guimaraes et al.(2017)
https://arxiv.org/abs/1705.10843
https://github.com/gablg1/ORGAN

ORGAN
40
Objective Algorithm Validity (%) Diversity Druglikeliness Synthesizability Solubility
MLE 75.9 0.64 0.48 (0%) 0.23 (0%) 0.30 (0%)
SeqGAN 80.3 0.61 0.49 (2%) 0.25 (6%) 0.31 (3%)
Druglikeliness ORGAN 88.2 0.55 0.52 (8%) 0.32 (38%) 0.35 (18%)
OR(W)GAN 85.0 0.95 0.60 (25%) 0.54 (130%) 0.47 (57%)
Naive RL 97.1 0.8 0.57 (19%) 0.53 (126%) 0.50 (67%)
Synthesizability ORGAN 96.5 0.92 0.51 (6%) 0.83 (255%) 0.45 (52%)
OR(W)GAN 97.6 1.00 0.20 (-59%) 0.75 (223%) 0.84 (184%)
Naive RL 97.7 0.96 0.52 (8%) 0.83 (256%) 0.46 (54%)
Solubility ORGAN 94.7 0.76 0.50 (4%) 0.63 (171%) 0.55 (85%)
OR(W)GAN 94.1 0.90 0.42 (-12%) 0.66 (185%) 0.54 (81%)
Naive RL 92.7 0.75 0.49 (3%) 0.70 (200%) 0.78 (162 %)
All/Alternated ORGAN 96.1 92.3 0.52 (9%) 0.71 (206%) 0.53 (79%)
ble 1: Evaluation of metrics, on several generative algorithms and optimized for different objectives for molecules. Reported values
an values of valid generated molecules. The percentage of improvement over the MLE baseline is reported in parenthesis. Values sho
bold indicate signiﬁcant improvement. Shaded cell indicates direct optimized objectives.
ble 2 shows quantitative results comparing ORGAN to oth
baseline methods optimizing for three different metrics. O
GAN outperforms SeqGAN and MLE in all of the three m
rics. Naive RL achieves a higher score than ORGAN for
Ratio of Steps metric, but it under-performs in terms of
Druglikeliness, Synthesizability, Solubility

ORGANIC
■ORGAN
41
Methods
gure 1: Usage of ORGANIC illustrated. In the training procedure we show the thre
ndamental components: a generator, a discriminator, and a reinforcement metric. Arrow
icate the ﬂow of inputs and outputs between networks.
B.S-.Lengeling, et al.(2017)
https://chemrxiv.org/articles/ORGANIC_1_pdf/5309668
https://github.com/aspuru-guzik-group/ORGANIC

MolGAN
■
■SMILES
42
ive model for small molecular graphs
Cao 1
Thomas Kipf 1
Molecular graph
Generator Discriminator
Reward
network
z ~ p(z)
0/1
0/1
x ~ pdata(x)
Generator Discriminator
N.D.Cao and T.Kipf(2018)

MolGAN
■
43
MolGAN: An implicit generative model for small molecular graphs
Generator
Graph
Molecule
N
N
N
N
N N
T T
z ~ p(z)
Adjacency tensor Sampled
SampledAnnotation matrix
~
~
GCN
GCN
0/1
0/1
Discriminator
Reward network
A<latexit sha1_base64="EMPyu5ASlEpI1qvrJeu1mckhUAU=">AAAB8XicbVDLSsNAFL3xWeur6tLNYBFclUSEuqy4cVnBPrANZTKdtEMnkzBzI5TQv3DjQhG3/o07/8ZJm4W2Hhg4nHMvc+4JEikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJk414y0Wy1h3A2q4FIq3UKDk3URzGgWSd4LJbe53nrg2IlYPOE24H9GREqFgFK302I8ojoMwu5kNKlW35s5BVolXkCoUaA4qX/1hzNKIK2SSGtPz3AT9jGoUTPJZuZ8anlA2oSPes1TRiBs/myeekXOrDEkYa/sUkrn6eyOjkTHTKLCTeUKz7OXif14vxfDaz4RKUuSKLT4KU0kwJvn5ZCg0ZyinllCmhc1K2JhqytCWVLYleMsnr5L2Zc1za979VbVRL+oowSmcwQV4UIcG3EETWsBAwTO8wptjnBfn3flYjK45xc4J/IHz+QOmV5Da</latexit><latexit sha1_base64="EMPyu5ASlEpI1qvrJeu1mckhUAU=">AAAB8XicbVDLSsNAFL3xWeur6tLNYBFclUSEuqy4cVnBPrANZTKdtEMnkzBzI5TQv3DjQhG3/o07/8ZJm4W2Hhg4nHMvc+4JEikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJk414y0Wy1h3A2q4FIq3UKDk3URzGgWSd4LJbe53nrg2IlYPOE24H9GREqFgFK302I8ojoMwu5kNKlW35s5BVolXkCoUaA4qX/1hzNKIK2SSGtPz3AT9jGoUTPJZuZ8anlA2oSPes1TRiBs/myeekXOrDEkYa/sUkrn6eyOjkTHTKLCTeUKz7OXif14vxfDaz4RKUuSKLT4KU0kwJvn5ZCg0ZyinllCmhc1K2JhqytCWVLYleMsnr5L2Zc1za979VbVRL+oowSmcwQV4UIcG3EETWsBAwTO8wptjnBfn3flYjK45xc4J/IHz+QOmV5Da</latexit><latexit sha1_base64="EMPyu5ASlEpI1qvrJeu1mckhUAU=">AAAB8XicbVDLSsNAFL3xWeur6tLNYBFclUSEuqy4cVnBPrANZTKdtEMnkzBzI5TQv3DjQhG3/o07/8ZJm4W2Hhg4nHMvc+4JEikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJk414y0Wy1h3A2q4FIq3UKDk3URzGgWSd4LJbe53nrg2IlYPOE24H9GREqFgFK302I8ojoMwu5kNKlW35s5BVolXkCoUaA4qX/1hzNKIK2SSGtPz3AT9jGoUTPJZuZ8anlA2oSPes1TRiBs/myeekXOrDEkYa/sUkrn6eyOjkTHTKLCTeUKz7OXif14vxfDaz4RKUuSKLT4KU0kwJvn5ZCg0ZyinllCmhc1K2JhqytCWVLYleMsnr5L2Zc1za979VbVRL+oowSmcwQV4UIcG3EETWsBAwTO8wptjnBfn3flYjK45xc4J/IHz+QOmV5Da</latexit><latexit sha1_base64="EMPyu5ASlEpI1qvrJeu1mckhUAU=">AAAB8XicbVDLSsNAFL3xWeur6tLNYBFclUSEuqy4cVnBPrANZTKdtEMnkzBzI5TQv3DjQhG3/o07/8ZJm4W2Hhg4nHMvc+4JEikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJk414y0Wy1h3A2q4FIq3UKDk3URzGgWSd4LJbe53nrg2IlYPOE24H9GREqFgFK302I8ojoMwu5kNKlW35s5BVolXkCoUaA4qX/1hzNKIK2SSGtPz3AT9jGoUTPJZuZ8anlA2oSPes1TRiBs/myeekXOrDEkYa/sUkrn6eyOjkTHTKLCTeUKz7OXif14vxfDaz4RKUuSKLT4KU0kwJvn5ZCg0ZyinllCmhc1K2JhqytCWVLYleMsnr5L2Zc1za979VbVRL+oowSmcwQV4UIcG3EETWsBAwTO8wptjnBfn3flYjK45xc4J/IHz+QOmV5Da</latexit>
X<latexit sha1_base64="k8fMTYMpbcAk1m6rTYMegJsdMOM=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiQi1GXBjcsK9oFtKJPppB06mYSZG6GE/oUbF4q49W/c+TdO2iy09cDA4Zx7mXNPkEhh0HW/ndLG5tb2Tnm3srd/cHhUPT7pmDjVjLdZLGPdC6jhUijeRoGS9xLNaRRI3g2mt7nffeLaiFg94CzhfkTHSoSCUbTS4yCiOAnCrDcfVmtu3V2ArBOvIDUo0BpWvwajmKURV8gkNabvuQn6GdUomOTzyiA1PKFsSse8b6miETd+tkg8JxdWGZEw1vYpJAv190ZGI2NmUWAn84Rm1cvF/7x+iuGNnwmVpMgVW34UppJgTPLzyUhozlDOLKFMC5uVsAnVlKEtqWJL8FZPXiedq7rn1r3761qzUdRRhjM4h0vwoAFNuIMWtIGBgmd4hTfHOC/Ou/OxHC05xc4p/IHz+QPJSpDx</latexit><latexit sha1_base64="k8fMTYMpbcAk1m6rTYMegJsdMOM=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiQi1GXBjcsK9oFtKJPppB06mYSZG6GE/oUbF4q49W/c+TdO2iy09cDA4Zx7mXNPkEhh0HW/ndLG5tb2Tnm3srd/cHhUPT7pmDjVjLdZLGPdC6jhUijeRoGS9xLNaRRI3g2mt7nffeLaiFg94CzhfkTHSoSCUbTS4yCiOAnCrDcfVmtu3V2ArBOvIDUo0BpWvwajmKURV8gkNabvuQn6GdUomOTzyiA1PKFsSse8b6miETd+tkg8JxdWGZEw1vYpJAv190ZGI2NmUWAn84Rm1cvF/7x+iuGNnwmVpMgVW34UppJgTPLzyUhozlDOLKFMC5uVsAnVlKEtqWJL8FZPXiedq7rn1r3761qzUdRRhjM4h0vwoAFNuIMWtIGBgmd4hTfHOC/Ou/OxHC05xc4p/IHz+QPJSpDx</latexit><latexit sha1_base64="k8fMTYMpbcAk1m6rTYMegJsdMOM=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiQi1GXBjcsK9oFtKJPppB06mYSZG6GE/oUbF4q49W/c+TdO2iy09cDA4Zx7mXNPkEhh0HW/ndLG5tb2Tnm3srd/cHhUPT7pmDjVjLdZLGPdC6jhUijeRoGS9xLNaRRI3g2mt7nffeLaiFg94CzhfkTHSoSCUbTS4yCiOAnCrDcfVmtu3V2ArBOvIDUo0BpWvwajmKURV8gkNabvuQn6GdUomOTzyiA1PKFsSse8b6miETd+tkg8JxdWGZEw1vYpJAv190ZGI2NmUWAn84Rm1cvF/7x+iuGNnwmVpMgVW34UppJgTPLzyUhozlDOLKFMC5uVsAnVlKEtqWJL8FZPXiedq7rn1r3761qzUdRRhjM4h0vwoAFNuIMWtIGBgmd4hTfHOC/Ou/OxHC05xc4p/IHz+QPJSpDx</latexit><latexit sha1_base64="k8fMTYMpbcAk1m6rTYMegJsdMOM=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiQi1GXBjcsK9oFtKJPppB06mYSZG6GE/oUbF4q49W/c+TdO2iy09cDA4Zx7mXNPkEhh0HW/ndLG5tb2Tnm3srd/cHhUPT7pmDjVjLdZLGPdC6jhUijeRoGS9xLNaRRI3g2mt7nffeLaiFg94CzhfkTHSoSCUbTS4yCiOAnCrDcfVmtu3V2ArBOvIDUo0BpWvwajmKURV8gkNabvuQn6GdUomOTzyiA1PKFsSse8b6miETd+tkg8JxdWGZEw1vYpJAv190ZGI2NmUWAn84Rm1cvF/7x+iuGNnwmVpMgVW34UppJgTPLzyUhozlDOLKFMC5uVsAnVlKEtqWJL8FZPXiedq7rn1r3761qzUdRRhjM4h0vwoAFNuIMWtIGBgmd4hTfHOC/Ou/OxHC05xc4p/IHz+QPJSpDx</latexit>
˜X<latexit sha1_base64="h5fkkvOPNqe9NI7w0SLn2N2FVmc=">AAAB+3icbVDLSsNAFL3xWesr1qWbwSK4KokIdVlw47KCfUATymQyaYdOJmFmIpaQX3HjQhG3/og7/8ZJm4W2Hhg4nHMv98wJUs6Udpxva2Nza3tnt7ZX3z84PDq2Txp9lWSS0B5JeCKHAVaUM0F7mmlOh6mkOA44HQSz29IfPFKpWCIe9DylfowngkWMYG2ksd3wYqynQZR7mvGQ5sOiGNtNp+UsgNaJW5EmVOiO7S8vTEgWU6EJx0qNXCfVfo6lZoTTou5liqaYzPCEjgwVOKbKzxfZC3RhlBBFiTRPaLRQf2/kOFZqHgdmskyqVr1S/M8bZTq68XMm0kxTQZaHoowjnaCyCBQySYnmc0MwkcxkRWSKJSba1FU3JbirX14n/auW67Tc++tmp13VUYMzOIdLcKENHbiDLvSAwBM8wyu8WYX1Yr1bH8vRDavaOYU/sD5/ALyelNg=</latexit><latexit sha1_base64="h5fkkvOPNqe9NI7w0SLn2N2FVmc=">AAAB+3icbVDLSsNAFL3xWesr1qWbwSK4KokIdVlw47KCfUATymQyaYdOJmFmIpaQX3HjQhG3/og7/8ZJm4W2Hhg4nHMv98wJUs6Udpxva2Nza3tnt7ZX3z84PDq2Txp9lWSS0B5JeCKHAVaUM0F7mmlOh6mkOA44HQSz29IfPFKpWCIe9DylfowngkWMYG2ksd3wYqynQZR7mvGQ5sOiGNtNp+UsgNaJW5EmVOiO7S8vTEgWU6EJx0qNXCfVfo6lZoTTou5liqaYzPCEjgwVOKbKzxfZC3RhlBBFiTRPaLRQf2/kOFZqHgdmskyqVr1S/M8bZTq68XMm0kxTQZaHoowjnaCyCBQySYnmc0MwkcxkRWSKJSba1FU3JbirX14n/auW67Tc++tmp13VUYMzOIdLcKENHbiDLvSAwBM8wyu8WYX1Yr1bH8vRDavaOYU/sD5/ALyelNg=</latexit><latexit sha1_base64="h5fkkvOPNqe9NI7w0SLn2N2FVmc=">AAAB+3icbVDLSsNAFL3xWesr1qWbwSK4KokIdVlw47KCfUATymQyaYdOJmFmIpaQX3HjQhG3/og7/8ZJm4W2Hhg4nHMv98wJUs6Udpxva2Nza3tnt7ZX3z84PDq2Txp9lWSS0B5JeCKHAVaUM0F7mmlOh6mkOA44HQSz29IfPFKpWCIe9DylfowngkWMYG2ksd3wYqynQZR7mvGQ5sOiGNtNp+UsgNaJW5EmVOiO7S8vTEgWU6EJx0qNXCfVfo6lZoTTou5liqaYzPCEjgwVOKbKzxfZC3RhlBBFiTRPaLRQf2/kOFZqHgdmskyqVr1S/M8bZTq68XMm0kxTQZaHoowjnaCyCBQySYnmc0MwkcxkRWSKJSba1FU3JbirX14n/auW67Tc++tmp13VUYMzOIdLcKENHbiDLvSAwBM8wyu8WYX1Yr1bH8vRDavaOYU/sD5/ALyelNg=</latexit><latexit sha1_base64="h5fkkvOPNqe9NI7w0SLn2N2FVmc=">AAAB+3icbVDLSsNAFL3xWesr1qWbwSK4KokIdVlw47KCfUATymQyaYdOJmFmIpaQX3HjQhG3/og7/8ZJm4W2Hhg4nHMv98wJUs6Udpxva2Nza3tnt7ZX3z84PDq2Txp9lWSS0B5JeCKHAVaUM0F7mmlOh6mkOA44HQSz29IfPFKpWCIe9DylfowngkWMYG2ksd3wYqynQZR7mvGQ5sOiGNtNp+UsgNaJW5EmVOiO7S8vTEgWU6EJx0qNXCfVfo6lZoTTou5liqaYzPCEjgwVOKbKzxfZC3RhlBBFiTRPaLRQf2/kOFZqHgdmskyqVr1S/M8bZTq68XMm0kxTQZaHoowjnaCyCBQySYnmc0MwkcxkRWSKJSba1FU3JbirX14n/auW67Tc++tmp13VUYMzOIdLcKENHbiDLvSAwBM8wyu8WYX1Yr1bH8vRDavaOYU/sD5/ALyelNg=</latexit>
Ã<latexit sha1_base64="IVJAEzjPjiXPvp4Oo4QNTUc/Kds=">AAAB+3icbVDLSsNAFL2pr1pftS7dDBbBVUlEqMuKG5cV7AOaUCaTSTt0MgkzE7GE/IobF4q49Ufc+TdO2iy09cDA4Zx7uWeOn3CmtG1/W5WNza3tnepubW//4PCoftzoqziVhPZIzGM59LGinAna00xzOkwkxZHP6cCf3Rb+4JFKxWLxoOcJ9SI8ESxkBGsjjesNN8J66oeZqxkPaHaT5+N6027ZC6B14pSkCSW64/qXG8QkjajQhGOlRo6daC/DUjPCaV5zU0UTTGZ4QkeGChxR5WWL7Dk6N0qAwliaJzRaqL83MhwpNY98M1kkVateIf7njVIdXnsZE0mqqSDLQ2HKkY5RUQQKmKRE87khmEhmsiIyxRITbeqqmRKc1S+vk/5ly7Fbzv1Vs9Mu66jCKZzBBTjQhg7cQRd6QOAJnuEV3qzcerHerY/laMUqd07gD6zPH5mUlME=</latexit><latexit sha1_base64="IVJAEzjPjiXPvp4Oo4QNTUc/Kds=">AAAB+3icbVDLSsNAFL2pr1pftS7dDBbBVUlEqMuKG5cV7AOaUCaTSTt0MgkzE7GE/IobF4q49Ufc+TdO2iy09cDA4Zx7uWeOn3CmtG1/W5WNza3tnepubW//4PCoftzoqziVhPZIzGM59LGinAna00xzOkwkxZHP6cCf3Rb+4JFKxWLxoOcJ9SI8ESxkBGsjjesNN8J66oeZqxkPaHaT5+N6027ZC6B14pSkCSW64/qXG8QkjajQhGOlRo6daC/DUjPCaV5zU0UTTGZ4QkeGChxR5WWL7Dk6N0qAwliaJzRaqL83MhwpNY98M1kkVateIf7njVIdXnsZE0mqqSDLQ2HKkY5RUQQKmKRE87khmEhmsiIyxRITbeqqmRKc1S+vk/5ly7Fbzv1Vs9Mu66jCKZzBBTjQhg7cQRd6QOAJnuEV3qzcerHerY/laMUqd07gD6zPH5mUlME=</latexit><latexit sha1_base64="IVJAEzjPjiXPvp4Oo4QNTUc/Kds=">AAAB+3icbVDLSsNAFL2pr1pftS7dDBbBVUlEqMuKG5cV7AOaUCaTSTt0MgkzE7GE/IobF4q49Ufc+TdO2iy09cDA4Zx7uWeOn3CmtG1/W5WNza3tnepubW//4PCoftzoqziVhPZIzGM59LGinAna00xzOkwkxZHP6cCf3Rb+4JFKxWLxoOcJ9SI8ESxkBGsjjesNN8J66oeZqxkPaHaT5+N6027ZC6B14pSkCSW64/qXG8QkjajQhGOlRo6daC/DUjPCaV5zU0UTTGZ4QkeGChxR5WWL7Dk6N0qAwliaJzRaqL83MhwpNY98M1kkVateIf7njVIdXnsZE0mqqSDLQ2HKkY5RUQQKmKRE87khmEhmsiIyxRITbeqqmRKc1S+vk/5ly7Fbzv1Vs9Mu66jCKZzBBTjQhg7cQRd6QOAJnuEV3qzcerHerY/laMUqd07gD6zPH5mUlME=</latexit><latexit sha1_base64="IVJAEzjPjiXPvp4Oo4QNTUc/Kds=">AAAB+3icbVDLSsNAFL2pr1pftS7dDBbBVUlEqMuKG5cV7AOaUCaTSTt0MgkzE7GE/IobF4q49Ufc+TdO2iy09cDA4Zx7uWeOn3CmtG1/W5WNza3tnepubW//4PCoftzoqziVhPZIzGM59LGinAna00xzOkwkxZHP6cCf3Rb+4JFKxWLxoOcJ9SI8ESxkBGsjjesNN8J66oeZqxkPaHaT5+N6027ZC6B14pSkCSW64/qXG8QkjajQhGOlRo6daC/DUjPCaV5zU0UTTGZ4QkeGChxR5WWL7Dk6N0qAwliaJzRaqL83MhwpNY98M1kkVateIf7njVIdXnsZE0mqqSDLQ2HKkY5RUQQKmKRE87khmEhmsiIyxRITbeqqmRKc1S+vk/5ly7Fbzv1Vs9Mu66jCKZzBBTjQhg7cQRd6QOAJnuEV3qzcerHerY/laMUqd07gD6zPH5mUlME=</latexit>
Figure 2. Outline of MolGAN. From left: the generator takes a sample from a prior distribution and generates a dense adjacency tensor
A and an annotation matrix X. Subsequently, sparse and discrete Ã and ˜X are obtained from A and X respectively via categorical
sampling. The combination of Ã and ˜X represents an annotated molecular graph which corresponds to a specific chemical compound.
Finally, the graph is processed by both the discriminator and reward networks that are invariant to node order permutations and based on
Relational-GCN (Schlichtkrull et al., 2017) layers.
loss and the RL loss: passing them to D and ˆR in order to make the gen-
eration stochastic while still forwarding continuous ob-
N.D.Cao and T.Kipf(2018)

SMILES
■ SMILES
44
Grammar Variational Autoencoder
O
OH
'c1ccccc1'
smiles
chain
...
chain
branched
atom
atom
aromatic
organic
'c'
ringbond
digit
'1'
branched
atom
smiles chain
chain
branched
atom
chain
branched
atom
3
atom, ringbond
branched
atom
aromatic
organic
atom
'c'
aromatic
organic
2
ringbond digit
digit '1'
4 5
form parse tree extract rules convert to 1-hot vectors
input SMILES
map to latent space
6
chain,
...
......
...
chain branched atom
smiles chain
chain chain, branched atom
atom, ringbondbranched atom
atombranched atom
aromatic organicatom
aliphatic organicatom
ringbond digit
digit '1'
'c'aromatic organic
'C'aliphatic organic
'N'aliphatic organic
digit '2'
1
SMILES grammar
Figure 1. The encoder of the GVAE. We denote the start rule in blue and all rules that decode to terminal in green. See text for details.
tion rules. We describe how the GVAE works using a sim-
ple example.
Encoding. Consider a subset of the SMILES grammar as
shown in Figure 1, box 1 . These are the possible pro-
duction rules that can be used for constructing a molecule.
Imagine we are given as input the SMILES string for ben-
zene: ‘c1ccccc1’. Figure 1, box 2 shows this molecule.
To encode this molecule into a continuous latent represen-
tation we begin by using the SMILES grammar to parse this
string into a parse tree (partially shown in box 3 ). This
tree describes how ‘c1ccccc1’ is generated by the grammar.
We decompose this tree into a sequence of production rules
by performing a pre-order traversal on the branches of the
parse tree going from left-to-right, shown in box 4 . We
convert these rules into 1-hot indicator vectors, where each
dimension corresponds to a rule in the SMILES grammar,
box 5 . Letting K denote the total number of production
timesteps (production rules) allowed by the decoder. We
will use these vectors in the rest of the decoder to select
production rules.
To ensure that any sequence of production rules generated
from the decoder is valid, we keep track of the state of
the parsing using a last-in first-out (LIFO) stack. This is
shown in Figure 2, box 3 . At the beginning, every valid
parse from the grammar must start with the start symbol:
smiles, which is placed on the stack. Next we pop off
whatever non-terminal symbol that was placed last on the
stack (in this case smiles), and we use it to mask out the
invalid dimensions of the logit vector. Formally, for ev-
ery non-terminal ↵ we define a fixed binary mask vector
m↵ 2 [0, 1]K
. This takes the value ‘1’ for all indices in
1, . . . , K corresponding to production rules that have ↵ on
their left-hand-side.
In this case the only production rule in the grammar begin-
ning with smiles is the first so we zero-out every dimension
M.J.Kusner, et al. ICML2017
http://proceedings.mlr.press/v70/kusner17a
https://github.com/mkusner/grammarVAE
(variational autoencoder, VAE)
Grammar Variational Autoencoder
map from latent space
1 2
...
convert to logits
maxlength
smiles
chain
chain,
branched
atom
branched
atom
branched
atom,
atom,
branched
atomringbond,
aromatic
organic,
branched
atomringbond,
branched
atom
ringbond,
stack mask out invalid rules
pop first
non-terminal
sample rule &
push non-terminals
onto stack
chainsmiles
chain
branched
atom
chain,
chain
branched
atom
chain
smiles
chain
branched
atom
atom, ringbond
branched
atom
atom
aromatic
organic
ringbond
digit
branched
atom
atom
aromatic
organic
'c'
aromatic
organic
ringbond digit
digit '1'digit,
...
......
3 4 5
concatenate
terminals
6 'c1ccccc1'
7
translate
molecule
Figure 2. The decoder of the GVAE. See text for details.
Algorithm 1 Sampling from the decoder
Input: Deterministic decoder output F 2 RTmax⇥K
,
masks m↵ for each production rule ↵
Output: Sampled productions X from p(X|z)
1: Initialize empty stack S, and push the start symbol S
onto the top; set t = 0
2: while S is nonempty do
3: Pop the last-pushed non-terminal ↵ from the stack S
4: Use Eq. (2) to sample a production rule R
5: Let xt be the 1-hot vector corresponding to R
character-based VAE decoder is that at every point in the
generated sequence, the character VAE can sample any
possible character. There is no stack or masking opera-
tion. The grammar VAE however is constrained to select
syntactically-valid sequences.
Syntactic vs. semantic validity. It is important to note
that the grammar encodes syntactically valid molecules
but not necessarily semantically valid molecules. This is
mainly because of three reasons. First, certain molecules

SMILES
45
B
C
N
O
S
P
F
I
H
Cl
Br
1
2
3
(
)
[
]
B
C
N
O
S
P
F
I
H
Cl
Br
1
2
3
(
)
[
]
B
C
N
O
S
P
F
I
H
Cl
Br
1
2
3
(
)
[
]
B
C
N
O
S
P
F
I
H
Cl
Br
1
2
3
(
)
[
]
C C 1
y(x1|w) y(x2|x<2, w)
B
C
N
O
S
P
F
I
H
Cl
Br
1
2
3
(
)
[
]
C
y(x3|x<3, w) y(x4|x<4, w) y(x5|x<5, w)
(x1) (x2) (x3) (x4)
RNN
cell
sequence
inputs:
Figure 1: The recurren
imate the Q-function.
function activation is
acter in C. Here the
SMILES alphabet and
acters of the molecule
example. The initial
from the ﬁrst hidden
continues until the en
during decoding, but its performance achieved by this method leaves scope fo
method requires hand-crafted grammatical rules for each application domain
In this paper, we propose a generative approach to modeling validity that
constraints of a given discrete space. We show how concepts from reinforce
used to deﬁne a suitable generative model and how this model can be approx
D.Janz, et al. ICLR2018
https://github.com/DavidJanz/molecule_grammar_rnn
LSTM

■ AlphaGO
■
47
ARTICLE RESEARCH
and the first-degree neighbouring atoms. Only rules that occurred at
least 50 times in reactions published before 2015 were kept. For the
Prediction with the in-scope filter network
After the search space has been narrowed down by the expansion policy
Search tree representationChemical representation of the synthesis plana b
B
E
A
F
C D
A= {1} B= {2,6} C= {3,6}
D= {4,5,6} E= {8,9} F= {6,7,8}
Root (target)
Target
Terminal
solved state
N
O
CO2 Me
CO2Me
Boc
Ph
HN
O
CO2Me
CO2Me
Ph
MeCO2
MeO2C
1
2
3 5
4
6
7
9
8
N
Boc
Ph
OH
N
Boc
Ph
OTBS
HN
Ph
OH
N
H
Boc OTBS
Ph Br
+
+
+
+Boc2O
8
Boc2O
Figure 1 | Translation of the traditional chemists’ retrosynthetic route
representation to the search tree representation. a, The traditional
chemists’ retrosynthetic route representation (conditions omitted)50
.
b, The search tree representation. The nodes in the tree represent the
synthetic position, and contain all precursors needed to make the
molecules of the preceding positions all the way down to the tree’s
root, which contains the target. Branches in the search tree correspond
to complete routes. Calculating the value of branches through task-
dependent scoring functions allows us to compare and rank different
routes. The target molecule can be solved if it can be deconstructed to a
set of readily available building blocks (marked red). Ph, phenyl; Boc,
tert-butyloxycarbonyl; TBS, tert-butyldimethylsilyl.
M.H.S.Segler, et al. Nature 555(2018)
https://www.nature.com/articles/nature25978

48
ARTICLERESEARCH
(1) Selection (2) Expansion (3) Rollout
Pick and evaluate
new position
Incorporate evaluation
in the search tree
Pick most
promising position
Retroanalyse, add new nodes to
tree by expansion procedure (see b)
(4) Update
δQ
δQ
δQ
δ
Invariant
encoding
Expansion policy:
prioritizes
transformations
Keep the k best
transformations and
apply them to
the target
Keep likely
reactions
For each reaction use
in-scope filter
Target
molecule
A
A
Synthesis planning with Monte Carlo tree search
Expansion procedureb
a
A
B
B
C
C
Ranked precursor
molecule positions
T1
T2
.
.
.
Tn
R1
R2
.
.
Rk
ECFP4
Symbolic Neural Neural SymbolicSymbolic
Figure 2 | Schematic of MCTS methodology. a, MCTS searches by
iterating over four phases. In the selection phase (1), the most urgent
node for analysis is chosen on the basis of the current position values.
In phase (2) this node may be expanded by processing the molecules of
the position A with the expansion procedure (b), which leads to new
positions B and C, which are added to the tree. Then, the most promising
new position is chosen, and a rollout phase (3) is performed by randomly
sampling transformations from the rollout policy until all molecules
are solved or a certain depth is exceeded. In the update phase (4), the
position values are updated in the current branch to reflect the result of the
rollout. b, Expansion procedure. First, the molecule (A) to retroanalyse is
converted to a fingerprint and fed into the policy network, which returns a
probability distribution over all possible transformations (T1 to Tn). Then,
only the k most probable transformations are applied to molecule A. This
yields the reactants necessary to make A, and thus complete reactions R1
to Rk. For each reaction, the reaction prediction is performed using the
in-scope filter, returning a probablity score. Improbable reactions are then
filtered out, which leads to the list of admissible actions and corresponding
precursor positions B and C.
M.H.S.Segler, et al. Nature 555(7678), 604 (2018)
https://www.nature.com/articles/nature25978

Sequence-to-Sequence
49
del. Seq2seq Model. Neural sequence-to-sequence
eq) models map one sequence to another and have
y shown state of the art performance in many tasks such
hine translation.49,50
It is based on an encoder−decoder
cture that consists of two recurrent neural networks
sequence log probability at each time step during decodi
retained, where N is the width of the beam. The decod
stopped once the lengths of the candidate sequences rea
maximum decode length of 140 characters. The can
sequences that contain an end of sequence charact
considered to be complete. On average, about 97% of all
3. Seq2seq model architecture.
DOI: 10.1021/acscentsc
ACS Cent. Sci. 2017, 3, 11
1105
SMILES
SMILES(SMART)
B.Liu, et al. ACS. Cent. Sci. 3(10), 1103(2017)
https://pubs.acs.org/doi/full/10.1021/acscentsci.7b00303
https://github.com/pandegroup/reaction_prediction_seq2seq

■Coley et al. (2017)
50
tension of the one-step strategy to multistep pathway planning is
.
characters (i.e., a product SMILES26
string without atom
C.W. Coley et al. ACS. Cent. Sci. 3(12), 1237 (2017)
https://pubs.acs.org/doi/full/10.1021/acscentsci.7b00355
https://github.com/connorcoley/retrosim

■Coley et al.
51
 
2018 9
http://www.molsci.jp/2018/pdf/4E13_w.pdf
Coming soon…
2018 10
(IBIS) 2018 11

1.
4.
5.
52

■Materials Informatics  
■
■Materials Informatics Python
53

■Materials Informatics
■  
https://www.jstage.jst.go.jp/article/ciqs/2017/0/2017_PL/
_pdf/-char/ja
■  
https://www.jstage.jst.go.jp/article/cicsj/36/1/36_9/_pdf/-
char/ja
■  
https://www.ssken.gr.jp/MAINSITE/event/2017/20171026-
sci/lecture-01/
SSKEN_sci2017_YoshidaRyo_presentation.pdf
54

■RDKit
■ https://www.rdkit.org/
■RDKit  
https://future-chem.com/rdkit-intro/
■  
https://github.com/chemo-wakate
■RDKit http://rdkit-users.jp/
55

Materials Informatics and Python

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Materials Informatics and Python

Similar to Materials Informatics and Python (20)

More from Shintaro Fukushima

More from Shintaro Fukushima (20)

Recently uploaded

Recently uploaded (20)

Materials Informatics and Python