Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Materials Informatics and Python

3,577 views

Published on

Pydata 1day conference Presentation
Pydata1day conference

Published in: Data & Analytics

Materials Informatics and Python

  1. 1. Materials Informatics Python
 PyData 1Day Conference 2018 10 20
  2. 2. ■ ■ IT ■ 
 ■ ■ 
 ■ ■ twitter: @sfchaos @shifukushima
 1
  3. 3. 2 2000 2006 2009 2018 • • • etc. • • • • • etc. 2014 • • “ ”
  4. 4. ■Materials Informatics ■ ■ ■Materials Informatics Python 3
  5. 5. 1. 2. Materials Informatics 3. Python Materials Informatics 4. 5. 4
  6. 6. Materials Informatics ■ 5 http://cms.mtl.kyoto-u.ac.jp/informatics.html
  7. 7. 6 Materials Informatics
  8. 8. Materials Informatics ■2011 
 Materials Genome Initiative 
 https://www.mgi.gov/ ■ 2015 NIMS 
 
 http://www.nims.go.jp/MII-I/
 
 NIMS: 7
  9. 9. ■ ■ ■ ■ 
 
 (Li 2 ) ■ TDK ■ ■ NEC ■ 
 etc. 8 
 https://tech.nikkeibp.co.jp/atcl/nxt/mag/ne/18/00030/00001/
  10. 10. ■ NIPS 
 9 NIPS2018 workshop Machine Learning for Molecules and Materials http://www.quantum- machine.org/workshops/ nips2018draft/ NIPS2017 workshop Machine Learning for Molecules and Materials http://www.quantum- machine.org/workshops/nips2017/
  11. 11. 1. 2. Materials Informatics 3. Python Materials Informatics 4. 5. 10
  12. 12. SMILES ■ 11 D.Weinberger et al., J.Chem.Inf.Model,28,31(1988) : (C6H6) c1ccccc1 : (C6H5COOH) OC(=O)c1ccccc1
  13. 13. SMILES ■ 12 D.Weinberger et al., J.Chem.Inf.Model,28,31(1988) https://ja.wikipedia.org/wiki/SMILES%E8%A8%98%E6%B3%95
  14. 14. 13 Y = f (X) Y X S
  15. 15. Fingerprint ■ 14 : • D.Rogers and M.Hahn, J.Chem.Inf.Model.,50(5), 742(2010)
 https://pubs.acs.org/doi/10.1021/ci100050t • 94 
 https://art.ist.hokudai.ac.jp/~takigawa/data/fpai94_takigawa.pdf
  16. 16. 1. 2. Materials Informatics 3. Python Materials Informatics 4. 5. 15
  17. 17. RDKit ■ ■ C++ ■Python 16 https://www.rdkit.org/
  18. 18. RDKit ■Anaconda 17 $ conda install -c rdkit rdkit Anaconda https://www.rdkit.org/docs/Install.html (variational autoencoder, VAE)
  19. 19. RDKit ■ ■ ■ 
 Getting Started with the RDKit in Python
 https://www.rdkit.org/docs/ GettingStartedInPython.html ■RDKit 
 https://future-chem.com/rdkit-intro/ ■ 
 https://github.com/chemo-wakate 18
  20. 20. 19 SMILES Mol
  21. 21. ■ 20
  22. 22. ■ 
 21
  23. 23. ■ 22
  24. 24. ■ 23 https://www.rdkit.org/docs/GettingStartedInPython.html#list-of- available-descriptors
  25. 25. ■ : 24
  26. 26. ■ 25 1. 2. 3. 4. 
 
 5. 
 Molecular neural network models with RDKit and Keras in Python http://www.wildcardconsulting.dk/useful-information/molecular-neural- network-models-with-rdkit-and-keras-in-python/ Keras http://www.ag.kagawa-u.ac.jp/charlesy/2017/07/21/ keras%E3%81%A7%E5%8C%96%E5%90%88%E7%89%A9%E3%81%AE%E6%BA%B6%E8%A7% A3%E5%BA%A6%E4%BA%88%E6%B8%AC%EF%BC%88%E3%83%8B%E3%83%A5%E3%83%BC %E3%83%A9%E3%83%AB%E3%83%8D%E3%83%83%E3%83%88%E3%83%AF%E3%83%BC/
  27. 27. 1. 
 RDKit github 26 https://github.com/rdkit/rdkit/blob/ master/Docs/Book/data solubility.train.sdf ( ) solubility.test.sdf ( ) ※ SDF (Structured Data Format) : SDF 
 https://www.chem-station.com/blog/2012/04/sdf.html
  28. 28. 1. 
 SDF solubility.train.sdf ) 27 n-pentane RDKit 2D 5 4 0 0 0 0 0 0 0 0999 V2000 0.2606 0.1503 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.3000 0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.6000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.9000 0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 4.9394 0.1503 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 2 3 1 0 3 4 1 0 4 5 1 0 M END > <ID> (1) 1 > <NAME> (1) n-pentane
  29. 29. 2. 28 CH2CH2CH2CH2CH2
  30. 30. 3. 29
  31. 31. 4. 30
  32. 32. 5. 31
  33. 33. 5. 32
  34. 34. 1. 2. Materials Informatics 3. Python Materials Informatics 4. 5. 33
  35. 35. ■ Arxiv ■ 2 ■ ■ 34
  36. 36. ■ : ■ 
 https://www.jstage.jst.go.jp/article/cicsj/36/1/36_9/ _pdf/-char/ja ■ Deep Learning 
 https://kivantium.net/deep-for-chem 35
  37. 37. ■ ■ORGAN / ORGANIC ■MolGAN ■ChemTS 36
  38. 38. GAN ■ 37 https://medium.com/@devnag/generative-adversarial- networks-gans-in-50-lines-of-code-pytorch-e81b79659e3f
  39. 39. SeqGAN 38 L.Yu, et al., AAAI2017. https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/download/ 14344/14489 https://github.com/LantaoYu/SeqGAN (state): t-1 (action): t
  40. 40. ORGAN ■SeqGAN 39 s d , d - h d o e ) . t , Finally in SeqGAN the reward function is provided by D . 4 ORGAN Figure 1: Schema for ORGAN. Left: D is trained as a classifier receiving as input a mix of real data and generated data by G. Right: G is trained by RL where the reward is a combination of D and the objectives, and is passed back to the policy function via Monte Carlo sampling. We penalize non-unique sequences. Figure 1 illustrates the main idea of ORGAN. To take into account domain-specific desired objectives Oi, we extend the SeqGAN SMILES G.Guimaraes et al.(2017) https://arxiv.org/abs/1705.10843 https://github.com/gablg1/ORGAN
  41. 41. ORGAN 40 Objective Algorithm Validity (%) Diversity Druglikeliness Synthesizability Solubility MLE 75.9 0.64 0.48 (0%) 0.23 (0%) 0.30 (0%) SeqGAN 80.3 0.61 0.49 (2%) 0.25 (6%) 0.31 (3%) Druglikeliness ORGAN 88.2 0.55 0.52 (8%) 0.32 (38%) 0.35 (18%) OR(W)GAN 85.0 0.95 0.60 (25%) 0.54 (130%) 0.47 (57%) Naive RL 97.1 0.8 0.57 (19%) 0.53 (126%) 0.50 (67%) Synthesizability ORGAN 96.5 0.92 0.51 (6%) 0.83 (255%) 0.45 (52%) OR(W)GAN 97.6 1.00 0.20 (-59%) 0.75 (223%) 0.84 (184%) Naive RL 97.7 0.96 0.52 (8%) 0.83 (256%) 0.46 (54%) Solubility ORGAN 94.7 0.76 0.50 (4%) 0.63 (171%) 0.55 (85%) OR(W)GAN 94.1 0.90 0.42 (-12%) 0.66 (185%) 0.54 (81%) Naive RL 92.7 0.75 0.49 (3%) 0.70 (200%) 0.78 (162 %) All/Alternated ORGAN 96.1 92.3 0.52 (9%) 0.71 (206%) 0.53 (79%) ble 1: Evaluation of metrics, on several generative algorithms and optimized for different objectives for molecules. Reported values an values of valid generated molecules. The percentage of improvement over the MLE baseline is reported in parenthesis. Values sho bold indicate significant improvement. Shaded cell indicates direct optimized objectives. ble 2 shows quantitative results comparing ORGAN to oth baseline methods optimizing for three different metrics. O GAN outperforms SeqGAN and MLE in all of the three m rics. Naive RL achieves a higher score than ORGAN for Ratio of Steps metric, but it under-performs in terms of Druglikeliness, Synthesizability, Solubility
  42. 42. ORGANIC ■ORGAN 41 Methods gure 1: Usage of ORGANIC illustrated. In the training procedure we show the thre ndamental components: a generator, a discriminator, and a reinforcement metric. Arrow icate the flow of inputs and outputs between networks. B.S-.Lengeling, et al.(2017) https://chemrxiv.org/articles/ORGANIC_1_pdf/5309668 https://github.com/aspuru-guzik-group/ORGANIC
  43. 43. MolGAN ■ ■SMILES 42 ive model for small molecular graphs Cao 1 Thomas Kipf 1 Molecular graph Generator Discriminator Reward network z ~ p(z) 0/1 0/1 x ~ pdata(x) Generator Discriminator N.D.Cao and T.Kipf(2018) https://arxiv.org/abs/1805.11973
  44. 44. MolGAN ■ 43 MolGAN: An implicit generative model for small molecular graphs Generator Graph Molecule N N N N N N T T z ~ p(z) Adjacency tensor Sampled SampledAnnotation matrix ~ ~ GCN GCN 0/1 0/1 Discriminator Reward network A<latexit sha1_base64="EMPyu5ASlEpI1qvrJeu1mckhUAU=">AAAB8XicbVDLSsNAFL3xWeur6tLNYBFclUSEuqy4cVnBPrANZTKdtEMnkzBzI5TQv3DjQhG3/o07/8ZJm4W2Hhg4nHMvc+4JEikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJk414y0Wy1h3A2q4FIq3UKDk3URzGgWSd4LJbe53nrg2IlYPOE24H9GREqFgFK302I8ojoMwu5kNKlW35s5BVolXkCoUaA4qX/1hzNKIK2SSGtPz3AT9jGoUTPJZuZ8anlA2oSPes1TRiBs/myeekXOrDEkYa/sUkrn6eyOjkTHTKLCTeUKz7OXif14vxfDaz4RKUuSKLT4KU0kwJvn5ZCg0ZyinllCmhc1K2JhqytCWVLYleMsnr5L2Zc1za979VbVRL+oowSmcwQV4UIcG3EETWsBAwTO8wptjnBfn3flYjK45xc4J/IHz+QOmV5Da</latexit><latexit sha1_base64="EMPyu5ASlEpI1qvrJeu1mckhUAU=">AAAB8XicbVDLSsNAFL3xWeur6tLNYBFclUSEuqy4cVnBPrANZTKdtEMnkzBzI5TQv3DjQhG3/o07/8ZJm4W2Hhg4nHMvc+4JEikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJk414y0Wy1h3A2q4FIq3UKDk3URzGgWSd4LJbe53nrg2IlYPOE24H9GREqFgFK302I8ojoMwu5kNKlW35s5BVolXkCoUaA4qX/1hzNKIK2SSGtPz3AT9jGoUTPJZuZ8anlA2oSPes1TRiBs/myeekXOrDEkYa/sUkrn6eyOjkTHTKLCTeUKz7OXif14vxfDaz4RKUuSKLT4KU0kwJvn5ZCg0ZyinllCmhc1K2JhqytCWVLYleMsnr5L2Zc1za979VbVRL+oowSmcwQV4UIcG3EETWsBAwTO8wptjnBfn3flYjK45xc4J/IHz+QOmV5Da</latexit><latexit sha1_base64="EMPyu5ASlEpI1qvrJeu1mckhUAU=">AAAB8XicbVDLSsNAFL3xWeur6tLNYBFclUSEuqy4cVnBPrANZTKdtEMnkzBzI5TQv3DjQhG3/o07/8ZJm4W2Hhg4nHMvc+4JEikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJk414y0Wy1h3A2q4FIq3UKDk3URzGgWSd4LJbe53nrg2IlYPOE24H9GREqFgFK302I8ojoMwu5kNKlW35s5BVolXkCoUaA4qX/1hzNKIK2SSGtPz3AT9jGoUTPJZuZ8anlA2oSPes1TRiBs/myeekXOrDEkYa/sUkrn6eyOjkTHTKLCTeUKz7OXif14vxfDaz4RKUuSKLT4KU0kwJvn5ZCg0ZyinllCmhc1K2JhqytCWVLYleMsnr5L2Zc1za979VbVRL+oowSmcwQV4UIcG3EETWsBAwTO8wptjnBfn3flYjK45xc4J/IHz+QOmV5Da</latexit><latexit sha1_base64="EMPyu5ASlEpI1qvrJeu1mckhUAU=">AAAB8XicbVDLSsNAFL3xWeur6tLNYBFclUSEuqy4cVnBPrANZTKdtEMnkzBzI5TQv3DjQhG3/o07/8ZJm4W2Hhg4nHMvc+4JEikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJk414y0Wy1h3A2q4FIq3UKDk3URzGgWSd4LJbe53nrg2IlYPOE24H9GREqFgFK302I8ojoMwu5kNKlW35s5BVolXkCoUaA4qX/1hzNKIK2SSGtPz3AT9jGoUTPJZuZ8anlA2oSPes1TRiBs/myeekXOrDEkYa/sUkrn6eyOjkTHTKLCTeUKz7OXif14vxfDaz4RKUuSKLT4KU0kwJvn5ZCg0ZyinllCmhc1K2JhqytCWVLYleMsnr5L2Zc1za979VbVRL+oowSmcwQV4UIcG3EETWsBAwTO8wptjnBfn3flYjK45xc4J/IHz+QOmV5Da</latexit> X<latexit sha1_base64="k8fMTYMpbcAk1m6rTYMegJsdMOM=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiQi1GXBjcsK9oFtKJPppB06mYSZG6GE/oUbF4q49W/c+TdO2iy09cDA4Zx7mXNPkEhh0HW/ndLG5tb2Tnm3srd/cHhUPT7pmDjVjLdZLGPdC6jhUijeRoGS9xLNaRRI3g2mt7nffeLaiFg94CzhfkTHSoSCUbTS4yCiOAnCrDcfVmtu3V2ArBOvIDUo0BpWvwajmKURV8gkNabvuQn6GdUomOTzyiA1PKFsSse8b6miETd+tkg8JxdWGZEw1vYpJAv190ZGI2NmUWAn84Rm1cvF/7x+iuGNnwmVpMgVW34UppJgTPLzyUhozlDOLKFMC5uVsAnVlKEtqWJL8FZPXiedq7rn1r3761qzUdRRhjM4h0vwoAFNuIMWtIGBgmd4hTfHOC/Ou/OxHC05xc4p/IHz+QPJSpDx</latexit><latexit sha1_base64="k8fMTYMpbcAk1m6rTYMegJsdMOM=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiQi1GXBjcsK9oFtKJPppB06mYSZG6GE/oUbF4q49W/c+TdO2iy09cDA4Zx7mXNPkEhh0HW/ndLG5tb2Tnm3srd/cHhUPT7pmDjVjLdZLGPdC6jhUijeRoGS9xLNaRRI3g2mt7nffeLaiFg94CzhfkTHSoSCUbTS4yCiOAnCrDcfVmtu3V2ArBOvIDUo0BpWvwajmKURV8gkNabvuQn6GdUomOTzyiA1PKFsSse8b6miETd+tkg8JxdWGZEw1vYpJAv190ZGI2NmUWAn84Rm1cvF/7x+iuGNnwmVpMgVW34UppJgTPLzyUhozlDOLKFMC5uVsAnVlKEtqWJL8FZPXiedq7rn1r3761qzUdRRhjM4h0vwoAFNuIMWtIGBgmd4hTfHOC/Ou/OxHC05xc4p/IHz+QPJSpDx</latexit><latexit sha1_base64="k8fMTYMpbcAk1m6rTYMegJsdMOM=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiQi1GXBjcsK9oFtKJPppB06mYSZG6GE/oUbF4q49W/c+TdO2iy09cDA4Zx7mXNPkEhh0HW/ndLG5tb2Tnm3srd/cHhUPT7pmDjVjLdZLGPdC6jhUijeRoGS9xLNaRRI3g2mt7nffeLaiFg94CzhfkTHSoSCUbTS4yCiOAnCrDcfVmtu3V2ArBOvIDUo0BpWvwajmKURV8gkNabvuQn6GdUomOTzyiA1PKFsSse8b6miETd+tkg8JxdWGZEw1vYpJAv190ZGI2NmUWAn84Rm1cvF/7x+iuGNnwmVpMgVW34UppJgTPLzyUhozlDOLKFMC5uVsAnVlKEtqWJL8FZPXiedq7rn1r3761qzUdRRhjM4h0vwoAFNuIMWtIGBgmd4hTfHOC/Ou/OxHC05xc4p/IHz+QPJSpDx</latexit><latexit sha1_base64="k8fMTYMpbcAk1m6rTYMegJsdMOM=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiQi1GXBjcsK9oFtKJPppB06mYSZG6GE/oUbF4q49W/c+TdO2iy09cDA4Zx7mXNPkEhh0HW/ndLG5tb2Tnm3srd/cHhUPT7pmDjVjLdZLGPdC6jhUijeRoGS9xLNaRRI3g2mt7nffeLaiFg94CzhfkTHSoSCUbTS4yCiOAnCrDcfVmtu3V2ArBOvIDUo0BpWvwajmKURV8gkNabvuQn6GdUomOTzyiA1PKFsSse8b6miETd+tkg8JxdWGZEw1vYpJAv190ZGI2NmUWAn84Rm1cvF/7x+iuGNnwmVpMgVW34UppJgTPLzyUhozlDOLKFMC5uVsAnVlKEtqWJL8FZPXiedq7rn1r3761qzUdRRhjM4h0vwoAFNuIMWtIGBgmd4hTfHOC/Ou/OxHC05xc4p/IHz+QPJSpDx</latexit> ˜X<latexit sha1_base64="h5fkkvOPNqe9NI7w0SLn2N2FVmc=">AAAB+3icbVDLSsNAFL3xWesr1qWbwSK4KokIdVlw47KCfUATymQyaYdOJmFmIpaQX3HjQhG3/og7/8ZJm4W2Hhg4nHMv98wJUs6Udpxva2Nza3tnt7ZX3z84PDq2Txp9lWSS0B5JeCKHAVaUM0F7mmlOh6mkOA44HQSz29IfPFKpWCIe9DylfowngkWMYG2ksd3wYqynQZR7mvGQ5sOiGNtNp+UsgNaJW5EmVOiO7S8vTEgWU6EJx0qNXCfVfo6lZoTTou5liqaYzPCEjgwVOKbKzxfZC3RhlBBFiTRPaLRQf2/kOFZqHgdmskyqVr1S/M8bZTq68XMm0kxTQZaHoowjnaCyCBQySYnmc0MwkcxkRWSKJSba1FU3JbirX14n/auW67Tc++tmp13VUYMzOIdLcKENHbiDLvSAwBM8wyu8WYX1Yr1bH8vRDavaOYU/sD5/ALyelNg=</latexit><latexit sha1_base64="h5fkkvOPNqe9NI7w0SLn2N2FVmc=">AAAB+3icbVDLSsNAFL3xWesr1qWbwSK4KokIdVlw47KCfUATymQyaYdOJmFmIpaQX3HjQhG3/og7/8ZJm4W2Hhg4nHMv98wJUs6Udpxva2Nza3tnt7ZX3z84PDq2Txp9lWSS0B5JeCKHAVaUM0F7mmlOh6mkOA44HQSz29IfPFKpWCIe9DylfowngkWMYG2ksd3wYqynQZR7mvGQ5sOiGNtNp+UsgNaJW5EmVOiO7S8vTEgWU6EJx0qNXCfVfo6lZoTTou5liqaYzPCEjgwVOKbKzxfZC3RhlBBFiTRPaLRQf2/kOFZqHgdmskyqVr1S/M8bZTq68XMm0kxTQZaHoowjnaCyCBQySYnmc0MwkcxkRWSKJSba1FU3JbirX14n/auW67Tc++tmp13VUYMzOIdLcKENHbiDLvSAwBM8wyu8WYX1Yr1bH8vRDavaOYU/sD5/ALyelNg=</latexit><latexit sha1_base64="h5fkkvOPNqe9NI7w0SLn2N2FVmc=">AAAB+3icbVDLSsNAFL3xWesr1qWbwSK4KokIdVlw47KCfUATymQyaYdOJmFmIpaQX3HjQhG3/og7/8ZJm4W2Hhg4nHMv98wJUs6Udpxva2Nza3tnt7ZX3z84PDq2Txp9lWSS0B5JeCKHAVaUM0F7mmlOh6mkOA44HQSz29IfPFKpWCIe9DylfowngkWMYG2ksd3wYqynQZR7mvGQ5sOiGNtNp+UsgNaJW5EmVOiO7S8vTEgWU6EJx0qNXCfVfo6lZoTTou5liqaYzPCEjgwVOKbKzxfZC3RhlBBFiTRPaLRQf2/kOFZqHgdmskyqVr1S/M8bZTq68XMm0kxTQZaHoowjnaCyCBQySYnmc0MwkcxkRWSKJSba1FU3JbirX14n/auW67Tc++tmp13VUYMzOIdLcKENHbiDLvSAwBM8wyu8WYX1Yr1bH8vRDavaOYU/sD5/ALyelNg=</latexit><latexit sha1_base64="h5fkkvOPNqe9NI7w0SLn2N2FVmc=">AAAB+3icbVDLSsNAFL3xWesr1qWbwSK4KokIdVlw47KCfUATymQyaYdOJmFmIpaQX3HjQhG3/og7/8ZJm4W2Hhg4nHMv98wJUs6Udpxva2Nza3tnt7ZX3z84PDq2Txp9lWSS0B5JeCKHAVaUM0F7mmlOh6mkOA44HQSz29IfPFKpWCIe9DylfowngkWMYG2ksd3wYqynQZR7mvGQ5sOiGNtNp+UsgNaJW5EmVOiO7S8vTEgWU6EJx0qNXCfVfo6lZoTTou5liqaYzPCEjgwVOKbKzxfZC3RhlBBFiTRPaLRQf2/kOFZqHgdmskyqVr1S/M8bZTq68XMm0kxTQZaHoowjnaCyCBQySYnmc0MwkcxkRWSKJSba1FU3JbirX14n/auW67Tc++tmp13VUYMzOIdLcKENHbiDLvSAwBM8wyu8WYX1Yr1bH8vRDavaOYU/sD5/ALyelNg=</latexit> ˜A<latexit sha1_base64="IVJAEzjPjiXPvp4Oo4QNTUc/Kds=">AAAB+3icbVDLSsNAFL2pr1pftS7dDBbBVUlEqMuKG5cV7AOaUCaTSTt0MgkzE7GE/IobF4q49Ufc+TdO2iy09cDA4Zx7uWeOn3CmtG1/W5WNza3tnepubW//4PCoftzoqziVhPZIzGM59LGinAna00xzOkwkxZHP6cCf3Rb+4JFKxWLxoOcJ9SI8ESxkBGsjjesNN8J66oeZqxkPaHaT5+N6027ZC6B14pSkCSW64/qXG8QkjajQhGOlRo6daC/DUjPCaV5zU0UTTGZ4QkeGChxR5WWL7Dk6N0qAwliaJzRaqL83MhwpNY98M1kkVateIf7njVIdXnsZE0mqqSDLQ2HKkY5RUQQKmKRE87khmEhmsiIyxRITbeqqmRKc1S+vk/5ly7Fbzv1Vs9Mu66jCKZzBBTjQhg7cQRd6QOAJnuEV3qzcerHerY/laMUqd07gD6zPH5mUlME=</latexit><latexit sha1_base64="IVJAEzjPjiXPvp4Oo4QNTUc/Kds=">AAAB+3icbVDLSsNAFL2pr1pftS7dDBbBVUlEqMuKG5cV7AOaUCaTSTt0MgkzE7GE/IobF4q49Ufc+TdO2iy09cDA4Zx7uWeOn3CmtG1/W5WNza3tnepubW//4PCoftzoqziVhPZIzGM59LGinAna00xzOkwkxZHP6cCf3Rb+4JFKxWLxoOcJ9SI8ESxkBGsjjesNN8J66oeZqxkPaHaT5+N6027ZC6B14pSkCSW64/qXG8QkjajQhGOlRo6daC/DUjPCaV5zU0UTTGZ4QkeGChxR5WWL7Dk6N0qAwliaJzRaqL83MhwpNY98M1kkVateIf7njVIdXnsZE0mqqSDLQ2HKkY5RUQQKmKRE87khmEhmsiIyxRITbeqqmRKc1S+vk/5ly7Fbzv1Vs9Mu66jCKZzBBTjQhg7cQRd6QOAJnuEV3qzcerHerY/laMUqd07gD6zPH5mUlME=</latexit><latexit sha1_base64="IVJAEzjPjiXPvp4Oo4QNTUc/Kds=">AAAB+3icbVDLSsNAFL2pr1pftS7dDBbBVUlEqMuKG5cV7AOaUCaTSTt0MgkzE7GE/IobF4q49Ufc+TdO2iy09cDA4Zx7uWeOn3CmtG1/W5WNza3tnepubW//4PCoftzoqziVhPZIzGM59LGinAna00xzOkwkxZHP6cCf3Rb+4JFKxWLxoOcJ9SI8ESxkBGsjjesNN8J66oeZqxkPaHaT5+N6027ZC6B14pSkCSW64/qXG8QkjajQhGOlRo6daC/DUjPCaV5zU0UTTGZ4QkeGChxR5WWL7Dk6N0qAwliaJzRaqL83MhwpNY98M1kkVateIf7njVIdXnsZE0mqqSDLQ2HKkY5RUQQKmKRE87khmEhmsiIyxRITbeqqmRKc1S+vk/5ly7Fbzv1Vs9Mu66jCKZzBBTjQhg7cQRd6QOAJnuEV3qzcerHerY/laMUqd07gD6zPH5mUlME=</latexit><latexit sha1_base64="IVJAEzjPjiXPvp4Oo4QNTUc/Kds=">AAAB+3icbVDLSsNAFL2pr1pftS7dDBbBVUlEqMuKG5cV7AOaUCaTSTt0MgkzE7GE/IobF4q49Ufc+TdO2iy09cDA4Zx7uWeOn3CmtG1/W5WNza3tnepubW//4PCoftzoqziVhPZIzGM59LGinAna00xzOkwkxZHP6cCf3Rb+4JFKxWLxoOcJ9SI8ESxkBGsjjesNN8J66oeZqxkPaHaT5+N6027ZC6B14pSkCSW64/qXG8QkjajQhGOlRo6daC/DUjPCaV5zU0UTTGZ4QkeGChxR5WWL7Dk6N0qAwliaJzRaqL83MhwpNY98M1kkVateIf7njVIdXnsZE0mqqSDLQ2HKkY5RUQQKmKRE87khmEhmsiIyxRITbeqqmRKc1S+vk/5ly7Fbzv1Vs9Mu66jCKZzBBTjQhg7cQRd6QOAJnuEV3qzcerHerY/laMUqd07gD6zPH5mUlME=</latexit> Figure 2. Outline of MolGAN. From left: the generator takes a sample from a prior distribution and generates a dense adjacency tensor A and an annotation matrix X. Subsequently, sparse and discrete ˜A and ˜X are obtained from A and X respectively via categorical sampling. The combination of ˜A and ˜X represents an annotated molecular graph which corresponds to a specific chemical compound. Finally, the graph is processed by both the discriminator and reward networks that are invariant to node order permutations and based on Relational-GCN (Schlichtkrull et al., 2017) layers. loss and the RL loss: passing them to D and ˆR in order to make the gen- eration stochastic while still forwarding continuous ob- N.D.Cao and T.Kipf(2018) https://arxiv.org/abs/1805.11973
  45. 45. SMILES ■ SMILES 44 Grammar Variational Autoencoder O OH 'c1ccccc1' smiles chain ... chain branched atom atom aromatic organic 'c' ringbond digit '1' branched atom smiles chain chain branched atom chain branched atom 3 atom, ringbond branched atom aromatic organic atom 'c' aromatic organic 2 ringbond digit digit '1' 4 5 form parse tree extract rules convert to 1-hot vectors input SMILES map to latent space 6 chain, ... ...... ... chain branched atom smiles chain chain chain, branched atom atom, ringbondbranched atom atombranched atom aromatic organicatom aliphatic organicatom ringbond digit digit '1' 'c'aromatic organic 'C'aliphatic organic 'N'aliphatic organic digit '2' 1 SMILES grammar Figure 1. The encoder of the GVAE. We denote the start rule in blue and all rules that decode to terminal in green. See text for details. tion rules. We describe how the GVAE works using a sim- ple example. Encoding. Consider a subset of the SMILES grammar as shown in Figure 1, box 1 . These are the possible pro- duction rules that can be used for constructing a molecule. Imagine we are given as input the SMILES string for ben- zene: ‘c1ccccc1’. Figure 1, box 2 shows this molecule. To encode this molecule into a continuous latent represen- tation we begin by using the SMILES grammar to parse this string into a parse tree (partially shown in box 3 ). This tree describes how ‘c1ccccc1’ is generated by the grammar. We decompose this tree into a sequence of production rules by performing a pre-order traversal on the branches of the parse tree going from left-to-right, shown in box 4 . We convert these rules into 1-hot indicator vectors, where each dimension corresponds to a rule in the SMILES grammar, box 5 . Letting K denote the total number of production timesteps (production rules) allowed by the decoder. We will use these vectors in the rest of the decoder to select production rules. To ensure that any sequence of production rules generated from the decoder is valid, we keep track of the state of the parsing using a last-in first-out (LIFO) stack. This is shown in Figure 2, box 3 . At the beginning, every valid parse from the grammar must start with the start symbol: smiles, which is placed on the stack. Next we pop off whatever non-terminal symbol that was placed last on the stack (in this case smiles), and we use it to mask out the invalid dimensions of the logit vector. Formally, for ev- ery non-terminal ↵ we define a fixed binary mask vector m↵ 2 [0, 1]K . This takes the value ‘1’ for all indices in 1, . . . , K corresponding to production rules that have ↵ on their left-hand-side. In this case the only production rule in the grammar begin- ning with smiles is the first so we zero-out every dimension M.J.Kusner, et al. ICML2017 http://proceedings.mlr.press/v70/kusner17a https://github.com/mkusner/grammarVAE (variational autoencoder, VAE) Grammar Variational Autoencoder map from latent space 1 2 ... convert to logits maxlength smiles chain chain, branched atom branched atom branched atom, atom, branched atomringbond, aromatic organic, branched atomringbond, branched atom ringbond, stack mask out invalid rules pop first non-terminal sample rule & push non-terminals onto stack chainsmiles chain branched atom chain, chain branched atom chain smiles chain branched atom atom, ringbond branched atom atom aromatic organic ringbond digit branched atom atom aromatic organic 'c' aromatic organic ringbond digit digit '1'digit, ... ...... 3 4 5 concatenate terminals 6 'c1ccccc1' 7 translate molecule Figure 2. The decoder of the GVAE. See text for details. Algorithm 1 Sampling from the decoder Input: Deterministic decoder output F 2 RTmax⇥K , masks m↵ for each production rule ↵ Output: Sampled productions X from p(X|z) 1: Initialize empty stack S, and push the start symbol S onto the top; set t = 0 2: while S is nonempty do 3: Pop the last-pushed non-terminal ↵ from the stack S 4: Use Eq. (2) to sample a production rule R 5: Let xt be the 1-hot vector corresponding to R character-based VAE decoder is that at every point in the generated sequence, the character VAE can sample any possible character. There is no stack or masking opera- tion. The grammar VAE however is constrained to select syntactically-valid sequences. Syntactic vs. semantic validity. It is important to note that the grammar encodes syntactically valid molecules but not necessarily semantically valid molecules. This is mainly because of three reasons. First, certain molecules
  46. 46. SMILES 45 B C N O S P F I H Cl Br 1 2 3 ( ) [ ] B C N O S P F I H Cl Br 1 2 3 ( ) [ ] B C N O S P F I H Cl Br 1 2 3 ( ) [ ] B C N O S P F I H Cl Br 1 2 3 ( ) [ ] C C 1 y(x1|w) y(x2|x<2, w) B C N O S P F I H Cl Br 1 2 3 ( ) [ ] C y(x3|x<3, w) y(x4|x<4, w) y(x5|x<5, w) (x1) (x2) (x3) (x4) RNN cell sequence inputs: Figure 1: The recurren imate the Q-function. function activation is acter in C. Here the SMILES alphabet and acters of the molecule example. The initial from the first hidden continues until the en during decoding, but its performance achieved by this method leaves scope fo method requires hand-crafted grammatical rules for each application domain In this paper, we propose a generative approach to modeling validity that constraints of a given discrete space. We show how concepts from reinforce used to define a suitable generative model and how this model can be approx D.Janz, et al. ICLR2018 https://arxiv.org/abs/1712.01664 https://github.com/DavidJanz/molecule_grammar_rnn LSTM
  47. 47. ■ 
 ■ 
 = 46
  48. 48. ■ AlphaGO ■ 47 ARTICLE RESEARCH and the first-degree neighbouring atoms. Only rules that occurred at least 50 times in reactions published before 2015 were kept. For the Prediction with the in-scope filter network After the search space has been narrowed down by the expansion policy Search tree representationChemical representation of the synthesis plana b B E A F C D A= {1} B= {2,6} C= {3,6} D= {4,5,6} E= {8,9} F= {6,7,8} Root (target) Target Terminal solved state N O CO2 Me CO2Me Boc Ph HN O CO2Me CO2Me Ph MeCO2 MeO2C 1 2 3 5 4 6 7 9 8 N Boc Ph OH N Boc Ph OTBS HN Ph OH N H Boc OTBS Ph Br + + + +Boc2O 8 Boc2O Figure 1 | Translation of the traditional chemists’ retrosynthetic route representation to the search tree representation. a, The traditional chemists’ retrosynthetic route representation (conditions omitted)50 . b, The search tree representation. The nodes in the tree represent the synthetic position, and contain all precursors needed to make the molecules of the preceding positions all the way down to the tree’s root, which contains the target. Branches in the search tree correspond to complete routes. Calculating the value of branches through task- dependent scoring functions allows us to compare and rank different routes. The target molecule can be solved if it can be deconstructed to a set of readily available building blocks (marked red). Ph, phenyl; Boc, tert-butyloxycarbonyl; TBS, tert-butyldimethylsilyl. M.H.S.Segler, et al. Nature 555(2018) https://www.nature.com/articles/nature25978
  49. 49. 48 ARTICLERESEARCH (1) Selection (2) Expansion (3) Rollout Pick and evaluate new position Incorporate evaluation in the search tree Pick most promising position Retroanalyse, add new nodes to tree by expansion procedure (see b) (4) Update δQ δQ δQ δ Invariant encoding Expansion policy: prioritizes transformations Keep the k best transformations and apply them to the target Keep likely reactions For each reaction use in-scope filter Target molecule A A Synthesis planning with Monte Carlo tree search Expansion procedureb a A B B C C Ranked precursor molecule positions T1 T2 . . . Tn R1 R2 . . Rk ECFP4 Symbolic Neural Neural SymbolicSymbolic Figure 2 | Schematic of MCTS methodology. a, MCTS searches by iterating over four phases. In the selection phase (1), the most urgent node for analysis is chosen on the basis of the current position values. In phase (2) this node may be expanded by processing the molecules of the position A with the expansion procedure (b), which leads to new positions B and C, which are added to the tree. Then, the most promising new position is chosen, and a rollout phase (3) is performed by randomly sampling transformations from the rollout policy until all molecules are solved or a certain depth is exceeded. In the update phase (4), the position values are updated in the current branch to reflect the result of the rollout. b, Expansion procedure. First, the molecule (A) to retroanalyse is converted to a fingerprint and fed into the policy network, which returns a probability distribution over all possible transformations (T1 to Tn). Then, only the k most probable transformations are applied to molecule A. This yields the reactants necessary to make A, and thus complete reactions R1 to Rk. For each reaction, the reaction prediction is performed using the in-scope filter, returning a probablity score. Improbable reactions are then filtered out, which leads to the list of admissible actions and corresponding precursor positions B and C. M.H.S.Segler, et al. Nature 555(7678), 604 (2018) https://www.nature.com/articles/nature25978
  50. 50. Sequence-to-Sequence 49 del. Seq2seq Model. Neural sequence-to-sequence eq) models map one sequence to another and have y shown state of the art performance in many tasks such hine translation.49,50 It is based on an encoder−decoder cture that consists of two recurrent neural networks sequence log probability at each time step during decodi retained, where N is the width of the beam. The decod stopped once the lengths of the candidate sequences rea maximum decode length of 140 characters. The can sequences that contain an end of sequence charact considered to be complete. On average, about 97% of all 3. Seq2seq model architecture. DOI: 10.1021/acscentsc ACS Cent. Sci. 2017, 3, 11 1105 SMILES SMILES(SMART) B.Liu, et al. ACS. Cent. Sci. 3(10), 1103(2017) https://pubs.acs.org/doi/full/10.1021/acscentsci.7b00303 https://github.com/pandegroup/reaction_prediction_seq2seq
  51. 51. ■Coley et al. (2017) 50 tension of the one-step strategy to multistep pathway planning is . characters (i.e., a product SMILES26 string without atom C.W. Coley et al. ACS. Cent. Sci. 3(12), 1237 (2017) https://pubs.acs.org/doi/full/10.1021/acscentsci.7b00355 https://github.com/connorcoley/retrosim
  52. 52. ■Coley et al. 51 
 2018 9 http://www.molsci.jp/2018/pdf/4E13_w.pdf Coming soon… 2018 10 (IBIS) 2018 11
  53. 53. 1. 2. Materials Informatics 3. Python Materials Informatics 4. 5. 52
  54. 54. ■Materials Informatics 
 ■ ■Materials Informatics Python 53
  55. 55. ■Materials Informatics ■ 
 https://www.jstage.jst.go.jp/article/ciqs/2017/0/2017_PL/ _pdf/-char/ja ■ 
 https://www.jstage.jst.go.jp/article/cicsj/36/1/36_9/_pdf/- char/ja ■ 
 https://www.ssken.gr.jp/MAINSITE/event/2017/20171026- sci/lecture-01/ SSKEN_sci2017_YoshidaRyo_presentation.pdf 54
  56. 56. ■RDKit ■ https://www.rdkit.org/ ■RDKit 
 https://future-chem.com/rdkit-intro/ ■ 
 https://github.com/chemo-wakate ■RDKit http://rdkit-users.jp/ 55

×