, SDS AI Lab.
2019. 6. 23.
PR-173
1. Research Background
1. Research Background
‘Chemical space’
2/19
• The goal of drug and material design is to identify novel molecules that have certain desirable
properties.
https://phys.org/news/2018-08-software-framework-drug-discovery-ieee.html
1. Research Background 3/19
Virtual screening Genetic algorithm
Evolutionary algorithms for de novo drug design – A survey (2015)Combination of Virtual Screening Protocol by in Silico
toward the Discovery of Novel 4-Hydroxyphenylpyruvate
Dioxygenase Inhibitors (2018)
1. Research Background
Main idea – continuous representation of molecules
4/19
• Hand-specified mutation rules are unnecessary
• We can enable the use of gradient-based optimization to make larger jumps in chemical space.
• A data-driven representation can leverage large sets of unlabeled chemical compounds to automatically build an
even larger implicit library.
1. Research Background
Main idea – Variational autoencoder for manifold learning
5/19
https://www.slideshare.net/NaverEngineering/ss-96581209
• A new for exploring chemical space based on continuous encodings of
molecules.
1. Research Background
Objective
6/19
Keywords: Chemical Design, Data-Driven, Continuous, VAE
2. Methods
2. Methods
Model training
7/19
- Database (structure, property)
- Molecular descriptor
- Model structure & hyperparameter
2. Methods
Database
8/19
- 250,000 drug-like commercially available molecules from ZINC DB
https://zinc.docking.org/subsets/drug-like
ZINC DB
QM9 DB
- set of molecules with fewer than 9 heavy atoms
- 108,000 molecules was used.
http://quantum-machine.org/datasets/
2. Methods
Molecular descriptor
9/19
https://www.researchgate.net/publication/235919348_manual_for_chemopy/
Yibo Li et al., Journal of Cheminformatics. 2018
SMILES
(Simplified molecular-input line-entry system)
Molecular fingerprintMolecular graph
https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system
2. Methods
Model structure – Variational autoencoder
10/19
• Structural parameters
- Max length: 120 (for ZINC) 34 (for QM9)
- Different characters: 35 (for ZINC), 22 (for QM9)
- three 1D convolutional layers of filter sizes 9, 9, 10 and 9, 9, 11 convolution kernels
- Latent space 196 (for ZINC), 156 (for QM9)
Encoder: 1d CNN
Yibo Li et al., Journal of Cheminformatics. 2018
2. Methods
Model structure – Variational autoencoder
11/19
Decoder: gated recurrent unit (GRU)
Josep Arús-Pous et al., Journal of Cheminformatics. 2018
- into three layers of gated recurrent unit (GRU) networks with hidden dimension of 488.
- Property prediction : fully connected layers [1000, 1000]
• Structural parameters
3. Experimental Results
3. Experimental Results 12/19
1)
2)
3)
3. Experimental Results
1) Mapping molecules to the latent space
13/19
5000 latent points -> 1000 attempts
Gaussian noise added to the encoding
3. Experimental Results 14/19
Interpolation!
?
Distance in the latent space
1) Mapping molecules to the latent space
3. Experimental Results
A continuous latent space allows interpolation of molecules
15/19
“Interpolating linearly between two points might pass by
an area of low probability, to keep the sampling on the
areas of high probability we utilize spherical interpolation
(slerp).”
3. Experimental Results
A distribution of chemical properties in training sets against molecules generated.
16/19
Figure 2 in SI
• More similar to the original data set.
• VAE generates molecules are new as the combinatorial space is extremely large
3. Experimental Results
The mapping of property values to the latent space representation of molecules
17/19
Encoding -> draw Sampling ->
prediction -> draw
performance of property prediction model
3. Experimental Results 18/19
Gradient-based optimization (Gaussian interpolation)
Optimization of Molecules via Properties
• Molecule generation from continuous latent space
4. Conclusion
4. Conclusions
• We propose a new family of methods for exploring chemical
space based on continuous encodings of molecules.
19/19
• The results and its application to optimizing objective functions
of molecular properties, have already and will continue to
influence new avenues for molecular design.
Thank you.

PR173 : Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules

  • 1.
    , SDS AILab. 2019. 6. 23. PR-173
  • 2.
  • 3.
    1. Research Background ‘Chemicalspace’ 2/19 • The goal of drug and material design is to identify novel molecules that have certain desirable properties. https://phys.org/news/2018-08-software-framework-drug-discovery-ieee.html
  • 4.
    1. Research Background3/19 Virtual screening Genetic algorithm Evolutionary algorithms for de novo drug design – A survey (2015)Combination of Virtual Screening Protocol by in Silico toward the Discovery of Novel 4-Hydroxyphenylpyruvate Dioxygenase Inhibitors (2018)
  • 5.
    1. Research Background Mainidea – continuous representation of molecules 4/19 • Hand-specified mutation rules are unnecessary • We can enable the use of gradient-based optimization to make larger jumps in chemical space. • A data-driven representation can leverage large sets of unlabeled chemical compounds to automatically build an even larger implicit library.
  • 6.
    1. Research Background Mainidea – Variational autoencoder for manifold learning 5/19 https://www.slideshare.net/NaverEngineering/ss-96581209
  • 7.
    • A newfor exploring chemical space based on continuous encodings of molecules. 1. Research Background Objective 6/19 Keywords: Chemical Design, Data-Driven, Continuous, VAE
  • 8.
  • 9.
    2. Methods Model training 7/19 -Database (structure, property) - Molecular descriptor - Model structure & hyperparameter
  • 10.
    2. Methods Database 8/19 - 250,000drug-like commercially available molecules from ZINC DB https://zinc.docking.org/subsets/drug-like ZINC DB QM9 DB - set of molecules with fewer than 9 heavy atoms - 108,000 molecules was used. http://quantum-machine.org/datasets/
  • 11.
    2. Methods Molecular descriptor 9/19 https://www.researchgate.net/publication/235919348_manual_for_chemopy/ YiboLi et al., Journal of Cheminformatics. 2018 SMILES (Simplified molecular-input line-entry system) Molecular fingerprintMolecular graph https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system
  • 12.
    2. Methods Model structure– Variational autoencoder 10/19 • Structural parameters - Max length: 120 (for ZINC) 34 (for QM9) - Different characters: 35 (for ZINC), 22 (for QM9) - three 1D convolutional layers of filter sizes 9, 9, 10 and 9, 9, 11 convolution kernels - Latent space 196 (for ZINC), 156 (for QM9) Encoder: 1d CNN Yibo Li et al., Journal of Cheminformatics. 2018
  • 13.
    2. Methods Model structure– Variational autoencoder 11/19 Decoder: gated recurrent unit (GRU) Josep Arús-Pous et al., Journal of Cheminformatics. 2018 - into three layers of gated recurrent unit (GRU) networks with hidden dimension of 488. - Property prediction : fully connected layers [1000, 1000] • Structural parameters
  • 14.
  • 15.
  • 16.
    3. Experimental Results 1)Mapping molecules to the latent space 13/19 5000 latent points -> 1000 attempts Gaussian noise added to the encoding
  • 17.
    3. Experimental Results14/19 Interpolation! ? Distance in the latent space 1) Mapping molecules to the latent space
  • 18.
    3. Experimental Results Acontinuous latent space allows interpolation of molecules 15/19 “Interpolating linearly between two points might pass by an area of low probability, to keep the sampling on the areas of high probability we utilize spherical interpolation (slerp).”
  • 19.
    3. Experimental Results Adistribution of chemical properties in training sets against molecules generated. 16/19 Figure 2 in SI • More similar to the original data set. • VAE generates molecules are new as the combinatorial space is extremely large
  • 20.
    3. Experimental Results Themapping of property values to the latent space representation of molecules 17/19 Encoding -> draw Sampling -> prediction -> draw performance of property prediction model
  • 21.
    3. Experimental Results18/19 Gradient-based optimization (Gaussian interpolation) Optimization of Molecules via Properties • Molecule generation from continuous latent space
  • 22.
  • 23.
    4. Conclusions • Wepropose a new family of methods for exploring chemical space based on continuous encodings of molecules. 19/19 • The results and its application to optimizing objective functions of molecular properties, have already and will continue to influence new avenues for molecular design. Thank you.