This study compares molecular descriptors calculated from 1D SMILES strings and 3D structures to determine if 3D structures are necessary for descriptor calculations. Descriptors for 94 molecules were calculated from both 1D and 3D formats. Most descriptors showed strong correlation between the two formats, indicating 3D structures are not essential. Comparison of descriptors calculated by MOE and RDKit showed low correlation, suggesting differences between software. The goal is to efficiently process over 1.7 million compounds for virtual screening on high performance computers by calculating descriptors from 1D SMILES strings if 3D structures are not required.
Comparing Structural Descriptors from 1D and 3D Formats
1. Objectives
Acknowledgements
• Svetlana Gelpi-Dominguez acknowledges support from NSF Northeast LSAMP Bridge to Doctorate Award
# 1400382. She would also like to thank Dr. Felice C. Lightstone, Dr. Brian J. Bennion, Dr. Sergio Wong,
and Dr. Eric Schwegler for the opportunity to work in the 2017 CCMS cohort. Svetlana would also like to
thank Dr. Miguel Morales-Silva and Tony Baylis for their constant mentoring at LLNL. Prepared by LLNL
under Contract DE-AC52-07NA27344.
References
• Hechinger M, Leonhard K, & Marquardt W (2012) What is Wrong with Quantitative
Structure–Property Relations Models Based on Three-Dimensional Descriptors?
Journal of Chemical Information and Modeling 52(8):1984-1993.
• Labute P (2000) A widely applicable set of descriptors. Journal of Molecular
Graphics and Modeling 18(4):464-477.
• Malde AK, et al. (2011) An Automated Force Field Topology Builder (ATB) and
Repository: Version 1.0. Journal of Chemical Theory and Computation 7(12):4026-
4037.
• Gaulton A, et al. (2017) The ChEMBL database in 2017. Nucleic Acids Research
45(Database issue):D945-D954.
Molecular Descriptors: Comparing Structural Complexity and Software
Svetlana Gelpí-Domínguez1,2 Sergio Wong2, Brian J. Bennion2 , Felice C. Lightstone2
1)Department of Chemistry, University of Connecticut, Storrs, 06269, CT
2)Lawrence Livermore National Laboratory, Livermore, 94550, CA
Methodology
Results and Discussion Conclusions
Figures A, B, C and D. A total of 94 molecules were used as input in their 1-D and 3-D format to calculate 188
molecular descriptors for each molecule. Figs A, B, and C are correlation plots for the descriptors molecular weight,
hydrophobicity, and accessible surface area. Fig D shows us that out of 188 molecular descriptors 130 (shaded in blue)
show an R2 of over 0.5. These are not dependent of 3-D structural input. The other 58 descriptors (shown in orange) must
be calculated using only 3-D structures files as input therefore explaining the low R2 values.
Figures E, F, G and H. Comparison of descriptors calculated using MOE and RDKit. Here we used 1-D
(SMILES strings) as input structures to compare 34 descriptors both programs have in common. We observe a
strong correlation in wolecular weight and hydrophobicity (fig. E). Fig G shows a low correlation between the MOE
SlogP_VSA descriptor and the Rdkit ‘MOE-like’ SlogP_VSA descriptor. Fig H shows the frequency of R2 for the
34 descriptors used in this comparison. The majority of the descriptors calculated have an R2 of under 0.6 showing a
low correlation between descriptors in both programs.
Future work
• Produce a reliable Quantitative Structure–Activity Relationship (QSAR) model
that yields the bio activity of these molecules against an important receptor such
as estrogen receptor alpha.
• Are 3-D descriptors necessary to build accurate QSAR models?
• There exists a strong correlation between calculated descriptors for 1-D
SMILES Strings and for descriptors based 3-D quantum mechanical structures.
• The average R2 between the calculated molecular descriptor calculations for 94
molecules in MOE for 1-D SMILES strings and 3-D structures was 0.72. In fig
D. it is noted that the majority of highly correlated molecular descriptors are 0-
D, 1-D, and 2-D descriptors. This means that for the purposes of using 0-D, 1-
D, and 2-D descriptors is isn’t essential to have 3-D structures as an input for
descriptor calculations.
• In the MOE calculations it was observed that 3-D descriptor values do depend
on the dimension used for the input structure.
• There is a low correlation between Moe descriptors in MOE and ‘Moe-like’
descriptors found in RDKit. (average R2 of 0.12).
• If your QSAR model does not depend on 3-D descriptors then your pipeline can
become more efficient by using 1-D SMILES strings for your descriptor
calculations.
SMILES String:
c1ccccc1
H)
Commercial Open Source
Abstract
What are descriptors? And how are they used? In a large effort to predict the
compound activity of over 1.7 million compounds in various in-vitro assays, the time it
takes to extract molecules from a database and process them for virtual screening is
crucial. Applications need to take advantage of LLNL’s high performance computing.
Simplified
Molecular-Input
Line-Entry System
(SMILES) c1ccccc1
1.7 million compounds
1-D Format
3-D Format
Predict Activity
Molecular Descriptor
Calculation.
4 types of Molecular Descriptors:
• Topological: Atom count
• Geometrical: Principal Moment of Inertia
• Electronic: Dipole moment
• 3D Descriptors
Software used to perform
calculations.
MOAD (clip.llnl.gov:5507)
• Are 3-D structures determined by ab-initio methods better to use than 1-D
SMILES Strings for the calculation of molecular descriptors? The answer to
this question can be found in figures A-D.
• Are MOE-like descriptors in RDKit the same as those created by MOE? The
answer to this question can be found in figures E-H.
• Upload the 1.7
million compounds
with their
calculated
descriptors to the
MOAD database.
R2 Frequency RDKit vs. MOE
B)A)
E) F)
G)
Positive Control
Positive Control
R2 Frequency Calculated DescriptorsD)
C)