Lecture 9- Molecular descriptors
BTT- 516– Drug Designing and Development
Topic To be covered
1. Introduction
2. Types of molecular descriptors
3. Tools for descriptor calculations
4. Home work
Molecular descriptors can be defined as mathematical representations of molecules’
properties that are generated by algorithms.
The numerical values of molecular descriptors are used to quantitatively describe the
physical and chemical information of the molecules.
An example of molecular descriptors is the LogP which is a quantitative representation of
the lipophilicity of the molecules, it is obtained by measuring the partitioning of the
molecule between an aqueous phase and a lipophilic phase which consists usually of
water/n-octanol.
Introduction
Molecular descriptors can be useful in performing similarity searches in molecular
libraries, as they can find molecules with similar physical or chemical properties based on
their similarity in the descriptors’ values.
The molecular descriptors are used in ADMET prediction models to correlate the
structure–property relationship to help in predicting the ADMET properties of molecules
based on their descriptors values (Khan and sylte, 2007).
The molecular descriptors that are used in ADMET models can be classified on the basis of
level of molecular representation required for calculating the descriptor.
• One-dimensional (1D)
• Two-dimensional (2D)
• Three-dimensional (3D)
The 1D descriptors are the simplest type of molecular descriptors, these represent
information that are calculated from the molecular formula of the molecule, which
includes the count and type of atoms in the molecule and the molecular weight.
One-dimensional (1D)
The 2D descriptors are more complex than the 1D descriptors, usually, they represent
molecular information regarding the size, shape, and electronic distribution in the molecule.
Calculating the 2D descriptors depends mainly on the database size, and the calculation of parts
of a molecule in which the data is missing could largely result in a false result.
The 3D descriptors describe mainly properties that are related to the 3D conformation of the
molecule, such as the intramolecular hydrogen bonding.
Examples of descriptors obtained from calculations involving the 3D structure of the molecules
are the polar and nonpolar surface area (PSA and NPSA, respectively). More advanced
calculation like quantum mechanics calculations can be used to obtain 3D descriptors that
describe the valence electron distribution in the molecules (Bergström, 2005).
3D descriptors
2 D descriptors
• 0D - bond counts, mol weight, atom counts
• 1D - fragment counts, H-Bond acc/don, Crippen, PSA, SMARTS
• 2D - topological descriptors (Balaban, Randic, Wiener, BCUT, kappa, chi)
• 3D - geometrical descriptors (3D WHIM, 3D autocorrelation, 3D-Morse) + surface
properties + COMFA
• 4D - 3D coordinates + conformations (JCHEM conformer, CORINA, gold set,
Crystaleye)
A selection of commercial and free descriptor calculation utilities is collected under the
molecular descriptor software collection or the CompChem list or new programs are posted
to CCL.
• alvaDesc - new visual descriptor suite from Kode solutions covering 4000 descriptors
(developed by Alvascience)
•CDK descriptor GUI (free and open source - using Open Source CDK and Joelib code)
•BlueDesc- Molecular Descriptor Calculator (free and open source - using CDK and Joelib
code, requires JAVA 1.6
•ChemAxon JChem - Descriptor package using Marvin JAVAAPI (free academic license)
•ISIDA/QSPR - free fragment based QSPR descriptor package
•E-Dragon (VCCLab) free (150 molecules), now with GSFRAG, GSFRAG-L, ETState >
3000 descriptors
Tools for descriptor calculations
•MOLD2 - (FDA) a free 2D molecule descriptor package
•Toxicity Estimation Software Tool (T.E.S.T.) - (EPA) contains more than 790 2-dimensional
descriptors
•Open3DQSAR - pharmacophore modelling using molecular interaction fields (MIFs)
•Dragon - 5,270 molecular descriptors for LINUX and WIN (Todeschini/Talete/Kode)
•PaDEL-Descriptor- based on CDK but includes additional 737 2D and 3D descriptors
(NUS/Singapore)
•ADMEWORKS ModelBuilder - 400 descriptors (Jurs) and MOPAC (Stewart) (Fujitsu/Poland)
•QuBiLS-MIDAS - a highly parallel software for three-dimensional molecular descriptor
calculation
Concepts for descriptor calculations and QSAR/QSPR
modeling
• You need a large dataset with the molecular property (logP, bp) to be modeled. The
larger the number of data points the better. There are QSAR models with 20 or less
points, however for broad applications one need to cover a large diversity space.
Hundreds or thousands of such values can be collected from databases or are now
available from HT screening methods.
• You need the molecular structures itself (as SMILES, SDF in 2D or optimized 3D
structure). Handling the molecules together with all descriptors can be a challenging
task, software which can do that is highly preferred.
• You need a descriptor package for descriptor calculation
• You need to apply feature selection (a statistical process) to discard unimportant
(invariant) or sometimes highly correlated descriptors (othogonalization)
• You need to divide your molecule set into three parts. A training (70%), validation (30%) and
an additional external training or validation set which is not used in either method. (Sometime
the validation set is called testing set or vice versa). Cross-validation (n-fold or v-fold)
techniques or other resampling tests (Monte Carlo Sampling, Jackknifing, Bootstrapping) need
to be applied, especially if not enough molecules are available.
• You need to apply regression or classification methods (including meta-learning approaches).
• One need to make sure that for future predictions no other compound classes are included
(which usually results in wrong predictions) by either including error values, fingerprint or
substructure matches or a simple dimension reduction method (PCA, PLS) to avoid molecules
which were not covered during development. As example a logP method only developed on
alkanes will 100% fail on complex drug molecules or molecules with multiple -OH and -NH
or -SH groups. Further more a complete statistical description for either the regression
performance or classification performance needs to be included.
Utility of molecular descriptors
• The purpose of molecular-Descriptor is to calculate properties of molecules
that serve as numerical descriptions or characterizations of molecules in
other calculations such as QSAR model, diversity analysis or combinatorial
library design.
Thank you
Er. Rajan Rolta
Faculty of Applied Sciences and Biotechnology
Shoolini University,
Village Bhajol, Solan (H.P)
+91-7018792621 (Mob No.)
rajanrolta@shooliniuniversity.com

Lecture 9 molecular descriptors

  • 1.
    Lecture 9- Moleculardescriptors BTT- 516– Drug Designing and Development
  • 2.
    Topic To becovered 1. Introduction 2. Types of molecular descriptors 3. Tools for descriptor calculations 4. Home work
  • 3.
    Molecular descriptors canbe defined as mathematical representations of molecules’ properties that are generated by algorithms. The numerical values of molecular descriptors are used to quantitatively describe the physical and chemical information of the molecules. An example of molecular descriptors is the LogP which is a quantitative representation of the lipophilicity of the molecules, it is obtained by measuring the partitioning of the molecule between an aqueous phase and a lipophilic phase which consists usually of water/n-octanol. Introduction Molecular descriptors can be useful in performing similarity searches in molecular libraries, as they can find molecules with similar physical or chemical properties based on their similarity in the descriptors’ values.
  • 4.
    The molecular descriptorsare used in ADMET prediction models to correlate the structure–property relationship to help in predicting the ADMET properties of molecules based on their descriptors values (Khan and sylte, 2007). The molecular descriptors that are used in ADMET models can be classified on the basis of level of molecular representation required for calculating the descriptor. • One-dimensional (1D) • Two-dimensional (2D) • Three-dimensional (3D) The 1D descriptors are the simplest type of molecular descriptors, these represent information that are calculated from the molecular formula of the molecule, which includes the count and type of atoms in the molecule and the molecular weight. One-dimensional (1D)
  • 5.
    The 2D descriptorsare more complex than the 1D descriptors, usually, they represent molecular information regarding the size, shape, and electronic distribution in the molecule. Calculating the 2D descriptors depends mainly on the database size, and the calculation of parts of a molecule in which the data is missing could largely result in a false result. The 3D descriptors describe mainly properties that are related to the 3D conformation of the molecule, such as the intramolecular hydrogen bonding. Examples of descriptors obtained from calculations involving the 3D structure of the molecules are the polar and nonpolar surface area (PSA and NPSA, respectively). More advanced calculation like quantum mechanics calculations can be used to obtain 3D descriptors that describe the valence electron distribution in the molecules (Bergström, 2005). 3D descriptors 2 D descriptors
  • 6.
    • 0D -bond counts, mol weight, atom counts • 1D - fragment counts, H-Bond acc/don, Crippen, PSA, SMARTS • 2D - topological descriptors (Balaban, Randic, Wiener, BCUT, kappa, chi) • 3D - geometrical descriptors (3D WHIM, 3D autocorrelation, 3D-Morse) + surface properties + COMFA • 4D - 3D coordinates + conformations (JCHEM conformer, CORINA, gold set, Crystaleye)
  • 7.
    A selection ofcommercial and free descriptor calculation utilities is collected under the molecular descriptor software collection or the CompChem list or new programs are posted to CCL. • alvaDesc - new visual descriptor suite from Kode solutions covering 4000 descriptors (developed by Alvascience) •CDK descriptor GUI (free and open source - using Open Source CDK and Joelib code) •BlueDesc- Molecular Descriptor Calculator (free and open source - using CDK and Joelib code, requires JAVA 1.6 •ChemAxon JChem - Descriptor package using Marvin JAVAAPI (free academic license) •ISIDA/QSPR - free fragment based QSPR descriptor package •E-Dragon (VCCLab) free (150 molecules), now with GSFRAG, GSFRAG-L, ETState > 3000 descriptors Tools for descriptor calculations
  • 8.
    •MOLD2 - (FDA)a free 2D molecule descriptor package •Toxicity Estimation Software Tool (T.E.S.T.) - (EPA) contains more than 790 2-dimensional descriptors •Open3DQSAR - pharmacophore modelling using molecular interaction fields (MIFs) •Dragon - 5,270 molecular descriptors for LINUX and WIN (Todeschini/Talete/Kode) •PaDEL-Descriptor- based on CDK but includes additional 737 2D and 3D descriptors (NUS/Singapore) •ADMEWORKS ModelBuilder - 400 descriptors (Jurs) and MOPAC (Stewart) (Fujitsu/Poland) •QuBiLS-MIDAS - a highly parallel software for three-dimensional molecular descriptor calculation
  • 9.
    Concepts for descriptorcalculations and QSAR/QSPR modeling • You need a large dataset with the molecular property (logP, bp) to be modeled. The larger the number of data points the better. There are QSAR models with 20 or less points, however for broad applications one need to cover a large diversity space. Hundreds or thousands of such values can be collected from databases or are now available from HT screening methods. • You need the molecular structures itself (as SMILES, SDF in 2D or optimized 3D structure). Handling the molecules together with all descriptors can be a challenging task, software which can do that is highly preferred. • You need a descriptor package for descriptor calculation • You need to apply feature selection (a statistical process) to discard unimportant (invariant) or sometimes highly correlated descriptors (othogonalization)
  • 10.
    • You needto divide your molecule set into three parts. A training (70%), validation (30%) and an additional external training or validation set which is not used in either method. (Sometime the validation set is called testing set or vice versa). Cross-validation (n-fold or v-fold) techniques or other resampling tests (Monte Carlo Sampling, Jackknifing, Bootstrapping) need to be applied, especially if not enough molecules are available. • You need to apply regression or classification methods (including meta-learning approaches). • One need to make sure that for future predictions no other compound classes are included (which usually results in wrong predictions) by either including error values, fingerprint or substructure matches or a simple dimension reduction method (PCA, PLS) to avoid molecules which were not covered during development. As example a logP method only developed on alkanes will 100% fail on complex drug molecules or molecules with multiple -OH and -NH or -SH groups. Further more a complete statistical description for either the regression performance or classification performance needs to be included.
  • 11.
    Utility of moleculardescriptors • The purpose of molecular-Descriptor is to calculate properties of molecules that serve as numerical descriptions or characterizations of molecules in other calculations such as QSAR model, diversity analysis or combinatorial library design.
  • 12.
    Thank you Er. RajanRolta Faculty of Applied Sciences and Biotechnology Shoolini University, Village Bhajol, Solan (H.P) +91-7018792621 (Mob No.) rajanrolta@shooliniuniversity.com