1. Quantitative Structure Activity
Relationship (QSAR): Statistical
method and Product concept
Presented by:
Radha Sureshrao Chafle
F. Y. M. Pharm. (Semester II)
(Pharmacology)
Guided by:
Dr. Mrs. Vandana S. Nikam
HOD, Associate Professor in Pharmacology
2. What is QSAR?
QSAR is a mathematical relationship
which describe the structural dependence
of biological activities either by
physicochemical parameters, by indicator
variables encoding different structural
features , or by three-dimensional
molecular property profiles of the
compounds.
3. Contd.
Drugs, which exert their biological effects by interaction
with a specific target must have :
a three-dimensional structure, which in the arrangement
of its functional groups and in its surface properties is
more or less complementary to a binding site.
Better the steric fit and complementarity of the of the
surface properties of a drug to its binding site are, the
higher its affinity will be and the higher may be its
biological activity.
4. Classical QSAR analyses
consider only 2D structures
main field of application is in substituent
variation of a common scaffold.
3D QSAR
• has a much broader scope.
• starts from 3D structures and
correlates biological activities with
3D-property fields.
5. History & Development of QSAR
1868, Crum-Brown and Fraser: Published an equation Φ =
f(C) i.e. assumption of physiological activity Φ as function of
the chemical structure C.
1900, H. H. Meyer and C. E. Overton: lipoid theory of
narcosis
1930‘s, L. Hammett: electronic sigma constants
1964, C. Hansch and T. Fujita: QSAR
1984, P. Andrews: affinity contributions of functional groups
1985, P. Goodford: GRID (hot spots at protein surface)
1988, R. Cramer: 3D QSAR
1992, H.-J. Bohm: LUDI interaction sites, docking, scoring
1997, C. Lipinski: bioavailability rule of five
1998, Ajay, W. P. Walters and M. A. Murcko; J. Sadowski
and H. Kubinyi: drug-like character
6. Basic Requirements in QSAR
Studies
all analogs belong to a congeneric series
all analogs exert the same mechanism of
action
all analogs bind in a comparable manner
the effects of isosteric replacement can be
predicted
binding affinity is correlated to interaction
energies
biological activities are correlated to binding
affinity
7. Molecular Properties and Their
Parameters
Molecular Property Corresponding
Interaction
Parameters
Lipophilicity hydrophobic interactions log P, 𝜋, 𝑓, RM, 𝜒
Polarizability van-der-Waals
interactions
MR, parachor, MV
Electron density ionic bonds, dipole-
dipole interactions,
hydrogen bonds, charge
transfer interactions
σ, R, F, κ, quantum
chemical indices
Topology steric hindrance
geometric fit
Es, rv, L, B, distances,
volumes
8. QSAR models
Hansch model (property-property relationship):
Definition of the lipophilicity parameter π
πX = log PRX - log PRH
where PRX represents the partition coefficient between n-
octanol and water and PRH that of the parent compound.
Linear Hansch model
Log 1/C = a log P + b σ + c MR + ... + k
Nonlinear Hansch models
log 1/C = a (log P)2 + b log P + c σ + ... + k
log 1/C = a π2 + b π + c σ + ... + k
log 1/C = a log P - b log (ßP + 1) + c σ + ... +
9. Contd.
Free-Wilson model (structure-property relationship)
log 1/C = Σ ai + µ
ai = substituent group contributions
µ = activity contribution of reference compound
Mixed Hansch/Free-Wilson model
log 1/C = a (log P)2 + b log P + c σ + ... + Σ ai + k
log 1/C = a log P - b log (ßP + 1) + c σ + ... + Σ ai + k
11. n-Octanol/Water as a Standard
System
membrane analogous structure
hydrogen bond donor and acceptor
practically insoluble in water
no desolvation on transfer into organic
phase
very low vapor pressure
transparent in the UV region
large data base of log P values
12. Additivity Principle of π Values(C.
Hansch, 1964)
πX = log PR-X - log PR-H
The lipophilicity parameter π is an additive,
constitutive molecular parameter;
compare the Hammett Equation:
ρσX = log KR-X - log KR-H
14. Hydrophobic Fragmental Constants f
The hydrophobic fragmental constant of a substituent
or molecular fragment represents the lipophilicity
contribution of that molecular fragment .
(R. Rekker, The Hydrophobic Fragmental Constant,
Elsevier, Amsterdam 1977; R. Rekker. Eur. J. Med. Chem.
14, 479 (1979)
log P = Σ aifi (R. Rekker, 1973)
Experimental Determination of Log P Values
- Shake flask method
- Reversed phase thin layer chromatography
- High performance liquid chromatography (HPLC)
15. Polarizability Parameters
Molar volume, Molar Refractivity, Parachor
MV =
MW
d
MR =
n2 – 1
n2+ 2
.
MW
d
PA = 𝛾1/4 MW
d
d = density; n = refraction index;
γ = surface tension
(MR is most often scaled by a factor of 0.1)
17. Quantum Mechanical Descriptors
Atom partial charges:
Mulliken population analysis (orbital population)
ESP charges (mapping EP to atom locations)
Dipole moment:
strength and orientation behavior of a molecule in an
electrostatic field
HOMO / LUMO (“frontier orbital theory“):
HOMO = energy of highest occupied molecular orbital,
“nucleophilicity’’
LUMO = energy of lowest unoccupied molecular orbital,
“electrophilicity”
Superdelocalizability:
estimate for the reactivity of positions in aromatic hydrocarbon
18. 3D QSAR
3D QSAR is an extension of classical QSAR which
exploits the 3 dimensional properties of the ligands
to predict their biological activity using robust
statistical analysis like PLS, G/PLS, ANN etc.
3D QSAR uses probe based sampling within a
molecular lattice to determine three-dimensional
properties of molecules and can then correlate these
3D descriptors with biological activity.
Some of the major factors like desolvation
energetics, temperature, diffusion, transport, pH,
salt concentration etc. which contribute to the
overall free energy of binding are difficult to handle,
and thus usually ignored.
19. On the basis of
intermolecular bonding
On the basis of alignment
criterion
On the basis of
chemometric
techniques used
Ligand Based
3D QSAR
For e.g.
CoMFA,
CoMSIA,
COMPASS,
CoMMA,
SoMFA
Receptor
Based 3D
QSAR
For e.g.
COMBINE,
AFMoC,
HIFA,
CoRIA
Linear 3D
QSAR
For e.g. CoMFA,
CoMSIA,
AFMoC,
GERM,
CoMMA,
SoMFA
Classification of 3D QSAR
Alignment
dependent 3D
QSAR
For e.g. CoMFA,
CoMSIA,
GERM,
COMBINE,
AFMoC, HIFA,
CoRIA
Alignment
independent 3D
QSAR
For e.g.
COMPASS,
CoMMA, HQSAR,
WHIM,
EVA/CoSA,
GRIND
20. Comparative Molecular Field
Analysis (CoMFA)
The Scientist named Cramer developed the
predecessor of 3D approaches called Dynamic
Lattice Oriented Molecular Modeling System
(DYLOMMS) that involves the use of PCA to
extract vectors from the molecular interaction fields,
which are then correlated with biological activities
in 1987.
CoMFA, powerful 3D QSAR methodology is a
combination of GRID and PLS.
21. Protocol for CoMFA
Determination of Bioactive conformations of the
molecule.
Superimposition or the alignment of molecules
using either manual or automated methods, in a
manner defined by the supposed mode of
interaction with the receptor.
The steric and electrostatic fields calculated
around the molecules with different probe groups
positioned at all interactions of the lattice.
The overlaid molecules are placed in the center of
a lattice grid with a spacing of 2 Å.
22. Contd.
The PLS technique is used to correlate the
interaction energy or field values with the
biological activity, by which the quantitative
influence of specific chemical features of
molecules on their biological can be
identified and extracted.
The results are coupled as correlation
equations with the number of latent variable
terms, each of which is a linear combination
of original independent lattice descriptors.
23. Steps in CoMFA:
a set of molecules is first selected.
all molecules have to interact with the same kind
of receptor (or enzyme, ion channel, transporter)
in the same manner, i.e., with identical binding
sites in the same relative geometry.
a certain subgroup of molecules is selected
which constitutes a training set to derive the
CoMFA model.
The residual molecules are considered to be a
test set which independently proves the validity
of the derived model(s).
24. Atomic partial charges are calculated and
(several) low energy conformations are
generated. A pharmacophore hypothesis is
derived to orient the superposition of all
individual molecules and to afford a rational
and consistent alignment.
A sufficiently large box is positioned around
the molecules and a grid distance is defined.
PLS analysis is the most appropriate method
for this purpose. Normally, cross-validation is
used to check the internal predictivity of the
derived model.
27. Drawbacks of CoMFA:
Too many adjustable parameters
Uncertainty in selection of compounds and
variables.
Fragmented contour maps with variable
selection procedures.
Hydrophobicity not well quantified
Cut-off limits used.
Low signal to noise ratio due to many useless
field variables.
Imperfections in potential energy funtions.
Applicable only to in vitro data.
28. Comparative Molecular Similarity
Indices Analysis (CoMSIA)
Molecular similarity indices are calculated
from modified SEAL similarity fields are
employed as descriptors to simultaneously
consider steric, electrostatic, hydrophobic
and hydrogen bonding properties.
These indices are estimated indirectly by
comparing the similarity of each molecule in
the dataset with a common probe atom
(having a radius of 1Å, charge of +1 and
hydrophobicity of +1) positioned at the
intersections of a surrounding grid/lattice.
29. For computing similarity at all grid points, the
mutual distances between the probe atom and the
atoms of the molecules in the aligned dataset are
also taken into account.
To describe this distance dependence and
calculate the molecular properties, Gaussian type
functions are employed.
Since the underlying Gaussian type functional
forms are ‘smooth’ with no singularities their
slopes are not as steep as the Columbic and
Lennerd Jones potentials in CoMFA; therefore no
arbitrary cut off limits are required to be defined.
30. Comparison between CoMFA and
CoMSIA
CoMFA CoMSIA
Function type Lennerd-Jones potential,
Coulomb potential
Gaussian
Descriptors Interaction energies Similarity indices
Cut-off required Not required
Field Steric, electrostatic Steric, electrostatic,
hydrophobic, hydrogen
bond donor and
hydrogen bond acceptor
Contour map Often not contiguous Contiguous
Model reproducibility Poor Good
*CoMSIA is Provide By TRIPOS Inc. in the Sybyl Software, along
with CoMFA.
31. Statistical methods used in QSAR
Linear Regression
Analysis (RA)
Multivariate Data
Analysis
Pattern Recognition
Simple Linear regression Principal component
analysis (PCA)
Cluster analysis
Multiple Linear
regression (MLR)
Principal component
regression (PCR)
Artificial neural
networks (ANNs)
Stepwise multiple linear
regression
Partial least square
analysis (PLS)
k-nearest neighbor
(kNN)
Genetic function
approximation (GFA)
Genetic partial least
squares (G/PLS)
32. Linear Regression Analysis (LRA)
Linear Regression Analyses are
considered as an easily interpretable
methods indicate for QSAR analysis.
These techniques construct a statistical
model to represent the correlation of one
or more independent variables(x) with a
dependent explicative variable (y).
The model can be utilized to predict y
from the knowledge of x variables, either
quantitative or qualitative.
33. a. Simple Linear regression
method
Standard linear regression calculation to
generate a set of QSAR equations that
include a single independent descriptor x
and dependent variable y.
A one term linear equation is produced
separately for each independent variable
from the descriptor set.
𝑦 = 𝑎 + 𝑏𝑥
34. b. Multiple Linear regression (MLR)
Referred as linear free energy relationship
(LFER) method.
Generates QSAR equations by performing
standard multivariable regression
calculations to identify the dependence of
a drug property or any all of the
descriptors under investigation.
Involves more than one variables.
𝑦 = 𝑏0 + 𝑏1𝑥1 + 𝑏2𝑥2 + … … … + 𝑏𝑚𝑥𝑚 + 𝑒
35. c. Stepwise multiple linear
regression
Commonly used variant of MLR.
Creates multiple term linear equation but not
all the independent variables are used.
Each independent variable is sequentially
added to the equation and new regression is
performed every time.
The new term is preserved only if the model
passes a test for significance.
Is useful when the number of descriptors are
large and key descriptor is unknown.
36. Multivariate Data Analysis
It replaced LRA.
It tried to explain an extended set of
variables by means of a reduced number
of new latent variables possessing the
maximum amount of information relevant
to the problem.
37. a. Principal component analysis
(PCA)
Data reduction technique that does not
generate QSAR model.
Creates new set of orthogonal descriptors
i.e. principal components which describe
most of the information contained in the
independent variables.
Reduces dimensionality of a multivariate
data set of descriptors to the actual
amount of data available.
38. b. Principal component regression
(PCR)
When principal components are employed
as the independent variables to perform a
linear regression, the method is termed as
Principal component regression.
PCR applies scores from PCA
decomposition as regressors in the QSAR
model, to generate multiple term linear
equation.
39. c. Partial least square analysis (PLS)
An iterative regression procedure that
produces its solution based on linear
transformation of large number of original
descriptors to a small number of new
orthogonal terms called latent variables.
PLS is able to analyze complex SAR data
in a more realistic way.
Is able to interpret the influence of
molecular structure on biological activity
40. d. Genetic function approximation
(GFA)
Serves as an alternative to standard
regression analysis for building QSAR
equations.
Employs natural principles of evolution of
species which leads to improvements by
recombination
Suitable for obtaining QSAR equations when
dealing with a larger number of independent
variables.
Results in multiple models generated by
initial models using genetic algorithms.
41. e. Genetic partial least squares
(G/PLS)
It is valuable analytical tool that has
evolved by combining the best features of
GFA and PLS.
42. Pattern Recognition
The method is based on the principal of
analogy.
The method is used for the detection of
the distance or closeness within the large
amount of multivariate data.
43. a. Cluster analysis
Statistical pattern recognition method used to
investigate the relationship between
observations associated with several
properties and to partition the data set into
categories consisting of similar elements.
Allows for the consideration of the inactive
compounds in the analysis.
Can be used to study a large set of
substituents to identify subsets which share
similar physical properties.
44. b. Artificial neural networks (ANNs)
The technique has its origin from the real
neurons present in an animal brain.
Are parallel computational systems
consists of groups of highly
interconnected processing elements called
neurons, which are arranged in a series of
layers.
a)First layer: Input layer
b)Subsequent layer: Hidden layer
c) Last layer: Output layer
45. Contd.
Each layer does its independent
computations and pass the results to another
one.
The weighed inputs are summed up and
supplied to the hidden layers, where a non
linear transfer function does all the required
processing.
The results of transfer function are
communicated to the neurons in the output
layer, where the results are interpreted and
finally presented to the users.
46. c. k-nearest neighbor (kNN)
Simplest machine learning algorithms.
Most commonly used for classifying a new
pattern (e.g. a molecule)
Technique is based on a simple distance
learning approach.
Where unknown/ new molecules are
classified according to the majority of its k-
nearest neighbors in the training set.
The nearness is determined by Euclidean
distance metric (e.g. similarity measure
computed using the structural descriptors of
the molecules).
47. Importance of statistical parameter
Equations generated/established in QSAR
studies are Linear Regression equations.
A number of equations may be
generated/established for one
problem/case under study.
Statistics helps in selecting one suitable
bet fit equation out of them.
48. Contd.
This may be done by checking standard
deviation/variance an other related
parameters for the data set used for QSAR
studies .
Correlation coefficient computed for the
data set under study also helps in selecting
appropriate QSAR equation.
49. Reference:
Kubinyi H., Introduction, In: Mannhold R., Larsen P.,
Timmerman H. QSAR: Hansch Analysis and related
approaches. New York, VCH Publishers, 1993. p. 4-8
Kubinyi H., Introduction, In: Mannhold R., Larsen P.,
Timmerman H. QSAR: Hansch Analysis and related
approaches. New York, VCH Publishers, 1993. p. 27-54
Kubinyi H., Introduction, In: Mannhold R., Larsen P.,
Timmerman H. QSAR: Hansch Analysis and related
approaches. New York, VCH Publishers, 1993. p. 57- 68
Kubinyi H., Introduction, In: Mannhold R., Larsen P.,
Timmerman H. QSAR: Hansch Analysis and related
approaches. New York, VCH Publishers, 1993. p. 159