Statistical Methods for QSAR Modeling

Statistical Method Used in QSAR
SUBMITTED BY:
UPASANA SHARMA
(M.PHARMA :Pharmaceutical Chemistry)
SUBMITTED TO:
Dr. SALAHUDDIN
(PROFESSOR)
NOIDA INSTITUTE OF ENGINEERING AND TECHNOLOGY (PHARMACY INSTITUTE)
GREATER NOIDA 1
UPASANA SHARMA (16/05/2023)

Collection of ligand
Generation of descriptors
Features selection
Construction of model
Validation of model
QSAR Work Flow:
2

Build models and calculate minimum energy
conformation
Calculate descriptors
Short list descriptors
 Based on correlation coefficient
 Based on cross correlation coefficient
 Based on dissimilarity distance
 Based on cluster analysis
 Based on genetic function approach
Develop regression relationship and estimate
statistics (R2,R2adj,R2pre,F test, Residuals)
Test model with external data set
QSAR Model
3

Model construction +Features selection= Statistical analysis
( for large no. of descriptors or few no. of descriptors)
Method
Regression Based approach
Classification Based approach
Machine Learning Technology
Simple Linear Regression method
Multiple linear regression method
Partial least square method
Cluster analysis
Principal component analysis
Logistic regression
Artificial neural network
Support vector machine
Gene expressing programming
Linear properties Linear regression
Partial regression
Semi supervised learning
algorithm
Unsupervised learning
algorithm
Non-linear properties
Supervised learning algorithm
Model
Artificial neural network
Model can be construct using
4

Validation:
Regression based QSAR
model
Validation metrics for internal validation
Least square fitting
Chi-Squared x2 and
root mean squared
error (RMSE)
Cross
validation
Leave one out cross
validation LOO
Leave some out cross
validation LSO
True Q2 and rm2
metrics
Validation metrics for external
validation
Predictive R2 (Q2F1)
Q2 F2 and Q3 F3
Golbraikh and tropsha's
criteria
Metrics include
(RMSEP) root mean
square for prediction
Validation metrics for
classification based
method
Wilks lamda statistics
Lower value
Canonical index (Rc)
Chi-square x2
Squared mahalanobis
distance 5

 Simple Linear regression method :
1 descriptor
Y= b+b1x1+e
 Multiple Linear Regression method:
Y=b+b1x1+b2x2+bnxn+e
 Non-linear regression method :
1parameter is not linear
Y=n(x ,B)+e (B= unknown parameter)
Observational data are molded by a function which is non-linear combination of the model
parameter and depend on one or more independent variables.
Regression based method :
6

UPASANA SHARMA (16/05/2023) 7
 Partial least square method:
 The principal component regression.
 predict or analyze a set of dependent variables from a set of independent variables by
multivariant statistical method done from multiple regression analysis.
 Applied in 3D-QSAR technique, Comparative Molecular Field Analysis (CoMFA)
 It used combination with GPLS genetic, FAPLS (factor analysis), OSCPLS (orthogonal
signal correction)
 COMPACT( computer optimized molecular parametric analysis of chemical toxicity)
predict carcinogenicity and other forms of toxicity.
 Software:
SIMCA-P
UNSCRAMBLER
SPM

Classification based approach :
 Cluster analysis:
 Clustering involves placing similar data into a group
in a way that maximizes similarity within groups and
dissimilarity between groups.
 Multivariate technique which analysis the
group based on distance (proximity).
 In hierarchical clustering, in a agglomerative and
divisive form
 The k-means clustering is a partition
based clustering
 Data reduction and hypotheses generation use in
data mining ,statistical data analysis
,pattern recognition.
8

 PRINCIPAL COMPONENT ANALYSIS:
 It transforms a number of possibly correlated variables
into a smaller number of uncorrelated variables called
principal components.
 PCA reduces attribute space from a larger number of
variables to a small number of factors (non-dependent
variable).
 PCA is a dimensionality reduction or data
compression method and there is no guarantee that the
dimensions are interpretable.
 Objective : To select a subset of variables from a larger
set, based on which original variables have the highest
correlations with the principal component.

Machine learning techniques :
 Artificial neural network
 Mimics the behavior of biological neurons.
 It has input layer
hidden layer
output layer
 Types:
Propagation neural networks, probabilistic
neural networks, Kohonen self-organizing
maps and Bayesian regularized neural
networks.
 Support vector machine
 Uses a linear classifier to classify data
into two categories.
 It used in combination with other
methods like MLR, PLS and so forth for
building more powerful and accurate
QSAR models
 Gene expressing programming
 The genetic algorithm and genetic
programming.
 It used for calculate the dermal
penetration, EC50, Binding Affinity
,Improved gene expressing
programming
10

Validation :
 It avoid chance correlation of numerous
descriptors used in the model and also over-
fitting of data. It assign the accuracy and
prediction of the model. (training set and test
set).
 The Organization for Economic Cooperation
and Development (OECD) give 5 principles to
test the model.
1) a defined endpoint
2) an unambiguous algorithm
3) a defined domain of applicability
4) appropriate measures of goodness-of-fit,
robustness and prediction accuracy
5) a mechanistic interpretation
11

Regression based QSAR model :
Validation metrics for internal validation
The use of molecules from training set to test the predictability of the model
 Least square fitting
It is the measure of square correlation coefficient (R2) between the predicted and experimental
value of activity.
The difference between R2 and R2 adj < 0.3 , QSAR validation is good.
 The χ2 Chi-squared and RMSE:
The values are used to assess the predictive quality of a model.
χ2 value shows the difference between experimentally determined bioactivity values and the
values predicted by the model
RMSE value of for large R2 value (that is >=7), values of χ2 and RMSE should < than 0.5 and
0.3 respectively.
12

Cross validation :
It is internal validation include Leave-Group-Out (LGO), which involves leaving of a molecule
or a group of molecules while creating model and evaluating the predictability of the model
using the molecules left.
 In LOO cross validation (leave-one-out)
 one compound is left out and the QSAR model is constructed using remaining
compounds.
 The eliminated compound is used as a test for the predicted model
 The predictability of the model is assessed by PRESS (Predicted Residual Sum of
Squares) and cross-validated R2 (Q2 ) when SDEP (Standard Deviation of Error of
Prediction) is obtained from PRESS
13

 True Q2 value:
 True Q2 is used for small data sets
 Q2 should not be treated as an ultimate proof for good predictability of models.
 Value of Q2 = < 0.5
 LSO (Leave-Some-Out) or LMO (Leave-Many-Out)
 It is set of data compounds are eliminated and models are created with rest of the
compounds.
 The left out compounds are then used to check the predictability of the model.
14
Low RMSE value and High R2 value

Validation metrics for external validation
 Predictive R2 or Q2 (F1) = correlation of observed and predicted data. Model is good
if predictive power has value of Q2 (F1) =< 0.5
 Q2 (F2) and Q2 (F3)= using the mean of test data set and training data . For validation of
QSAR model, threshold value of 0.5 is defined for both metrics
 Golbraikh and Tropsha’s criteria= forth condition for selection of training and test data
sets.
 For having a good predictive power, QSAR model should satisfy following condition
i. Q 2 training > 0.5
ii. R 2 test > 0.6
iii. (r2 - r 2 0)/ r 2 < 0.1 or (r2 – r’2 0)/ r 2 < 0.1, where r2 0 is R2 of predicted vs. observed
activities and r’2 0 is R2 of observed vs. predicted activities.
iv. 0.85 <= k <= 1.15 or 0.85 <= k’<= 1.15, where k and k’ are the slopes of regression lines
 Other metrics includes RMSEP (Root Mean Square Error of Prediction) to calculate
prediction error of QSAR model
15

Validation metrics for classification based methods
-cluster analysis and PCA (principal component analysis):
The validation matrix employed in classification-based methods is
 Wilks lambda (λ) statistics: It is sum of squares to total dispersion. The value ranges
between 0< λ <1
lower value corresponding = higher level of discrimination.
 Canonical index (Rc): It is used to estimate the strength of relationship between various
dependent and independent variables
 Chi-square (χ2): to check the quality of the classification based model
 Squared Mahalanobis: distance is a measure calculated using random data points
16

 REFRENCES:
Pirhadi S, Shiri F, Ghasemi JB. Multivariate statistical analysis methods in QSAR.
Rsc Advances. 2015;5(127):104635-65.
Damme SV, Bultinck AR. Journal of Computational Chemistry. 2007
Aug;28(11):1924-8.
De Oliveira DB, Gaudio AC. BuildQSAR: a new computer program for QSAR
analysis. Quantitative Structure‐Activity Relationships: An International Journal
Devoted to Fundamental and Practical Aspects of Electroanalysis. 2000
Dec;19(6):599-601.
Verma J, Khedkar VM, Coutinho EC. 3D-QSAR in drug design-a review. Current
topics in medicinal chemistry. 2010 Jan 1;10(1):95-115.

THANKYOU

Statistical Methods for QSAR Modeling

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Statistical Methods for QSAR Modeling

Similar to Statistical Methods for QSAR Modeling (20)

Recently uploaded

Recently uploaded (20)

Statistical Methods for QSAR Modeling