4. Correlation
• Strong correlation implies good predictivity
– I have observed a correlation so you must use my rule
• Multivariate data analysis (e.g. PCA) usually involves
transformation to orthogonal basis
• Applying cutoffs (e.g. MW restriction) to data can
distort correlations
• Noise and range limits in data
5. Quantifying strengths of relationships between
continuous variables
• Correlation measures
– Pearson product-moment correlation coefficient (R)
– Spearman's rank correlation coefficient ()
– Kendall rank correlation coefficient (τ)
• Quality of fit measures
– Coefficient of determination (R2) is the fraction of the
variance in Y that is explained by model
– Root mean square error (RMSE)
6. Preparation of synthetic data sets
Kenny & Montanari (2013) JCAMD 27:1-13 DOI
Add Gaussian noise
(SD=10) to Y
7. Correlation inflation by hiding variation
See Hopkins, Mason & Overington (2006) Curr Opin Struct Biol 16:127-136 DOI
Leeson & Springthorpe (2007) NRDD 6:881-890 DOI
Data is naturally binned (X is an integer) and mean value of Y is calculated for each
value of X. In some studies, averaged data is only presented graphically and it is left to
the reader to judge the strength of the correlation.
R = 0.34 R = 0.30 R = 0.31
R = 0.67 R = 0.93 R = 0.996
8. r
N 1202
R 0.247 ( 95% CI: 0.193 | 0.299)
0.215 ( P < 0.0001)
0.148 ( P < 0.0001)
N 8
R 0.972 ( 95% CI: 0.846 | 0.995)
0.970 ( P < 0.0001)
0.909 ( P = 0.0018)
Correlation Inflation in Flatland
See Lovering, Bikker & Humblet (2009) JMC 52:6752-6756 DOI
9. Masking variation with standard error
See Gleeson (2008) JMC 51:817-834 DOI
Partition by value of X into 4 bins with equal numbers of data points and display 95%
confidence interval for mean (green) and mean ± SD (blue) for each bin.
R = 0.12 R = 0.29 R = 0.28
10. N Bins Degrees of Freedom F P
40 4 3 0.2596 0.8540
400 4 3 12.855 < 0.0001
4000 4 3 115.35 < 0.0001
4000 2 1 270.91 < 0.0001
4000 8 7 50.075 < 0.0001
“In each plot provided, the width of the errors bars and the difference in the mean
values of the different categories are indicative of the strength of the relationship
between the parameters.” Gleeson (2008) JMC 51:817-834 DOI
The error of standard error
ANOVA for binned data sets
11. Know your data
• Assays are typically run in replicate making it possible
to estimate assay variance
• Every assay has a finite dynamic range and it may not
always be obvious what this is for a particular assay
• Dynamic range may have been sacrificed for
thoughput but this, by itself, does not make the
assay bad
• We need to be able analyse in-range and out-of-
range data within single unified framework
– See Lind (2010) QSAR analysis involving assay results which are only known to
be greater than, or less than some cut-off limit. Mol Inf 29:845-852 DOI
12. Depicting variation with
percentile plots
This graphical representation of data makes it easy
to visualize variation and can be used with mixed
in-range and out-of-range data. See Colclough et
al (2008) BMCL 16:6611-6616 DOI
13. Binning continuous data restricts your options for analysis and
places burden of proof on you to show that your conclusions are
independent of the binning scheme. Think before you bin!
Averaging the
binned data was
your idea so don’t
try blaming me this
time!
14. Correlation inflation: some stuff to think about
• Model continuous data as continuous data
– RMSE is most relevant to prediction but you still need R2
– Fitted parameters may provide insight (e.g. solubility is more sensitive than
potency to lipophilicity)
• When selecting training data think in terms of Design of Experiments
(e.g. evenly spaced values of X)
• Try to achieve normally distributed Y (e.g. use pIC50 rather than IC50)
• Never make statements about the strength of a relationship when
you’ve hidden or masked variation in the data (unless you want a
starring role in Correlation Inflation 2)
• To be meaningful, a measure of the spread of a distribution must be
independent of sample size
• Reviewers/editors, mercilessly purge manuscripts of statements like,
“A negative correlation was observed between X and Y” or “A and B are
correlated/linked”
15. Ligand efficiency metrics (LEMs) considered harmful
• We use LEMs to normalize activity with respect to risk
factors such as molecular size and lipophilicity
• What do we mean by normalization?
• We make assumptions about underlying relationship
between activity and risk factor(s) when we define an
LEM
• LEM as measure of extent to which activity beats a
trend?
Kenny, Leitão & Montanari (2014) JCAMD 28:699-710 DOI
16. Scale activity/affinity by risk factor
LE = ΔG/HA
Offset activity/affinity by risk factor
LipE = pIC50 ClogP
Ligand efficiency metrics
No reason that dependence of activity on risk factor should be restricted to
one of these two linear models
17. Use trend actually observed in data for normalization
rather than some arbitrarily assumed trend
18. There’s a reason why we say standard free energy
of binding…
DG = DH TDS = RTln(Kd/C0)
• Adoption of 1 M as standard concentration is
arbitrary
• A view of a chemical system that changes with
the choice of standard concentration is
thermodynamically invalid
20. Scaling transformation of parallel lines by dividing Y by X
(This is how ligand efficiency is calculated)
Size dependency of LE is consequence of non-zero intercept
21. Affinity plotted against molecular weight for minimal binding
elements against various targets in inhibitor deconstruction
study showing variation in intercept term
Hajduk PJ (2006) J Med Chem 49:6972–6976 DOI
Is it valid to combine results from different assays in LE analysis?
22. Offsetting transformation of lines with different slope and
common intercept by subtracting X from Y
(This is how lipophilic efficiency is calculated)
Thankfully (hopefully?) nobody has ‘discovered’
lipophilicity-dependent lipophilic efficiency yet
23. Linear fit of ΔG for published data set
Mortenson & Murray (2011) JCAMD 25:663-667 DOI
25. Some more stuff to think about
• Normalize activity using trend actually observed in
data (this means you have to model the data)
• Residuals are invariant with respect to choice in
standard concentration
• Residuals can be used with other functional forms
(e.g. non-linear and multi-linear)