SlideShare a Scribd company logo
Tales of correlation inflation
(Eu prefiro a minha comida cozida e meus dados brutos)
Peter W Kenny, Universidade de São Paulo
Correlation
• Strong correlation implies good predictivity
– I have observed a correlation so you must use my rule
• Multivariate data analysis (e.g. PCA) usually involves
transformation to orthogonal basis
• Applying cutoffs (e.g. MW restriction) to data can
distort correlations
Quantifying strengths of relationships between
continuous variables
• Correlation measures
– Pearson product-moment correlation coefficient (R)
– Spearman's rank correlation coefficient ()
– Kendall rank correlation coefficient (τ)
• Quality of fit measures
– Coefficient of determination (R2) is the fraction of the
variance in Y that is explained by model
– Root mean square error (RMSE)
Difference in mean values of Y for X = A and X = B
Scale by standard
deviation
Scale by standard
error
Cohen’s d
(independent of
sample size)
Student’s t
(depends on
sample size)
Size of effect for categorical X
R2 can be seen as analogous to Cohen’s d
r
N 1202
R 0.247 ( 95% CI: 0.193 | 0.299)
 0.215 ( P < 0.0001)
 0.148 ( P < 0.0001)
N 8
R 0.972 ( 95% CI: 0.846 | 0.995)
 0.970 ( P < 0.0001)
 0.909 ( P = 0.0018)
Correlation Inflation in Flatland
See Lovering, Bikker & Humblet (2009) JMC 52:6752-6756 DOI
Preparation of synthetic data sets
Kenny & Montanari (2013) JCAMD 27:1-13 DOI
Add Gaussian noise
(SD=10) to Y
Correlation inflation by hiding variation
See Hopkins, Mason & Overington (2006) Curr Opin Struct Biol 16:127-136 DOI
Leeson & Springthorpe (2007) NRDD 6:881-890 DOI
Data is naturally binned (X is an integer) and mean value of Y is calculated for each
value of X. In some studies, averaged data is only presented graphically and it is left to
the reader to judge the strength of the correlation.
R = 0.34 R = 0.30 R = 0.31
R = 0.67 R = 0.93 R = 0.996
Masking variation with standard error
See Gleeson (2008) JMC 51:817-834 DOI
Partition by value of X into 4 bins with equal numbers of data points and display 95%
confidence interval for mean (green) and mean ± SD (blue) for each bin.
R = 0.12 R = 0.29 R = 0.28
N Bins Degrees of Freedom F P
40 4 3 0.2596 0.8540
400 4 3 12.855 < 0.0001
4000 4 3 115.35 < 0.0001
4000 2 1 270.91 < 0.0001
4000 8 7 50.075 < 0.0001
“In each plot provided, the width of the errors bars and the difference in the mean
values of the different categories are indicative of the strength of the relationship
between the parameters.” Gleeson (2008) JMC 51:817-834 DOI
The error of standard error
ANOVA for binned data sets
Know your data
• Assays are typically run in replicate making it possible
to estimate assay variance
• Every assay has a finite dynamic range and it may not
always be obvious what this is for a particular assay
• Dynamic range may have been sacrificed for
thoughput but this, by itself, does not make the
assay bad
• We need to be able analyse in-range and out-of-
range data within single unified framework
– See Lind (2010) QSAR analysis involving assay results which are only known to
be greater than, or less than some cut-off limit. Mol Inf 29:845-852 DOI
Depicting variation with
percentile plots
This graphical representation of data makes it easy
to visualize variation and can be used with mixed
in-range and out-of-range data. See Colclough et
al (2008) BMCL 16:6611-6616 DOI
Binning continuous data restricts your options for analysis and
places burden of proof on you to show that your conclusions are
independent of the binning scheme. Think before you bin!
Averaging the
binned data was
your idea so don’t
try blaming me this
time!
Some stuff to think about
• Model continuous data as continuous data
– RMSE is most relevant to prediction but you still need R2
– Fitted parameters may provide insight (e.g. solubility is more sensitive than
potency to lipophilicity)
• When selecting training data think in terms of Design of Experiments
(e.g. evenly spaced values of X)
• Try to achieve normally distributed Y (e.g. use pIC50 rather than IC50)
• Never make statements about the strength of a relationship when
you’ve hidden variation in the data (unless you want a starring role in
Correlation Inflation 2)
• To be meaningful a measure of the spread of a distribution must be
independent of sample size
• Reviewers/editors, mercilessly purge manuscripts of statements like,
“A negative correlation was observed between X and Y” or “A and B are
correlated/linked”

More Related Content

What's hot

statistics in nursing
 statistics in nursing statistics in nursing
statistics in nursing
Pratibha Srivastava
 
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regression
James Neill
 
15. descriptive statistics
15. descriptive statistics15. descriptive statistics
15. descriptive statistics
Ashok Kulkarni
 
Measures of dispersion
Measures of dispersionMeasures of dispersion
Measures of dispersion
Gnana Sravani
 
Measures of Dispersion - Thiyagu
Measures of Dispersion - ThiyaguMeasures of Dispersion - Thiyagu
Measures of Dispersion - Thiyagu
Thiyagu K
 
Statr sessions 4 to 6
Statr sessions 4 to 6Statr sessions 4 to 6
Statr sessions 4 to 6
Ruru Chowdhury
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
Dr Resu Neha Reddy
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
Bhagya Silva
 
Data Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVAData Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVA
Derek Kane
 
Univariate & bivariate analysis
Univariate & bivariate analysisUnivariate & bivariate analysis
Univariate & bivariate analysis
sristi1992
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
Regent University
 
Analysis of variance (ANOVA)
Analysis of variance (ANOVA)Analysis of variance (ANOVA)
Analysis of variance (ANOVA)
Sneh Kumari
 
Statistics in research by dr. sudhir sahu
Statistics in research by dr. sudhir sahuStatistics in research by dr. sudhir sahu
Statistics in research by dr. sudhir sahu
Sudhir INDIA
 
Understanding statistics in research
Understanding statistics in researchUnderstanding statistics in research
Understanding statistics in research
Dr. Senthilvel Vasudevan
 
Descriptive Statistics - Thiyagu K
Descriptive Statistics - Thiyagu KDescriptive Statistics - Thiyagu K
Descriptive Statistics - Thiyagu K
Thiyagu K
 
Aed1222 lesson 2
Aed1222 lesson 2Aed1222 lesson 2
Aed1222 lesson 2nurun2010
 
Basics of Educational Statistics (Descriptive statistics)
Basics of Educational Statistics (Descriptive statistics)Basics of Educational Statistics (Descriptive statistics)
Basics of Educational Statistics (Descriptive statistics)
HennaAnsari
 
Choosing the Right Statistical Techniques
Choosing the Right Statistical TechniquesChoosing the Right Statistical Techniques
Choosing the Right Statistical Techniques
Bodhiya Wijaya Mulya
 

What's hot (18)

statistics in nursing
 statistics in nursing statistics in nursing
statistics in nursing
 
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regression
 
15. descriptive statistics
15. descriptive statistics15. descriptive statistics
15. descriptive statistics
 
Measures of dispersion
Measures of dispersionMeasures of dispersion
Measures of dispersion
 
Measures of Dispersion - Thiyagu
Measures of Dispersion - ThiyaguMeasures of Dispersion - Thiyagu
Measures of Dispersion - Thiyagu
 
Statr sessions 4 to 6
Statr sessions 4 to 6Statr sessions 4 to 6
Statr sessions 4 to 6
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
Data Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVAData Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVA
 
Univariate & bivariate analysis
Univariate & bivariate analysisUnivariate & bivariate analysis
Univariate & bivariate analysis
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Analysis of variance (ANOVA)
Analysis of variance (ANOVA)Analysis of variance (ANOVA)
Analysis of variance (ANOVA)
 
Statistics in research by dr. sudhir sahu
Statistics in research by dr. sudhir sahuStatistics in research by dr. sudhir sahu
Statistics in research by dr. sudhir sahu
 
Understanding statistics in research
Understanding statistics in researchUnderstanding statistics in research
Understanding statistics in research
 
Descriptive Statistics - Thiyagu K
Descriptive Statistics - Thiyagu KDescriptive Statistics - Thiyagu K
Descriptive Statistics - Thiyagu K
 
Aed1222 lesson 2
Aed1222 lesson 2Aed1222 lesson 2
Aed1222 lesson 2
 
Basics of Educational Statistics (Descriptive statistics)
Basics of Educational Statistics (Descriptive statistics)Basics of Educational Statistics (Descriptive statistics)
Basics of Educational Statistics (Descriptive statistics)
 
Choosing the Right Statistical Techniques
Choosing the Right Statistical TechniquesChoosing the Right Statistical Techniques
Choosing the Right Statistical Techniques
 

Viewers also liked

Sheltrex Smart Phone city
Sheltrex Smart Phone citySheltrex Smart Phone city
Sheltrex Smart Phone city
Sarestates Realty Advisors Pvt Ltd
 
Dossier AULABIERTA
Dossier AULABIERTADossier AULABIERTA
Dossier AULABIERTAguestb50afe
 
Michael Fuchs the Head of Technology at E.On Kernkraft (Atoms for the Future ...
Michael Fuchs the Head of Technology at E.On Kernkraft (Atoms for the Future ...Michael Fuchs the Head of Technology at E.On Kernkraft (Atoms for the Future ...
Michael Fuchs the Head of Technology at E.On Kernkraft (Atoms for the Future ...
Société Française d'Energie Nucléaire
 
Clasificador objeto del gasto 2007
Clasificador objeto del gasto   2007Clasificador objeto del gasto   2007
Clasificador objeto del gasto 2007joseriveramaza
 
SYNTERGYM, LIBERACION Y BIENESTAR
SYNTERGYM, LIBERACION Y BIENESTARSYNTERGYM, LIBERACION Y BIENESTAR
SYNTERGYM, LIBERACION Y BIENESTAR
Dr.Jose A Santos. +4500 contactos
 
Print Security - Are Business Complacent?
Print Security - Are Business Complacent?Print Security - Are Business Complacent?
Print Security - Are Business Complacent?
Adrian Boucek
 
Pioneer credential aug 2013.pptx2
Pioneer credential aug 2013.pptx2Pioneer credential aug 2013.pptx2
Pioneer credential aug 2013.pptx2Hoai Anh Do
 
Loyal 5
 Loyal 5 Loyal 5
Loyal 5
Loyal5
 
Programa diplomado diseño participativo el bosque (2)
Programa diplomado diseño participativo el bosque (2)Programa diplomado diseño participativo el bosque (2)
Programa diplomado diseño participativo el bosque (2)
disenoparticipativo
 
ciencias sociales y la ciencia
ciencias sociales y la cienciaciencias sociales y la ciencia
ciencias sociales y la ciencia
alisson medina
 
Diapo seminario .systems analysis of_transcription_factor_activities_in_envir...
Diapo seminario .systems analysis of_transcription_factor_activities_in_envir...Diapo seminario .systems analysis of_transcription_factor_activities_in_envir...
Diapo seminario .systems analysis of_transcription_factor_activities_in_envir...ma_alejandra
 
Pendulo simple lab. fisica
Pendulo simple lab. fisicaPendulo simple lab. fisica
Pendulo simple lab. fisica
Ronmel Romero
 
1. Reglamento general de softcombat
1. Reglamento general de softcombat1. Reglamento general de softcombat
1. Reglamento general de softcombat
Luis Miguel Caño
 
Auto de imputación "caso Pujol". Juzgado Central de Instrucción nº 5 de la Au...
Auto de imputación "caso Pujol". Juzgado Central de Instrucción nº 5 de la Au...Auto de imputación "caso Pujol". Juzgado Central de Instrucción nº 5 de la Au...
Auto de imputación "caso Pujol". Juzgado Central de Instrucción nº 5 de la Au...
Juan Segura Aguiló
 
Parables good samaritan
Parables good samaritanParables good samaritan
Parables good samaritan
Grace Canberra
 

Viewers also liked (20)

Suncoastscam
SuncoastscamSuncoastscam
Suncoastscam
 
Sheltrex Smart Phone city
Sheltrex Smart Phone citySheltrex Smart Phone city
Sheltrex Smart Phone city
 
Dossier AULABIERTA
Dossier AULABIERTADossier AULABIERTA
Dossier AULABIERTA
 
Michael Fuchs the Head of Technology at E.On Kernkraft (Atoms for the Future ...
Michael Fuchs the Head of Technology at E.On Kernkraft (Atoms for the Future ...Michael Fuchs the Head of Technology at E.On Kernkraft (Atoms for the Future ...
Michael Fuchs the Head of Technology at E.On Kernkraft (Atoms for the Future ...
 
Clasificador objeto del gasto 2007
Clasificador objeto del gasto   2007Clasificador objeto del gasto   2007
Clasificador objeto del gasto 2007
 
SYNTERGYM, LIBERACION Y BIENESTAR
SYNTERGYM, LIBERACION Y BIENESTARSYNTERGYM, LIBERACION Y BIENESTAR
SYNTERGYM, LIBERACION Y BIENESTAR
 
Print Security - Are Business Complacent?
Print Security - Are Business Complacent?Print Security - Are Business Complacent?
Print Security - Are Business Complacent?
 
Clasificación padel
Clasificación padelClasificación padel
Clasificación padel
 
Pioneer credential aug 2013.pptx2
Pioneer credential aug 2013.pptx2Pioneer credential aug 2013.pptx2
Pioneer credential aug 2013.pptx2
 
Loyal 5
 Loyal 5 Loyal 5
Loyal 5
 
Programa diplomado diseño participativo el bosque (2)
Programa diplomado diseño participativo el bosque (2)Programa diplomado diseño participativo el bosque (2)
Programa diplomado diseño participativo el bosque (2)
 
QR Kody A Jine Kody
QR Kody A Jine KodyQR Kody A Jine Kody
QR Kody A Jine Kody
 
ciencias sociales y la ciencia
ciencias sociales y la cienciaciencias sociales y la ciencia
ciencias sociales y la ciencia
 
Diapo seminario .systems analysis of_transcription_factor_activities_in_envir...
Diapo seminario .systems analysis of_transcription_factor_activities_in_envir...Diapo seminario .systems analysis of_transcription_factor_activities_in_envir...
Diapo seminario .systems analysis of_transcription_factor_activities_in_envir...
 
Pendulo simple lab. fisica
Pendulo simple lab. fisicaPendulo simple lab. fisica
Pendulo simple lab. fisica
 
1. Reglamento general de softcombat
1. Reglamento general de softcombat1. Reglamento general de softcombat
1. Reglamento general de softcombat
 
Auto de imputación "caso Pujol". Juzgado Central de Instrucción nº 5 de la Au...
Auto de imputación "caso Pujol". Juzgado Central de Instrucción nº 5 de la Au...Auto de imputación "caso Pujol". Juzgado Central de Instrucción nº 5 de la Au...
Auto de imputación "caso Pujol". Juzgado Central de Instrucción nº 5 de la Au...
 
Exposicion sdh fundamentos
Exposicion sdh fundamentosExposicion sdh fundamentos
Exposicion sdh fundamentos
 
Malla curricular 3ro
Malla curricular 3roMalla curricular 3ro
Malla curricular 3ro
 
Parables good samaritan
Parables good samaritanParables good samaritan
Parables good samaritan
 

Similar to Tales of correlation inflation (2013 CADD GRC)

Data-analytic sins in property-based molecular design
Data-analytic sins in property-based molecular design Data-analytic sins in property-based molecular design
Data-analytic sins in property-based molecular design
Peter Kenny
 
BrazMedChem2014
BrazMedChem2014BrazMedChem2014
BrazMedChem2014
Peter Kenny
 
Statistics.pdf
Statistics.pdfStatistics.pdf
Statistics.pdf
Shruti Nigam (CWM, AFP)
 
measure of dispersion.pptx
measure of dispersion.pptxmeasure of dispersion.pptx
measure of dispersion.pptx
SoujanyaLk1
 
Measures of Dispersion
Measures of DispersionMeasures of Dispersion
Measures of Dispersion
KainatIqbal7
 
Statistics
StatisticsStatistics
Statistics
megamsma
 
Medical statistics2
Medical statistics2Medical statistics2
Medical statistics2
Amany El-seoud
 
Measures of Variation.pdf
Measures of Variation.pdfMeasures of Variation.pdf
Measures of Variation.pdf
MuhammadFaizan389
 
Lect w8 w9_correlation_regression
Lect w8 w9_correlation_regressionLect w8 w9_correlation_regression
Lect w8 w9_correlation_regression
Rione Drevale
 
Molecular design: How to and how not to?
Molecular design:  How to and how not to?Molecular design:  How to and how not to?
Molecular design: How to and how not to?
Peter Kenny
 
Ch2 Data Description
Ch2 Data DescriptionCh2 Data Description
Ch2 Data Description
Farhan Alfin
 
Basic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptxBasic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptx
Anusuya123
 
Measures of dispersion
Measures of dispersionMeasures of dispersion
Measures of dispersion
Nilanjan Bhaumik
 
Measures of dispersion discuss 2.2
Measures of dispersion discuss 2.2Measures of dispersion discuss 2.2
Measures of dispersion discuss 2.2
Makati Science High School
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?
Kazi Toufiq Wadud
 
Numerical measures stat ppt @ bec doms
Numerical measures stat ppt @ bec domsNumerical measures stat ppt @ bec doms
Numerical measures stat ppt @ bec doms
Babasab Patil
 
3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf
AmanuelDina
 
Measure of Dispersion in statistics
Measure of Dispersion in statisticsMeasure of Dispersion in statistics
Measure of Dispersion in statistics
Md. Mehadi Hassan Bappy
 
SEM
SEMSEM

Similar to Tales of correlation inflation (2013 CADD GRC) (20)

Data-analytic sins in property-based molecular design
Data-analytic sins in property-based molecular design Data-analytic sins in property-based molecular design
Data-analytic sins in property-based molecular design
 
BrazMedChem2014
BrazMedChem2014BrazMedChem2014
BrazMedChem2014
 
Statistics.pdf
Statistics.pdfStatistics.pdf
Statistics.pdf
 
measure of dispersion.pptx
measure of dispersion.pptxmeasure of dispersion.pptx
measure of dispersion.pptx
 
Measures of Dispersion
Measures of DispersionMeasures of Dispersion
Measures of Dispersion
 
Statistics
StatisticsStatistics
Statistics
 
Medical statistics2
Medical statistics2Medical statistics2
Medical statistics2
 
Measures of Variation.pdf
Measures of Variation.pdfMeasures of Variation.pdf
Measures of Variation.pdf
 
Lect w8 w9_correlation_regression
Lect w8 w9_correlation_regressionLect w8 w9_correlation_regression
Lect w8 w9_correlation_regression
 
Statistics chm 235
Statistics chm 235Statistics chm 235
Statistics chm 235
 
Molecular design: How to and how not to?
Molecular design:  How to and how not to?Molecular design:  How to and how not to?
Molecular design: How to and how not to?
 
Ch2 Data Description
Ch2 Data DescriptionCh2 Data Description
Ch2 Data Description
 
Basic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptxBasic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptx
 
Measures of dispersion
Measures of dispersionMeasures of dispersion
Measures of dispersion
 
Measures of dispersion discuss 2.2
Measures of dispersion discuss 2.2Measures of dispersion discuss 2.2
Measures of dispersion discuss 2.2
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?
 
Numerical measures stat ppt @ bec doms
Numerical measures stat ppt @ bec domsNumerical measures stat ppt @ bec doms
Numerical measures stat ppt @ bec doms
 
3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf
 
Measure of Dispersion in statistics
Measure of Dispersion in statisticsMeasure of Dispersion in statistics
Measure of Dispersion in statistics
 
SEM
SEMSEM
SEM
 

More from Peter Kenny

LE Metrics (EuroQSAR2016)
LE Metrics (EuroQSAR2016)LE Metrics (EuroQSAR2016)
LE Metrics (EuroQSAR2016)
Peter Kenny
 
PWK EuroQSAR
PWK EuroQSARPWK EuroQSAR
PWK EuroQSAR
Peter Kenny
 
Thermodynamics for medicinal chemistry design
Thermodynamics for medicinal chemistry designThermodynamics for medicinal chemistry design
Thermodynamics for medicinal chemistry design
Peter Kenny
 
partition coefficients in drug discovery
partition coefficients in drug discoverypartition coefficients in drug discovery
partition coefficients in drug discovery
Peter Kenny
 
Property-based molecular design: where next? (12-Jun-2015)
Property-based molecular design: where next? (12-Jun-2015)Property-based molecular design: where next? (12-Jun-2015)
Property-based molecular design: where next? (12-Jun-2015)
Peter Kenny
 
Ligand efficiency: nice concept shame about the metrics
Ligand efficiency: nice concept shame about the metricsLigand efficiency: nice concept shame about the metrics
Ligand efficiency: nice concept shame about the metrics
Peter Kenny
 
Aspects of pharmaceutical molecular design (Fidelta version)
Aspects of pharmaceutical molecular design (Fidelta version)Aspects of pharmaceutical molecular design (Fidelta version)
Aspects of pharmaceutical molecular design (Fidelta version)
Peter Kenny
 
Aspects of pharmaceutical molecular design (Belgrade version)
Aspects of pharmaceutical molecular design (Belgrade version)Aspects of pharmaceutical molecular design (Belgrade version)
Aspects of pharmaceutical molecular design (Belgrade version)
Peter Kenny
 
IQSC Oct 2014
IQSC Oct 2014IQSC Oct 2014
IQSC Oct 2014
Peter Kenny
 
UCT Oct 2014
UCT Oct 2014UCT Oct 2014
UCT Oct 2014
Peter Kenny
 
Aspects of pharmaceutical molecular design
Aspects of pharmaceutical molecular designAspects of pharmaceutical molecular design
Aspects of pharmaceutical molecular design
Peter Kenny
 
Perspective of pharmaceutical molecular design
Perspective of pharmaceutical molecular designPerspective of pharmaceutical molecular design
Perspective of pharmaceutical molecular designPeter Kenny
 
Some new directions for pharmaceutical molecular design
Some new directions for pharmaceutical molecular designSome new directions for pharmaceutical molecular design
Some new directions for pharmaceutical molecular design
Peter Kenny
 
A survey of halogens (2008 EuroCUP)
A survey of halogens (2008 EuroCUP)A survey of halogens (2008 EuroCUP)
A survey of halogens (2008 EuroCUP)
Peter Kenny
 
Fragment screening library workshop (IQPC 2008)
Fragment screening library workshop (IQPC 2008)Fragment screening library workshop (IQPC 2008)
Fragment screening library workshop (IQPC 2008)
Peter Kenny
 
Design of fragment screening libraries (IQPC 2008)
Design of fragment screening libraries (IQPC 2008)Design of fragment screening libraries (IQPC 2008)
Design of fragment screening libraries (IQPC 2008)
Peter Kenny
 
Design of compound libraries for fragment screening (Feb 2012 version)
Design of compound libraries for fragment screening (Feb 2012 version)Design of compound libraries for fragment screening (Feb 2012 version)
Design of compound libraries for fragment screening (Feb 2012 version)
Peter Kenny
 
Design of fragment screening libraries (Feb 2010 version)
Design of fragment screening libraries (Feb 2010 version)Design of fragment screening libraries (Feb 2010 version)
Design of fragment screening libraries (Feb 2010 version)
Peter Kenny
 
Lipophilicity in the context of molecular design
Lipophilicity in the context of molecular designLipophilicity in the context of molecular design
Lipophilicity in the context of molecular design
Peter Kenny
 
From screening to molecular interactions: A short tour
From screening to molecular interactions: A short tour From screening to molecular interactions: A short tour
From screening to molecular interactions: A short tour
Peter Kenny
 

More from Peter Kenny (20)

LE Metrics (EuroQSAR2016)
LE Metrics (EuroQSAR2016)LE Metrics (EuroQSAR2016)
LE Metrics (EuroQSAR2016)
 
PWK EuroQSAR
PWK EuroQSARPWK EuroQSAR
PWK EuroQSAR
 
Thermodynamics for medicinal chemistry design
Thermodynamics for medicinal chemistry designThermodynamics for medicinal chemistry design
Thermodynamics for medicinal chemistry design
 
partition coefficients in drug discovery
partition coefficients in drug discoverypartition coefficients in drug discovery
partition coefficients in drug discovery
 
Property-based molecular design: where next? (12-Jun-2015)
Property-based molecular design: where next? (12-Jun-2015)Property-based molecular design: where next? (12-Jun-2015)
Property-based molecular design: where next? (12-Jun-2015)
 
Ligand efficiency: nice concept shame about the metrics
Ligand efficiency: nice concept shame about the metricsLigand efficiency: nice concept shame about the metrics
Ligand efficiency: nice concept shame about the metrics
 
Aspects of pharmaceutical molecular design (Fidelta version)
Aspects of pharmaceutical molecular design (Fidelta version)Aspects of pharmaceutical molecular design (Fidelta version)
Aspects of pharmaceutical molecular design (Fidelta version)
 
Aspects of pharmaceutical molecular design (Belgrade version)
Aspects of pharmaceutical molecular design (Belgrade version)Aspects of pharmaceutical molecular design (Belgrade version)
Aspects of pharmaceutical molecular design (Belgrade version)
 
IQSC Oct 2014
IQSC Oct 2014IQSC Oct 2014
IQSC Oct 2014
 
UCT Oct 2014
UCT Oct 2014UCT Oct 2014
UCT Oct 2014
 
Aspects of pharmaceutical molecular design
Aspects of pharmaceutical molecular designAspects of pharmaceutical molecular design
Aspects of pharmaceutical molecular design
 
Perspective of pharmaceutical molecular design
Perspective of pharmaceutical molecular designPerspective of pharmaceutical molecular design
Perspective of pharmaceutical molecular design
 
Some new directions for pharmaceutical molecular design
Some new directions for pharmaceutical molecular designSome new directions for pharmaceutical molecular design
Some new directions for pharmaceutical molecular design
 
A survey of halogens (2008 EuroCUP)
A survey of halogens (2008 EuroCUP)A survey of halogens (2008 EuroCUP)
A survey of halogens (2008 EuroCUP)
 
Fragment screening library workshop (IQPC 2008)
Fragment screening library workshop (IQPC 2008)Fragment screening library workshop (IQPC 2008)
Fragment screening library workshop (IQPC 2008)
 
Design of fragment screening libraries (IQPC 2008)
Design of fragment screening libraries (IQPC 2008)Design of fragment screening libraries (IQPC 2008)
Design of fragment screening libraries (IQPC 2008)
 
Design of compound libraries for fragment screening (Feb 2012 version)
Design of compound libraries for fragment screening (Feb 2012 version)Design of compound libraries for fragment screening (Feb 2012 version)
Design of compound libraries for fragment screening (Feb 2012 version)
 
Design of fragment screening libraries (Feb 2010 version)
Design of fragment screening libraries (Feb 2010 version)Design of fragment screening libraries (Feb 2010 version)
Design of fragment screening libraries (Feb 2010 version)
 
Lipophilicity in the context of molecular design
Lipophilicity in the context of molecular designLipophilicity in the context of molecular design
Lipophilicity in the context of molecular design
 
From screening to molecular interactions: A short tour
From screening to molecular interactions: A short tour From screening to molecular interactions: A short tour
From screening to molecular interactions: A short tour
 

Recently uploaded

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 

Recently uploaded (20)

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 

Tales of correlation inflation (2013 CADD GRC)

  • 1. Tales of correlation inflation (Eu prefiro a minha comida cozida e meus dados brutos) Peter W Kenny, Universidade de São Paulo
  • 2. Correlation • Strong correlation implies good predictivity – I have observed a correlation so you must use my rule • Multivariate data analysis (e.g. PCA) usually involves transformation to orthogonal basis • Applying cutoffs (e.g. MW restriction) to data can distort correlations
  • 3. Quantifying strengths of relationships between continuous variables • Correlation measures – Pearson product-moment correlation coefficient (R) – Spearman's rank correlation coefficient () – Kendall rank correlation coefficient (τ) • Quality of fit measures – Coefficient of determination (R2) is the fraction of the variance in Y that is explained by model – Root mean square error (RMSE)
  • 4. Difference in mean values of Y for X = A and X = B Scale by standard deviation Scale by standard error Cohen’s d (independent of sample size) Student’s t (depends on sample size) Size of effect for categorical X R2 can be seen as analogous to Cohen’s d
  • 5. r N 1202 R 0.247 ( 95% CI: 0.193 | 0.299)  0.215 ( P < 0.0001)  0.148 ( P < 0.0001) N 8 R 0.972 ( 95% CI: 0.846 | 0.995)  0.970 ( P < 0.0001)  0.909 ( P = 0.0018) Correlation Inflation in Flatland See Lovering, Bikker & Humblet (2009) JMC 52:6752-6756 DOI
  • 6. Preparation of synthetic data sets Kenny & Montanari (2013) JCAMD 27:1-13 DOI Add Gaussian noise (SD=10) to Y
  • 7. Correlation inflation by hiding variation See Hopkins, Mason & Overington (2006) Curr Opin Struct Biol 16:127-136 DOI Leeson & Springthorpe (2007) NRDD 6:881-890 DOI Data is naturally binned (X is an integer) and mean value of Y is calculated for each value of X. In some studies, averaged data is only presented graphically and it is left to the reader to judge the strength of the correlation. R = 0.34 R = 0.30 R = 0.31 R = 0.67 R = 0.93 R = 0.996
  • 8. Masking variation with standard error See Gleeson (2008) JMC 51:817-834 DOI Partition by value of X into 4 bins with equal numbers of data points and display 95% confidence interval for mean (green) and mean ± SD (blue) for each bin. R = 0.12 R = 0.29 R = 0.28
  • 9. N Bins Degrees of Freedom F P 40 4 3 0.2596 0.8540 400 4 3 12.855 < 0.0001 4000 4 3 115.35 < 0.0001 4000 2 1 270.91 < 0.0001 4000 8 7 50.075 < 0.0001 “In each plot provided, the width of the errors bars and the difference in the mean values of the different categories are indicative of the strength of the relationship between the parameters.” Gleeson (2008) JMC 51:817-834 DOI The error of standard error ANOVA for binned data sets
  • 10. Know your data • Assays are typically run in replicate making it possible to estimate assay variance • Every assay has a finite dynamic range and it may not always be obvious what this is for a particular assay • Dynamic range may have been sacrificed for thoughput but this, by itself, does not make the assay bad • We need to be able analyse in-range and out-of- range data within single unified framework – See Lind (2010) QSAR analysis involving assay results which are only known to be greater than, or less than some cut-off limit. Mol Inf 29:845-852 DOI
  • 11. Depicting variation with percentile plots This graphical representation of data makes it easy to visualize variation and can be used with mixed in-range and out-of-range data. See Colclough et al (2008) BMCL 16:6611-6616 DOI
  • 12. Binning continuous data restricts your options for analysis and places burden of proof on you to show that your conclusions are independent of the binning scheme. Think before you bin! Averaging the binned data was your idea so don’t try blaming me this time!
  • 13. Some stuff to think about • Model continuous data as continuous data – RMSE is most relevant to prediction but you still need R2 – Fitted parameters may provide insight (e.g. solubility is more sensitive than potency to lipophilicity) • When selecting training data think in terms of Design of Experiments (e.g. evenly spaced values of X) • Try to achieve normally distributed Y (e.g. use pIC50 rather than IC50) • Never make statements about the strength of a relationship when you’ve hidden variation in the data (unless you want a starring role in Correlation Inflation 2) • To be meaningful a measure of the spread of a distribution must be independent of sample size • Reviewers/editors, mercilessly purge manuscripts of statements like, “A negative correlation was observed between X and Y” or “A and B are correlated/linked”