SlideShare a Scribd company logo
1 of 23
Download to read offline
Elsevier Predictive Modeling |
Presented By
Date
Bioactivity Predictive Modeling
Using Reaxys Medicinal Chemistry Data to Build Models for Any Target
Matthew CLARK
5 May 2016
Elsevier Predictive Modeling |
R&D Solutions
FOR PHARMA & LIFE SCIENCES
Essential Tools and Solutions for Pharma &
Life Sciences
https://www.elsevier.com/rd-solutions/pharma-and-life-
sciences-solutions
Matthew Clark, Ph. D. m.clark@elsevier.com
Elsevier Predictive Modeling |
• Reaxys Medicinal Chemistry has millions of bioactivity measurements for
thousands of targets.
• Data from Aureus, Elsevier, and GVK Bio
• Covering patents and journal articles
• The measurements are associated with compounds with chemical structures.
• We can combine the structures/activities to make Quantitative Structure
Activity Relationship (QSAR) predictive models.
• Here we use a KNIME automation workflow to demonstrate a system that can
create and evaluate a predictive model for any selected target.
• Of course not all are ‘predictive’, but the workflow evaluates this
• Each model takes from 2-15 minutes to create
• It provides an extremely flexible workbench for exploring QSAR
• Elsevier services can integrate the system with your in-house
structure/activity databases.
Summary
Elsevier Predictive Modeling | 4
THE WORLD’S LARGEST & BEST-ORGANIZED MEDICINAL CHEMISTRY DATABASE
The Reaxys Medicinal Chemistry database contains:
This unique collection of data is sourced from:
Elsevier Predictive Modeling | 5
Various Activity Metrics are Consolidated to a Single Value
• pIC50, pEC50, pED50, pID50, pLC50, pLD50, pKi, pKd,
pKb, pD2, pD’2, pA2 , IC50, EC50, ED50, ID50, LC50, LD50,
Ki, Kd, Kb, Ka, Ke , % Inhibition
pX is computed
Original values are preserved;
this is an additional computed
descriptor
Elsevier Predictive Modeling | 6
Conversion of % inhibition
• If Concentration of the compound is not Available p[X] is not calculated
• IF CONCENTRATION Available as a range
• p[X] is not calculated
• If concentration is available as a single value
• p[X] is calculated
• If % inhibition > 25 p[X] = -log10(IC50)
𝑰𝑪 𝟓𝟎 =
𝟏𝟎𝟎∗[𝑪]
% 𝒊𝒏𝒉𝒊𝒃𝒊𝒕𝒊𝒐𝒏
− [C]
• If % inhibition < 25 p[X] = 1
Elsevier Predictive Modeling | 7
Qualitative results
• Reported as Inactive
• p[x] = 1
• Reported as Active
• Concentration reported
• p[X] = - log10(Concentration)
• Concentration is not reported
• p[X] not calculated
Elsevier Predictive Modeling | 8
Process for Computing pX
Conc
Unit?
Data
concentration
Not concentration
Conversion to
concentration
Quality
metric?
Yes
No
Quality
Rules
pX not
calculated
is
Range?
Range
processing
Log
unit?
No
Yes
pX
Yes
-log10(affinity)
No
Low
quality
Acceptable
quality
pX not
calculated
Not
convertible
Elsevier Predictive Modeling |
• Issue: some targets have thousands of measurements
• Solution: randomly select N examples from each decade of pX activity
• Selecting from each decade provides a more balanced data set across the activity range
• In this study only IC50 measurements are used, not % inhibition or Kd
• N is from 50 to 200 in these examples.
• Compounds are selected so that training and test sets have no overlap
• Use random selection of both
• For each compound the median pX value is chosen when multiple measurements
are reported
• pX = -log(molar activity)
• Partial-least-squares correlation is used, selecting the optimal number of
descriptor components
• Model quality is assessed
• I like RMS error of prediction because it most directly answers the question “how accurate do I
expect the prediction to be?”
• r2, F and other statistics are also computed.
9
Elements of the Model Making Workflow
Elsevier Predictive Modeling | 10
Random Selection to ‘Flatten’ Activity Distribution
1 2 43 65 7 8 9 10
-log(M)
#of
compounds
1 2 43 65 7 8 9 10
-log(M)
#of
compounds
Raw reports – may be thousands for a
popular target. Activities and structure
distribution may skew statistics
Randomly selected ‘N’
compounds per decade
More diverse set of
structures and more
uniform distribution of
activities
Elsevier Predictive Modeling |
KNIME Workflow
Input:
• target name,
• number of compounds to use from
each decade of log(Activity)
• activity type, e.g. IC50
Output:
• model from training set
• predictions of test set
Elsevier Predictive Modeling |
• Powerful statistical methods can select variables to create a model that looks
great on the training set
• However the real test is how well the model can predict activity for a
compound not in the training set.
• In these examples the Training and Test sets are not overlapping; no test
compound is in the training set.
• In this study the best metric for the model is the RMS Error of Prediction –
the errors seen when predicting activities for compounds not in the training
set.
• This is useful to provide the “error bar” for the predictions.
• The required accuracy depends on what you are using the prediction for
• The Pearson r2 provides a metric of “fit” but is dependent on the range of the data and
does not directly tell you how much error to expect.
• Other metrics for prediction might include how similar the compound is any compounds in
the training set. One might expect the predictions to be less reliable if the compound is
very different.
12
The Importance of the Test Set
Elsevier Predictive Modeling |
• The descriptors are properties computed for the compounds that are related
to the activities
• They encode elements of the structure and other properties of the
compounds.
• In this study we combined 2 sets of computed descriptors
• RD Kit
• CDK Toolkit
13
Descriptors
Elsevier Predictive Modeling |
library(pls)
model <- plsr(pX ~ ., data = knime.in)
# compute RMS error of prediction for each number of components
components <- RMSEP(model)
optimalComps = length(components$comps) - 1
# find optimal number of components in the model based on
# decreasing rms error of prediction for the models.
for (i in 1:optimalComps) {
if (components$val[i] - components$val[i+1] <= 0.0001) {
optimalComps = i
break
}
}
# make final model with optimal components
model <- plsr(pX ~ ., data = knime.in, ncomp = optimalComps)
sprintf("number optimal components: %d RMS error of prediction: %2.3f",
optimalComps, components$val[optimalComps] )
14
R Script for Partial Least Squares
Create model with all
possible components
Find model with
optimal # of
components
Make final model
with optimal # of
components
Elsevier Predictive Modeling |
• This model is moderately useful to separate best
from worst molecules.
• The test set statistics are influenced by 2 outliers
with absurdly high predicted binding Without those
the rms error is about 0.9.
15
ORL-1
Model
RMS Error of
Prediction
Training Set 0.71
Test Set 1.3
y = 0.9912x
R² = 0.6787
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Training Set y = 0.9956x
R² = 0.2808
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Test Set
Elsevier Predictive Modeling |
• p38a has 2 binding sites (‘normal’ and allosteric)
which make a simple QSAR more difficult if the
compounds are not separated
• Test set RMSEP suggests predictive value is limited
16
p38a (Q16539)
Model
RMS Error of
Prediction
Training Set 1.15
Test Set 1.45
y = 0.9734x
R² = 0.3237
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Training Set y = 0.951x
R² = 0.0775
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Test Set
Elsevier Predictive Modeling |
• This target has fewer reported
compounds and activities than others
• Test suggests reasonable predictive
power to distinguish active from
inactive compounds in-silico
17
ABL Kinase (P00519)
Model
RMS Error of
Prediction
Training Set 0.63
Test Set 0.96
y = 0.9919x
R² = 0.8741
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Training Set
y = 0.9868x
R² = 0.7179
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted pX Measured
Test Set
Elsevier Predictive Modeling |
• Prediction error of 1 log unit is useful model for in-
silico screening
18
ErbB1
y = 0.9814x
R² = 0.5975
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Training Set
y = 0.991x
R² = 0.3463
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Test Set
Model
RMS Error of
Prediction
Training Set 0.89
Test Set 0.89
Elsevier Predictive Modeling |
• Results are similar for this ABL example.
• For ‘final’ models larger selections or all data points can be used to increase
robustness
• The automation aspect makes repeated trials easy to do
19
Q: Since random sets are selected, what happens if you repeat trials?
A: The results are nearly identical in a statistical sense
Model
RMS Error of
Prediction
Training Set 0.63
Test Set 1.6
Model
RMS Error of
Prediction
Training Set 0.62
Test Set 1.0
y = 0.9924x
R² = 0.8803
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Training Set
y = 0.9964x
R² = 0.7402
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Test Set
y = 0.9924x
R² = 0.8804
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Training Set y = 1.0049x
R² = 0.4629
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Test Set
y = 0.9914x
R² = 0.8597
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Training Set
y = 0.9959x
R² = 0.7053
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Test Set
Model
RMS Error of
Prediction
Training Set 0.66
Test Set 1.0
Elsevier Predictive Modeling | 20
Q: What if you use a higher level class instead of a single protein?
A: In this case at least, the model isn’t as good.
y = 0.9495x
R² = 0.3565
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Training Set
y = 0.9526x
R² = 0.0029
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Test Set
Model
RMS Error of
Prediction
Training Set 1.47
Test Set 1.99
VEGFR*
VEGFR* (200,000 bioactivities reported)
Includes VEGFR 1, 2 3 , FLT1, FLT3 etc. etc.
There is probably too much variation in the sub-target structures that
makes the model ineffective
Model is not useful.
Your mileage may vary with class
Elsevier Predictive Modeling |
• EGFR* (200,000 bioactivities reported)
• Includes VEGFR, FLT1, FLT3 etc. etc.
• There is probably too much variation in the sub-
target structures that makes the model ineffective
• Model is not useful.
• Your mileage may vary with class
21
EGFR (P00533)
Model
RMS Error of
Prediction
Training Set 1.24
Test Set 1.59
y = 0.9721x
R² = 0.711
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Training Set
y = 0.9514x
R² = 0.4955
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Test Set
Elsevier Predictive Modeling |
• VEGFR
• FLT1
• FLT4
• FLK-1 (VEGFR2)
• VEGFR [Human]
22
Models for Sub-Variants of VEGFR – Some of the variants
contribute more to the model than others
Target RMS EP Test Set RMS EP Training
VEGFR* 2.0 1.5
FLT1 1.4 1.8
FLT4 0.87 0.86
FLK-1 1.8 2.1
VEGFR [Human] 1.0 0.92
Elsevier Predictive Modeling |
• This tool makes it very easy to do answer many many questions
with minimal time and effort, as shown.
• One can add ability to search for specific scaffolds and combine with target
query.
• It provides an unprecedented tool to explore QSAR by target, class
and other criteria.
• One can assess the likelihood that there is sufficient data to make a
model for a given target
• The automatic evaluation using a test set provides nearly instant feedback on
the prospects for making a predictive model.
• Because it uses the KNIME system, nearly any descriptor set can be
used for making the models.
• There are additional open-source and commercial descriptor tools available.
• One can probably get better results with better descriptors.
23
Conclusions

More Related Content

Viewers also liked

ACS Spring 2016 Combining semantic triple stores across knowledge domains
ACS Spring 2016 Combining semantic triple stores across knowledge domainsACS Spring 2016 Combining semantic triple stores across knowledge domains
ACS Spring 2016 Combining semantic triple stores across knowledge domainsMatthew Clark
 
Time-based bioactivity analysis
Time-based bioactivity analysisTime-based bioactivity analysis
Time-based bioactivity analysisMatthew Clark
 
Mobile First SEO The Marketer Edition
Mobile First SEO The Marketer EditionMobile First SEO The Marketer Edition
Mobile First SEO The Marketer EditionVirginia S. Willis
 
Reaxys in Education and Research at ETH Zurich
Reaxys in Education and Research at ETH ZurichReaxys in Education and Research at ETH Zurich
Reaxys in Education and Research at ETH ZurichAnn-Marie Roche
 
Reaxys Medicinal Chemistry Nov20 2014
Reaxys Medicinal Chemistry Nov20 2014Reaxys Medicinal Chemistry Nov20 2014
Reaxys Medicinal Chemistry Nov20 2014Reaxys
 
ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...
ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...
ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...Dr. Haxel Consult
 
Introduction to Reaxys
Introduction to ReaxysIntroduction to Reaxys
Introduction to ReaxysReaxys
 
Reaxys Medicinal Chemistry 2014
Reaxys Medicinal Chemistry 2014Reaxys Medicinal Chemistry 2014
Reaxys Medicinal Chemistry 2014Reaxys
 
Aleyda Solis: Reenfocando tu SEO para un mundo Mobile-First
Aleyda Solis: Reenfocando tu SEO para un mundo Mobile-FirstAleyda Solis: Reenfocando tu SEO para un mundo Mobile-First
Aleyda Solis: Reenfocando tu SEO para un mundo Mobile-FirstWe Are Marketing
 
Good Practices to Maximize Your Mobile SEO
Good Practices to Maximize Your Mobile SEOGood Practices to Maximize Your Mobile SEO
Good Practices to Maximize Your Mobile SEOAleyda Solís
 
Mobile-First SEO at #InOrbit2017
Mobile-First SEO at #InOrbit2017Mobile-First SEO at #InOrbit2017
Mobile-First SEO at #InOrbit2017Aleyda Solís
 

Viewers also liked (12)

ACS Spring 2016 Combining semantic triple stores across knowledge domains
ACS Spring 2016 Combining semantic triple stores across knowledge domainsACS Spring 2016 Combining semantic triple stores across knowledge domains
ACS Spring 2016 Combining semantic triple stores across knowledge domains
 
Time-based bioactivity analysis
Time-based bioactivity analysisTime-based bioactivity analysis
Time-based bioactivity analysis
 
Mobile First SEO The Marketer Edition
Mobile First SEO The Marketer EditionMobile First SEO The Marketer Edition
Mobile First SEO The Marketer Edition
 
2014.apr.rx
2014.apr.rx2014.apr.rx
2014.apr.rx
 
Reaxys in Education and Research at ETH Zurich
Reaxys in Education and Research at ETH ZurichReaxys in Education and Research at ETH Zurich
Reaxys in Education and Research at ETH Zurich
 
Reaxys Medicinal Chemistry Nov20 2014
Reaxys Medicinal Chemistry Nov20 2014Reaxys Medicinal Chemistry Nov20 2014
Reaxys Medicinal Chemistry Nov20 2014
 
ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...
ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...
ICIC 2014 Increasing the efficiency of pharmaceutical research through data i...
 
Introduction to Reaxys
Introduction to ReaxysIntroduction to Reaxys
Introduction to Reaxys
 
Reaxys Medicinal Chemistry 2014
Reaxys Medicinal Chemistry 2014Reaxys Medicinal Chemistry 2014
Reaxys Medicinal Chemistry 2014
 
Aleyda Solis: Reenfocando tu SEO para un mundo Mobile-First
Aleyda Solis: Reenfocando tu SEO para un mundo Mobile-FirstAleyda Solis: Reenfocando tu SEO para un mundo Mobile-First
Aleyda Solis: Reenfocando tu SEO para un mundo Mobile-First
 
Good Practices to Maximize Your Mobile SEO
Good Practices to Maximize Your Mobile SEOGood Practices to Maximize Your Mobile SEO
Good Practices to Maximize Your Mobile SEO
 
Mobile-First SEO at #InOrbit2017
Mobile-First SEO at #InOrbit2017Mobile-First SEO at #InOrbit2017
Mobile-First SEO at #InOrbit2017
 

Similar to Bioactivity Predictive ModelingMay2016

How predictive models help Medicinal Chemists design better drugs_webinar
How predictive models help Medicinal Chemists design better drugs_webinarHow predictive models help Medicinal Chemists design better drugs_webinar
How predictive models help Medicinal Chemists design better drugs_webinarAnn-Marie Roche
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Predictionsriram30691
 
TCI in general pracice - reliability (2006)
TCI in general pracice - reliability (2006)TCI in general pracice - reliability (2006)
TCI in general pracice - reliability (2006)Evangelos Kontopantelis
 
Modeling Chemical Datasets
Modeling Chemical DatasetsModeling Chemical Datasets
Modeling Chemical DatasetsAbhik Seal
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerDatabricks
 
06-00-ACA-Evaluation.pdf
06-00-ACA-Evaluation.pdf06-00-ACA-Evaluation.pdf
06-00-ACA-Evaluation.pdfAlexanderLerch4
 
Measurement system analysis Presentation.ppt
Measurement system analysis Presentation.pptMeasurement system analysis Presentation.ppt
Measurement system analysis Presentation.pptjawadullah25
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methodsjoycemi_la
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methodsjoycemi_la
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1khairulhuda242
 
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...WiMLDSMontreal
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
 
[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation SpecializationAndrea Rubio
 
Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning modelsKyriakos Chatzidimitriou
 
An Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social SciencesAn Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social Sciencesfsmart01
 
AIAA-MAO-RegionalError-2012
AIAA-MAO-RegionalError-2012AIAA-MAO-RegionalError-2012
AIAA-MAO-RegionalError-2012OptiModel
 

Similar to Bioactivity Predictive ModelingMay2016 (20)

How predictive models help Medicinal Chemists design better drugs_webinar
How predictive models help Medicinal Chemists design better drugs_webinarHow predictive models help Medicinal Chemists design better drugs_webinar
How predictive models help Medicinal Chemists design better drugs_webinar
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Prediction
 
TCI in general pracice - reliability (2006)
TCI in general pracice - reliability (2006)TCI in general pracice - reliability (2006)
TCI in general pracice - reliability (2006)
 
Modeling Chemical Datasets
Modeling Chemical DatasetsModeling Chemical Datasets
Modeling Chemical Datasets
 
Model selection
Model selectionModel selection
Model selection
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
06-00-ACA-Evaluation.pdf
06-00-ACA-Evaluation.pdf06-00-ACA-Evaluation.pdf
06-00-ACA-Evaluation.pdf
 
Measurement system analysis Presentation.ppt
Measurement system analysis Presentation.pptMeasurement system analysis Presentation.ppt
Measurement system analysis Presentation.ppt
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methods
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methods
 
Optimization in QBD
Optimization in QBDOptimization in QBD
Optimization in QBD
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1
 
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization[M3A4] Data Analysis and Interpretation Specialization
[M3A4] Data Analysis and Interpretation Specialization
 
Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning models
 
Ds for finance day 3
Ds for finance day 3Ds for finance day 3
Ds for finance day 3
 
An Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social SciencesAn Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social Sciences
 
AIAA-MAO-RegionalError-2012
AIAA-MAO-RegionalError-2012AIAA-MAO-RegionalError-2012
AIAA-MAO-RegionalError-2012
 
Machine learning meetup
Machine learning meetupMachine learning meetup
Machine learning meetup
 

Bioactivity Predictive ModelingMay2016

  • 1. Elsevier Predictive Modeling | Presented By Date Bioactivity Predictive Modeling Using Reaxys Medicinal Chemistry Data to Build Models for Any Target Matthew CLARK 5 May 2016
  • 2. Elsevier Predictive Modeling | R&D Solutions FOR PHARMA & LIFE SCIENCES Essential Tools and Solutions for Pharma & Life Sciences https://www.elsevier.com/rd-solutions/pharma-and-life- sciences-solutions Matthew Clark, Ph. D. m.clark@elsevier.com
  • 3. Elsevier Predictive Modeling | • Reaxys Medicinal Chemistry has millions of bioactivity measurements for thousands of targets. • Data from Aureus, Elsevier, and GVK Bio • Covering patents and journal articles • The measurements are associated with compounds with chemical structures. • We can combine the structures/activities to make Quantitative Structure Activity Relationship (QSAR) predictive models. • Here we use a KNIME automation workflow to demonstrate a system that can create and evaluate a predictive model for any selected target. • Of course not all are ‘predictive’, but the workflow evaluates this • Each model takes from 2-15 minutes to create • It provides an extremely flexible workbench for exploring QSAR • Elsevier services can integrate the system with your in-house structure/activity databases. Summary
  • 4. Elsevier Predictive Modeling | 4 THE WORLD’S LARGEST & BEST-ORGANIZED MEDICINAL CHEMISTRY DATABASE The Reaxys Medicinal Chemistry database contains: This unique collection of data is sourced from:
  • 5. Elsevier Predictive Modeling | 5 Various Activity Metrics are Consolidated to a Single Value • pIC50, pEC50, pED50, pID50, pLC50, pLD50, pKi, pKd, pKb, pD2, pD’2, pA2 , IC50, EC50, ED50, ID50, LC50, LD50, Ki, Kd, Kb, Ka, Ke , % Inhibition pX is computed Original values are preserved; this is an additional computed descriptor
  • 6. Elsevier Predictive Modeling | 6 Conversion of % inhibition • If Concentration of the compound is not Available p[X] is not calculated • IF CONCENTRATION Available as a range • p[X] is not calculated • If concentration is available as a single value • p[X] is calculated • If % inhibition > 25 p[X] = -log10(IC50) 𝑰𝑪 𝟓𝟎 = 𝟏𝟎𝟎∗[𝑪] % 𝒊𝒏𝒉𝒊𝒃𝒊𝒕𝒊𝒐𝒏 − [C] • If % inhibition < 25 p[X] = 1
  • 7. Elsevier Predictive Modeling | 7 Qualitative results • Reported as Inactive • p[x] = 1 • Reported as Active • Concentration reported • p[X] = - log10(Concentration) • Concentration is not reported • p[X] not calculated
  • 8. Elsevier Predictive Modeling | 8 Process for Computing pX Conc Unit? Data concentration Not concentration Conversion to concentration Quality metric? Yes No Quality Rules pX not calculated is Range? Range processing Log unit? No Yes pX Yes -log10(affinity) No Low quality Acceptable quality pX not calculated Not convertible
  • 9. Elsevier Predictive Modeling | • Issue: some targets have thousands of measurements • Solution: randomly select N examples from each decade of pX activity • Selecting from each decade provides a more balanced data set across the activity range • In this study only IC50 measurements are used, not % inhibition or Kd • N is from 50 to 200 in these examples. • Compounds are selected so that training and test sets have no overlap • Use random selection of both • For each compound the median pX value is chosen when multiple measurements are reported • pX = -log(molar activity) • Partial-least-squares correlation is used, selecting the optimal number of descriptor components • Model quality is assessed • I like RMS error of prediction because it most directly answers the question “how accurate do I expect the prediction to be?” • r2, F and other statistics are also computed. 9 Elements of the Model Making Workflow
  • 10. Elsevier Predictive Modeling | 10 Random Selection to ‘Flatten’ Activity Distribution 1 2 43 65 7 8 9 10 -log(M) #of compounds 1 2 43 65 7 8 9 10 -log(M) #of compounds Raw reports – may be thousands for a popular target. Activities and structure distribution may skew statistics Randomly selected ‘N’ compounds per decade More diverse set of structures and more uniform distribution of activities
  • 11. Elsevier Predictive Modeling | KNIME Workflow Input: • target name, • number of compounds to use from each decade of log(Activity) • activity type, e.g. IC50 Output: • model from training set • predictions of test set
  • 12. Elsevier Predictive Modeling | • Powerful statistical methods can select variables to create a model that looks great on the training set • However the real test is how well the model can predict activity for a compound not in the training set. • In these examples the Training and Test sets are not overlapping; no test compound is in the training set. • In this study the best metric for the model is the RMS Error of Prediction – the errors seen when predicting activities for compounds not in the training set. • This is useful to provide the “error bar” for the predictions. • The required accuracy depends on what you are using the prediction for • The Pearson r2 provides a metric of “fit” but is dependent on the range of the data and does not directly tell you how much error to expect. • Other metrics for prediction might include how similar the compound is any compounds in the training set. One might expect the predictions to be less reliable if the compound is very different. 12 The Importance of the Test Set
  • 13. Elsevier Predictive Modeling | • The descriptors are properties computed for the compounds that are related to the activities • They encode elements of the structure and other properties of the compounds. • In this study we combined 2 sets of computed descriptors • RD Kit • CDK Toolkit 13 Descriptors
  • 14. Elsevier Predictive Modeling | library(pls) model <- plsr(pX ~ ., data = knime.in) # compute RMS error of prediction for each number of components components <- RMSEP(model) optimalComps = length(components$comps) - 1 # find optimal number of components in the model based on # decreasing rms error of prediction for the models. for (i in 1:optimalComps) { if (components$val[i] - components$val[i+1] <= 0.0001) { optimalComps = i break } } # make final model with optimal components model <- plsr(pX ~ ., data = knime.in, ncomp = optimalComps) sprintf("number optimal components: %d RMS error of prediction: %2.3f", optimalComps, components$val[optimalComps] ) 14 R Script for Partial Least Squares Create model with all possible components Find model with optimal # of components Make final model with optimal # of components
  • 15. Elsevier Predictive Modeling | • This model is moderately useful to separate best from worst molecules. • The test set statistics are influenced by 2 outliers with absurdly high predicted binding Without those the rms error is about 0.9. 15 ORL-1 Model RMS Error of Prediction Training Set 0.71 Test Set 1.3 y = 0.9912x R² = 0.6787 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Training Set y = 0.9956x R² = 0.2808 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Test Set
  • 16. Elsevier Predictive Modeling | • p38a has 2 binding sites (‘normal’ and allosteric) which make a simple QSAR more difficult if the compounds are not separated • Test set RMSEP suggests predictive value is limited 16 p38a (Q16539) Model RMS Error of Prediction Training Set 1.15 Test Set 1.45 y = 0.9734x R² = 0.3237 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Training Set y = 0.951x R² = 0.0775 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Test Set
  • 17. Elsevier Predictive Modeling | • This target has fewer reported compounds and activities than others • Test suggests reasonable predictive power to distinguish active from inactive compounds in-silico 17 ABL Kinase (P00519) Model RMS Error of Prediction Training Set 0.63 Test Set 0.96 y = 0.9919x R² = 0.8741 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Training Set y = 0.9868x R² = 0.7179 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Test Set
  • 18. Elsevier Predictive Modeling | • Prediction error of 1 log unit is useful model for in- silico screening 18 ErbB1 y = 0.9814x R² = 0.5975 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Training Set y = 0.991x R² = 0.3463 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Test Set Model RMS Error of Prediction Training Set 0.89 Test Set 0.89
  • 19. Elsevier Predictive Modeling | • Results are similar for this ABL example. • For ‘final’ models larger selections or all data points can be used to increase robustness • The automation aspect makes repeated trials easy to do 19 Q: Since random sets are selected, what happens if you repeat trials? A: The results are nearly identical in a statistical sense Model RMS Error of Prediction Training Set 0.63 Test Set 1.6 Model RMS Error of Prediction Training Set 0.62 Test Set 1.0 y = 0.9924x R² = 0.8803 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Training Set y = 0.9964x R² = 0.7402 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Test Set y = 0.9924x R² = 0.8804 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Training Set y = 1.0049x R² = 0.4629 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Test Set y = 0.9914x R² = 0.8597 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Training Set y = 0.9959x R² = 0.7053 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Test Set Model RMS Error of Prediction Training Set 0.66 Test Set 1.0
  • 20. Elsevier Predictive Modeling | 20 Q: What if you use a higher level class instead of a single protein? A: In this case at least, the model isn’t as good. y = 0.9495x R² = 0.3565 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Training Set y = 0.9526x R² = 0.0029 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Test Set Model RMS Error of Prediction Training Set 1.47 Test Set 1.99 VEGFR* VEGFR* (200,000 bioactivities reported) Includes VEGFR 1, 2 3 , FLT1, FLT3 etc. etc. There is probably too much variation in the sub-target structures that makes the model ineffective Model is not useful. Your mileage may vary with class
  • 21. Elsevier Predictive Modeling | • EGFR* (200,000 bioactivities reported) • Includes VEGFR, FLT1, FLT3 etc. etc. • There is probably too much variation in the sub- target structures that makes the model ineffective • Model is not useful. • Your mileage may vary with class 21 EGFR (P00533) Model RMS Error of Prediction Training Set 1.24 Test Set 1.59 y = 0.9721x R² = 0.711 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Training Set y = 0.9514x R² = 0.4955 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Test Set
  • 22. Elsevier Predictive Modeling | • VEGFR • FLT1 • FLT4 • FLK-1 (VEGFR2) • VEGFR [Human] 22 Models for Sub-Variants of VEGFR – Some of the variants contribute more to the model than others Target RMS EP Test Set RMS EP Training VEGFR* 2.0 1.5 FLT1 1.4 1.8 FLT4 0.87 0.86 FLK-1 1.8 2.1 VEGFR [Human] 1.0 0.92
  • 23. Elsevier Predictive Modeling | • This tool makes it very easy to do answer many many questions with minimal time and effort, as shown. • One can add ability to search for specific scaffolds and combine with target query. • It provides an unprecedented tool to explore QSAR by target, class and other criteria. • One can assess the likelihood that there is sufficient data to make a model for a given target • The automatic evaluation using a test set provides nearly instant feedback on the prospects for making a predictive model. • Because it uses the KNIME system, nearly any descriptor set can be used for making the models. • There are additional open-source and commercial descriptor tools available. • One can probably get better results with better descriptors. 23 Conclusions