Olivier Barberan
Senior Product Manager
Data driven drug discovery :
How Predictive models help medicinal
chemists design better drugs.
Putting Data to Work |
Today’s discussion
• Elsevier is a partner and vendor to 100% of the world’s top
pharmaceutical companies
• It provides leading data solutions in chemistry, biology, drug safety
and biomedical review
• Our workflow solutions are based on clearly defined use cases
BUT as with many data-driven companies, our data is used for specific
purposes via different user interfaces.
Our Challenge: How to externalize data for:
• Complementary data integration
• Bioactivity prediction
• Use in additional use cases and workflows
2
Putting Data to Work |
The Challenge Pharma face : Investing in earlier development stages
builds up the pipeline and reduces attrition from foreseeable causes
3
Target ID &
Validation
Lead ID &
Validation
Pre-clinical Clinical
(Phase I to III)
Post-Launch
Characterize &
understand
disease
Identify, design &
validate leads
Cull/prioritize leads Determine safety
and efficacy profile
Manage risk &
compliance; improve
patient care
Source: Tufts Center for the study of drug development, Nov 2014
$125 M $773 M $200 M $1,460 M $3–5 B
Cost (/NME)
“We cannot fail for reasons we could have predicted. We
should fail only for reasons we could not predict.”
—Dr Moncef Slaoui
Head of Global R&D, GSK
• Low margin of safety is a major
cause of attrition in Phase I and II
• Lack of efficacy is a major cause of
attrition in Phase II and III
Better informed decisions at the Lead ID
& Validation stage generates more
optimized leads and mitigates failures
and miss-investments
80% 65% 69% 12%
Success Rate
“Usability & Re-usability of data is critical to our success.
We are experts at capturing data for purpose, and poor at
making that data available for any purpose ”
—Dr M. Ramsey,
Chief Data Officer, GSK
| 4
Excerption of facts – how Reaxys Medicinal
Chemistry handles the literature
Excerption of facts and Indexing of literature and patents is done by
pharmacologists for Medicinal chemists
Substance-record
Bibliographic-record
ExcerptionIndexing
Bioactivity-record
Potency &
selectivity
DMPK properties
Physical properties
Safety
pharmacology
The lead optimization challenge: Optimization of early
substance to potential drug
Hit Identification Lead gen. and optimization
Drug discovery involves different personas and needs
Medicinal chemists: “I need to
optimize chemical structures in order to
improve affinity, selectivity ADMET
properties and decrease side effects”
Computational chemists:
“I need to find hits for druggable targets”
(virtual screening)
Integration /
Professional
Services
In house data
Putting Data to Work |
7
Hit to lead : Virtual screening
Putting Data to Work |
Ligand-based virtual screening – Using Reaxys Medicinal
Chemistry
Objective
• Describe an In Silico Screening approach
using Reaxys Medicinal Chemistry
Case Study on T-Type calcium channels
Putting Data to Work |
Ligand-based in silico screening
730 Query structures
Representation & Chemical
Space Molecular descriptors &
Fingerprints
Virtual Screening
Pharmacophoric Similarity
N
O
N
N
N
O
N
N
N
314 Hits
"Drug-like" Filtering
1. Molecular diversity and chemical originality
2. Compounds availability
39 compounds ordered for testing
28 M Substances
Chemical space based on
synthesized substances
Pipeline pilot
Putting Data to Work | 11
Biological testing of the 39 hits on Cav3.2
Electrophysiology experiments: Screening @10 µM
on Cav3.2 T-Type channels
1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627 2930313233343536373839
0
25
50
75
100
Peakcurrentinhibition(%)
28
9 compounds with a % inhibition > 75%
15 compounds with a % inhibition >50%
Compound # (@ 10µM)
Back to
referring
slide
Reaxys Medicinal Chemistry supports Ligand-based virtual
screening approaches
Putting Data to Work |
Lead optimization
Predictive Modeling of On-And-Off Target Bioactivities
TITLE OF PRESENTATION | 13
Issue
• Need to make best possible decisions to rank/decide on drug
candidates
• Use all available information to help make decisions about potential
therapeutics
Method
• Build hundreds of models for on and off-target activities
• Create a model building engine in KNIME that can make models for large
lists of targets using data from Reaxys Medicinal Chemistry.
• Selectivity panels.
• Toxicology pathways (from pathway analysis).
• PK/ADME models
Challenge: Make best decisions based on data
TITLE OF PRESENTATION |
• Issue: some targets have thousands of measurements
• Solution: Randomly select N examples from each decade of pX activity
• Selecting from each decade provides a more balanced data set across the activity range
• In this study only IC50 measurements are used, not % inhibition or Kd
• N is from 50 to 200 in these examples.
• Compounds are selected so that training and test sets have no overlap
• Use random selection of both
• For each compound the median pX value is chosen when multiple measurements are
reported
• pX = -log(molar activity)
• Partial-least-squares correlation is used, selecting the optimal number of descriptor
components
• Model quality is assessed
• I like RMS error of prediction because it most directly answers the question “how accurate do I expect the
prediction to be?”
• r2, F and other statistics are also computed.
14
Elements of the model making workflow
TITLE OF PRESENTATION | 15
Random Selection to ‘Flatten’ Activity Distribution
1 2 43 65 7 8 9 10
-log(M)
#of
compounds
1 2 43 65 7 8 9 10
-log(M)
#of
compounds
Raw reports – may be thousands for a
popular target. Activities and structure
distribution may skew statistics
Randomly selected ‘N’
compounds per decade
Result - diverse set of
structures and more uniform
distribution of activities
TITLE OF PRESENTATION |
High-Level workflow for creating models
• Uses a list of UNIPROT target names, and NCBI
synonyms
• Attempts to create a model for each target
• Rejects models for insufficient data, or low
quality
TITLE OF PRESENTATION |
• Powerful statistical methods can select variables to create a model that looks great on
the training set
• However the real test is how well the model can predict activity for a compound not in
the training set.
• In these examples the Training and Test sets are not overlapping; no test compound is
in the training set.
• In this study the best metric for the model is the RMS Error of Prediction – the errors
seen when predicting activities for compounds not in the training set.
• This is useful to provide the “error bar” for the predictions.
• The required accuracy depends on what you are using the prediction for
• The Pearson r2 provides a metric of “fit” but is dependent on the range of the data and does not
directly tell you how much error to expect.
• Other metrics for prediction might include how similar the compound is any compounds in the training
set. One might expect the predictions to be less reliable if the compound is very different.
17
The Importance of the Test Set
TITLE OF PRESENTATION | 18
Model Engine
• Creates series of Reaxys queries for each decade of activity
• Uses PLS/R to create and evaluate models
• Creates spreadsheet with RMS error of prediction and plots
TITLE OF PRESENTATION |
• The descriptors are properties computed for the compounds that are
related to the activities
• They encode elements of the structure and other properties of the
compounds.
• In this study we combined 2 sets of computed descriptors
• RD Kit
• CDK Toolkit
19
Descriptors
TITLE OF PRESENTATION |
• Results are similar for this ABL example.
• For ‘final’ models larger selections or all data points can be used to increase
robustness
• The automation aspect makes repeated trials easy to do
20
Q: Since random sets are selected, what happens if you repeat trials?
A: The results are nearly identical in a statistical sense
Model
RMS Error of
Prediction
Training Set 0.63
Test Set 1.6
Model
RMS Error of
Prediction
Training Set 0.62
Test Set 1.0
y = 0.9924x
R² = 0.8803
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Training Set
y = 0.9964x
R² = 0.7402
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Test Set
y = 0.9924x
R² = 0.8804
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Training Set y = 1.0049x
R² = 0.4629
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Test Set
y = 0.9914x
R² = 0.8597
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Training Set
y = 0.9959x
R² = 0.7053
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Test Set
Model
RMS Error of
Prediction
Training Set 0.66
Test Set 1.0
TITLE OF PRESENTATION |
• Training set, test set
• Statistics for training and test sets.
21
Spreadsheet created for each model
TITLE OF PRESENTATION | 22
EGFR Model Example
y = 0.9829x
R² = 0.8205
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
EGFR Training Set y = 0.9765x
R² = 0.7151
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
EGFR Test Set
RMSEP Type target protein trainingSetSize testSetSize
0.97 Training Set P00533 EGFR 1583
1.2 Test Set P00533 EGFR 521
TITLE OF PRESENTATION |
RMSEP Type target protein trainingSetSize testSetSize
0.76 Training Set O60674 JAK2 1372
0.89 Test Set O60674 JAK2 446
23
JAK2 Model Example
y = 0.9889x
R² = 0.7638
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
JAK2 Training Set y = 0.9956x
R² = 0.6749
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
JAK2 Test Set
TITLE OF PRESENTATION |
RMSEP Type target protein trainingSetSize testSetSize
0.65 Training Set P49841 GSK3β 1271
0.97 Test Set P49841 GSK3β 427
24
GSK3 Model Example
y = 0.9911x
R² = 0.813
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
GSK3β Training Set y = 1.0008x
R² = 0.5678
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16pXPredicted
pX Measured
GSK3β Test Set
TITLE OF PRESENTATION |
RMSEP Type target protein trainingSetSize testSetSize
0.90 Training Set P98161 PKD1 60
1.05 Test Set P98161 PKD1 26
25
Example of Poor Model
y = 0.9795x
R² = -3.788
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
PDK1 Training Set
y = 0.9867x
R² = -15.88
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted pX Measured
PDK1 Test Set
TITLE OF PRESENTATION | 26
Prediction of Activities
• Reads in all models
• Reads in a set of molecules in SDF format
• Predicts activity for each model
• Computes Z value for classification of active/inactive of pX > 6
• Classify as active/inactive based on desired confidence
TITLE OF PRESENTATION |
• Yellow  Black increasing predicted inhibition based on Z value
• Selectivity metric - % of kinases highly inhibited
27
HeatMap for Kinase panel
predIRAK3
predO00141SGK1
predO14757CHK1
predO14920ikbkb
predO14976GAK
predO15264p38δMAPK
predO15530PDK1
predO43353RIP2
predO60674JAK2
predO75116ROCK2
predO75582MSK1
predO96013PAK4
predO96017CHK2
predP00519ABL1
predP00533EGFR
predP04049c-Raf
predP04626ERBB2
predP06239Lck
predP06493CDK1
predP11309PIM1
predP12931Src
predP15056BRAF
predP15735PHK
predP17252PKCα
predP23443S6K1
predP23458JAK1
predP24941CDK2-CyclinA
predP27361ERK1
predP27448MARK3
predP27986PIK3
predP28028B-Raf
predP28482ERK2
predP31749AKT1
predP35968VEGFR2
predP36507MEK2
predP36888FLT3
predP41240CSK
predP45983JNK1
predP45984JNK2
predP48730CK1δ
predP49841GSK3β
predP51812RSK2
predP51955NEK2a
predP53778p38γMAPK
predP53779JNK3
predP98161PKD1
predQ02750MEK1
predQ05513PKCζ
predQ13131AMPK
predQ13188MST2
predQ13627DYRK1A
predQ14012CaMK1
predQ14680MELK
predQ15418RSK1
predQ15746SmMLCK
predQ16513PRK2
predQ16539p38αMAPK
predQ86V86PIM3
predQ8IW41PRAK
predQ8IWQ3BRSK2
predQ8N5S9CaMKKα
predQ8TD08ERK8
predQ92630DYRK2
predQ96RR4CaMKKβ
predQ96SB4SRPK1
predQ9BUB5MNK1
predQ9H2X6HIPK2
predQ9H422HIPK3
predQ9HBH9MNK2
predQ9P286PAK5
predQ9UQB9AuroraC
selectivity
Molecule
M
olecule
nam
e
0 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 1 2 1 0 1 -1 0 1 -1 0 0 -1 0 0 3 -1 0 0 0 0 0 1 2 1 0 1 0 0 -1 1 0 -1 0 0 0 -1 -1 0 -1 -1 0 0 -1 0 0 1 0 0 0 -2 0 1 1 0 2 31%
fedratinib
0 -1 -1 0 1 1 0 0 2 2 0 1 -2 -1 0 0 0 -1 -1 0 -3 2 -1 0 1 1 0 -3 0 0 3 1 2 1 -1 0 0 2 -1 1 1 0 0 0 0 0 -2 -2 0 0 0 -1 -2 0 -2 0 0 -2 -3 0 0 1 0 0 0 0 1 0 0 0 0 24%
ruxolitinib
0 -1 0 1 1 1 1 0 1 0 0 0 0 1 1 0 2 -3 1 1 1 -1 0 -1 0 -1 0 0 0 1 3 0 1 1 0 0 -2 0 -1 1 0 1 0 0 -1 0 0 -1 0 0 0 -1 1 0 -1 -1 0 1 -3 0 0 1 0 0 0 0 1 -1 -1 0 0 30%
gefitinib
0 0 1 2 0 1 1 1 0 0 0 0 1 0 1 1 0 2 0 -1 0 0 0 0 -2 0 1 0 0 0 3 0 0 0 1 -2 -1 0 1 1 0 1 0 1 0 0 0 -1 0 0 -1 -1 0 0 -1 -1 1 0 -2 0 0 1 0 0 0 -3 0 1 1 0 2 30%
imatinib
1 -1 2 2 1 1 1 1 2 2 1 2 0 0 0 3 -3 -4 2 2 2 1 2 4 4 0 3 -1 1 3 -8 -1 2 0 2 3 0 2 4 1 2 3 0 1 0 2 2 0 2 2 2 -1 2 2 2 3 5 3 2 1 1 1 1 2 0 4 0 0 1 1 2 73%
staurosporine
TITLE OF PRESENTATION | 28
Fedratinib – trials stopped due to cerebral edema in patients.
TITLE OF PRESENTATION | 29
IRAK3 identified by pathway analysis as significant for brain
toxicity
Rinker, Predicting adverse drug events using literature-based pathway analysis 251st National Meeting of the
American Chemical Society, San Diego CA March 13-17, 2016
TITLE OF PRESENTATION | 30
TITLE OF PRESENTATION |
• Yellow  Black increasing predicted inhibition based on Z value
• Selectivity metric - % of kinases highly inhibited
31
HeatMap for Kinase panel
predIRAK3
predO00141SGK1
predO14757CHK1
predO14920ikbkb
predO14976GAK
predO15264p38δMAPK
predO15530PDK1
predO43353RIP2
predO60674JAK2
predO75116ROCK2
predO75582MSK1
predO96013PAK4
predO96017CHK2
predP00519ABL1
predP00533EGFR
predP04049c-Raf
predP04626ERBB2
predP06239Lck
predP06493CDK1
predP11309PIM1
predP12931Src
predP15056BRAF
predP15735PHK
predP17252PKCα
predP23443S6K1
predP23458JAK1
predP24941CDK2-CyclinA
predP27361ERK1
predP27448MARK3
predP27986PIK3
predP28028B-Raf
predP28482ERK2
predP31749AKT1
predP35968VEGFR2
predP36507MEK2
predP36888FLT3
predP41240CSK
predP45983JNK1
predP45984JNK2
predP48730CK1δ
predP49841GSK3β
predP51812RSK2
predP51955NEK2a
predP53778p38γMAPK
predP53779JNK3
predP98161PKD1
predQ02750MEK1
predQ05513PKCζ
predQ13131AMPK
predQ13188MST2
predQ13627DYRK1A
predQ14012CaMK1
predQ14680MELK
predQ15418RSK1
predQ15746SmMLCK
predQ16513PRK2
predQ16539p38αMAPK
predQ86V86PIM3
predQ8IW41PRAK
predQ8IWQ3BRSK2
predQ8N5S9CaMKKα
predQ8TD08ERK8
predQ92630DYRK2
predQ96RR4CaMKKβ
predQ96SB4SRPK1
predQ9BUB5MNK1
predQ9H2X6HIPK2
predQ9H422HIPK3
predQ9HBH9MNK2
predQ9P286PAK5
predQ9UQB9AuroraC
selectivity
Molecule
M
olecule
nam
e
0 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 1 2 1 0 1 -1 0 1 -1 0 0 -1 0 0 3 -1 0 0 0 0 0 1 2 1 0 1 0 0 -1 1 0 -1 0 0 0 -1 -1 0 -1 -1 0 0 -1 0 0 1 0 0 0 -2 0 1 1 0 2 31%
fedratinib
0 -1 -1 0 1 1 0 0 2 2 0 1 -2 -1 0 0 0 -1 -1 0 -3 2 -1 0 1 1 0 -3 0 0 3 1 2 1 -1 0 0 2 -1 1 1 0 0 0 0 0 -2 -2 0 0 0 -1 -2 0 -2 0 0 -2 -3 0 0 1 0 0 0 0 1 0 0 0 0 24%
ruxolitinib
0 -1 0 1 1 1 1 0 1 0 0 0 0 1 1 0 2 -3 1 1 1 -1 0 -1 0 -1 0 0 0 1 3 0 1 1 0 0 -2 0 -1 1 0 1 0 0 -1 0 0 -1 0 0 0 -1 1 0 -1 -1 0 1 -3 0 0 1 0 0 0 0 1 -1 -1 0 0 30%
gefitinib
0 0 1 2 0 1 1 1 0 0 0 0 1 0 1 1 0 2 0 -1 0 0 0 0 -2 0 1 0 0 0 3 0 0 0 1 -2 -1 0 1 1 0 1 0 1 0 0 0 -1 0 0 -1 -1 0 0 -1 -1 1 0 -2 0 0 1 0 0 0 -3 0 1 1 0 2 30%
imatinib
1 -1 2 2 1 1 1 1 2 2 1 2 0 0 0 3 -3 -4 2 2 2 1 2 4 4 0 3 -1 1 3 -8 -1 2 0 2 3 0 2 4 1 2 3 0 1 0 2 2 0 2 2 2 -1 2 2 2 3 5 3 2 1 1 1 1 2 0 4 0 0 1 1 2 73%
staurosporine
IRAK3
JAK2 JAK1
• This tool makes it very easy to do answer many many questions with
minimal time and effort, as shown.
 One can add ability to search for specific scaffolds and combine with target query.
• It provides an unprecedented tool to explore QSAR by target, class and
other criteria.
• One can assess the likelihood of sufficient data to make a model for a
given target
 The automatic evaluation using a test set, provides nearly instant feedback on the
prospect of making a predictive model.
• Because it uses the KNIME system, nearly any descriptor set can be
used for making the models.
 There are additional open-source and commercial descriptor tools available.
 One can probably get better results with better descriptors.
32
Conclusions
ADMET Properties influencing medicinal Chemistry design
Prediction of ADMET properties
Influencing medicinal chemistry design
• logD7.4
• Protein Binding
• Solubility
• Metabolic
Stability
• hERG
• Etc…
Step1
•Rat PPB
• Hu heps
• CYP inhib
• Caco2
• NaV1.5
• Etc…
Step2
• logD7.4
• Solubility
•Protein Binding
• hERG
• Rat PPB
• Metabolic
Stability
• CYP inhib
• Caco2
• etc.
Step 0
AstraZeneca’s global HERG QSAR model70 has contributed to the
reduction in the synthesis of ‘red flag’ compounds (compounds that are
measured to have an HERG potency of <1μM)
from 25.8% of all compounds tested in 2003 to only 6% in 2010.
Cumming, J.G., Davis, A.M. et al. Nat. Rev. Drug Disc. (2013) 12, 948–962
Putting Data to Work |
Prediction of Cardiotoxic drugs related to hERG blockade
Putting Data to Work | 36
How to avoid hERG inhibition?
QT interval prolongation can lead to serious arrhythmias which can evolve into a fatal issue.
A large number of non-cardiac drugs-induced QT prolongation has been reported and
continues to increase with the withdrawal of some blockbuster medicines.
hERG seems to be the main target of this adverse side effect.
In silico models could be rapid and powerful tools to screen out potential hERG blockers as
early as possible during the discovery process
Putting Data to Work |
Extraction & Methodology
hERG data set
640 mol
Recursive Partitioning (RP)
MOE QuaSAR Classify
Molecules tested on hERG (Kv11.1)
2D-Molecular descriptors sets Predictive models
Subsets of molecules
According to biological detailed protocols
Representation of hERG ligands within chemical space
of NCI database according to the two first PCA axis
NCI database
hERG data set
Cross-validation
External validation
Putting Data to Work |
Model 3 HIGH WEAK
1 µM 10 µM 50 / 9 mol50 / 46 mol
Training Test
Descriptors Relevant P_VSA Relevant P_VSA
High 48/50 47/50 45/46 42/46
Weak 49/50 49/50 8/9 9/9
All 97% 96% 96% 93%
Correct classifications determined by 5-fold cross-validation (Training) and
by external validation (Test) for each descriptor set.
SlogP_VSA7
SlogP_VSA2SMR_VSA6
PEOE_VSA+1
+ +
+ +
+
-
- -
-
-
SMR_VSA5 SMR_VSA6
SMR_VSA5
SMR_VSA4
PEOE_VSA+0
+
-
SlogP
PEOE_VSA_FHYD
SMR_VSA1
SMR_VSA6
SMR_VSA1
SlogP vsa_pol
-
-
-- ++
Relevant P_VSA
Putting Data to Work |
Summary
 Reaxys Medicinal Chemistry enables the retrieval of high quality datasets with
chemical diversity and homogeneous biological activities.
 Pertinent predictive models of hERG activity have been designed using
recursive partitioning analysis with 2D-molecular descriptors.
 From Reaxys Medicinal Chemistry, a fast virtual screening approach could be
used as an early tool in the drug discovery process to avoid cardiotoxic side
effects related to hERG blockage.
AstraZeneca’s global HERG QSAR model70 has contributed to the reduction in the
synthesis of ‘red flag’ compounds (compounds that are measured to have an HERG
potency of <1μM)
from 25.8% of all compounds tested in 2003 to only 6% in 2010.
Cumming, J.G., Davis, A.M. et al. Nat. Rev. Drug Disc. (2013) 12, 948–962
Even if models are performing well designers need interpretable models
Wrap Up
“They care mostly about
Accessing our data through
API Knime Pipeline pilot”
“They want a product they can
use right out of the box”
New Reaxys Medicinal Chemistry is supporting Hit to lead and lead
optimization process by providing relevant and high quality data to scientists.
Computational Chemists
High quality data on many
different topics (efficacy , ADMET,
Animal models)
Large Amount of data to Perform
models
Medicinal Chemists
Accessing the data through third
party tools
Reaxys Medicinal Chemistry is able to support both Computational and
Medicinal chemists.

How predictive models help Medicinal Chemists design better drugs_webinar

  • 1.
    Olivier Barberan Senior ProductManager Data driven drug discovery : How Predictive models help medicinal chemists design better drugs.
  • 2.
    Putting Data toWork | Today’s discussion • Elsevier is a partner and vendor to 100% of the world’s top pharmaceutical companies • It provides leading data solutions in chemistry, biology, drug safety and biomedical review • Our workflow solutions are based on clearly defined use cases BUT as with many data-driven companies, our data is used for specific purposes via different user interfaces. Our Challenge: How to externalize data for: • Complementary data integration • Bioactivity prediction • Use in additional use cases and workflows 2
  • 3.
    Putting Data toWork | The Challenge Pharma face : Investing in earlier development stages builds up the pipeline and reduces attrition from foreseeable causes 3 Target ID & Validation Lead ID & Validation Pre-clinical Clinical (Phase I to III) Post-Launch Characterize & understand disease Identify, design & validate leads Cull/prioritize leads Determine safety and efficacy profile Manage risk & compliance; improve patient care Source: Tufts Center for the study of drug development, Nov 2014 $125 M $773 M $200 M $1,460 M $3–5 B Cost (/NME) “We cannot fail for reasons we could have predicted. We should fail only for reasons we could not predict.” —Dr Moncef Slaoui Head of Global R&D, GSK • Low margin of safety is a major cause of attrition in Phase I and II • Lack of efficacy is a major cause of attrition in Phase II and III Better informed decisions at the Lead ID & Validation stage generates more optimized leads and mitigates failures and miss-investments 80% 65% 69% 12% Success Rate “Usability & Re-usability of data is critical to our success. We are experts at capturing data for purpose, and poor at making that data available for any purpose ” —Dr M. Ramsey, Chief Data Officer, GSK
  • 4.
    | 4 Excerption offacts – how Reaxys Medicinal Chemistry handles the literature Excerption of facts and Indexing of literature and patents is done by pharmacologists for Medicinal chemists Substance-record Bibliographic-record ExcerptionIndexing Bioactivity-record
  • 5.
    Potency & selectivity DMPK properties Physicalproperties Safety pharmacology The lead optimization challenge: Optimization of early substance to potential drug
  • 6.
    Hit Identification Leadgen. and optimization Drug discovery involves different personas and needs Medicinal chemists: “I need to optimize chemical structures in order to improve affinity, selectivity ADMET properties and decrease side effects” Computational chemists: “I need to find hits for druggable targets” (virtual screening) Integration / Professional Services In house data
  • 7.
  • 8.
    Hit to lead: Virtual screening
  • 9.
    Putting Data toWork | Ligand-based virtual screening – Using Reaxys Medicinal Chemistry Objective • Describe an In Silico Screening approach using Reaxys Medicinal Chemistry Case Study on T-Type calcium channels
  • 10.
    Putting Data toWork | Ligand-based in silico screening 730 Query structures Representation & Chemical Space Molecular descriptors & Fingerprints Virtual Screening Pharmacophoric Similarity N O N N N O N N N 314 Hits "Drug-like" Filtering 1. Molecular diversity and chemical originality 2. Compounds availability 39 compounds ordered for testing 28 M Substances Chemical space based on synthesized substances Pipeline pilot
  • 11.
    Putting Data toWork | 11 Biological testing of the 39 hits on Cav3.2 Electrophysiology experiments: Screening @10 µM on Cav3.2 T-Type channels 1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627 2930313233343536373839 0 25 50 75 100 Peakcurrentinhibition(%) 28 9 compounds with a % inhibition > 75% 15 compounds with a % inhibition >50% Compound # (@ 10µM) Back to referring slide Reaxys Medicinal Chemistry supports Ligand-based virtual screening approaches
  • 12.
    Putting Data toWork | Lead optimization Predictive Modeling of On-And-Off Target Bioactivities
  • 13.
    TITLE OF PRESENTATION| 13 Issue • Need to make best possible decisions to rank/decide on drug candidates • Use all available information to help make decisions about potential therapeutics Method • Build hundreds of models for on and off-target activities • Create a model building engine in KNIME that can make models for large lists of targets using data from Reaxys Medicinal Chemistry. • Selectivity panels. • Toxicology pathways (from pathway analysis). • PK/ADME models Challenge: Make best decisions based on data
  • 14.
    TITLE OF PRESENTATION| • Issue: some targets have thousands of measurements • Solution: Randomly select N examples from each decade of pX activity • Selecting from each decade provides a more balanced data set across the activity range • In this study only IC50 measurements are used, not % inhibition or Kd • N is from 50 to 200 in these examples. • Compounds are selected so that training and test sets have no overlap • Use random selection of both • For each compound the median pX value is chosen when multiple measurements are reported • pX = -log(molar activity) • Partial-least-squares correlation is used, selecting the optimal number of descriptor components • Model quality is assessed • I like RMS error of prediction because it most directly answers the question “how accurate do I expect the prediction to be?” • r2, F and other statistics are also computed. 14 Elements of the model making workflow
  • 15.
    TITLE OF PRESENTATION| 15 Random Selection to ‘Flatten’ Activity Distribution 1 2 43 65 7 8 9 10 -log(M) #of compounds 1 2 43 65 7 8 9 10 -log(M) #of compounds Raw reports – may be thousands for a popular target. Activities and structure distribution may skew statistics Randomly selected ‘N’ compounds per decade Result - diverse set of structures and more uniform distribution of activities
  • 16.
    TITLE OF PRESENTATION| High-Level workflow for creating models • Uses a list of UNIPROT target names, and NCBI synonyms • Attempts to create a model for each target • Rejects models for insufficient data, or low quality
  • 17.
    TITLE OF PRESENTATION| • Powerful statistical methods can select variables to create a model that looks great on the training set • However the real test is how well the model can predict activity for a compound not in the training set. • In these examples the Training and Test sets are not overlapping; no test compound is in the training set. • In this study the best metric for the model is the RMS Error of Prediction – the errors seen when predicting activities for compounds not in the training set. • This is useful to provide the “error bar” for the predictions. • The required accuracy depends on what you are using the prediction for • The Pearson r2 provides a metric of “fit” but is dependent on the range of the data and does not directly tell you how much error to expect. • Other metrics for prediction might include how similar the compound is any compounds in the training set. One might expect the predictions to be less reliable if the compound is very different. 17 The Importance of the Test Set
  • 18.
    TITLE OF PRESENTATION| 18 Model Engine • Creates series of Reaxys queries for each decade of activity • Uses PLS/R to create and evaluate models • Creates spreadsheet with RMS error of prediction and plots
  • 19.
    TITLE OF PRESENTATION| • The descriptors are properties computed for the compounds that are related to the activities • They encode elements of the structure and other properties of the compounds. • In this study we combined 2 sets of computed descriptors • RD Kit • CDK Toolkit 19 Descriptors
  • 20.
    TITLE OF PRESENTATION| • Results are similar for this ABL example. • For ‘final’ models larger selections or all data points can be used to increase robustness • The automation aspect makes repeated trials easy to do 20 Q: Since random sets are selected, what happens if you repeat trials? A: The results are nearly identical in a statistical sense Model RMS Error of Prediction Training Set 0.63 Test Set 1.6 Model RMS Error of Prediction Training Set 0.62 Test Set 1.0 y = 0.9924x R² = 0.8803 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Training Set y = 0.9964x R² = 0.7402 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Test Set y = 0.9924x R² = 0.8804 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Training Set y = 1.0049x R² = 0.4629 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Test Set y = 0.9914x R² = 0.8597 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Training Set y = 0.9959x R² = 0.7053 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured Test Set Model RMS Error of Prediction Training Set 0.66 Test Set 1.0
  • 21.
    TITLE OF PRESENTATION| • Training set, test set • Statistics for training and test sets. 21 Spreadsheet created for each model
  • 22.
    TITLE OF PRESENTATION| 22 EGFR Model Example y = 0.9829x R² = 0.8205 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured EGFR Training Set y = 0.9765x R² = 0.7151 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured EGFR Test Set RMSEP Type target protein trainingSetSize testSetSize 0.97 Training Set P00533 EGFR 1583 1.2 Test Set P00533 EGFR 521
  • 23.
    TITLE OF PRESENTATION| RMSEP Type target protein trainingSetSize testSetSize 0.76 Training Set O60674 JAK2 1372 0.89 Test Set O60674 JAK2 446 23 JAK2 Model Example y = 0.9889x R² = 0.7638 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured JAK2 Training Set y = 0.9956x R² = 0.6749 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured JAK2 Test Set
  • 24.
    TITLE OF PRESENTATION| RMSEP Type target protein trainingSetSize testSetSize 0.65 Training Set P49841 GSK3β 1271 0.97 Test Set P49841 GSK3β 427 24 GSK3 Model Example y = 0.9911x R² = 0.813 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured GSK3β Training Set y = 1.0008x R² = 0.5678 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16pXPredicted pX Measured GSK3β Test Set
  • 25.
    TITLE OF PRESENTATION| RMSEP Type target protein trainingSetSize testSetSize 0.90 Training Set P98161 PKD1 60 1.05 Test Set P98161 PKD1 26 25 Example of Poor Model y = 0.9795x R² = -3.788 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured PDK1 Training Set y = 0.9867x R² = -15.88 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 pXPredicted pX Measured PDK1 Test Set
  • 26.
    TITLE OF PRESENTATION| 26 Prediction of Activities • Reads in all models • Reads in a set of molecules in SDF format • Predicts activity for each model • Computes Z value for classification of active/inactive of pX > 6 • Classify as active/inactive based on desired confidence
  • 27.
    TITLE OF PRESENTATION| • Yellow  Black increasing predicted inhibition based on Z value • Selectivity metric - % of kinases highly inhibited 27 HeatMap for Kinase panel predIRAK3 predO00141SGK1 predO14757CHK1 predO14920ikbkb predO14976GAK predO15264p38δMAPK predO15530PDK1 predO43353RIP2 predO60674JAK2 predO75116ROCK2 predO75582MSK1 predO96013PAK4 predO96017CHK2 predP00519ABL1 predP00533EGFR predP04049c-Raf predP04626ERBB2 predP06239Lck predP06493CDK1 predP11309PIM1 predP12931Src predP15056BRAF predP15735PHK predP17252PKCα predP23443S6K1 predP23458JAK1 predP24941CDK2-CyclinA predP27361ERK1 predP27448MARK3 predP27986PIK3 predP28028B-Raf predP28482ERK2 predP31749AKT1 predP35968VEGFR2 predP36507MEK2 predP36888FLT3 predP41240CSK predP45983JNK1 predP45984JNK2 predP48730CK1δ predP49841GSK3β predP51812RSK2 predP51955NEK2a predP53778p38γMAPK predP53779JNK3 predP98161PKD1 predQ02750MEK1 predQ05513PKCζ predQ13131AMPK predQ13188MST2 predQ13627DYRK1A predQ14012CaMK1 predQ14680MELK predQ15418RSK1 predQ15746SmMLCK predQ16513PRK2 predQ16539p38αMAPK predQ86V86PIM3 predQ8IW41PRAK predQ8IWQ3BRSK2 predQ8N5S9CaMKKα predQ8TD08ERK8 predQ92630DYRK2 predQ96RR4CaMKKβ predQ96SB4SRPK1 predQ9BUB5MNK1 predQ9H2X6HIPK2 predQ9H422HIPK3 predQ9HBH9MNK2 predQ9P286PAK5 predQ9UQB9AuroraC selectivity Molecule M olecule nam e 0 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 1 2 1 0 1 -1 0 1 -1 0 0 -1 0 0 3 -1 0 0 0 0 0 1 2 1 0 1 0 0 -1 1 0 -1 0 0 0 -1 -1 0 -1 -1 0 0 -1 0 0 1 0 0 0 -2 0 1 1 0 2 31% fedratinib 0 -1 -1 0 1 1 0 0 2 2 0 1 -2 -1 0 0 0 -1 -1 0 -3 2 -1 0 1 1 0 -3 0 0 3 1 2 1 -1 0 0 2 -1 1 1 0 0 0 0 0 -2 -2 0 0 0 -1 -2 0 -2 0 0 -2 -3 0 0 1 0 0 0 0 1 0 0 0 0 24% ruxolitinib 0 -1 0 1 1 1 1 0 1 0 0 0 0 1 1 0 2 -3 1 1 1 -1 0 -1 0 -1 0 0 0 1 3 0 1 1 0 0 -2 0 -1 1 0 1 0 0 -1 0 0 -1 0 0 0 -1 1 0 -1 -1 0 1 -3 0 0 1 0 0 0 0 1 -1 -1 0 0 30% gefitinib 0 0 1 2 0 1 1 1 0 0 0 0 1 0 1 1 0 2 0 -1 0 0 0 0 -2 0 1 0 0 0 3 0 0 0 1 -2 -1 0 1 1 0 1 0 1 0 0 0 -1 0 0 -1 -1 0 0 -1 -1 1 0 -2 0 0 1 0 0 0 -3 0 1 1 0 2 30% imatinib 1 -1 2 2 1 1 1 1 2 2 1 2 0 0 0 3 -3 -4 2 2 2 1 2 4 4 0 3 -1 1 3 -8 -1 2 0 2 3 0 2 4 1 2 3 0 1 0 2 2 0 2 2 2 -1 2 2 2 3 5 3 2 1 1 1 1 2 0 4 0 0 1 1 2 73% staurosporine
  • 28.
    TITLE OF PRESENTATION| 28 Fedratinib – trials stopped due to cerebral edema in patients.
  • 29.
    TITLE OF PRESENTATION| 29 IRAK3 identified by pathway analysis as significant for brain toxicity Rinker, Predicting adverse drug events using literature-based pathway analysis 251st National Meeting of the American Chemical Society, San Diego CA March 13-17, 2016
  • 30.
  • 31.
    TITLE OF PRESENTATION| • Yellow  Black increasing predicted inhibition based on Z value • Selectivity metric - % of kinases highly inhibited 31 HeatMap for Kinase panel predIRAK3 predO00141SGK1 predO14757CHK1 predO14920ikbkb predO14976GAK predO15264p38δMAPK predO15530PDK1 predO43353RIP2 predO60674JAK2 predO75116ROCK2 predO75582MSK1 predO96013PAK4 predO96017CHK2 predP00519ABL1 predP00533EGFR predP04049c-Raf predP04626ERBB2 predP06239Lck predP06493CDK1 predP11309PIM1 predP12931Src predP15056BRAF predP15735PHK predP17252PKCα predP23443S6K1 predP23458JAK1 predP24941CDK2-CyclinA predP27361ERK1 predP27448MARK3 predP27986PIK3 predP28028B-Raf predP28482ERK2 predP31749AKT1 predP35968VEGFR2 predP36507MEK2 predP36888FLT3 predP41240CSK predP45983JNK1 predP45984JNK2 predP48730CK1δ predP49841GSK3β predP51812RSK2 predP51955NEK2a predP53778p38γMAPK predP53779JNK3 predP98161PKD1 predQ02750MEK1 predQ05513PKCζ predQ13131AMPK predQ13188MST2 predQ13627DYRK1A predQ14012CaMK1 predQ14680MELK predQ15418RSK1 predQ15746SmMLCK predQ16513PRK2 predQ16539p38αMAPK predQ86V86PIM3 predQ8IW41PRAK predQ8IWQ3BRSK2 predQ8N5S9CaMKKα predQ8TD08ERK8 predQ92630DYRK2 predQ96RR4CaMKKβ predQ96SB4SRPK1 predQ9BUB5MNK1 predQ9H2X6HIPK2 predQ9H422HIPK3 predQ9HBH9MNK2 predQ9P286PAK5 predQ9UQB9AuroraC selectivity Molecule M olecule nam e 0 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 1 2 1 0 1 -1 0 1 -1 0 0 -1 0 0 3 -1 0 0 0 0 0 1 2 1 0 1 0 0 -1 1 0 -1 0 0 0 -1 -1 0 -1 -1 0 0 -1 0 0 1 0 0 0 -2 0 1 1 0 2 31% fedratinib 0 -1 -1 0 1 1 0 0 2 2 0 1 -2 -1 0 0 0 -1 -1 0 -3 2 -1 0 1 1 0 -3 0 0 3 1 2 1 -1 0 0 2 -1 1 1 0 0 0 0 0 -2 -2 0 0 0 -1 -2 0 -2 0 0 -2 -3 0 0 1 0 0 0 0 1 0 0 0 0 24% ruxolitinib 0 -1 0 1 1 1 1 0 1 0 0 0 0 1 1 0 2 -3 1 1 1 -1 0 -1 0 -1 0 0 0 1 3 0 1 1 0 0 -2 0 -1 1 0 1 0 0 -1 0 0 -1 0 0 0 -1 1 0 -1 -1 0 1 -3 0 0 1 0 0 0 0 1 -1 -1 0 0 30% gefitinib 0 0 1 2 0 1 1 1 0 0 0 0 1 0 1 1 0 2 0 -1 0 0 0 0 -2 0 1 0 0 0 3 0 0 0 1 -2 -1 0 1 1 0 1 0 1 0 0 0 -1 0 0 -1 -1 0 0 -1 -1 1 0 -2 0 0 1 0 0 0 -3 0 1 1 0 2 30% imatinib 1 -1 2 2 1 1 1 1 2 2 1 2 0 0 0 3 -3 -4 2 2 2 1 2 4 4 0 3 -1 1 3 -8 -1 2 0 2 3 0 2 4 1 2 3 0 1 0 2 2 0 2 2 2 -1 2 2 2 3 5 3 2 1 1 1 1 2 0 4 0 0 1 1 2 73% staurosporine IRAK3 JAK2 JAK1
  • 32.
    • This toolmakes it very easy to do answer many many questions with minimal time and effort, as shown.  One can add ability to search for specific scaffolds and combine with target query. • It provides an unprecedented tool to explore QSAR by target, class and other criteria. • One can assess the likelihood of sufficient data to make a model for a given target  The automatic evaluation using a test set, provides nearly instant feedback on the prospect of making a predictive model. • Because it uses the KNIME system, nearly any descriptor set can be used for making the models.  There are additional open-source and commercial descriptor tools available.  One can probably get better results with better descriptors. 32 Conclusions
  • 33.
    ADMET Properties influencingmedicinal Chemistry design
  • 34.
    Prediction of ADMETproperties Influencing medicinal chemistry design • logD7.4 • Protein Binding • Solubility • Metabolic Stability • hERG • Etc… Step1 •Rat PPB • Hu heps • CYP inhib • Caco2 • NaV1.5 • Etc… Step2 • logD7.4 • Solubility •Protein Binding • hERG • Rat PPB • Metabolic Stability • CYP inhib • Caco2 • etc. Step 0 AstraZeneca’s global HERG QSAR model70 has contributed to the reduction in the synthesis of ‘red flag’ compounds (compounds that are measured to have an HERG potency of <1μM) from 25.8% of all compounds tested in 2003 to only 6% in 2010. Cumming, J.G., Davis, A.M. et al. Nat. Rev. Drug Disc. (2013) 12, 948–962
  • 35.
    Putting Data toWork | Prediction of Cardiotoxic drugs related to hERG blockade
  • 36.
    Putting Data toWork | 36 How to avoid hERG inhibition? QT interval prolongation can lead to serious arrhythmias which can evolve into a fatal issue. A large number of non-cardiac drugs-induced QT prolongation has been reported and continues to increase with the withdrawal of some blockbuster medicines. hERG seems to be the main target of this adverse side effect. In silico models could be rapid and powerful tools to screen out potential hERG blockers as early as possible during the discovery process
  • 37.
    Putting Data toWork | Extraction & Methodology hERG data set 640 mol Recursive Partitioning (RP) MOE QuaSAR Classify Molecules tested on hERG (Kv11.1) 2D-Molecular descriptors sets Predictive models Subsets of molecules According to biological detailed protocols Representation of hERG ligands within chemical space of NCI database according to the two first PCA axis NCI database hERG data set Cross-validation External validation
  • 38.
    Putting Data toWork | Model 3 HIGH WEAK 1 µM 10 µM 50 / 9 mol50 / 46 mol Training Test Descriptors Relevant P_VSA Relevant P_VSA High 48/50 47/50 45/46 42/46 Weak 49/50 49/50 8/9 9/9 All 97% 96% 96% 93% Correct classifications determined by 5-fold cross-validation (Training) and by external validation (Test) for each descriptor set. SlogP_VSA7 SlogP_VSA2SMR_VSA6 PEOE_VSA+1 + + + + + - - - - - SMR_VSA5 SMR_VSA6 SMR_VSA5 SMR_VSA4 PEOE_VSA+0 + - SlogP PEOE_VSA_FHYD SMR_VSA1 SMR_VSA6 SMR_VSA1 SlogP vsa_pol - - -- ++ Relevant P_VSA
  • 39.
    Putting Data toWork | Summary  Reaxys Medicinal Chemistry enables the retrieval of high quality datasets with chemical diversity and homogeneous biological activities.  Pertinent predictive models of hERG activity have been designed using recursive partitioning analysis with 2D-molecular descriptors.  From Reaxys Medicinal Chemistry, a fast virtual screening approach could be used as an early tool in the drug discovery process to avoid cardiotoxic side effects related to hERG blockage. AstraZeneca’s global HERG QSAR model70 has contributed to the reduction in the synthesis of ‘red flag’ compounds (compounds that are measured to have an HERG potency of <1μM) from 25.8% of all compounds tested in 2003 to only 6% in 2010. Cumming, J.G., Davis, A.M. et al. Nat. Rev. Drug Disc. (2013) 12, 948–962 Even if models are performing well designers need interpretable models
  • 40.
    Wrap Up “They caremostly about Accessing our data through API Knime Pipeline pilot” “They want a product they can use right out of the box” New Reaxys Medicinal Chemistry is supporting Hit to lead and lead optimization process by providing relevant and high quality data to scientists. Computational Chemists High quality data on many different topics (efficacy , ADMET, Animal models) Large Amount of data to Perform models Medicinal Chemists Accessing the data through third party tools Reaxys Medicinal Chemistry is able to support both Computational and Medicinal chemists.

Editor's Notes

  • #7 Challenges: - Big data and increasing complexity - Multiple sources of information - A wide variety of tools and processes - End users from different departments, wide diversity of domain expertise - Finding the right information Needs: - All relevant data must be accessible and actionable - Information must be flexibly delivered - All content and solutions must integrate into the existing environment of tools and processes, and with other data sources - Large-scale data mining and processing must be possible - Comprehensive tools that can integrate and/or replace other tools.
  • #41 Medium size pharma are in between consequently they are potentially interested in the two approaches Product out of the box and Content integration. The Strategy to adopt depend on the personas that are involved in the deal Computationnal chemist, Chemoinformaticians etc.. (HT data users) => Data + KNIME/API Chemist medicinal chemist etc…. (LT data users) => UI