4. NIH spent a decade funding HTS efforts as
part of the MLSCN and MLPCN
By 2010 $576.6M in funding
Various definitions of a probe
Potency, selectivity, solubility and availability
Little has been done to learn from this work
5.
6. Lajiness et al. - 13 Chemists assessed 22,000 compounds (2000 each) for
drug or lead likeness.
Not consistent in rejecting undesirable compounds
(J Med Chem 2004, 47: 4891-6)
Hack et al.- 145 chemists to fill holes in a screening library
(J Chem Inf Model 2012; 51, 3275-86)
Kutchukian et al. – medicinal chemists surveyed in selecting fragments
for a lead –
lack of consensus in compound selection
(PLOS ONE 2012, 7, e48476)
Since the rule of 5 there has been a considerable focus on more rules –
ALERTS, PAINS, QED, BadApple etc
7. But do we really need a crowd?
Could 1 medicinal chemist be enough?
> 40 years experience
8. Chris Lipinski scored the original 64 cpds – he
was close to median
Found more probes since 2009
Now scored more than 300 NIH Probes for
desirability
Extensive due diligence
▪ Based on literature (public/private)
▪ Chemical Reactivity
10. representing molecules of different classes from public and commercial databases
ML010
(CID 17757274)
valsartan
(CID 60846) CAS1164083-19-5
US20120040982
(CID 57498937)
ML160
(CID 824820)
11. Properties from CDD
Properties from Discovery Studio
Higher Mwt, rotatable bonds and heavy atoms is desirable
12. Yellow - desirable
Blue - undesirable
Yellow – chemical probes
Blue - Microsource spectrum
compounds
13. Desirable probes
less likely to be
filtered by PAINS
or BadApple as
promiscuous than
those scored as
undesirable.
(Fisher's exact
test, p>0.0001 for
PAINS and p=0.04
for BadApple).
14. 322 NIH MLP
probes
clustered into 44
groups using
ECFP_6
fingerprints
using a Tanimoto
similarity threshold
of >0.11 for cluster
membership.
Blue - desirable
Red – undesirable
Circle area is
proportional to
cluster size, and
singletons are
represented as a
dot.
15. Drug discovery is repetitive and there are 1000s of diseases
Drug discovery is high risk
Do we need robots or just smarter programs that discover the ideas we test?
16. What would happen if we could model Chris’s
decisions
NIH probes
Potential for other non medicinal chemists to benefit
Streamline scoring compounds, save time
17. FCFP_6 descriptors + 8 simple descriptors
Leave out 50% x 100 of Bayesian models
5 fold cross validation for n307 models
18.
19. • The colors on the heat map correspond to the value of
the indicated metric for each probe, listed vertically.
• The scale was normalized internally with green
corresponding to the optimal condition within each
metric.
20.
21. MoDELS RESIDE IN PAPERS
NOT ACCESSIBLE…THIS IS
UNDESIRABLE
How do we share them?
How do we use Them?
22. Open Extended Connectivity Fingerprints
ECFP_6 FCFP_6
Collected,
deduplicated,
hashed
Sparse integers
• Invented for Pipeline Pilot: public method, proprietary details
• Often used with Bayesian models: many published papers
• Built a new implementation: open source, Java, CDK
– stable: fingerprints don't change with each new toolkit release
– well defined: easy to document precise steps
– easy to port: already migrated to iOS (Objective-C) for TB Mobile app
• Provides core basis feature for CDD open source model service
23. Data + One Click =
Uses Bayesian algorithm and FCFP_6 fingerprints
24. Rebuilt the n307
model in CDD
Models
3 fold cross
validation
ROC = 0.69
25. http://goo.gl/PVkQeo
Making the data more accessible as we are
drowning in molecules
3.5
3
2.5
2
1.5
1
0.5
0
-0.5
-1
log database size (millions)
26. Ligand efficiency higher in
undesirable compounds
Bayesian model preferable in
classifying desirable
compounds vs other molecule
quality metrics
Model could improve probe
selection, score libraries, prior
to more extensive due diligence
Probes could be scored by
additional chemists dependent
on needs e.g. bias to CNS,
anticancer..
CNS
Anticancer
NIH probes
27. Complexities in finding the NIH
MLP probes in PubChem
Identifier and structure
searches in CAS SciFinderTM
reveals an extreme disclosure
The parallel worlds of
commercial and public
database disclosure do not
completely intersect
Integration and intersections of
databases and the need for
bioassay ontology adoption
Public Commercial
28. Need more collaboration or openness
in terms of availability of chemistry
and biology data.
Increased communication between
the various databases that are both
public and proprietary
Major hurdles exist to prevent this
from happening - too much
commercial value to proprietary
databases
Clearly CAS and the other
commercial vendors have to take
notice
29. We acknowledge that the Bayesian model software within
CDD was developed with support from Award Number
9R44TR000942-02 “Biocomputation across distributed
private datasets to enhance drug discovery” from the
NCATS.
SE gratefully acknowledges Biovia (formerly Accelrys) for
providing Discovery Studio.
SE thanks Jeremy Yang for the link to BadApple
30. Litterman NK, Lipinski CA, Bunin BA, Ekins S. Computational
Prediction and Validation of an Expert's Evaluation of
Chemical Probes. J Chem Inf Model. 2014 Oct 27;54(10):2996-
3004. doi: 10.1021/ci500445u. Epub 2014 Oct 7.
Christopher A. Lipinski, Nadia Litterman, Christopher Southan,
Antony J. Williams, Alex M. Clark and Sean Ekins, The parallel
worlds of public and commercial bioactive chemistry data
J Med Chem. Epub 2014 Nov 21.
Editor's Notes
From left to right; the documented probe is ML010 (CID 17757274), the drug is valsartan (CID 60846), a prophetic compound is from CAS1164083-19-5 from WO 2001056358 (not in PubChem or ChemSpider) 42, a text extracted compound is from US20120040982 17 (CID 57498937) and one of the probes with incomplete data linkage is ML160 (CID 824820).