Predicting Pharmacology Willem van Hoorn Pfizer Global Research & Development Sandwich UK [email_address] Pipeline Pilot UGM, San Diego, Mar 2006
Willem van Hoorn Standing on the Shoulders of Giants Gaia Paolini Richard Shapland Andrew Hopkins Jonathan Mason
The Work of Giants 4.8 M structures 275k active compounds 600k activities (IC50, etc) 3k targets 800 human targets Inpharmatica StARLITe Cerep Bioprint Thomson IDDB Pfizer in house Oracle / DayCard cartridge Structures stored as smiles Pipeline Pilot: Canonical tautomers, salt stripping, etc Access: ODBC components + web service Pfizer compound structure retrieval Unified DB
Why Giants Are Required
Unified DB Unified Database as Starting Point  Bayesian Learn Molecular Categories Predicting activities Linear Discriminant Analysis (LDA) Predicting gene families Polypharmacology interaction network
Polypharmacology Network From Binding Data Node : target Edge : compound Metalloproteases Cysteine proteases Serine proteases Phosphodiesterases Aminergic GPCRs Peptide GPCRs GPCRs (others: classes A, B & C) Enzymes  (hydrolases, transferases, oxidoreductases & others) Ion Channels Nuclear hormone receptors Aspartyl proteases Kinases Miscellaneous
Deriving Multi-Category Bayesian Model 238k actives (   10   M), human target,  Mw < 1000, pass reactivity filter,    10 actives / target FCFP_6 90% / 214k 10% / 23,792 55,781 activities 698 models Unified DB
Assessing the Predictions of the Random Test Set Large number of predictions: 23,792 * 698 ~ 16.6M 55,781 activities, rest unknown presumed inactive Interpretation of Bayesian score? Score    cut-off : active, rest inactive # predicted actives = F(cut-off) Comparison with random: For each cut-off: calculate number of predicted actives Generate exactly same number of random predicted actives
50 Assessing the Predictions of the Random Test Set 58,428 predictions / 17,210 compounds 16,281 compounds   1 correct prediction 31,600 true positives (random: 292) Enrichment ~ 100 fold 26,828 false positives (random: 55,489) 24,181 false negatives
Nuclear hormone receptors Ion Channels Phosphodiesterases Aminergic GPCRs Peptide GPCRs GPCRs (others) Enzymes  (others) True positive prediction False positive prediction Predicted Polypharmacology Network At Bayesian Cut-off 50
Predicted Polypharmacology Network At Bayesian Cut-off 50  At confidence level 50, most predictions are intra gene class Quite a few false positive connections coincide with true positives Exceptions: Ion Channels, Enzymes-others Although the prediction is wrong, the connection is right? Or the prediction is right and the connection is false negative (not measured?) Most interesting part of predicted connections to test Compare to Peter Willett’s work in similarity searches:  (Next) Nearest neighbours of inactive nearest neighbours are equal likely to be active as nearest neighbours themselves:  J. Med. Chem.  2005,  48 , 7049
A More Challenging Test Set: Cerep Bioprint 238k actives (   10   M), human target,  Mw < 1000, pass reactivity filter,    10 actives / target FCFP_6 237k Bioprint 997 compounds 316 targets 694 models Unified DB
A More Challenging Test Set: Cerep Bioprint 50 720 predictions / 291 compounds 210 compounds   1 correct prediction 433 true positives (random: 17) Enrichment ~ 25 fold 287 false positives (random: 55,489) 12,281 false negatives
Another Look At The Same Data 0 36,222 predictions  6,121 true positives 30,101 false positives 6,593 false negatives  48% of actives in 11% of data Plus 378 extra predicted targets
A More Challenging Test Set: Cerep Bioprint Bioprint harder to predict than 10% random test set  Data can be interpreted depending on need Few high confidence predictions, appropriate for triaging HTS hits Many low confidence predictions, appropriate for risk assessment of lead
length height left rim bottom rim H. Lohninger Teach/Me Data Analysis http://www.vias.org/tmdatanaleng Linear Discriminant Analysis diagonal Similar to PCA which tries to represent classes Tries to discover what distinguishes classes Compare letters: O and Q PCA focuses on circle, LDA on tail Web example: distinguish between genuine and false banknotes Training set: 200 banknotes, 100 genuine / 100 forgeries NOTE Length Left Right Bottom Top Diagonal Genuine BN1 214.8 131.0 131.1 9.000 9.700 141.0 true BN2 214.6 129.7 129.7 8.100 9.500 141.7 true BN3 214.8 129.7 129.7 8.700 9.600 142.2 true BN4 214.8 129.7 129.6 7.500 10.40 142.0 true BN5 215.0 129.6 129.7 10.40 7.700 141.8 true BN6 215.7 130.8 130.5 9.000 10.10 141.4 true BN7 215.5 129.5 129.7 7.900 9.600 141.6 true BN8 214.5 129.6 129.2 7.200 10.70 141.7 true BN9 214.9 129.4 129.7 8.200 11.00 141.9 true BN10 215.2 130.4 130.3 9.200 10.00 140.7 true … . … . … . … . … . … . … . … . BN195 214.9 130.3 130.5 11.60 10.60 139.8 false BN196 215.0 130.4 130.3 9.900 12.10 139.6 false BN197 215.1 130.3 129.9 10.30 11.50 139.7 false BN198 214.8 130.3 130.4 10.60 11.10 140.0 false BN199 214.7 130.7 130.8 11.20 11.20 139.4 false BN200 214.3 129.9 129.9 10.20 11.50 139.6 false
Predicting Forgeries with LDA and Bayesian LDA Bayesian NOTE Length Left Right Bottom Top Diagonal BankNotes LD1 BN1 215.1 130.0 129.8 9.100 10.20 141.5 true 2.501 BN2 214.7 130.7 130.8 11.20 11.20 139.4 false -4.561 BN3 214.3 129.9 129.9 10.20 11.50 139.6 false -3.390 BN4 214.7 130.0 129.4 7.800 10.00 141.2 true 4.060 NOTE Length Left Right Bottom Top Diagonal BankNotesBayes BN1 215.1 130.0 129.8 9.100 10.20 141.5 1.992 BN2 214.7 130.7 130.8 11.20 11.20 139.4 -6.611 BN3 214.3 129.9 129.9 10.20 11.50 139.6 -6.341 BN4 214.7 130.0 129.4 7.800 10.00 141.2 1.771
Predicting Gene Class by Physical Properties Compounds binding to different gene classes posses different  physical property distributions: Can this be used to predict gene class from physical properties alone? How does LDA compare to Bayesian? Mw clogP
Predicting Gene Class by Physical Properties 148k actives (   10   M), human target,  Mw < 1000, pass reactivity filter, binding to single target class only Aminergic GPCRs Aspartyl Proteases Cysteine Proteases Enzymes- others GPCRs Class A- others GPCRs Class B GPCRs Class C Hydrolases Ion Channels- Ligand_Gated Ion Channels- others Kinases- others Metalloproteases Nuclear hormone receptors Others Oxidoreductases PDEs Peptide GPCRs Protein Kinases Serine Proteases Transferases 20 Gene Classes: Unified DB
Molecular_Weight Num_H_Acceptors  Num_H_Donors Num_RotatableBonds Molecular_PolarSurfaceArea No_IonCenters  Molecular_Solubility Molecular_SurfaceArea ClogP * Andrews* Predicting Gene Class by Physical Properties 10 Descriptors: 147,534 118,118 29,416
Predicting Gene Class by Physical Properties 29416 (9025) 1 (0) 349 (137) 5309 (1423) 8123 (2811) 791 (248) 888 (241) 2638 (499) 482 (163) 279 (74) 0 (0) 152 (59) 47 (0) 0 (0) 0 (0) 1 (0) 1268 (366) 1969 (321) 75 (28) 1180 (613) 5864 (2042) LDA (correct) 29416  (5631) 1012 (125) 792 (133) 341 (147) 2809 (1135) 2176 (392) 1437 (329) 90 (47) 2083 (345) 1626 (293) 1545 (100) 964 (104) 2109 (280) 350 (42) 3346 (146) 2340 (115) 962 (309) 1 (0) 1464 (73) 1670 (614) 2299 (902) Bayes (correct) 29416  (1447) 1460 (36) 1526 (53) 1488 (148) 1461 (236) 1468 (56) 1492 (54) 1465 (167) 1459 (53) 1515 (47) 1430 (11) 1441 (29) 1448 (52) 1461 (15) 1438 (29) 1477 (14) 1524 (117) 1451 (135) 1470 (13) 1479 (29) 1463 (153) Random (correct) 29416 727 913 2927 5027 1178 1385 3336 1238 849 198 594 764 286 339 226 2647 2574 252 728 3228 Experiment Target class Total Transferases Serine Proteases Protein Kinases Peptide GPCRs PDEs Oxidoreductases Others Nuclear hormone receptors Metalloproteases Kinases- others Ion Channels- others Ion Channels- Ligand_Gated Hydrolases GPCRs Class C GPCRs Class B GPCRs Class A- others Enzymes- others Cysteine Proteases Aspartyl Proteases Aminergic GPCRs
Predicting Gene Class by Physical Properties Enrichment over random: LDA ~ 6 fold,  Bayes ~4 fold Bayesian: more equal spread LDA: some baskets contain too many eggs? Some of the misclassifications might be true: many missing values Unbiased and fast method to (pre)screen large compound collection Compare with other unbiased methods: docking, pharmacophore search
Conclusions Data from heterogeneous sources can be combined in one knowledge base Predictive Bayesian models can be derived from it Models are adaptive, regenerate to incorporate latest experimental results Models are not replacement for experiment Models can lead to substantially lower screening investment Drug design compared to supermarket stock inventory: Just in time delivery vs. just enough screening Don’t discount simple molecular properties
 

Predicting Pharmacology

  • 1.
    Predicting Pharmacology Willemvan Hoorn Pfizer Global Research & Development Sandwich UK [email_address] Pipeline Pilot UGM, San Diego, Mar 2006
  • 2.
    Willem van HoornStanding on the Shoulders of Giants Gaia Paolini Richard Shapland Andrew Hopkins Jonathan Mason
  • 3.
    The Work ofGiants 4.8 M structures 275k active compounds 600k activities (IC50, etc) 3k targets 800 human targets Inpharmatica StARLITe Cerep Bioprint Thomson IDDB Pfizer in house Oracle / DayCard cartridge Structures stored as smiles Pipeline Pilot: Canonical tautomers, salt stripping, etc Access: ODBC components + web service Pfizer compound structure retrieval Unified DB
  • 4.
  • 5.
    Unified DB UnifiedDatabase as Starting Point Bayesian Learn Molecular Categories Predicting activities Linear Discriminant Analysis (LDA) Predicting gene families Polypharmacology interaction network
  • 6.
    Polypharmacology Network FromBinding Data Node : target Edge : compound Metalloproteases Cysteine proteases Serine proteases Phosphodiesterases Aminergic GPCRs Peptide GPCRs GPCRs (others: classes A, B & C) Enzymes (hydrolases, transferases, oxidoreductases & others) Ion Channels Nuclear hormone receptors Aspartyl proteases Kinases Miscellaneous
  • 7.
    Deriving Multi-Category BayesianModel 238k actives (  10  M), human target, Mw < 1000, pass reactivity filter,  10 actives / target FCFP_6 90% / 214k 10% / 23,792 55,781 activities 698 models Unified DB
  • 8.
    Assessing the Predictionsof the Random Test Set Large number of predictions: 23,792 * 698 ~ 16.6M 55,781 activities, rest unknown presumed inactive Interpretation of Bayesian score? Score  cut-off : active, rest inactive # predicted actives = F(cut-off) Comparison with random: For each cut-off: calculate number of predicted actives Generate exactly same number of random predicted actives
  • 9.
    50 Assessing thePredictions of the Random Test Set 58,428 predictions / 17,210 compounds 16,281 compounds  1 correct prediction 31,600 true positives (random: 292) Enrichment ~ 100 fold 26,828 false positives (random: 55,489) 24,181 false negatives
  • 10.
    Nuclear hormone receptorsIon Channels Phosphodiesterases Aminergic GPCRs Peptide GPCRs GPCRs (others) Enzymes (others) True positive prediction False positive prediction Predicted Polypharmacology Network At Bayesian Cut-off 50
  • 11.
    Predicted Polypharmacology NetworkAt Bayesian Cut-off 50 At confidence level 50, most predictions are intra gene class Quite a few false positive connections coincide with true positives Exceptions: Ion Channels, Enzymes-others Although the prediction is wrong, the connection is right? Or the prediction is right and the connection is false negative (not measured?) Most interesting part of predicted connections to test Compare to Peter Willett’s work in similarity searches: (Next) Nearest neighbours of inactive nearest neighbours are equal likely to be active as nearest neighbours themselves: J. Med. Chem. 2005, 48 , 7049
  • 12.
    A More ChallengingTest Set: Cerep Bioprint 238k actives (  10  M), human target, Mw < 1000, pass reactivity filter,  10 actives / target FCFP_6 237k Bioprint 997 compounds 316 targets 694 models Unified DB
  • 13.
    A More ChallengingTest Set: Cerep Bioprint 50 720 predictions / 291 compounds 210 compounds  1 correct prediction 433 true positives (random: 17) Enrichment ~ 25 fold 287 false positives (random: 55,489) 12,281 false negatives
  • 14.
    Another Look AtThe Same Data 0 36,222 predictions 6,121 true positives 30,101 false positives 6,593 false negatives  48% of actives in 11% of data Plus 378 extra predicted targets
  • 15.
    A More ChallengingTest Set: Cerep Bioprint Bioprint harder to predict than 10% random test set Data can be interpreted depending on need Few high confidence predictions, appropriate for triaging HTS hits Many low confidence predictions, appropriate for risk assessment of lead
  • 16.
    length height leftrim bottom rim H. Lohninger Teach/Me Data Analysis http://www.vias.org/tmdatanaleng Linear Discriminant Analysis diagonal Similar to PCA which tries to represent classes Tries to discover what distinguishes classes Compare letters: O and Q PCA focuses on circle, LDA on tail Web example: distinguish between genuine and false banknotes Training set: 200 banknotes, 100 genuine / 100 forgeries NOTE Length Left Right Bottom Top Diagonal Genuine BN1 214.8 131.0 131.1 9.000 9.700 141.0 true BN2 214.6 129.7 129.7 8.100 9.500 141.7 true BN3 214.8 129.7 129.7 8.700 9.600 142.2 true BN4 214.8 129.7 129.6 7.500 10.40 142.0 true BN5 215.0 129.6 129.7 10.40 7.700 141.8 true BN6 215.7 130.8 130.5 9.000 10.10 141.4 true BN7 215.5 129.5 129.7 7.900 9.600 141.6 true BN8 214.5 129.6 129.2 7.200 10.70 141.7 true BN9 214.9 129.4 129.7 8.200 11.00 141.9 true BN10 215.2 130.4 130.3 9.200 10.00 140.7 true … . … . … . … . … . … . … . … . BN195 214.9 130.3 130.5 11.60 10.60 139.8 false BN196 215.0 130.4 130.3 9.900 12.10 139.6 false BN197 215.1 130.3 129.9 10.30 11.50 139.7 false BN198 214.8 130.3 130.4 10.60 11.10 140.0 false BN199 214.7 130.7 130.8 11.20 11.20 139.4 false BN200 214.3 129.9 129.9 10.20 11.50 139.6 false
  • 17.
    Predicting Forgeries withLDA and Bayesian LDA Bayesian NOTE Length Left Right Bottom Top Diagonal BankNotes LD1 BN1 215.1 130.0 129.8 9.100 10.20 141.5 true 2.501 BN2 214.7 130.7 130.8 11.20 11.20 139.4 false -4.561 BN3 214.3 129.9 129.9 10.20 11.50 139.6 false -3.390 BN4 214.7 130.0 129.4 7.800 10.00 141.2 true 4.060 NOTE Length Left Right Bottom Top Diagonal BankNotesBayes BN1 215.1 130.0 129.8 9.100 10.20 141.5 1.992 BN2 214.7 130.7 130.8 11.20 11.20 139.4 -6.611 BN3 214.3 129.9 129.9 10.20 11.50 139.6 -6.341 BN4 214.7 130.0 129.4 7.800 10.00 141.2 1.771
  • 18.
    Predicting Gene Classby Physical Properties Compounds binding to different gene classes posses different physical property distributions: Can this be used to predict gene class from physical properties alone? How does LDA compare to Bayesian? Mw clogP
  • 19.
    Predicting Gene Classby Physical Properties 148k actives (  10  M), human target, Mw < 1000, pass reactivity filter, binding to single target class only Aminergic GPCRs Aspartyl Proteases Cysteine Proteases Enzymes- others GPCRs Class A- others GPCRs Class B GPCRs Class C Hydrolases Ion Channels- Ligand_Gated Ion Channels- others Kinases- others Metalloproteases Nuclear hormone receptors Others Oxidoreductases PDEs Peptide GPCRs Protein Kinases Serine Proteases Transferases 20 Gene Classes: Unified DB
  • 20.
    Molecular_Weight Num_H_Acceptors Num_H_Donors Num_RotatableBonds Molecular_PolarSurfaceArea No_IonCenters Molecular_Solubility Molecular_SurfaceArea ClogP * Andrews* Predicting Gene Class by Physical Properties 10 Descriptors: 147,534 118,118 29,416
  • 21.
    Predicting Gene Classby Physical Properties 29416 (9025) 1 (0) 349 (137) 5309 (1423) 8123 (2811) 791 (248) 888 (241) 2638 (499) 482 (163) 279 (74) 0 (0) 152 (59) 47 (0) 0 (0) 0 (0) 1 (0) 1268 (366) 1969 (321) 75 (28) 1180 (613) 5864 (2042) LDA (correct) 29416 (5631) 1012 (125) 792 (133) 341 (147) 2809 (1135) 2176 (392) 1437 (329) 90 (47) 2083 (345) 1626 (293) 1545 (100) 964 (104) 2109 (280) 350 (42) 3346 (146) 2340 (115) 962 (309) 1 (0) 1464 (73) 1670 (614) 2299 (902) Bayes (correct) 29416 (1447) 1460 (36) 1526 (53) 1488 (148) 1461 (236) 1468 (56) 1492 (54) 1465 (167) 1459 (53) 1515 (47) 1430 (11) 1441 (29) 1448 (52) 1461 (15) 1438 (29) 1477 (14) 1524 (117) 1451 (135) 1470 (13) 1479 (29) 1463 (153) Random (correct) 29416 727 913 2927 5027 1178 1385 3336 1238 849 198 594 764 286 339 226 2647 2574 252 728 3228 Experiment Target class Total Transferases Serine Proteases Protein Kinases Peptide GPCRs PDEs Oxidoreductases Others Nuclear hormone receptors Metalloproteases Kinases- others Ion Channels- others Ion Channels- Ligand_Gated Hydrolases GPCRs Class C GPCRs Class B GPCRs Class A- others Enzymes- others Cysteine Proteases Aspartyl Proteases Aminergic GPCRs
  • 22.
    Predicting Gene Classby Physical Properties Enrichment over random: LDA ~ 6 fold, Bayes ~4 fold Bayesian: more equal spread LDA: some baskets contain too many eggs? Some of the misclassifications might be true: many missing values Unbiased and fast method to (pre)screen large compound collection Compare with other unbiased methods: docking, pharmacophore search
  • 23.
    Conclusions Data fromheterogeneous sources can be combined in one knowledge base Predictive Bayesian models can be derived from it Models are adaptive, regenerate to incorporate latest experimental results Models are not replacement for experiment Models can lead to substantially lower screening investment Drug design compared to supermarket stock inventory: Just in time delivery vs. just enough screening Don’t discount simple molecular properties
  • 24.