• Like
Exploring virtual compound space with Bayesian statistics
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Exploring virtual compound space with Bayesian statistics

  • 435 views
Published

Search 10^12 virtual compounds in minutes with a Bayesian model made from 10^6 combinatorial chemistry derived compounds. This work has been published: http://pubs.acs.org/doi/abs/10.1021/ci900072g

Search 10^12 virtual compounds in minutes with a Bayesian model made from 10^6 combinatorial chemistry derived compounds. This work has been published: http://pubs.acs.org/doi/abs/10.1021/ci900072g

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
435
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
12
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Exploring virtual compound space with Bayesian statistics Willem P van Hoorn Chemistry Pfizer Global Research and Development Sandwich UK [email_address] Pipeline Pilot UGM, San Diego, Mar 2007
  • 2. Overview
    • An embarrassment of the riches
    • Methodology
    • Validation
    • Interpreting the results
    • How about the singleton file?
    • Implementation
    • Conclusions
  • 3. An embarrassment of the riches
    • I have a compound. I want to make many analogues by combinatorial chemistry.
    • At Pfizer, libraries are made internally and by multiple external suppliers.
    • Chemist X used to know the entire protocol collection but even he has lost track.
    • The Pfizer virtual library is estimated at 10 12 compounds*
    • This number is huge. Suppose 1000 CPU’s times 1000 compounds a second: run time of >11 days. Or a maximum of 31 searches a year.
    • Can’t this be done quicker?
    As an aside: this is a trillion in USA and modern British, and a billion in traditional British http://en.wikipedia.org/wiki/Names_of_large_numbers
  • 4. Methods
  • 5. Bayesian Learning, single category Data set (assay data) Fingerprint bits ~ substructures “ Good” Actives “ Bad” Inactives Bayesian Model Rev Thomas Bayes ca 1702 - 1761
    • Fingerprints are calculated for each molecule
    • Check how often fingerprint bit is observed
    • And how often in “Good” compound
    • Assign weighting factor taking into account
    • both activity ratio and sampling size
    • For instance: “Good”/Total ratio of 90/100 is
    • statistically more relevant than 9/10
    • Model distinguishes “Good” from “Bad”
    • Prediction: likelihood molecule is “Good”
    • Standard component from Pipeline Pilot
  • 6. Bayesian Learning, multiple categories Pfizer library file Fingerprint bits ~ substructures Library 1 Bayesian Models
    • Data set contains multiple categories
    • For each category, one model is built
    • Model X describes what separates
    • category X from all other categories
    • Available in Pipeline Pilot since version 4.5.2
    • Dataset: Pfizer library file
    • Category: library name
    • Prediction made: Library names ranked by
    • likelihood compound originates from that library
    Library N …
  • 7. Building the multi-category Bayesian models Pfizer compound database Pfizer library file: all compounds made in-house and externally by combinatorial chemistry: O(6) compounds, O(3) libraries 50% 50% 12.5K Pfizer singleton diversity subset
  • 8. A singleton library?
    • 12.5K diverse singleton set added as separate “library”
    • Set consists of diverse subset out of larger set of “clean” singleton compounds
    • Clean: R4.5 compliant, no structural alerts, reactive groups, etc
    • Bayesian model describes what these diverse compounds have in common vs.
    • all library compounds.
    • Difference is not presence of nitro, aldehyde or other undesirable group
    • Difference must be in fragments (chemistry) not obtainable by library chemistry
    • High score for this model may indicate that one is outside library chemistry space
  • 9. Multi-category Bayesian predictions A probe (UK-92480, Sildenafil) By default top 16 libraries is calculated: Singleton Library1 Library2 ... Library15 104.57 84.10 43.97 ... 12.63
  • 10. Bayesian predictions are exemplified by Nearest Neighbour search Exemplify libraries by identifying nearest neighbours from library file, default top 6. Final output: 16 x 6 = 96 compounds (one-plate screenable hypothesis) 16 96 R1 R2 849914-95-0 139755-82-1 298214-47-8 no CAS 155879-54-2 223430-18-0 UK-A UK-B UK-C UK-D UK-E 1. Singleton (in file: 12500) Singleton Library1 Library2 ... Library15 104.57 84.10 43.97 ... 12.63
  • 11. What is searched?
    • Virtual library : 25 x 31
    • Real library compounds (yellow) : 19 x 25
    • Bayesian model fully covers 19 x 25 matrix
    • despite being trained on only yellow products
    • All fingerprints in virtual red product are
    • contained in crossed yellow products
    • Virtual products outside real library in area 1
    • share 1 monomer with real library: still
    • (partially) within scope of Bayesian model
    • Only products 2 don’t share monomer,
    • probably less well predicted
    • Easier to be out of scope in virtual library
    • without template or common core (amide
    • formation)
    1 1 2
    • Model based of O(6) compounds covers most of Pfizer virtual library O(12)
    • This coverage is at library ID level, not compound level
    x x x x
  • 12. A note on coverage of chemical space
    • Electron density around hydrogen atom is described as probability distribution
    • Electron density is never zero even at large distance
    • Chemists tend to apply cut-off (atomic radius)
    • Similarly, coverage of chemical space by Bayesian models is probalistic
    • Coverage is therefore hard to define, having an amide in common is probably not coverage as a chemist would see it, but what is?
    • Time will tell whether a Bayesian score cut-off for chemical space coverage can be established
  • 13. Validation
  • 14. Random test set Exclude singleton library O(6) Random x%: 9452 Top 5 predicted library ID compared with real library ID V1 9411, 99.6% Found in top 5 41 , 0.4% 9068, 96% 9452 Not in top 5 Found in top 1 Test set
  • 15. 41 compounds with correct library not in top 5 PF-A Internal Library 1 349 -29.0651987519029 Another PF number Internal Library 2 243 -13.3644982689003 Another PF number Internal Library 3 69 -0.400118961865439 Another PF number Internal Library 4 63 0.614583090788282 Another PF number Internal Library 1 53 -0.13987606970494 Another PF number Internal Library 5 50 -7.32271642948761 Another PF number Internal Library 1 35 3.41709994966454 Another PF number Internal Library 3 22 9.57829190295786 Another PF number Internal Library 1 22 10.0504136444794 Another PF number External Library 1 20 22.8956131731457 Another PF number External Library 2 19 18.8320528385981 Another PF number Internal Library 3 15 12.1074842056827 Another PF number Internal Library 1 14 54.6179465790837 Another PF number Internal Library 3 13 16.6244027916311 Another PF number Internal Library 1 12 6.74173586963795 Another PF number Internal Library 1 12 17.0964105412622 Another PF number Internal Library 1 11 58.7994305333701 Another PF number External Library 3 10 58.2193181435516 Another PF number Internal Library 1 10 12.5102031415206 Another PF number Internal Library 6 9 19.4093857882624 Another PF number Internal Library 1 8 20.6383651456158 Another PF number External Library 3 8 73.0633503114444 Another PF number External Library 4 8 18.5429747446516 Another PF number Internal Library 1 8 36.9730841061725 Another PF number Internal Library 1 7 34.8859762378176 Another PF number Internal Library 3 7 17.3617539873978 Another PF number Internal Library 1 7 30.8582847036755 Another PF number Internal Library 1 7 41.848859585633 Another PF number External Library 5 7 25.8587564812026 Another PF number External Library 6 6 33.5395919145182 Another PF number External Library 7 6 39.9074521984672 Another PF number External Library 8 6 32.3248563198852 Another PF number External Library 9 6 23.9421542596281 Another PF number External Library 10 6 95.1965176091739 Another PF number Internal Library 1 6 53.8715604224809 Another PF number Internal Library 1 6 28.2709230508615 Another PF number Internal Library 1 6 48.1827060771728 Another PF number Internal Library 1 6 17.7689907755174 Another PF number Internal Library 7 6 57.5694207876578 Another PF number Internal Library 7 6 53.9529913359943 Another PF number Internal Library 7 6 56.184768481901 compound_ID correct library_id ranked_as Bayesian score
  • 16. PF-A Amide formation Monomer 2 Monomer 1 + No registration error Worst mispredicted: PF-A General remark: in-house libraries have broad scope, therefore harder to predict Internal library 1 29,800 compounds registered, monomers known for 28,670 120 of these contain Monomer 1, but only 1 compound contains Monomer 2: PF-A is atypical product
  • 17. So what was found? Bayesian predictions: 1. External library 11: Amide formation 2. External library 12: Amide formation … ..
    • Similar products to probe
    • Same chemistry as Internal library 1
    • Not really a failure after all
    V1 Similar to monomer 2 Similar to monomer 1
  • 18. Six Bayesian categorisation models are available
    • ECFP_6
    • Based on atom type (C, N, O, etc)
    • Chemical descriptor
    • Default choice for similarity searching
    • Therefore default for Bayesian search
    • FCFP_6
    • Based on atom function (donor, acc, etc)
    • Functional descriptor
    • Default choice for activity modelling
    • Might give chemotype jump
    Fingerprint How to compensate for different sizes of libraries in training set?
    • Not
    • Fastest
    • Works well
    • Therefore default
    • #Enrichment
    • Favours smaller libraries
    • Slower to calculate
    • #EstPGood
    • Favours larger libraries
    • Slower to calculate
  • 19. Recall of known library id as function of model Exclude singleton library O(6) Random x%: 9452 Top 5 predicted library ID compared with real library ID 205, 2.2% 9247, 97.8% 5692, 60% ECFP_Enrichment 85, 0.9% 9367, 99.1% 8920, 94% FCFP 13, 0.1% 9439, 99.9% 8372, 89% ECFP_EstPGood 108, 1.1% 9344, 98.9% 6093, 64% FCFP_Enrichment 9441, 99.9% 9411, 99.6% Found in top 5 11, 0.1% 8547, 90% FCFP_EstPGood 41, 0.4% 9068, 96% ECFP Not in top 5 Found in top 1 Model Test set
  • 20. Comparison of six Bayesian models
    • Evaluation of absolute models at least twice as slow
    • “ #Enrichment” models perform significantly worse
    • “ #EstPGood” models perform better when looking at top 5 (but worse in top 1)
    • ECFP / FCFP fingerprint results very similar
    • Recall rate used as metric, but search method intended as idea generator
    • Hard to predict what chemist considers best idea, advice is to try more than one
    V1
  • 21. Interpreting the results
  • 22. Opening the Bayesian black box
    • Bayesian score ~ likelihood of compound originating from library
    • But how is this score derived?
    • Scitegic fingerprint bits ~ substructures
    • Bayesian model consists of fingerprint bits + weighting factors
    • Score is the sum of the weights of the substructures present in molecule
    • High scoring parts of molecule can be highlighted by these substructures
    Probe Fingerprint bits
    • Filter by weight
    • Color probe by FP
    A library
  • 23. Probe is highlighted by what each library recognises 1. Singleton 2. In-house 1 3. In-house 2 4. External 1 5. External 2 6. External 3 7. External 4 8. External 5 9. External 6 10. External 7 11. External 8 12. External 9 13. External 10 14. External 11 15. External 12 16. External 13 In-house 2 yields compounds similar to left hand site of probe In-house 1 yields compounds similar to right hand site of probe
  • 24. Highlighted probes compared to actual compounds retrieved 2. In-house 1 3. In-house 2 4. External 1
  • 25. How about the singleton file?
  • 26. How about the Pfizer singleton file?
    • 150  Mw  750; AlogP  7.5
    • Pass reactive group filters
    • O(6) compounds, liquid sample
    • O(5) compounds, solid sample
    • Calculate top 1 predicted library
    • Equals cluster by predicted library
    • Map singletons on combinatorial chemistry space
    Pfizer compound database All singletons: O(6) compounds
  • 27. O(4)
    • None mapped (size of library):
    • Library X1 (1)
    • Library X2 (11)
    • Library X3 (1)
    • Library X4 (2)
    Singleton O(6) liquid singleton compounds mapped to O(3) libraries As expected, “Singleton” library dominates Generally: Good spread
  • 28. Pfizer solids and vendor compounds have been mapped to libraries 7 unmapped libraries 6 unmapped libraries O(5) solid samples for which no liquid sample is available O(4) O(5) O(6) structures from ChemNavigator not in Pfizer files Singleton Singleton
  • 29. Mapped singleton/vendor compounds can be searched by similarity
    • Extend SAR
    • Spark idea for novel monomers, templates
    4 x 96 16 Library compounds Singleton compounds, liquid Singleton compounds, solid Singleton compounds, vendor 147676-92-4
  • 30. Implementation
  • 31. Bayesian search implemented as web service
    • 2. Options, defaults are OK:
    • # of ranked libraries (16)
    • # of nearest neighbors (6)
    • Bayesian model
    • Unified output?
    Last model update, overview of coverage, etc ~5-10 min User
    • 1. Query, one or more of:
    • (file of) Pfizer ID
    • smile (file)
    • mol/sd file
    • sketch
    pdf report: Ranked libraries + NN examples 1. Singleton 2. In-house 1 3. In-house 2 4. External 1 5. External 2 6. External 3 7. External 4 8. External 5 9. External 6 10. External 7 11. External 8 12. External 9 13. External 10 14. External 11 15. External 12 16. External 13 Singleton R1 R2 849914-95-0 139755-82-1 298214-47-8 no CAS 155879-54-2 223430-18-0 UK-A UK-B UK-C UK-D UK-E 1. Singleton (in file: 12500)
  • 32. A happy user
  • 33.
    • Advantages
    • Fast and very accurate in retrieving known library IDs
    • Number of output libraries/compounds is tuneable
    • Based on registered compounds:
      • precedented chemistry
      • exemplified compounds ready for screening: instant hypothesis testing
    • Does not know chemistry, no need to encode chemical reactions
    • Coverage of singleton / vendor chemical space
    • Neat pdf output
    • Proven ability to jump chemotypes
    • Disadvantages
    • Novel libraries or monomers only “detected” once in registered product and models
    • have been regenerated
    • Example products are real, more similar virtual products are probably missed
    Conclusions
  • 34.
    • Thanks for contributing ideas, challenges and/or willingness to test:
    • Andy Bell
    • Bruce Lefker
    • Dafydd Owen
    • Dan Kung
    • Graham Smith
    • Jens Loesel
    • Kevin Dack
    • Dave Rogers (Scitegic)
    Acknowledgements