Library design and series selection by Pareto ranking


Published on

Pfool - Pareto fast optimisation of libraries: tool to design combinatorial libraries of infinite size with Pareto ranking

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Library design and series selection by Pareto ranking

  1. 1. Library design and series selection by Pareto ranking Willem P. van Hoorn & Robert T Smith Pfizer Global Research and Development Sandwich United Kingdom [email_address]
  2. 2. <ul><li>Why Pareto ranking? </li></ul><ul><li>Fast approximation to Pareto to weed out likely losers </li></ul><ul><li>Dealing with large libraries by random sampling </li></ul><ul><li>Recent results of libraries designed for target X </li></ul><ul><li>How were the target X series identified </li></ul><ul><li>Conclusions </li></ul>Content
  3. 3. Pareto ranking, the art of compromise You want to be here (high X, high Y) X Y But these are your compounds Two gold compounds are better than the black compound, it is ‘dominated’ as would all other compounds in the shaded area 5 compounds are special (gold): Going from one to the other, you can improve X or Y but not both. They are best compromises (Pareto front)
  4. 4. Why not applying cut-offs? Y > cut-off X > cut-off Applying cut-offs results in settling 5 times for nearly the same (mediocre) compromise 5 compounds on the Pareto front on average make the same compromise, but sample much more space
  5. 5. Pareto ranking, a summary <ul><li>Well-established theory, esp. in engineering </li></ul><ul><li>Probably best method for sampling when there are conflicting objectives </li></ul><ul><li>Fast implementation available in Pipeline Pilot </li></ul><ul><li>But… </li></ul><ul><li>Scales N 2 , prohibitively slow for larger N (PP component limited to 10k compounds) </li></ul><ul><li>Will pick compounds you would have never picked </li></ul><ul><li>(This is not a disadvantage, but you might find this unsettling) </li></ul>
  6. 6. Increasing the speed of Pareto ranking. 1 <ul><li>Usually only need top X highest ranking compounds, therefore: </li></ul><ul><li>For each dimension, take top X, combine and Pareto rank. Done! </li></ul><ul><li>Unfortunately: </li></ul><ul><li>2D example, coloured by front </li></ul><ul><li>Blind spot in the middle where </li></ul><ul><li>compounds are missed </li></ul><ul><li>This approach does not work </li></ul>
  7. 7. Increasing the speed of Pareto ranking. 2 Worst value of Y Worst value of X Approximate the 2D Pareto front as a circle. Best compounds are furthest away from worst (X,Y) Linear scaling calculation: R 2 =  X 2 +  Y 2 Distance also defined for higher dimensions
  8. 8. How the R2 approximation works in practise <ul><li>Example: </li></ul><ul><li>Enumerate a library of 500,000 compounds </li></ul><ul><li>Pareto-rank in 2D to get top 500 compounds (Gold Standard) </li></ul><ul><li>Perform R2 approximation, Pareto-rank top 10,000 to identify top 500 compounds. </li></ul><ul><li>486/521 of the top compounds were successfully identified (93%). </li></ul><ul><li>A typical library size is 130 compounds. </li></ul>
  9. 9. What R2 approximation missed (blue) What was included instead (red) How the R2 approximation works in practise
  10. 10. Dealing with large virtual libraries <ul><li>Enumeration “ in vacuo ” is fast </li></ul>10M in ~42 min 10M in ~59 min <ul><li>Processing of many products becomes unmanageable (time and memory) </li></ul>Mw filter One Bayesian model 10M in ~9.5 hours
  11. 11. Dealing with large virtual libraries by random sampling <ul><li>Assumption: </li></ul><ul><li>Certain monomers yield products with consistently higher scores than others </li></ul><ul><li>Hot monomer analysis: </li></ul><ul><li>Enumerate a random subset of the library </li></ul><ul><li>Calculate the product scores (R2 distance to worst) of these compounds </li></ul><ul><li>Assign the scores to the constituting monomers of the product </li></ul><ul><li>For each monomer in the design: calculate average product score </li></ul><ul><li>Take the top x monomers and send only these for full enumeration </li></ul><ul><li>This has worked well in the past for 1D design, does it work for multiple dimensions? </li></ul>
  12. 12. <ul><li>VRXN-3-00352 (Pfizer combinatorial chemistry reaction ID) </li></ul><ul><li>A x B x C = 1 x 1116 x 314 </li></ul><ul><li>Virtual Library (VL) = 350,424 </li></ul><ul><li>Sampled random 11,142 (3.2%) </li></ul><ul><li>Optimised in 3D: Bayesian models for target X  ( ↑ ), target Y  (↓), target Z  (↓) </li></ul>Example of R2 ranking / random product sampling
  13. 13. Random 3.2% subset of VRXN-3-00352 coloured by R2 1116 ranked B (by avg R2) 314 ranked C (by avg R2 ) Best 100 x 100
  14. 14. Full VL of VRXN-3-00352 with top 14 Pareto fronts In red: Pareto ranked compounds (11327) In blue: the rest (Pareto front >= 15) (This takes ~weekend to calculate) 1116 ranked B (by avg R2 of all) 314 ranked C (by avg R2 of all)
  15. 15. Top 100 monomers R2 sampled versus true ranks R2 rank B (full 350k) R2 subset sampled rank B R2 rank C (full 350k) R2 subset sampled rank C In common: 83 Best rank of missing: 52 Worst rank of replacement: 242 In common: 93 Best rank of missing: 83 Worst rank of replacement: 145 100 100 100 100
  16. 16. Compounds found by R2 sampling per Pareto front Pareto front: 1 2 3 4 5 Contains: 171 246 337 445 523 Success rate: 94% 82% 78% 71% 58% Compounds found by sampling and enumerating 100 x 100 monomers Compounds not found by above A typical design contains ~100-200 compounds
  17. 17. Recent libraries designed using Pareto ranking <ul><li>Project: target X </li></ul><ul><li>Wanted: Selectivity over target Y and target Z </li></ul><ul><li>Target X, Y, Z all kinases </li></ul><ul><li>Compounds made so far with VRXN-3-00582 and VRXN-3-00352 show promise </li></ul><ul><li>VRXN-3-00582 series: </li></ul><ul><li>Compounds screened so far: target Y and target Z selectivity tracks </li></ul><ul><li>Designed library of 120 compounds using target X, Y and Z models </li></ul><ul><li>VRXN-3-00352 series: </li></ul><ul><li>Existing compounds were (designed to be) target Z selective </li></ul><ul><li>Can the selectivity be reversed? </li></ul><ul><li>Designed library of 120 compounds using target X, Y and Z models </li></ul>
  18. 18. What has Pareto done for me? X score Y score Z score Top graphs: Model score distribution full VL, lower graphs: first Pareto front
  19. 19. VRXN-3-00582 results WP001398: difficult chemistry, only 22 compounds were made… X / Z selectivity X IC50 = 770 nM Y IC50 = 9  M Z IC50 = 3.7  M X IC50 = 997 nM Y IC50 = 7.5  M Z IC50 = 11  M Lorna Mitchell Nunzio Sciammetta Ian Marsh X / Y selectivity > 5fold selective (5) Inverse selective (2) Inactive (14) Non-selective (1)
  20. 20. VRXN-3-00352 results WP001524: 77 compounds were made Lorna Mitchell Nunzio Sciammetta Ian Marsh X / Z selectivity X / Y selectivity X over Y = 40 fold X over Z = 5.6 fold X over Y = 7 fold X over Z = 13 fold 8 compounds with the desired profile.
  21. 21. How were these series found from HTS hits? Nightly scheduled download: <ul><li>Download current project IC50 data: target X, Y, Z </li></ul><ul><li>Combine with previous target X, Y, Z screening data </li></ul><ul><li>For most compounds: incomplete set of experimental data </li></ul><ul><li>How to compare compounds / series of compounds? </li></ul><ul><li>Build Bayesian models for target X, Y, Z (Pipeline Pilot component) </li></ul><ul><li>For all compounds: complete set of 3 predictions available </li></ul><ul><li>Use Pareto sorting to find best predicted selective compounds </li></ul><ul><li>How well does predicted activity/selectivity work? </li></ul>
  22. 22. Measured activity tracks with Bayesian model score Bayesian score: target X Y Red: IC50 > 2500 nM (inactive) Yellow: IC50 <= 2500 nM (moderate active) Blue: IC50 <= 250 nM (active) Colour by X activity Colour by Y activity Bayesian score: target X
  23. 23. Bayesian score: X Y Outside top 10k (light grey) Top 10k Pareto ranked, but no experimental selectivity (dark grey) Red: < 10fold selective Yellow: < 50fold selective Blue: >= 50fold selective Area with highest predicted selectivity includes multiple greys Predicted vs experimental selectivity X over Y
  24. 24. Area with highest predicted selectivity has highest proportion truly selective compounds. Bayesian score: X Y Predicted vs experimental selectivity X over Y Good (>50) Bad (<10) Moderate (<50)
  25. 25. Nearly identical picture for X over Z X/Z selectivity X/Z selectivity
  26. 26. Ranking of series <ul><li>Calculate average of each model per series </li></ul><ul><li>Pareto rank series by 3 averaged model scores </li></ul><ul><li>Non-dominated series = best series </li></ul><ul><li>Focus attention/screening/design on these series </li></ul>X Y Compounds Series Best Series Colour by library ID
  27. 27. Results <ul><li>Current lead series immediately found from many series identified in HTS </li></ul><ul><li>This series would have been found otherwise but, these were not: </li></ul>X Y <ul><li>48 compounds </li></ul><ul><li>5 compounds with selectivity against Y </li></ul><ul><li>3 of these also selective against Z </li></ul><ul><li>Original library designed in Z project to be Z selective </li></ul><ul><li>103 compounds </li></ul><ul><li>5 with proven selectivity against Y </li></ul><ul><li>But none selective against Z </li></ul>
  28. 28. Conclusions <ul><li>Multi-dimensional design of very large virtual libraries enabled by </li></ul><ul><li>- R2 approximation to Pareto sort </li></ul><ul><li>- Random product sampling to identify hot monomers </li></ul><ul><li>Automatic Pareto ranking / Bayesian modelling useful to quickly identify selective series </li></ul><ul><li>Plans: </li></ul><ul><li>- Include similarity to hit/lead as design dimension in Pfool </li></ul><ul><li>- Post-fix the design to prevent too many single-use monomers </li></ul>
  29. 29. July 18, 2006 More of science and decision making can automated Science 3 April 2009: Vol. 324. no. 5923, pp. 85 - 89 Pfizer sponsors a PhD position here. Interested? Contact me or Ross King (
  30. 30. Spare slides etc
  31. 31. Pareto library design (Pfool) workflow PGVL hub <ul><li>Entry point </li></ul><ul><li>Start new design </li></ul><ul><li>Retrieve existing design </li></ul><ul><li>Select monomers </li></ul><ul><li>Start enumeration </li></ul><ul><li>Select products </li></ul><ul><li>Pfizer in-house tool to: </li></ul><ul><li>Finalise design </li></ul><ul><li>Register library </li></ul>Engine: Pipeline Pilot services monomer processing and enumeration
  32. 32. Pfool is started as Pipeline Pilot webservice
  33. 33. Pfool input <ul><li>Required input: </li></ul><ul><li>VRXN/protocol that is known to work </li></ul><ul><li>Design dimensions, <=6 of: </li></ul><ul><ul><li>Bayesian model from ADP </li></ul></ul><ul><ul><li>ADME Bayesian model </li></ul></ul><ul><ul><li>Your own model </li></ul></ul><ul><ul><li>PP calculatable property (Mw) </li></ul></ul><ul><li>Optional input: </li></ul><ul><li>Own list of monomers </li></ul><ul><li>Last step transformation </li></ul>TargetX Bayesian model Send to Spotfire
  34. 34. Mw AlogP Filter monomers in Spotfire Availability in Pfizer stores: Protecting groups:
  35. 35. Product enumeration parameters
  36. 36. Start the enumeration / Pareto ranking of products Job can be retrieved via: Job has successfully started You can log off when see this
  37. 37. Accessing existing designs (Re)send original monomers to Spotfire Send designed products to Spotfire (Re)send filtered monomers to Spotfire TargetX Bayesian model TargetY Bayesian model TargetZ Bayesian model TargetX Bayesian model TargetY Bayesian model TargetZ Bayesian model
  38. 38. Accessing existing jobs Create file for Pfizer in-house tool TargetX Bayesian model TargetY Bayesian model
  39. 39. Finish / register design in PGVL hub