ChEMBL UGM May 2011

1,022 views

Published on

Learn from the ChEMBL database what makes a compound drug-like

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,022
On SlideShare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

ChEMBL UGM May 2011

  1. 1. Lessons from ChEMBL<br />Willem P. van Hoorn<br />Senior Solutions Consultant<br />Willem.vanhoorn@accelrys.com<br />
  2. 2. Those who cannot remember the <br />past are condemned to repeat it<br />Contents<br />
  3. 3. ‘Nasties’<br />Things you would not like to see in your hits<br />Specifically: reactive/labile chemical groups<br />Is the compound still on the plate?<br />Activity due to (selective) non-covalent binding?<br />Some overlap with frequent hitters/aggregators<br />Peroxides, aldehydes, etc<br />Not ‘structural alerts’<br />Off-target toxicity<br />Toxic compounds after metabolic activation<br />hERG binders, anilines, etc<br />
  4. 4. This is not a new concept<br />If you are a chemist you know many of these<br />If you have been working in pharma you know more of these<br />Pharma companies probably all have their in-house list of ‘forbidden/risky/ugly’ structures<br />Some publications but no definitive public list<br />Thus reinvention of the wheel, wasted effort<br />
  5. 5. ChEMBL: <br />“the most comprehensive ever seen in a public database.’” (wikipedia)<br />“…cover a significant fraction of the SAR and discovery of modern drugs” (ChEMBL website)<br />This must be a good source to learn what goes<br />Experienced scientists who cared enough about compounds to measure the activity and submit the results to peer-reviewed journals<br />ChEMBL as a teacher<br />
  6. 6. To learn we also need to know what not to do:<br />Compound vendor catalogues<br />Fewer constraints on reactivity / stability<br />Drive for diversity<br />More customers than just pharma:<br />Should be enriched in nasties compared to ChEMBL<br />ChEMBL as a teacher<br />
  7. 7. Lesson 1<br />ChEMBL<br />Release 7<br />Dump all compounds, keep largest fragment<br />Unique canonical smiles: 597,255<br />Vendor reagents<br />Pipeline Pilot examples: Maybridge + Asinex<br />186,967 unique compounds<br />Build Bayesian model ‘reagentlike’<br />Vendor “good” v. ChEMBL “baseline”<br />What do reagents have in common that ChEMBL compounds don’t?<br />
  8. 8. Training/Test: Random 80% / 20%<br />Excellent separation ChEMBL / Reagent<br />Reagentlike model<br />Leave-one-out enrichment<br />Test set enrichment<br />
  9. 9. Done?<br />
  10. 10. A look at high and low scoring compounds<br />Colour atoms by contribution to Bayesian score<br />Red: high contribution: reagent-like<br />Blue: low contribution: not reagent-like<br />Color gradient over set of molecules<br />
  11. 11. High scoring molecules<br />
  12. 12. More high scoring molecules<br />They do contain ‘nasty’groups…<br />But they don’t stand out against rest of the molecules (all red). <br />
  13. 13. Low scoring molecules<br />Etc<br />
  14. 14. High scoring features<br />Low scoring features<br />High and low scoring reagent features<br />Seen 1029 times, of which in reagent set 1024 times<br />Many variations of peptide bondand other polypeptide features:<br />635 out of 639 in reagent set<br />
  15. 15. Learning the difference was too easy<br />Small organic vs large polypeptide<br />Both sets contain many series, model learns common core instead of (nasty) decorations<br />Metric: compounds / Murcko frames<br />ChEMBL: ~6.7, reagent: ~9.0 <br />Number of frames / in common: ~81k / ~6k<br />I need to resit this class<br />Conclusions from lesson 1<br />
  16. 16. Restrict to organic small molecules<br />AlogP < 6, Mw < 600, organic compound filter<br />Bayesian Model<br />ECFP_2(smaller features compared to ECFP_6)<br />Less likely to capture whole core<br />Lesson 2: Rebalancing the training set<br />
  17. 17. Still a predictive model<br />
  18. 18. A typical high scoring compound <br />~neutral score for parts presumed common to both sets like phenyl<br />~positive score for nasty parts<br />
  19. 19. Low scoring example<br />Many sugars, phosphates, steroids, etc<br />
  20. 20. High scoring features<br />Low scoring features<br />Some ECFP_2 features<br />
  21. 21. Less learning of “series by template”<br />But it still happens, don’t need to capture whole ring to capture sugar, steroid, etc<br />Some of expected nasty features found<br />But many are not<br />Better training set needed<br />Series: similar in both clean/nasty training set, so that difference is not the template<br />Many ChEMBL compounds are odd<br />I have still not learned the lesson<br />Conclusions from lesson 2<br />
  22. 22. ChEMBL:<br />What I should have started with:<br />All compounds with IC50 or Ki expressed in nM, <br />Against human target,<br />Include reference: journal, volume, year, page<br />569,569 activities<br />223,896 compounds<br />14,383 references<br />Lesson 3: Learning from (big) pharma<br />
  23. 23. Looking up author affiliation in PubMed<br />NCBI Entrez Utilities Web Service (Text Analytics component collection)<br />This takes ~4 hours in a weekend (PubMedusage restriction)<br /><ul><li> 13,410 references
  24. 24. 564,422 activities
  25. 25. 214,747 compounds</li></li></ul><li>Something wrong with some BMCL refs?<br />
  26. 26. Top 10 affiliations<br />
  27. 27. Where is Pfizer?<br />And 318 more… Similar for other contributors<br />
  28. 28. if DocAuthorsAffiliationrlike 'univers|Faculty|hospital|National.*Institute.*Health|Polytechnic' then<br />Published_by := 'Academic';<br />elsifDocAuthorsAffiliationrlike 'Pfizer' then<br />Published_by := 'Pfizer';<br />elsifDocAuthorsAffiliationrlike 'warner.*lambert|parke.*davis' then<br />Published_by := 'Warner-Lambert';<br />elsifDocAuthorsAffiliationrlike 'Pharmacia|Upjohn' then<br />Published_by := 'Pharmacia';<br />elsifDocAuthorsAffiliationrlike 'Wyeth' then<br />Published_by := 'Wyeth';<br />elsifDocAuthorsAffiliationrlike 'Merck' then<br />Published_by := 'Merck';<br />…<br />else<br />Published_by := 'Other';<br />end if;<br />Merging affiliations<br />
  29. 29. Ranked contributors to ChEMBL<br />
  30. 30. Creating balanced training/test sets<br />Affiliation: Pharma, Other, Academic<br />Keep 602 targets for which measured activities are available for all 3 affiliations<br />Same target, same pharmacophore, some me-too work: less series learning<br />
  31. 31. Bayesian model based on <= 2005 data<br />Descriptors: ECFP_6 + Ro5 physical properties<br />Categorical model: Pharma/Academic/Other<br />
  32. 32. Predicting affiliation post 2005<br />Academic<br />Pharma<br /><ul><li>Other not different
  33. 33. Academic/Pharma distinct</li></ul>Other<br />
  34. 34. What makes a compound ‘Pharma’<br />Aromatic rings, aromatic rings, aromatic rings. IP? Absence of decorations means these are not distinctive.<br />Number of times feature observed / how many times in academic / pharma<br />
  35. 35. What makes a compound ‘Academic’<br />Aliphatic, single rings, bold usage of F and other decorations, etc. Maybe not nasty but not very druglike.<br />Number of times feature observed / how many times in academic / pharma<br />
  36. 36. Most Pharma-like compounds<br />For each target, compound with highest ‘Pharma’ score and true origin<br />
  37. 37. Most Academic-like compounds<br />For each target, compound with highest ‘Academic’ score and true origin<br />
  38. 38. Set out to learn nasty model, ended up with a (non)drug-like model<br />Pharma is ‘a bit’ underrepresented<br />10% of MDDR is in ChEMBL (Dave Rogers)<br />ChEMBL c/should include patent literature<br />Over the years (big) pharma has delivered the goods and learned what does (not) work in a structure. Some of this knowledge can be extracted from ChEMBL.<br />Ignore this at your peril<br />Conclusions<br />

×