Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ChEMBL UGM May 2011


Published on

Learn from the ChEMBL database what makes a compound drug-like

  • Be the first to comment

  • Be the first to like this

ChEMBL UGM May 2011

  1. 1. Lessons from ChEMBL<br />Willem P. van Hoorn<br />Senior Solutions Consultant<br /><br />
  2. 2. Those who cannot remember the <br />past are condemned to repeat it<br />Contents<br />
  3. 3. ‘Nasties’<br />Things you would not like to see in your hits<br />Specifically: reactive/labile chemical groups<br />Is the compound still on the plate?<br />Activity due to (selective) non-covalent binding?<br />Some overlap with frequent hitters/aggregators<br />Peroxides, aldehydes, etc<br />Not ‘structural alerts’<br />Off-target toxicity<br />Toxic compounds after metabolic activation<br />hERG binders, anilines, etc<br />
  4. 4. This is not a new concept<br />If you are a chemist you know many of these<br />If you have been working in pharma you know more of these<br />Pharma companies probably all have their in-house list of ‘forbidden/risky/ugly’ structures<br />Some publications but no definitive public list<br />Thus reinvention of the wheel, wasted effort<br />
  5. 5. ChEMBL: <br />“the most comprehensive ever seen in a public database.’” (wikipedia)<br />“…cover a significant fraction of the SAR and discovery of modern drugs” (ChEMBL website)<br />This must be a good source to learn what goes<br />Experienced scientists who cared enough about compounds to measure the activity and submit the results to peer-reviewed journals<br />ChEMBL as a teacher<br />
  6. 6. To learn we also need to know what not to do:<br />Compound vendor catalogues<br />Fewer constraints on reactivity / stability<br />Drive for diversity<br />More customers than just pharma:<br />Should be enriched in nasties compared to ChEMBL<br />ChEMBL as a teacher<br />
  7. 7. Lesson 1<br />ChEMBL<br />Release 7<br />Dump all compounds, keep largest fragment<br />Unique canonical smiles: 597,255<br />Vendor reagents<br />Pipeline Pilot examples: Maybridge + Asinex<br />186,967 unique compounds<br />Build Bayesian model ‘reagentlike’<br />Vendor “good” v. ChEMBL “baseline”<br />What do reagents have in common that ChEMBL compounds don’t?<br />
  8. 8. Training/Test: Random 80% / 20%<br />Excellent separation ChEMBL / Reagent<br />Reagentlike model<br />Leave-one-out enrichment<br />Test set enrichment<br />
  9. 9. Done?<br />
  10. 10. A look at high and low scoring compounds<br />Colour atoms by contribution to Bayesian score<br />Red: high contribution: reagent-like<br />Blue: low contribution: not reagent-like<br />Color gradient over set of molecules<br />
  11. 11. High scoring molecules<br />
  12. 12. More high scoring molecules<br />They do contain ‘nasty’groups…<br />But they don’t stand out against rest of the molecules (all red). <br />
  13. 13. Low scoring molecules<br />Etc<br />
  14. 14. High scoring features<br />Low scoring features<br />High and low scoring reagent features<br />Seen 1029 times, of which in reagent set 1024 times<br />Many variations of peptide bondand other polypeptide features:<br />635 out of 639 in reagent set<br />
  15. 15. Learning the difference was too easy<br />Small organic vs large polypeptide<br />Both sets contain many series, model learns common core instead of (nasty) decorations<br />Metric: compounds / Murcko frames<br />ChEMBL: ~6.7, reagent: ~9.0 <br />Number of frames / in common: ~81k / ~6k<br />I need to resit this class<br />Conclusions from lesson 1<br />
  16. 16. Restrict to organic small molecules<br />AlogP < 6, Mw < 600, organic compound filter<br />Bayesian Model<br />ECFP_2(smaller features compared to ECFP_6)<br />Less likely to capture whole core<br />Lesson 2: Rebalancing the training set<br />
  17. 17. Still a predictive model<br />
  18. 18. A typical high scoring compound <br />~neutral score for parts presumed common to both sets like phenyl<br />~positive score for nasty parts<br />
  19. 19. Low scoring example<br />Many sugars, phosphates, steroids, etc<br />
  20. 20. High scoring features<br />Low scoring features<br />Some ECFP_2 features<br />
  21. 21. Less learning of “series by template”<br />But it still happens, don’t need to capture whole ring to capture sugar, steroid, etc<br />Some of expected nasty features found<br />But many are not<br />Better training set needed<br />Series: similar in both clean/nasty training set, so that difference is not the template<br />Many ChEMBL compounds are odd<br />I have still not learned the lesson<br />Conclusions from lesson 2<br />
  22. 22. ChEMBL:<br />What I should have started with:<br />All compounds with IC50 or Ki expressed in nM, <br />Against human target,<br />Include reference: journal, volume, year, page<br />569,569 activities<br />223,896 compounds<br />14,383 references<br />Lesson 3: Learning from (big) pharma<br />
  23. 23. Looking up author affiliation in PubMed<br />NCBI Entrez Utilities Web Service (Text Analytics component collection)<br />This takes ~4 hours in a weekend (PubMedusage restriction)<br /><ul><li> 13,410 references
  24. 24. 564,422 activities
  25. 25. 214,747 compounds</li></li></ul><li>Something wrong with some BMCL refs?<br />
  26. 26. Top 10 affiliations<br />
  27. 27. Where is Pfizer?<br />And 318 more… Similar for other contributors<br />
  28. 28. if DocAuthorsAffiliationrlike 'univers|Faculty|hospital|National.*Institute.*Health|Polytechnic' then<br />Published_by := 'Academic';<br />elsifDocAuthorsAffiliationrlike 'Pfizer' then<br />Published_by := 'Pfizer';<br />elsifDocAuthorsAffiliationrlike 'warner.*lambert|parke.*davis' then<br />Published_by := 'Warner-Lambert';<br />elsifDocAuthorsAffiliationrlike 'Pharmacia|Upjohn' then<br />Published_by := 'Pharmacia';<br />elsifDocAuthorsAffiliationrlike 'Wyeth' then<br />Published_by := 'Wyeth';<br />elsifDocAuthorsAffiliationrlike 'Merck' then<br />Published_by := 'Merck';<br />…<br />else<br />Published_by := 'Other';<br />end if;<br />Merging affiliations<br />
  29. 29. Ranked contributors to ChEMBL<br />
  30. 30. Creating balanced training/test sets<br />Affiliation: Pharma, Other, Academic<br />Keep 602 targets for which measured activities are available for all 3 affiliations<br />Same target, same pharmacophore, some me-too work: less series learning<br />
  31. 31. Bayesian model based on <= 2005 data<br />Descriptors: ECFP_6 + Ro5 physical properties<br />Categorical model: Pharma/Academic/Other<br />
  32. 32. Predicting affiliation post 2005<br />Academic<br />Pharma<br /><ul><li>Other not different
  33. 33. Academic/Pharma distinct</li></ul>Other<br />
  34. 34. What makes a compound ‘Pharma’<br />Aromatic rings, aromatic rings, aromatic rings. IP? Absence of decorations means these are not distinctive.<br />Number of times feature observed / how many times in academic / pharma<br />
  35. 35. What makes a compound ‘Academic’<br />Aliphatic, single rings, bold usage of F and other decorations, etc. Maybe not nasty but not very druglike.<br />Number of times feature observed / how many times in academic / pharma<br />
  36. 36. Most Pharma-like compounds<br />For each target, compound with highest ‘Pharma’ score and true origin<br />
  37. 37. Most Academic-like compounds<br />For each target, compound with highest ‘Academic’ score and true origin<br />
  38. 38. Set out to learn nasty model, ended up with a (non)drug-like model<br />Pharma is ‘a bit’ underrepresented<br />10% of MDDR is in ChEMBL (Dave Rogers)<br />ChEMBL c/should include patent literature<br />Over the years (big) pharma has delivered the goods and learned what does (not) work in a structure. Some of this knowledge can be extracted from ChEMBL.<br />Ignore this at your peril<br />Conclusions<br />