Promiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCN






  • Promiscuous patterns and perils in PubChem and the MLSCN Jeremy J Yang Cristian Bologa Tudor Oprea Division of Biocomputing Dept of Biochem. & Mol. Biology NM Mol. Libraries Screening Center University of New Mexico
  • Goals
    • Improve HTS success rate by pre-filtering or retro-filtering
    • Detection of promiscuous compounds
    • Effectively data-mine PubChem, developing tools to ask related questions
    • Generalizable knowledge
  • Background
    • HTS is a game of chance
    • Druglike, ADMET, Ro5, leadlike, probelike
    • Smaller still: fragment-based screening
    • Ligand efficiency
    • Compound fitness relative to HTS assay method
    • False positives
    • True but useless positives
    • Aggregators
    • Reactives
    • Studying the signal vs studying the noise
    • NIH Roadmap, MLI, MLSCN (MLPCN), MLSMR
    10 6 10 0
  • Promiscuity defined
    • Known types
    • aggregators
    • reactives
    • true binders
    Bioactivity for multiple targets, i.e. “frequent-hitter”, non-selective binder Multi-target bioactivity for involved scaffold Scaffold may be a determinant or simply an informatic device. Scaffold promiscuity [working definition]
  • Real vs phony promiscuity
    • Apparent promiscuity may be due to:
      • Actual promiscuous binding
      • Artifact-ual promiscuity
        • Reactivity
        • Aggregation
        • Fluorescence
      • Experimental errors
    • For given assay method, actual and artifact-ual equivalent for most HTS intents and purposes
  • PubChem: cathedral or bazaar?
    • Mission and success of PubChem
      • 38 million compounds (March 2008)‏
      • High volume worldwide use by scientific community
    • Mission and success of NIH MLI and MLSCN
      • 10 centers, 3 years
      • ~100? probes, ~1000 assays, ~30M data
    (a) National Cathedral, Washington, DC (b) Santa Fe Flea Market, Santa Fe, NM a b ref: “The Cathedral and the Bazaar”, Eric Raymond, 1997.
  • PubChem and MLSCN*
    • Publicly available bioactivity data on this scale is unprecedented accomplishment & opportunity.
    • With rapid growth, data quality is an important concern.
    • Overall goals broader than individual HTS campaigns
    • Big idea: MLSCN+PubChem, reaching critical mass ?
    *MLSCN = Molecular Libraries Screening Center Network, to be MLPCN, Molecular Libraries Program Center Network
  • PubChem and MLSMR
    • Molecular Libraries Small Molecule Repository
    • Managed by BioFocusDPI/Galapagos
    • ~ 300k compounds (March 2008) and growing
    • Used for primary HTS by all MLI centers
    Plot c/o Victor Panchenco, BioFocusDPI
  • MLSMR and MLSCN actives
    • MLSMR actives:
    • MLSCN actives:
    94,148 104,078 93,616 104,610 532 10,462 (Some MLSCN compounds – esp secondary assays – from other sources such as commercial vendors.)‏ *June 2008
  • Selected published pre-filtering expert knowledge*
    • Rishton
      • Reactive compounds and in vitro false positives in HTS, Drug Discov. Today, 2, 382-384.
    • Hann
      • Strategic pooling of compounds for high-throughput screening, Mike Hann et al., J. Chem. Inf. Comp. Sci., 1999, 39, 897-902.
    • Rishton
      • Nonleadlikeness and leadlikeness in biochemical screening, G. M. Rishton, Drug Disc. Today, 8, 2003, 86-96.
    • Seidler
      • Identification and Prediction of Promiscuous Aggregating Inhibitors among Known Drugs, J. Seidler et al., J. Med Chem, 2003, 46, 4477-4486.
    *generalizable domain knowledge
  • Selected pre-filtering semi-public expertise
    • Blake
      • James Blake, Sybyl script lint_sln.spl, formerly bundled with Sybyl, 2001(?).
    • Commercial vendors
      • Property predictions (LogP, ADME-Tox, solubility)‏
      • Filtering tools and APIs
      • Some defined patterns
    • “ Development of a Virtual Screening Method for Identification of 'Frequent Hitters' in Compound Libraries”, Roche et al., J. Med. Chem., 45, 2002, 137-142. [neural network, 345 descriptors]
  • MLSMR filtering protocols
    • MLSMR has a set of smarts-based filters, “excluded functionality filters” used to reject compounds unfit for HTS
    • Filters developed in collaboration with MLSCN Chemistry working group
    • WG chose to be less restrictive than recommended by BioFocusDPI.
    • Optimum filtering not a solved problem
    • One reason: fitness assay-dependent
  • MLSMR re-filtered...
    • Using a combined reactive filter
      • 49k of 286k rejected (17%)‏
      • Top reactive patterns:
    *search failed via PC GUI
  • MLSMR re-filtered... Example rejects:
  • Pre-filtering at UNM Java servlet using ChemAxon/JChem
  • Activity multiplicity – all assays compounds active in any assay peril? 103894 MLSCN compounds 724 MLSCN assays
  • Activity multiplicity – screening assays compounds active in any screening assay peril 89954 MLSCN compounds 268 MLSCN assays
  • Top 100 hier-scaffolds*, active PubChem MLSMR Example scaffold #1 *Wilkins, J. Med Chem 2005
  • Top active hier-scaffolds: example #1 Scaffold: 33 rd most common 510 active compounds 34 assays c1c2c([nH]c(=O)cn2)ncn1 Top 12 compounds Top 12 of 510 compounds CID, #assays active
  • Top active hier-scaffolds: example #1 #compounds vs #assays in which they are active
  • Top active hier-scaffolds: example #1 All from NCGC
  • Other promiscuous scaffolds COMPOUNDS:total/tested/active ASSAYS:tested/active SAMPLES:tested/active
  • Other promiscuous scaffolds COMPOUNDS:total/tested/active ASSAYS:tested/active SAMPLES:tested/active
  • Other promiscuous scaffolds COMPOUNDS:total/tested/active ASSAYS:tested/active SAMPLES:tested/active
  • Activity mining with PubChem GUI Lots of great functionality, but not everything...
  • Activity mining with command line Automation is good...
  • Example scaffold #2 62 nd most common 523 active compounds 208 assays c1ccc(cc1)NC(=O)c2ccco2 Top active hier-scaffolds: example #2 Top 12 of 523 compounds CID, #assays active
  • Top active hier-scaffolds: example #2 #compounds vs #assays in which they are active
  • Top active hier-scaffolds: example #2 #compounds vs #assays in which they are active
  • Top active hier-scaffolds: example #3 Example scaffold #4 1307 th most common 27 active compounds 140 assays c1nc-2c(=O)[nH]c(=O)nc2[nH]n1 toxoflavin
  • Top active hier-scaffolds: example #3 #compounds vs #assays in which they are active
  • Digression: PubChem bug... substructure search for CID 66541 143 hits substructure search for C2(C1=NC=NNC1=NC(N2)=O)=O 143 hits substructure search for c1nc-2c(=O)[nH]c(=O)nc2[nH]n1 0 hits Ergo: aromatic structural queries not allowed?
  • Top active hier-scaffolds: example #3
  • More example scaffolds and histograms of #compounds vs. #assays in which compounds are active
  • Scaffold promiscuity vs. SEA*
    • For a given scaffold, what are the active molecules and the bioassays in which they are active?
    SEA (Similarity Ensemble Approach): For given query molecule, are there bioactive similars and for what targets? *Keiser MJ, Roth BL, Armbruster BN, Ernsberger P, Irwin JJ, Shoichet BK. Relating protein pharmacology by ligand chemistry. Nat Biotech 25 (2), 197-206 (2007). http://shoichetlab.compbio.ucsf.edu/~keiser/sea/
  • Possibilities
    • Activity in multiple assays expected when assays related (screening/confirmatory, etc.)‏
    • If compounds not promiscuous, one possibility is assays are redundant (also interesting)‏
    • Need rule of thumb? Active in >5 HTS screens deserves a red flag?
    • Need assay classification/canonicalization – i.e. rigorous informatics?
    • Scaffold “popularity” generally desirable.
  • Possibilities
    • The global bioactivty landscape (nature) calls for a comprehensive view of bioactivity, chemical space, promiscuity, privileged structures and other patterns. In other words chem+bio informatics.*
    *"Is There a General Model for Bioactivity?", T.I. Oprea, O. Ursu, C.G. Bologa, and L.A. Sklar, The 8th International Conference on Chemical Structures, June, 2008, Noordwijkerhout, The Netherlands (http://www.int-conf-chem-structures.org/pdf/B-6.pdf).
  • Data mining methodology notes
    • Get all active MLSCN/MLSMR compounds (Entrez)
    • HierScaf perception w/ all MLSCN compounds (OE)‏
    • Extract compounds for selected scaffolds (OE)‏
    • Download compounds (PUG)‏
    • Find all bioassays in which cpds are active, using local PubChem bioassay ftp-mirror
    • Get bioassay summaries (Entrez)‏
    • Try our scripts: http://pangolin.health.unm.edu/kit/
    PUG = NCBI PubChem Power User Gateway (http)‏ Entrez = NCBI Entrez eUtils API OE = OpenEye OEChem All code Perl or Python *June 2008
  • Now for some general comments... 10 6 10 0
  • Probability – in HTS game etc. In other words: Quantity vs. Quality E not striking out = 1 – E striking out E success = 1 – (1 - E hit ) N where E hit = probability of hit per try N = number of tries
  • Probability, more hard lessons *&quot;Method and Apparatus for Designing Molecules with Desired Properties by Evolving Successive Populations,&quot; David Weininger, U.S. patent US5434796, 1995. De novo molecular design Case study: Grok and Grope*, 1992, Weininger, Blaney, Dixon GA -> virtual library, docking fitness, lots of cpu cycles But you need to recognize a good hit, including all aspects of fitness (ADMET, synthesis, etc.). <- approximation from memory
      • conclusions:
      • medchem knowledge is important
      • chemical space is huge
  • Probability, more quantity vs quality “ Quantitative high-throughput screening: A titration-based approach that efficiently identifies biological activities in large chemical libraries”, Inglese et al., PNAS, 103, 2006, 11473-11478.
  • Probability and prejudice Lucky CEOs, coaches, and fund managers, Kahneman's 2002 Nobel prize for economics-psychology, good stories vs. Occam's razor, confirmation bias, pathological pattern-recognizers are we.
  • Statistics and signficance “ The Trouble with QSAR (or How I Learned To Stop Worrying and Embrace Fallacy)”, Stephen R. Johnson, J. Chem. Inf. Model., 48 (1), 25 -26, 2008.
  • Conclusions
    • Promiscuous compounds afflict HTS, but effects can be mitigated by pre-filtering and post-analysis including recognition of patterns
    • Promiscuous patterns can be natural (chemistry based) or manufactured (e.g. Analog series)‏
    • For scaffolds, bad-promiscuity continuous with good-promiscuity, a.k.a. privileged structures, i.e. scaffolds which frequently enable selective bioactive compounds.
  • Conclusions
    • PubChem/MLSCN is a rich, growing, unprecedented public source of bioactivity data offering new research avenues.
    • PubChem APIs provide good methods of compound and bioassay data mining...
    • ...BUT – work remains to be done to fully utilize PubChem and integrate with other data sources and procedures.
  • Conclusions (from Chris Lipinski*)‏
    • Designing screening libraries:
      • know and use the pharma industry filters
      • use expert medicinal chemistry advice
      • get the best chemistry quality you can afford
    *Chris Lipinski, Nanosyn Open House talk, Feb 16, 2008.
  • Acknowledgements, thanks
    • Cristian Bologa, UNM
    • Tudor Oprea, UNM
    • Oleg Ursu, UNM
    • Steve Mathias, UNM
    • Chris Lipinsky, Melior
    • PubChem team
    • NCBI team
    OpenEye Software c/o:
      • contact: jjyang@salud.unm.edu
  • Scaffolds and chemotypes
    • Nature and chemical scaffolds
    • Scaffolds and shape
    • Scaffolds, aromatic hetero rings and bioactivity*
    • Human understanding and scaffolds (SAR)‏
    • Synthesis and scaffolds
    • Scaffolds and fragment-based screening
    • Commerce and scaffolds
    • Scaffold definitions
    • Wilkens et al. Hierarchical Scaffolds
    *Ertl, et al., Quest for the Rings
  • HierScaffolds: - scaffolds are one or several rings connected by linkers - compounds can be related by any of their scaffolds