Digging out Structures for Repurposing:
     Non-competitive Intelligence


             PubChem Seminar April 2013

  Christopher Southan, TW2Informatics, Göteborg, Sweden




                                                          [1]
Dr Christopher Southan, Ph.D., M.Sc.,B.Sc.
TW2Informatics: http://www.cdsouthan.info/Consult/CDS_cons.htm
Mobile: +46(0)702-530710
Skype: cdsouthan
Email: cdsouthan@hotmail.com
Twitter: http://twitter.com/#!/cdsouthan
Blog: http://cdsouthan.blogspot.com/
LinkedIN: http://www.linkedin.com/in/cdsouthan
Publications: http://www.citeulike.org/user/cdsouthan/order/year,,/publications
Presentations: http://www.slideshare.net/cdsouthan




                                                                                  [2]
Outline


•   Trawling for repurposing-relevant data
•   Code names statistics and name > structure triage
•   The NCATS/MRC challenge
•   Story of JNJ-39393406
•   Scaling-up Code name hunting and x-mapping
•   Code name in clinical trials, MeSH, PubChem
•   Story of PF-04457845
•   Trials, MeSH and PubChem code name intersects
•   Conclusions




                                                        [3]
Intelligence: trawling compound information


              Competitive                            Non-competitive

• Directed towards commercially              • Directed towards repositioning any
  positioning and/or repurposing               compound
  own portfolio                              • Collaborative approaches to IP
• Major big pharma activity                    holders (but new IP possible)
• Mixed commercial/public sources            • Can utilise public resources alone
• Internal specialists                       • Different domain expert entry
• Typically a closed activity (i.e. little     points
  open “best practice”)                      • Predominantly an open activity
• Typically therapeutic area aligned           (e.g. OSDD)
                                             • Can be hypothesis-neutral


                                                                               [4]
Structures:
connecting to repurposing-relevant data

•   Code names and synonyms
•   Resolving these to structures
•   Database entries
•   BioAssay results
•   Target/pathway links
•   In vitro & in vivo research papers
•   Clinical trial results and papers
•   Patents for analogues and SAR
•   Comparative in vivo data
•   Mendelian and GWAS disease links
•   Expression data for cpds
•   In silico modeling (including rare or NTDs)
•   Vendor similarity matches

                                                  [5]
Code names: 2-15 year information hole




                       Pharmaprojects
                       2009-10 figures




                                         [6]
Drugs,code names, INN/USANs and structures:
              few congruent hard numbers

•   Pharmaprojects (2013) drug profiles ~ 50,000
•   Thomson Reuters Cortelis (2012) drug monographs = 41,889
•   Pharmaprojects (via ProQuest, 2012) records ~ 35,000
•   Thomson Reuters Partnering (2011 structures, PMID: 22024215) = 17,901
•   Pharmaprojects (2003 structures) = 14,000
•   ChEMBL USANs (2013) = 10,568
•   PubChem (2013) “USAN [synonym] OR INN [synonym]” = 9,890
•   Pharmaprojects (2010 in development, no structure count) = 9,737
•   GVKBIO Clinical Candidate structures (2008, PMID:20298516) = 8,864
•   Pharmaprojects (2010 review, no structures) Phase 1+2+3 = 3,828




                                                                       [7]
Code names: major repurposing potential – but..
• ~ 95% of the 30K are/will become “parked” or “abandoned”
• Can be repurposed in silico at least
• Obvious hierarchy : leads> development > clinical trials > INN > approved

• Problems
   – New code names < 50% - 70% blinded (i.e. no structures)
   – Some older code names never un-blinded
   – Code naming practices independent and completely ad hoc
   – Publications, conference reports, clinical trials entries, press releases
     and portfolio listings linked to “blinded” code names (no structures)
   – Even for public declarations (e.g. papers) data linked into “the system”
     (e.g. synonym mapping) is patchy
   – Code originators do not provenance public database entries
   – Data supporting non-progression decisions rarely disclosed
   – http://chembl.blogspot.se/p/research-code-stems.html 100’s of codes

                                                                            [8]
Code name-to-structure mapping triage

Dig out the code names    Name/image > struc

 PubChem Substance        • chemicalize.org, OPSIN,
                            Chemical Identifier Resolver,
 PubChem Compound           sketchers, OSRA


    PubMed/MeSH           • Cross-checks:
                             –   SMILES/SDF/InChI strings
                                 PubChem and ChemSpider
    Google Scholar           –   InChIKey in Google
                             –   SureChemOpen patent search
    Google Images            –   Clinicaltrials.gov
                             –   Synonym trawling

 Google open (filtered)

                                                              [9]
The NCATS/MRC industry sponsored
repurposing exercise: the joy of code lists




                                              [10]
NCATS/MRC repurposing candidates




http://cdsouthan.blogspot.se/2012/09/mrc-22-vs-ncats-58-repurposing-lists.html
                                                                                 [11]
NCATS/MRC: summary statistics



                                PMID 23159359




              •   70 code names – no structures
              •   18 INNs & 4 codes-only in PubChem
              •   24 strucs “dug out” but PubChem-ve
              •   24 codes remain blinded
                                                   [12]
Sleuthing down a JNJ-39393406 structure:
        from darkness to twilight




                                           [13]
JNJ-39393406:NCATS documentation PubChem -ve




                                               [14]
JNJ-39393406: ClinicalTrials.gov




                                   [15]
JNJ-39393406 in PubMed




                         [16]
JNJ-39393406: open Google




                            [17]
JNJ-39393406: Google Scholar (was) structure -ve




                                                   [18]
JNJ-39393406 in Google images: finally a mapping




But where did these two vendors get their mapping from ?
                                                           [19]
(Probable) JNJ-39393406 in PubChem:
CID 1675566 patent-only sources and near-neighbours




                                                  [20]
(Probable) JNJ-39393406:
SureChemOpen patent match
   with corroborative data

   PubChem SID 152835708




                             Cf NCATS data




                                             [21]
More JNJ-39393406 mystery:
InChIKey in Google > ChemSpider > 3rd vendor




                                               [22]
Not all JNJ-s are blinded: JNJ-40418677
   IUPAC in abstract but code still PubChem –ve




IUPAC name converted at chemicalize.org for PubChem mapping
                                                              [23]
Scaling-up code name retrieval:
       wild card searches




                                  [24]
Phases & codes in Clinicaltrials.gov:
                           thin on results

• Interventional studies = 115356 , 7895 with results (7%)

• Results | Interventional Studies | Phase 1, 2, 3 | Industry = 4477

• Interventional Studies | GSK* | Phase 1, 2, 3 | Industry = 1004
• Results | Interventional Studies | GSK* | Phase 1, 2, 3 | Industry = 122 (12%)

• Interventional Studies | GSK* OR AZD* OR JNJ* OR PF0* | Phase 1, 2, 3 |
  Industry = 1640

• Results | Interventional Studies | GSK* OR AZD* OR JNJ* OR PF0* | Phase
  1, 2, 3 | Industry = 185 (11%)



                                                                             [25]
altrials.net: public pressure > more results > more
              repurposing opportunities




             http://www.youtube.com/watch?v=lQ6YTU5kGXw&fe
             ature=youtu.be&t=28m39s

                                                             [26]
Stemming code names in MeSh




                              [27]
Code names in PubChem Compound (CIDs)




       CID:SID ratio 275:1039           [28]
Codes in PubChem: selected matches




                                     [29]
“GSK-” in ChEMBL : 61




                        [30]
Tracking PF-04457845 through the system




                                          [31]
PubMed intersects: finding PF-04457845




                                         [32]
PF-04457845:
   PubMed




           [33]
PF-04457845: Clinicaltrials.org




                                  [34]
PF-04457845:
  PubChem CID
    24771824

 Substance (SID)
capture of activity,
vendor and patent
     sources




                   [35]
Wikipedia: links to other development compounds




                 But who put them in ?



                                                  [36]
PF-04457845: (almost) a total system success

•   Declared efficacy failure > possible repurposing candidate
•   Selection of analogues and a probe [18F]PF-9811 (CID 70679467)
•   The “system” did well because of good publishing practice (e.g. full text)
•   Code, structure, target, papers, trials and patents all connected
•   5mg for $275

But-
• Serendipitous finding (no “efficacy failure” or “study stopped” tags)
• Lack of clinicaltrials.org <> PubMed
• BindingDB using deprecated ChEBI ID
• PMID:21505060 not yet in ChEMBL
• No direct target or patent nos. in CID record because no DrugBank,
  SCRIPDB or IBM capture
• [18F]PF-9811 PubChem, [(18)F]PF-9811 PubMed, PF-9811-18F Books
                                                                                 [37]
Looking at code name intersects in different
            parts of the system




                                               [38]
Clinicaltrials.org       JNJ* Word cloud




  JNJ-28431754 = Canagliflozin = CID 24812758


                                                [39]
Company Pipelines: GSK codes for 2012




                                        [40]
GSK codes: PubChem vs. 2012 Pipeline




                                       [41]
Clinical Trials, PubChem, MeSH: GSK




                                      [42]
Clinical Trials, PubChem, MeSH: JNJ




                                      [43]
Clinical, PubChem, MeSH, & 2012 Pipeline:GSK




                                               [44]
Conclusions


• Stalled development candidates, designated by company codes,
  constitute a large potential repurposing information estate
• Historical in vitro , pharmacological & clinical data linked to ~ 30K codes
• But only 40-50% have structures assignable from open sources
• An even smaller proportion have code names in PubChem
• Public name>struc>data capture is ad hoc and needs improving
• Repurposing-relevant relationships are not easy to dig out
• Some “non competitive intelligence” approaches are shown here
• The big push for transparency and open access should improve
  disclosure, data capture, linkage and repurposing opportunities

                                  Happy hunting !

              TED Talk: Francis Collins: We need better drugs -- now
         http://www.ted.com/talks/francis_collins_we_need_better_drugs_now.html
                                                                                  [45]

Digging out Structures for Repurposing: Non-competitive Intelligence

  • 1.
    Digging out Structuresfor Repurposing: Non-competitive Intelligence PubChem Seminar April 2013 Christopher Southan, TW2Informatics, Göteborg, Sweden [1]
  • 2.
    Dr Christopher Southan,Ph.D., M.Sc.,B.Sc. TW2Informatics: http://www.cdsouthan.info/Consult/CDS_cons.htm Mobile: +46(0)702-530710 Skype: cdsouthan Email: cdsouthan@hotmail.com Twitter: http://twitter.com/#!/cdsouthan Blog: http://cdsouthan.blogspot.com/ LinkedIN: http://www.linkedin.com/in/cdsouthan Publications: http://www.citeulike.org/user/cdsouthan/order/year,,/publications Presentations: http://www.slideshare.net/cdsouthan [2]
  • 3.
    Outline • Trawling for repurposing-relevant data • Code names statistics and name > structure triage • The NCATS/MRC challenge • Story of JNJ-39393406 • Scaling-up Code name hunting and x-mapping • Code name in clinical trials, MeSH, PubChem • Story of PF-04457845 • Trials, MeSH and PubChem code name intersects • Conclusions [3]
  • 4.
    Intelligence: trawling compoundinformation Competitive Non-competitive • Directed towards commercially • Directed towards repositioning any positioning and/or repurposing compound own portfolio • Collaborative approaches to IP • Major big pharma activity holders (but new IP possible) • Mixed commercial/public sources • Can utilise public resources alone • Internal specialists • Different domain expert entry • Typically a closed activity (i.e. little points open “best practice”) • Predominantly an open activity • Typically therapeutic area aligned (e.g. OSDD) • Can be hypothesis-neutral [4]
  • 5.
    Structures: connecting to repurposing-relevantdata • Code names and synonyms • Resolving these to structures • Database entries • BioAssay results • Target/pathway links • In vitro & in vivo research papers • Clinical trial results and papers • Patents for analogues and SAR • Comparative in vivo data • Mendelian and GWAS disease links • Expression data for cpds • In silico modeling (including rare or NTDs) • Vendor similarity matches [5]
  • 6.
    Code names: 2-15year information hole Pharmaprojects 2009-10 figures [6]
  • 7.
    Drugs,code names, INN/USANsand structures: few congruent hard numbers • Pharmaprojects (2013) drug profiles ~ 50,000 • Thomson Reuters Cortelis (2012) drug monographs = 41,889 • Pharmaprojects (via ProQuest, 2012) records ~ 35,000 • Thomson Reuters Partnering (2011 structures, PMID: 22024215) = 17,901 • Pharmaprojects (2003 structures) = 14,000 • ChEMBL USANs (2013) = 10,568 • PubChem (2013) “USAN [synonym] OR INN [synonym]” = 9,890 • Pharmaprojects (2010 in development, no structure count) = 9,737 • GVKBIO Clinical Candidate structures (2008, PMID:20298516) = 8,864 • Pharmaprojects (2010 review, no structures) Phase 1+2+3 = 3,828 [7]
  • 8.
    Code names: majorrepurposing potential – but.. • ~ 95% of the 30K are/will become “parked” or “abandoned” • Can be repurposed in silico at least • Obvious hierarchy : leads> development > clinical trials > INN > approved • Problems – New code names < 50% - 70% blinded (i.e. no structures) – Some older code names never un-blinded – Code naming practices independent and completely ad hoc – Publications, conference reports, clinical trials entries, press releases and portfolio listings linked to “blinded” code names (no structures) – Even for public declarations (e.g. papers) data linked into “the system” (e.g. synonym mapping) is patchy – Code originators do not provenance public database entries – Data supporting non-progression decisions rarely disclosed – http://chembl.blogspot.se/p/research-code-stems.html 100’s of codes [8]
  • 9.
    Code name-to-structure mappingtriage Dig out the code names Name/image > struc PubChem Substance • chemicalize.org, OPSIN, Chemical Identifier Resolver, PubChem Compound sketchers, OSRA PubMed/MeSH • Cross-checks: – SMILES/SDF/InChI strings PubChem and ChemSpider Google Scholar – InChIKey in Google – SureChemOpen patent search Google Images – Clinicaltrials.gov – Synonym trawling Google open (filtered) [9]
  • 10.
    The NCATS/MRC industrysponsored repurposing exercise: the joy of code lists [10]
  • 11.
  • 12.
    NCATS/MRC: summary statistics PMID 23159359 • 70 code names – no structures • 18 INNs & 4 codes-only in PubChem • 24 strucs “dug out” but PubChem-ve • 24 codes remain blinded [12]
  • 13.
    Sleuthing down aJNJ-39393406 structure: from darkness to twilight [13]
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
    JNJ-39393406: Google Scholar(was) structure -ve [18]
  • 19.
    JNJ-39393406 in Googleimages: finally a mapping But where did these two vendors get their mapping from ? [19]
  • 20.
    (Probable) JNJ-39393406 inPubChem: CID 1675566 patent-only sources and near-neighbours [20]
  • 21.
    (Probable) JNJ-39393406: SureChemOpen patentmatch with corroborative data PubChem SID 152835708 Cf NCATS data [21]
  • 22.
    More JNJ-39393406 mystery: InChIKeyin Google > ChemSpider > 3rd vendor [22]
  • 23.
    Not all JNJ-sare blinded: JNJ-40418677 IUPAC in abstract but code still PubChem –ve IUPAC name converted at chemicalize.org for PubChem mapping [23]
  • 24.
    Scaling-up code nameretrieval: wild card searches [24]
  • 25.
    Phases & codesin Clinicaltrials.gov: thin on results • Interventional studies = 115356 , 7895 with results (7%) • Results | Interventional Studies | Phase 1, 2, 3 | Industry = 4477 • Interventional Studies | GSK* | Phase 1, 2, 3 | Industry = 1004 • Results | Interventional Studies | GSK* | Phase 1, 2, 3 | Industry = 122 (12%) • Interventional Studies | GSK* OR AZD* OR JNJ* OR PF0* | Phase 1, 2, 3 | Industry = 1640 • Results | Interventional Studies | GSK* OR AZD* OR JNJ* OR PF0* | Phase 1, 2, 3 | Industry = 185 (11%) [25]
  • 26.
    altrials.net: public pressure> more results > more repurposing opportunities http://www.youtube.com/watch?v=lQ6YTU5kGXw&fe ature=youtu.be&t=28m39s [26]
  • 27.
    Stemming code namesin MeSh [27]
  • 28.
    Code names inPubChem Compound (CIDs) CID:SID ratio 275:1039 [28]
  • 29.
    Codes in PubChem:selected matches [29]
  • 30.
  • 31.
  • 32.
  • 33.
    PF-04457845: PubMed [33]
  • 34.
  • 35.
    PF-04457845: PubChemCID 24771824 Substance (SID) capture of activity, vendor and patent sources [35]
  • 36.
    Wikipedia: links toother development compounds But who put them in ? [36]
  • 37.
    PF-04457845: (almost) atotal system success • Declared efficacy failure > possible repurposing candidate • Selection of analogues and a probe [18F]PF-9811 (CID 70679467) • The “system” did well because of good publishing practice (e.g. full text) • Code, structure, target, papers, trials and patents all connected • 5mg for $275 But- • Serendipitous finding (no “efficacy failure” or “study stopped” tags) • Lack of clinicaltrials.org <> PubMed • BindingDB using deprecated ChEBI ID • PMID:21505060 not yet in ChEMBL • No direct target or patent nos. in CID record because no DrugBank, SCRIPDB or IBM capture • [18F]PF-9811 PubChem, [(18)F]PF-9811 PubMed, PF-9811-18F Books [37]
  • 38.
    Looking at codename intersects in different parts of the system [38]
  • 39.
    Clinicaltrials.org JNJ* Word cloud JNJ-28431754 = Canagliflozin = CID 24812758 [39]
  • 40.
    Company Pipelines: GSKcodes for 2012 [40]
  • 41.
    GSK codes: PubChemvs. 2012 Pipeline [41]
  • 42.
  • 43.
  • 44.
    Clinical, PubChem, MeSH,& 2012 Pipeline:GSK [44]
  • 45.
    Conclusions • Stalled developmentcandidates, designated by company codes, constitute a large potential repurposing information estate • Historical in vitro , pharmacological & clinical data linked to ~ 30K codes • But only 40-50% have structures assignable from open sources • An even smaller proportion have code names in PubChem • Public name>struc>data capture is ad hoc and needs improving • Repurposing-relevant relationships are not easy to dig out • Some “non competitive intelligence” approaches are shown here • The big push for transparency and open access should improve disclosure, data capture, linkage and repurposing opportunities Happy hunting ! TED Talk: Francis Collins: We need better drugs -- now http://www.ted.com/talks/francis_collins_we_need_better_drugs_now.html [45]

Editor's Notes

  • #24 IUPAC in abstract converted by MeSH but not transferred to PubChemChemicalize.org used for conversion, matched patent sourcesTherefore structure is there but code synonym is notNo ones responsibility to submit the code-to-struc