SlideShare a Scribd company logo
1 of 32
Navigating between patents, papers,
abstracts and databases using public
          sources and tools



       Christopher Southan1 and Sean Ekins2
         TW2Informatics, Göteborg, Sweden,
   Collaborative Drug Discovery, North Carolina, USA

                   ACS, April 2013




                                                       [1]
[2]
ACS Abstract

Engaging with chemistry in the biosciences requires navigation between
journals, patents, abstracts, databases, Google results and connecting across
millions of structures specified only in text. The ability to do this in public
sources has been revolutionised by several trends a) ChEMBL's capture of SAR
from journals c) the deposition of three major automated patent extractions
(SureChem, IBM and SCRIPDB) in PubChem for over 15 million structures, d)
open tools such as chemicalize.org, OPSIN, and OSCAR that enable the
conversion of IUPAC names or images to structures e) the indexing of chemical
terms (e.g. InChIKeys) that turn Google searches into a merged global
repository of 40 to 50 million structures. Details of these trends, including
PubChem intersect statistics, will be presented, along with practical examples
from selected tools. New structure sharing trends will also be considered such
as patent crowdsourcing, dropbox, blogs, figshare and open lab notebooks.




                                                                                  [3]
Getting chemistry out of text and linking to data:
  some is done but we have to dig for the rest




                                                 [4]
Estimates for chemical text tombs


• Journal chemistry public extraction, ~10 to 20 million entombed ?
• Majority of useful patent chemistry already publically extracted, but, ~5
  to 10 million still to go?
• PubMed abstracts and MeSH chemistry ~ 0.5 million still entombed ?
• Other unique, useful, text-only (i.e. no database cross-references)
  chemistry on the web ~ 0.1 to 0.5 million entombed ?




                                                                          [5]
What’s out there: publically disinterred structures

    •   InChIKey in Google ~ 50 million
    •   PubChem = 48 million
    •   PubChem ROF + 250-800 Mw (lead-like) = 31 million
    •   ChemSpider = 28 million
    •   PubChem all docs (papers & patents) = 16 million
    •   PubChem patents = 15 million
    •   SureChemOpen = 13 million
    •   PubChem journal sources (PubMed + ChEMBL) = 1 million




~90% of all structures in databases have their primary origin in text sources



                                                                                [6]
Medicinal chemistry patents (tombs with lids off)

 • 18,777,229 patents, 2,208,422 WO’s (i.e. ~ 9 per family)
 • WO, C07 or A61= 469,856
 • WO , C07D or A61K = 235,854
 • WO, C07D = 72,737 (assignee vs. year plots below)




                                                              [7]
PubMed at 22 mill:
~ 10% with chemistry (guarded tombs)




      “Free full text” = 575,513 (24%)




                                         [8]
Top-5 Med Chem journals (4% lids off tombs)




             “Free full text” = 2671 (4.3%)
                                              [9]
Growth:
 (escaping the
    tombs)
• Patent “big bang”
  (SureChem &
  SCRIPDB in
  2012)

• Literature “slow
  burn” (ChEMBL
  2009 jump)

• Paradox -
  patents:papers
  15:1

(both sets of CIDs
cumulative)
                     [10]
Patents in PubChem:
         post-bang total vs. unique content




PubChem at 47.3 million CIDs, 32% include patents, 20% patent-only
                                                                     [11]
Citations: connections between tombs
     but still need to disinter structures

Papers                         Abstracts




                              PubMed
              Patents
                              "relatedness"
                              heuristics




                                              [12]
Databases <> structures < > documents:
        links, but few reciprocal

 Papers                       Abstracts

                 0.8 mill
                (ChEMBL)




                 12K        0.2 mill (mainly MeSH)

Patents


            15 mill




                                                     [13]
Post-document retrieval: basic questions

1.    What is the name:IUPAC:image:other ratio in the document?
2.    Which tools might be appropriate for first-pass extractions?
3.    How many and what proportion of strucs can be extracted?
4.    Which SAR /in vivo/clinical data is linked to strucs ?
5.    Which document sections include the key strucs ?
6.    Which database entries have links (back) to this document?
7.    Which strucs have InChIKey matches in Google, & database entries?
8.    Which strucs have synthesis data?
9.    What other documents specify and/or cite this struc ?
10.   Which database records for this struc have links to other documents?
11.   What realtionship connections can be made using similarity searches?
12.   What intersects and differences are discernible within a document set ?



                                                                                [14]
Triaging document or webpage chemistry
• Identify the structure specification types, e.g.
   – Semantic names (all sources)
   – Code names (press releases, papers and abstracts)
   – IUPAC names (papers, patents and abstracts)
   – Images (papers, patents, & Google images)
   – SMILES (open lab books)
   – InChi strings (open lab books)
   – SDF files (open lab books, & github)

Convert these to a structure (e.g. SDF, SMILES, InChI) then:
   – Search InChIKey in Google
   – Search major databases
   – Search SureChemOpen
   – Compare extracted sets for intersects and diffs
   – Extend exact match connectivity with similarity searching
                                                                 [15]
Triage example:
  antimalarial
 starting point



The MMV390048 code
name is linked to an
image in press reports
but is PubChem and
PubMed -ve




                         [16]
Images: convert and search

                      Real chemists sketch them in a jiffy;

   the rest of us can use OSRA: Optical Structure Recognition Application




(after editing, CS(=O)(=O)c3ccc(C2=CN=C(N)C(C1=CCC(C(F)(F)F)N=C1)C2)cc3)
                                                                            [17]
Making connections:
image > strucure > database > documents




                 CID 53311393 > ChEMBL > PubMed
                 SureChem or chemicalize.org > patent


                                                        [18]
Patent SAR from WO2011086531:
Collating activities via SureChemOpen

     CID 53311393 >




                                        [19]
Patent SAR results: top-20 from 39 IC50s




                                           [20]
Results > figshare




http://figshare.com/articles/Patent_SAR_for_MMV390048/657979
                                                               [21]
Structures > MyNCBI




http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1zWhcobieZ
bIouGfUdsdbHek5/.
                                                                         [22]
SAR Table: iOS app
  from Molecular
     Materials
    Informatics

SureChemOpen strucs ->

manual data collation ->

PubChem CIDs -> SDF ->

Dropbox -> SAR Table

-> edit in data, R-group
decompose

-> share


                           [23]
InChIKey in Google: instant orthogonal joining




                                                 [24]
Chemicalize.org: 413 strucs from WO2011086532



CID 53311393 ->




                                            [25]
Using OPSIN and chemcalize.org to fix
     recalcitrant IUPACs from WO2011086532




Can quasi-manually extract ~ 10 more “split IUPAC” examples
                                                              [26]
Clustering document extraction sets: CheS-Mapper




  WO2011086531 -> chemicalize.org -> 413 cpds download ->
  CheS-Mapper -> cluster 8 -> export 53 cpds

                                                            [27]
PubChem -> ChEMBL -> PMID -> assay -> strucs
                   • CHEMBL2041980 (structure)
                   • PMID 22390538 (paper)
                   • CHEMBL2045642 (assay for 32 strucs
                     from paper)
                   • The 32 CIDs all have patent matches
                   •




                                                       [28]
Venny: intersects, diffs, de-dupes and merges


                                 1) WO2011086531
                                 matches in PubCHem

                                 2) CheS-Mapper
                                 cluster 8 from
                                 WO2011086532

                                 3) ChEMBL assayed
                                 cpds from PMID
                                 22390538

                                 (handles any regular
                                 strings e.g. db IDs,
                                 SMILES, IChI or
                                 InChIKey)


                                                        [29]
The open toolbox facilitates extraction and
  collation of 10 to 30 million structures
             entombed in text




                                              [30]
Conclusions

• The ability to extract chemical structures from text and web sources
  has been transformed by an expansion of the public toolbox
• The PubChem big-bang increases probability of extraction having
  database exact or similarity matches
• Paradoxically, the patent corpus is now completely open while access
  to journal text is still restricted
• However, ChEMBL has extracted ~ 0.8 mill. SAR-linked and target
  mapped structures from ~ 50K papers
• The submission of ~15 mill. patent structures to PubChem ensures at
  least representation from the majority of medicinal chemistry patents
  (many of which spawned the subsequent ChEMBL papers)
• Those who want to share their structures globally (e.g. OSDD) have an
  expanding set of options for surfacing their results.



                                                                          [31]
You can find me @...CDD Booth 205
PAPER ID: 13433
PAPER TITLE: “Dispensing processes profoundly impact biological assays and computational and statistical
analyses”
April 8th 8.35am Room 349

PAPER ID: 14750
PAPER TITLE: “Enhancing High Throughput Screening For Mycobacterium tuberculosis Drug Discovery
Using Bayesian Models”
April 9th 1.30pm Room 353
PAPER ID: 21524

PAPER TITLE: “Navigating between patents, papers, abstracts and databases using public sources and
tools”
April 9th 3.50pm Room 350
PAPER ID: 13358

PAPER TITLE: “TB Mobile: Appifying Data on Anti-tuberculosis Molecule Targets”
April 10th 8.30am Room 357

PAPER ID: 13382
PAPER TITLE: “Challenges and recommendations for obtaining chemical structures of industry-provided
repurposing candidates”
April 10th 10.20am Room 350

PAPER ID: 13438
PAPER TITLE: “Dual-event machine learning models to accelerate drug discovery”
April 10th 3.05 pm Room 350                                                                            [32]

More Related Content

Viewers also liked

From geek to event organiser
From geek to event organiserFrom geek to event organiser
From geek to event organiserCristiano Betta
 
Drupal for Large scale project
Drupal for Large scale projectDrupal for Large scale project
Drupal for Large scale projectCyril Reinhard
 
Devnest 111115
Devnest 111115Devnest 111115
Devnest 111115Angus Fox
 
Functional Reactive Programming at Booster 2014
Functional Reactive Programming at Booster 2014Functional Reactive Programming at Booster 2014
Functional Reactive Programming at Booster 2014mikaelbr
 
Datatium - radiation free responsive experiences
Datatium - radiation free responsive experiencesDatatium - radiation free responsive experiences
Datatium - radiation free responsive experiencesAndrew Fisher
 
2011 TDI Conference Social Media Guide
2011 TDI Conference Social Media Guide2011 TDI Conference Social Media Guide
2011 TDI Conference Social Media GuidePurple Communications
 
What may I do with your data? What do I have to do with your data? Policie...
What may I do with your data? What do I have to do with your data? Policie...What may I do with your data? What do I have to do with your data? Policie...
What may I do with your data? What do I have to do with your data? Policie...Steffen Staab
 
Présentation de LemonLDAP::NG aux Journées Perl 2016
Présentation de LemonLDAP::NG aux Journées Perl 2016Présentation de LemonLDAP::NG aux Journées Perl 2016
Présentation de LemonLDAP::NG aux Journées Perl 2016Clément OUDOT
 
Introduction to Perl Best Practices
Introduction to Perl Best PracticesIntroduction to Perl Best Practices
Introduction to Perl Best PracticesJosé Castro
 
Enrique Allen, D Fund - Warm Gun Conference
Enrique Allen, D Fund - Warm Gun ConferenceEnrique Allen, D Fund - Warm Gun Conference
Enrique Allen, D Fund - Warm Gun Conference500 Startups
 
SXSW 2013: How Twitter is Changing How We Watch TV
SXSW 2013: How Twitter is Changing How We Watch TVSXSW 2013: How Twitter is Changing How We Watch TV
SXSW 2013: How Twitter is Changing How We Watch TVJenn Deering Davis
 
Simplicity: UXLx version
Simplicity: UXLx versionSimplicity: UXLx version
Simplicity: UXLx versioncxpartners
 
Make your web apps "Go, Go" like Power Rangers
Make your web apps "Go, Go" like Power RangersMake your web apps "Go, Go" like Power Rangers
Make your web apps "Go, Go" like Power RangersKarolina Szczur
 
Advanced querying
Advanced queryingAdvanced querying
Advanced queryingstrmpnk
 
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor Apps
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor AppsLibrato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor Apps
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor AppsHeroku
 
Morgan e xt_062811
Morgan e xt_062811Morgan e xt_062811
Morgan e xt_062811kimorgan613
 
Responsive Web Design - but for real!
Responsive Web Design - but for real!Responsive Web Design - but for real!
Responsive Web Design - but for real!Rudy Rigot
 
Combining Context with Signals in the Internet of Things
Combining Context with Signals in the Internet of ThingsCombining Context with Signals in the Internet of Things
Combining Context with Signals in the Internet of ThingsAndy Piper
 

Viewers also liked (20)

From geek to event organiser
From geek to event organiserFrom geek to event organiser
From geek to event organiser
 
Drupal for Large scale project
Drupal for Large scale projectDrupal for Large scale project
Drupal for Large scale project
 
Devnest 111115
Devnest 111115Devnest 111115
Devnest 111115
 
Functional Reactive Programming at Booster 2014
Functional Reactive Programming at Booster 2014Functional Reactive Programming at Booster 2014
Functional Reactive Programming at Booster 2014
 
Datatium - radiation free responsive experiences
Datatium - radiation free responsive experiencesDatatium - radiation free responsive experiences
Datatium - radiation free responsive experiences
 
Whither Twitter?
Whither Twitter?Whither Twitter?
Whither Twitter?
 
2011 TDI Conference Social Media Guide
2011 TDI Conference Social Media Guide2011 TDI Conference Social Media Guide
2011 TDI Conference Social Media Guide
 
What may I do with your data? What do I have to do with your data? Policie...
What may I do with your data? What do I have to do with your data? Policie...What may I do with your data? What do I have to do with your data? Policie...
What may I do with your data? What do I have to do with your data? Policie...
 
Présentation de LemonLDAP::NG aux Journées Perl 2016
Présentation de LemonLDAP::NG aux Journées Perl 2016Présentation de LemonLDAP::NG aux Journées Perl 2016
Présentation de LemonLDAP::NG aux Journées Perl 2016
 
Introduction to Perl Best Practices
Introduction to Perl Best PracticesIntroduction to Perl Best Practices
Introduction to Perl Best Practices
 
Enrique Allen, D Fund - Warm Gun Conference
Enrique Allen, D Fund - Warm Gun ConferenceEnrique Allen, D Fund - Warm Gun Conference
Enrique Allen, D Fund - Warm Gun Conference
 
SXSW 2013: How Twitter is Changing How We Watch TV
SXSW 2013: How Twitter is Changing How We Watch TVSXSW 2013: How Twitter is Changing How We Watch TV
SXSW 2013: How Twitter is Changing How We Watch TV
 
Simplicity: UXLx version
Simplicity: UXLx versionSimplicity: UXLx version
Simplicity: UXLx version
 
Make your web apps "Go, Go" like Power Rangers
Make your web apps "Go, Go" like Power RangersMake your web apps "Go, Go" like Power Rangers
Make your web apps "Go, Go" like Power Rangers
 
Advanced querying
Advanced queryingAdvanced querying
Advanced querying
 
Introducing Xapian
Introducing XapianIntroducing Xapian
Introducing Xapian
 
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor Apps
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor AppsLibrato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor Apps
Librato's Joseph Ruscio at Heroku's 2013: Instrumenting 12-Factor Apps
 
Morgan e xt_062811
Morgan e xt_062811Morgan e xt_062811
Morgan e xt_062811
 
Responsive Web Design - but for real!
Responsive Web Design - but for real!Responsive Web Design - but for real!
Responsive Web Design - but for real!
 
Combining Context with Signals in the Internet of Things
Combining Context with Signals in the Internet of ThingsCombining Context with Signals in the Internet of Things
Combining Context with Signals in the Internet of Things
 

Similar to Navigatingbetween patents, papers, abstracts and databases using public sources and tools

Closing the gap between chemistry and biology: Joining between text tombs and...
Closing the gap between chemistry and biology: Joining between text tombs and...Closing the gap between chemistry and biology: Joining between text tombs and...
Closing the gap between chemistry and biology: Joining between text tombs and...Chris Southan
 
20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horseChris Southan
 
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsThe Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsChris Southan
 
A Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and WikidataA Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and Wikidatapetermurrayrust
 
Pros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemPros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemChris Southan
 
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...Dr. Haxel Consult
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
 
The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsDr. Haxel Consult
 
Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityChris Southan
 
Mining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity DataMining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity DataChris Southan
 
Digging out Structures for Repurposing: Non-competitive Intelligence ...
Digging out Structures for Repurposing: Non-competitive Intelligence        ...Digging out Structures for Repurposing: Non-competitive Intelligence        ...
Digging out Structures for Repurposing: Non-competitive Intelligence ...Chris Southan
 
Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themRoss Mounce
 
Patent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEsPatent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEsChris Southan
 
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...ChemAxon
 
2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...Michel Dumontier
 
The PubChemQC Project
The PubChemQC ProjectThe PubChemQC Project
The PubChemQC ProjectMaho Nakata
 
Overview of cheminformatics
Overview of cheminformaticsOverview of cheminformatics
Overview of cheminformaticsBenjamin Bucior
 

Similar to Navigatingbetween patents, papers, abstracts and databases using public sources and tools (20)

Closing the gap between chemistry and biology: Joining between text tombs and...
Closing the gap between chemistry and biology: Joining between text tombs and...Closing the gap between chemistry and biology: Joining between text tombs and...
Closing the gap between chemistry and biology: Joining between text tombs and...
 
20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse
 
Patents in PubChem
Patents in PubChemPatents in PubChem
Patents in PubChem
 
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsThe Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
 
A Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and WikidataA Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and Wikidata
 
Pros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemPros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChem
 
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveats
 
Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivity
 
Mining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity DataMining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity Data
 
Digging out Structures for Repurposing: Non-competitive Intelligence ...
Digging out Structures for Repurposing: Non-competitive Intelligence        ...Digging out Structures for Repurposing: Non-competitive Intelligence        ...
Digging out Structures for Repurposing: Non-competitive Intelligence ...
 
Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on them
 
Patent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEsPatent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEs
 
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
 
Connecting Chemists To The Internet Training at Burlington House 2010
Connecting Chemists To The Internet Training at Burlington House 2010Connecting Chemists To The Internet Training at Burlington House 2010
Connecting Chemists To The Internet Training at Burlington House 2010
 
2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...
 
The PubChemQC Project
The PubChemQC ProjectThe PubChemQC Project
The PubChemQC Project
 
Overview of cheminformatics
Overview of cheminformaticsOverview of cheminformatics
Overview of cheminformatics
 
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
 

More from Sean Ekins

How to Win a small business grant.pptx
How to Win a small business grant.pptxHow to Win a small business grant.pptx
How to Win a small business grant.pptxSean Ekins
 
Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...
Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...
Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...Sean Ekins
 
A presentation at the Global Genes rare drug development symposium on governm...
A presentation at the Global Genes rare drug development symposium on governm...A presentation at the Global Genes rare drug development symposium on governm...
A presentation at the Global Genes rare drug development symposium on governm...Sean Ekins
 
Leveraging Science Communication and Social Media to Build Your Brand and Ele...
Leveraging Science Communication and Social Media to Build Your Brand and Ele...Leveraging Science Communication and Social Media to Build Your Brand and Ele...
Leveraging Science Communication and Social Media to Build Your Brand and Ele...Sean Ekins
 
Bayesian Models for Chagas Disease
Bayesian Models for Chagas DiseaseBayesian Models for Chagas Disease
Bayesian Models for Chagas DiseaseSean Ekins
 
Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...
Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...
Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...Sean Ekins
 
Drug Discovery Today March 2017 special issue
Drug Discovery Today March 2017 special issueDrug Discovery Today March 2017 special issue
Drug Discovery Today March 2017 special issueSean Ekins
 
Using In Silico Tools in Repurposing Drugs for Neglected and Orphan Diseases
Using In Silico Tools in Repurposing Drugs for Neglected and Orphan DiseasesUsing In Silico Tools in Repurposing Drugs for Neglected and Orphan Diseases
Using In Silico Tools in Repurposing Drugs for Neglected and Orphan DiseasesSean Ekins
 
Five Ways to Use Social Media to Raise Awareness for Your Paper or Research
Five Ways to Use Social Media to Raise Awareness for Your Paper or ResearchFive Ways to Use Social Media to Raise Awareness for Your Paper or Research
Five Ways to Use Social Media to Raise Awareness for Your Paper or ResearchSean Ekins
 
Open zika presentation
Open zika presentation Open zika presentation
Open zika presentation Sean Ekins
 
academic / small company collaborations for rare and neglected diseasesv2
 academic / small company collaborations for rare and neglected diseasesv2 academic / small company collaborations for rare and neglected diseasesv2
academic / small company collaborations for rare and neglected diseasesv2Sean Ekins
 
CDD models case study #3
CDD models case study #3 CDD models case study #3
CDD models case study #3 Sean Ekins
 
CDD models case study #2
CDD models case study #2 CDD models case study #2
CDD models case study #2 Sean Ekins
 
CDD Models case study #1
CDD Models case study #1 CDD Models case study #1
CDD Models case study #1 Sean Ekins
 
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...Sean Ekins
 
CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...
CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...
CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...Sean Ekins
 
The future of computational chemistry b ig
The future of computational chemistry b igThe future of computational chemistry b ig
The future of computational chemistry b igSean Ekins
 
#ZikaOpen: Homology Models -
#ZikaOpen: Homology Models - #ZikaOpen: Homology Models -
#ZikaOpen: Homology Models - Sean Ekins
 
Slas talk 2016
Slas talk 2016Slas talk 2016
Slas talk 2016Sean Ekins
 
Pros and cons of social networking for scientists
Pros and cons of social networking for scientistsPros and cons of social networking for scientists
Pros and cons of social networking for scientistsSean Ekins
 

More from Sean Ekins (20)

How to Win a small business grant.pptx
How to Win a small business grant.pptxHow to Win a small business grant.pptx
How to Win a small business grant.pptx
 
Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...
Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...
Evaluating Multiple Machine Learning Models for Biodegradation and Aquatic To...
 
A presentation at the Global Genes rare drug development symposium on governm...
A presentation at the Global Genes rare drug development symposium on governm...A presentation at the Global Genes rare drug development symposium on governm...
A presentation at the Global Genes rare drug development symposium on governm...
 
Leveraging Science Communication and Social Media to Build Your Brand and Ele...
Leveraging Science Communication and Social Media to Build Your Brand and Ele...Leveraging Science Communication and Social Media to Build Your Brand and Ele...
Leveraging Science Communication and Social Media to Build Your Brand and Ele...
 
Bayesian Models for Chagas Disease
Bayesian Models for Chagas DiseaseBayesian Models for Chagas Disease
Bayesian Models for Chagas Disease
 
Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...
Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...
Assay Central: A New Approach to Compiling Big Data and Preparing Machine Lea...
 
Drug Discovery Today March 2017 special issue
Drug Discovery Today March 2017 special issueDrug Discovery Today March 2017 special issue
Drug Discovery Today March 2017 special issue
 
Using In Silico Tools in Repurposing Drugs for Neglected and Orphan Diseases
Using In Silico Tools in Repurposing Drugs for Neglected and Orphan DiseasesUsing In Silico Tools in Repurposing Drugs for Neglected and Orphan Diseases
Using In Silico Tools in Repurposing Drugs for Neglected and Orphan Diseases
 
Five Ways to Use Social Media to Raise Awareness for Your Paper or Research
Five Ways to Use Social Media to Raise Awareness for Your Paper or ResearchFive Ways to Use Social Media to Raise Awareness for Your Paper or Research
Five Ways to Use Social Media to Raise Awareness for Your Paper or Research
 
Open zika presentation
Open zika presentation Open zika presentation
Open zika presentation
 
academic / small company collaborations for rare and neglected diseasesv2
 academic / small company collaborations for rare and neglected diseasesv2 academic / small company collaborations for rare and neglected diseasesv2
academic / small company collaborations for rare and neglected diseasesv2
 
CDD models case study #3
CDD models case study #3 CDD models case study #3
CDD models case study #3
 
CDD models case study #2
CDD models case study #2 CDD models case study #2
CDD models case study #2
 
CDD Models case study #1
CDD Models case study #1 CDD Models case study #1
CDD Models case study #1
 
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...
Using Machine Learning Models Based on Phenotypic Data to Discover New Molecu...
 
CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...
CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...
CDD: Vault, CDD: Vision and CDD: Models software for biologists and chemists ...
 
The future of computational chemistry b ig
The future of computational chemistry b igThe future of computational chemistry b ig
The future of computational chemistry b ig
 
#ZikaOpen: Homology Models -
#ZikaOpen: Homology Models - #ZikaOpen: Homology Models -
#ZikaOpen: Homology Models -
 
Slas talk 2016
Slas talk 2016Slas talk 2016
Slas talk 2016
 
Pros and cons of social networking for scientists
Pros and cons of social networking for scientistsPros and cons of social networking for scientists
Pros and cons of social networking for scientists
 

Recently uploaded

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 

Navigatingbetween patents, papers, abstracts and databases using public sources and tools

  • 1. Navigating between patents, papers, abstracts and databases using public sources and tools Christopher Southan1 and Sean Ekins2 TW2Informatics, Göteborg, Sweden, Collaborative Drug Discovery, North Carolina, USA ACS, April 2013 [1]
  • 2. [2]
  • 3. ACS Abstract Engaging with chemistry in the biosciences requires navigation between journals, patents, abstracts, databases, Google results and connecting across millions of structures specified only in text. The ability to do this in public sources has been revolutionised by several trends a) ChEMBL's capture of SAR from journals c) the deposition of three major automated patent extractions (SureChem, IBM and SCRIPDB) in PubChem for over 15 million structures, d) open tools such as chemicalize.org, OPSIN, and OSCAR that enable the conversion of IUPAC names or images to structures e) the indexing of chemical terms (e.g. InChIKeys) that turn Google searches into a merged global repository of 40 to 50 million structures. Details of these trends, including PubChem intersect statistics, will be presented, along with practical examples from selected tools. New structure sharing trends will also be considered such as patent crowdsourcing, dropbox, blogs, figshare and open lab notebooks. [3]
  • 4. Getting chemistry out of text and linking to data: some is done but we have to dig for the rest [4]
  • 5. Estimates for chemical text tombs • Journal chemistry public extraction, ~10 to 20 million entombed ? • Majority of useful patent chemistry already publically extracted, but, ~5 to 10 million still to go? • PubMed abstracts and MeSH chemistry ~ 0.5 million still entombed ? • Other unique, useful, text-only (i.e. no database cross-references) chemistry on the web ~ 0.1 to 0.5 million entombed ? [5]
  • 6. What’s out there: publically disinterred structures • InChIKey in Google ~ 50 million • PubChem = 48 million • PubChem ROF + 250-800 Mw (lead-like) = 31 million • ChemSpider = 28 million • PubChem all docs (papers & patents) = 16 million • PubChem patents = 15 million • SureChemOpen = 13 million • PubChem journal sources (PubMed + ChEMBL) = 1 million ~90% of all structures in databases have their primary origin in text sources [6]
  • 7. Medicinal chemistry patents (tombs with lids off) • 18,777,229 patents, 2,208,422 WO’s (i.e. ~ 9 per family) • WO, C07 or A61= 469,856 • WO , C07D or A61K = 235,854 • WO, C07D = 72,737 (assignee vs. year plots below) [7]
  • 8. PubMed at 22 mill: ~ 10% with chemistry (guarded tombs) “Free full text” = 575,513 (24%) [8]
  • 9. Top-5 Med Chem journals (4% lids off tombs) “Free full text” = 2671 (4.3%) [9]
  • 10. Growth: (escaping the tombs) • Patent “big bang” (SureChem & SCRIPDB in 2012) • Literature “slow burn” (ChEMBL 2009 jump) • Paradox - patents:papers 15:1 (both sets of CIDs cumulative) [10]
  • 11. Patents in PubChem: post-bang total vs. unique content PubChem at 47.3 million CIDs, 32% include patents, 20% patent-only [11]
  • 12. Citations: connections between tombs but still need to disinter structures Papers Abstracts PubMed Patents "relatedness" heuristics [12]
  • 13. Databases <> structures < > documents: links, but few reciprocal Papers Abstracts 0.8 mill (ChEMBL) 12K 0.2 mill (mainly MeSH) Patents 15 mill [13]
  • 14. Post-document retrieval: basic questions 1. What is the name:IUPAC:image:other ratio in the document? 2. Which tools might be appropriate for first-pass extractions? 3. How many and what proportion of strucs can be extracted? 4. Which SAR /in vivo/clinical data is linked to strucs ? 5. Which document sections include the key strucs ? 6. Which database entries have links (back) to this document? 7. Which strucs have InChIKey matches in Google, & database entries? 8. Which strucs have synthesis data? 9. What other documents specify and/or cite this struc ? 10. Which database records for this struc have links to other documents? 11. What realtionship connections can be made using similarity searches? 12. What intersects and differences are discernible within a document set ? [14]
  • 15. Triaging document or webpage chemistry • Identify the structure specification types, e.g. – Semantic names (all sources) – Code names (press releases, papers and abstracts) – IUPAC names (papers, patents and abstracts) – Images (papers, patents, & Google images) – SMILES (open lab books) – InChi strings (open lab books) – SDF files (open lab books, & github) Convert these to a structure (e.g. SDF, SMILES, InChI) then: – Search InChIKey in Google – Search major databases – Search SureChemOpen – Compare extracted sets for intersects and diffs – Extend exact match connectivity with similarity searching [15]
  • 16. Triage example: antimalarial starting point The MMV390048 code name is linked to an image in press reports but is PubChem and PubMed -ve [16]
  • 17. Images: convert and search Real chemists sketch them in a jiffy; the rest of us can use OSRA: Optical Structure Recognition Application (after editing, CS(=O)(=O)c3ccc(C2=CN=C(N)C(C1=CCC(C(F)(F)F)N=C1)C2)cc3) [17]
  • 18. Making connections: image > strucure > database > documents CID 53311393 > ChEMBL > PubMed SureChem or chemicalize.org > patent [18]
  • 19. Patent SAR from WO2011086531: Collating activities via SureChemOpen CID 53311393 > [19]
  • 20. Patent SAR results: top-20 from 39 IC50s [20]
  • 23. SAR Table: iOS app from Molecular Materials Informatics SureChemOpen strucs -> manual data collation -> PubChem CIDs -> SDF -> Dropbox -> SAR Table -> edit in data, R-group decompose -> share [23]
  • 24. InChIKey in Google: instant orthogonal joining [24]
  • 25. Chemicalize.org: 413 strucs from WO2011086532 CID 53311393 -> [25]
  • 26. Using OPSIN and chemcalize.org to fix recalcitrant IUPACs from WO2011086532 Can quasi-manually extract ~ 10 more “split IUPAC” examples [26]
  • 27. Clustering document extraction sets: CheS-Mapper WO2011086531 -> chemicalize.org -> 413 cpds download -> CheS-Mapper -> cluster 8 -> export 53 cpds [27]
  • 28. PubChem -> ChEMBL -> PMID -> assay -> strucs • CHEMBL2041980 (structure) • PMID 22390538 (paper) • CHEMBL2045642 (assay for 32 strucs from paper) • The 32 CIDs all have patent matches • [28]
  • 29. Venny: intersects, diffs, de-dupes and merges 1) WO2011086531 matches in PubCHem 2) CheS-Mapper cluster 8 from WO2011086532 3) ChEMBL assayed cpds from PMID 22390538 (handles any regular strings e.g. db IDs, SMILES, IChI or InChIKey) [29]
  • 30. The open toolbox facilitates extraction and collation of 10 to 30 million structures entombed in text [30]
  • 31. Conclusions • The ability to extract chemical structures from text and web sources has been transformed by an expansion of the public toolbox • The PubChem big-bang increases probability of extraction having database exact or similarity matches • Paradoxically, the patent corpus is now completely open while access to journal text is still restricted • However, ChEMBL has extracted ~ 0.8 mill. SAR-linked and target mapped structures from ~ 50K papers • The submission of ~15 mill. patent structures to PubChem ensures at least representation from the majority of medicinal chemistry patents (many of which spawned the subsequent ChEMBL papers) • Those who want to share their structures globally (e.g. OSDD) have an expanding set of options for surfacing their results. [31]
  • 32. You can find me @...CDD Booth 205 PAPER ID: 13433 PAPER TITLE: “Dispensing processes profoundly impact biological assays and computational and statistical analyses” April 8th 8.35am Room 349 PAPER ID: 14750 PAPER TITLE: “Enhancing High Throughput Screening For Mycobacterium tuberculosis Drug Discovery Using Bayesian Models” April 9th 1.30pm Room 353 PAPER ID: 21524 PAPER TITLE: “Navigating between patents, papers, abstracts and databases using public sources and tools” April 9th 3.50pm Room 350 PAPER ID: 13358 PAPER TITLE: “TB Mobile: Appifying Data on Anti-tuberculosis Molecule Targets” April 10th 8.30am Room 357 PAPER ID: 13382 PAPER TITLE: “Challenges and recommendations for obtaining chemical structures of industry-provided repurposing candidates” April 10th 10.20am Room 350 PAPER ID: 13438 PAPER TITLE: “Dual-event machine learning models to accelerate drug discovery” April 10th 3.05 pm Room 350 [32]

Editor's Notes

  1. 70 million substances in CAS suggest a 20-30 million shortfall (i.e. SciFinder only) but they include virtualsand librariesSureChen will continue patent extraction but expect an asymtote of true novels only soonPubMed capture largely dependant on MeSH but a lot of IUPAC chemistry is only anually updated, and some not capturedSureChem, IBM and chemicalize all inticate that, including MeSH terms at least 0.5 million structures could be extracted from PubMedNo idea how much web-unique chemistry (not in documents or databases) is out there but open lab books will increase this
  2. IinChIKeys - estimate of PubChem + ChemSpider in Google – but PubChem currently has a backlog for Key scrapingThe ROF + 250-800 is a very approximate circumscription of the property space that has some possibility of bioactivityProbably a proportion of vendor structures may have never been committed to textThere are some virtuals “out there” including some patent-extractions but difficult to estimate
  3. Note the WO/PCT queries are non-redundant in the patent family senseThe medicinal chemistry corpus is actually quite smallNote big pharma patent decline post-2008 Average exemplified cpds with activity data per patent (family) is unknown but GVKs curation average is ~ 50
  4. Using the top level MeSH term as a filter for “PubMeds with some chemistry”Free full text is ¨ ¼ but there are a lot of biological journals in this set
  5. Select the core journals used for med chem extraction by GVKBIO and ChEMBL. Not a large corpus Both extract ~ 15 cpds per paperNote the proportion of “free full text” is low
  6. Note that cumulative plots include an element of back-mapping i.e. the 2005 matches are to the 2013 total not the just the 2005 documents
  7. PubChem hit 15 million patents in March 2013Largest unique content is SureChemOpen Thomson uniqueness low because a) they include at least 30% journal extractions and b) the Derwent WPI content (was) also in Discovery gateIBM are only pre-2000 patents and the extracted content overlaps with other sources.
  8. Citations are a core tradition but they do not provide direct structure &lt;-&gt; structure linksPatents cite papers but papers rarely cite patents (with the exception of patent reviews)
  9. Only Nature Chemical Biology and Nature Chemistry have direct links from the journal document to PubChemGiven todays technology the major patent offices could put links in the PDFs but are unlikely to do so
  10. The problem “how do I find the chemistry out there relevant to my interests” is a general search retrieval recall and specificity challenge. cannot be addressed here. Beyond PubMed and Google it’s getting better (e.g. indexing of full text patents) but there are still issues (e.g. text mining of chemical journals still very restricted)Once you have found the documents or text, these are the typical set of questions you might want to address, especially in regard to choosing which tools are best for the job.
  11. Need to assess what representational types are being used in the documentEg. Some patents are image-only (but SureChem is pulling most of these out)Then select tools and sources for the job ´Decide how to store your structures locally The default batch search is an upload to PubChemThe default individual search is the InChIKey against Google
  12. Self explanatoryNote my blog post was indexed
  13. The simplest of starting points, at least the press release had a structure diagram OSRA provides good starting points to edit and get SMILESThe structure does not have to be exactly right because a database similarity match is OK to see what it should have been
  14. SMILES from the image hits the CID in PubChemThis links to patents via SureChem and chemicalize.orgChEMBL provides a link to the paper Note none of these sources have MMV390048 as synonym so all the connections are via structure
  15. We can start of with patent linksNote in this case numbered image capture, as oposed to the IUPAC listing, was important to manually collate the structure against the correct IC50
  16. From manual cross-checking between the individual example structures and the IC50 table the Excel sheet can be populated
  17. Useful way to share results that is citableIndexed in Google but no live links in Excel sheet (yet)
  18. Can upload CID lists and download as a saved and public collection
  19. This is the Pistoia /AlexClark SAR Table appDropped the CIDs out of PubChem into DropBox and picked them up on the IPADNice but would be good to automate the decomposition
  20. InChIkey search picks up instantly This was just a choice of one of the activesSo this connects PubChem and figshare
  21. The CID links straight throught to chemicalize and will just re-extract the whole patent in a few seconds The 413 gave 358 hits in pub chem
  22. IUPAC names have a lot of usage variants and OCR mistakes Typically gaps, line breaks 1 instead of 1 and missing bracketsOPSIN is good for indicating where the break is This can then be fixed for a series in chemicalize.org
  23. Total extractions from patents can include a lot of low Mw common reagent chemistryCheS mapper display makes it easy to pick out clusters of lead-like compoundsClusters can then be downloadedFlexibility is high because document sets can be split or merged at the imput stage
  24. ChEMBL extracts structure and dataCant actually select a set of cpds via the PubMed ID but can via the assay ID that is usually unique to that paperIn this case we got 32 structures, all of which came from that patent
  25. Very useful utility for any kind of set operations e.g. sets of extractions Total flexibility e.g. intersecting patents and papers with extractions from abstract setsSets can be de-duplicatedand merged from multiple sets (e.g. 10 patent extractions in one box)Can combine with selected downloaded database records