Navigatingbetween patents, papers, abstracts and databases using public sources and tools

Navigating between patents, papers,
abstracts and databases using public
sources and tools

Christopher Southan1 and Sean Ekins2
TW2Informatics, Göteborg, Sweden,
Collaborative Drug Discovery, North Carolina, USA

ACS, April 2013

[1]

ACS Abstract

Engaging with chemistry in the biosciences requires navigation between
journals, patents, abstracts, databases, Google results and connecting across
millions of structures specified only in text. The ability to do this in public
sources has been revolutionised by several trends a) ChEMBL's capture of SAR
from journals c) the deposition of three major automated patent extractions
(SureChem, IBM and SCRIPDB) in PubChem for over 15 million structures, d)
open tools such as chemicalize.org, OPSIN, and OSCAR that enable the
conversion of IUPAC names or images to structures e) the indexing of chemical
terms (e.g. InChIKeys) that turn Google searches into a merged global
repository of 40 to 50 million structures. Details of these trends, including
PubChem intersect statistics, will be presented, along with practical examples
from selected tools. New structure sharing trends will also be considered such
as patent crowdsourcing, dropbox, blogs, figshare and open lab notebooks.

[3]

Getting chemistry out of text and linking to data:
some is done but we have to dig for the rest

[4]

Estimates for chemical text tombs

• Journal chemistry public extraction, ~10 to 20 million entombed ?
• Majority of useful patent chemistry already publically extracted, but, ~5
to 10 million still to go?
• PubMed abstracts and MeSH chemistry ~ 0.5 million still entombed ?
• Other unique, useful, text-only (i.e. no database cross-references)
chemistry on the web ~ 0.1 to 0.5 million entombed ?

[5]

What’s out there: publically disinterred structures

• InChIKey in Google ~ 50 million
• PubChem = 48 million
• PubChem ROF + 250-800 Mw (lead-like) = 31 million
• ChemSpider = 28 million
• PubChem all docs (papers & patents) = 16 million
• PubChem patents = 15 million
• SureChemOpen = 13 million
• PubChem journal sources (PubMed + ChEMBL) = 1 million

~90% of all structures in databases have their primary origin in text sources

[6]

Medicinal chemistry patents (tombs with lids off)

• 18,777,229 patents, 2,208,422 WO’s (i.e. ~ 9 per family)
• WO, C07 or A61= 469,856
• WO , C07D or A61K = 235,854
• WO, C07D = 72,737 (assignee vs. year plots below)

[7]

PubMed at 22 mill:
~ 10% with chemistry (guarded tombs)

“Free full text” = 575,513 (24%)

[8]

Top-5 Med Chem journals (4% lids off tombs)

“Free full text” = 2671 (4.3%)
[9]

Growth:
(escaping the
tombs)
• Patent “big bang”
(SureChem &
SCRIPDB in
2012)

• Literature “slow
burn” (ChEMBL
2009 jump)

• Paradox -
patents:papers
15:1

(both sets of CIDs
cumulative)
[10]

Patents in PubChem:
post-bang total vs. unique content

PubChem at 47.3 million CIDs, 32% include patents, 20% patent-only
[11]

Citations: connections between tombs
but still need to disinter structures

Papers Abstracts

PubMed
Patents
"relatedness"
heuristics

[12]

Databases <> structures < > documents:
links, but few reciprocal

Papers Abstracts

0.8 mill
(ChEMBL)

12K 0.2 mill (mainly MeSH)

Patents

15 mill

[13]

Post-document retrieval: basic questions

1. What is the name:IUPAC:image:other ratio in the document?
2. Which tools might be appropriate for first-pass extractions?
3. How many and what proportion of strucs can be extracted?
4. Which SAR /in vivo/clinical data is linked to strucs ?
5. Which document sections include the key strucs ?
6. Which database entries have links (back) to this document?
7. Which strucs have InChIKey matches in Google, & database entries?
8. Which strucs have synthesis data?
9. What other documents specify and/or cite this struc ?
10. Which database records for this struc have links to other documents?
11. What realtionship connections can be made using similarity searches?
12. What intersects and differences are discernible within a document set ?

[14]

Triaging document or webpage chemistry
• Identify the structure specification types, e.g.
– Semantic names (all sources)
– Code names (press releases, papers and abstracts)
– IUPAC names (papers, patents and abstracts)
– Images (papers, patents, & Google images)
– SMILES (open lab books)
– InChi strings (open lab books)
– SDF files (open lab books, & github)

Convert these to a structure (e.g. SDF, SMILES, InChI) then:
– Search InChIKey in Google
– Search major databases
– Search SureChemOpen
– Compare extracted sets for intersects and diffs
– Extend exact match connectivity with similarity searching
[15]

Triage example:
antimalarial
starting point

The MMV390048 code
name is linked to an
image in press reports
but is PubChem and
PubMed -ve

[16]

Images: convert and search

Real chemists sketch them in a jiffy;

the rest of us can use OSRA: Optical Structure Recognition Application

(after editing, CS(=O)(=O)c3ccc(C2=CN=C(N)C(C1=CCC(C(F)(F)F)N=C1)C2)cc3)
[17]

Making connections:
image > strucure > database > documents

CID 53311393 > ChEMBL > PubMed
SureChem or chemicalize.org > patent

[18]

Patent SAR from WO2011086531:
Collating activities via SureChemOpen

CID 53311393 >

[19]

Patent SAR results: top-20 from 39 IC50s

[20]

Results > figshare

http://figshare.com/articles/Patent_SAR_for_MMV390048/657979
[21]

Structures > MyNCBI

http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1zWhcobieZ
bIouGfUdsdbHek5/.
[22]

SAR Table: iOS app
from Molecular
Materials
Informatics

SureChemOpen strucs ->

manual data collation ->

PubChem CIDs -> SDF ->

Dropbox -> SAR Table

-> edit in data, R-group
decompose

-> share

[23]

InChIKey in Google: instant orthogonal joining

[24]

Chemicalize.org: 413 strucs from WO2011086532

CID 53311393 ->

[25]

Using OPSIN and chemcalize.org to fix
recalcitrant IUPACs from WO2011086532

Can quasi-manually extract ~ 10 more “split IUPAC” examples
[26]

Clustering document extraction sets: CheS-Mapper

WO2011086531 -> chemicalize.org -> 413 cpds download ->
CheS-Mapper -> cluster 8 -> export 53 cpds

[27]

PubChem -> ChEMBL -> PMID -> assay -> strucs
• CHEMBL2041980 (structure)
• PMID 22390538 (paper)
• CHEMBL2045642 (assay for 32 strucs
from paper)
• The 32 CIDs all have patent matches
•

[28]

Venny: intersects, diffs, de-dupes and merges

1) WO2011086531
matches in PubCHem

2) CheS-Mapper
cluster 8 from
WO2011086532

3) ChEMBL assayed
cpds from PMID
22390538

(handles any regular
strings e.g. db IDs,
SMILES, IChI or
InChIKey)

[29]

The open toolbox facilitates extraction and
collation of 10 to 30 million structures
entombed in text

[30]

Conclusions

• The ability to extract chemical structures from text and web sources
has been transformed by an expansion of the public toolbox
• The PubChem big-bang increases probability of extraction having
database exact or similarity matches
• Paradoxically, the patent corpus is now completely open while access
to journal text is still restricted
• However, ChEMBL has extracted ~ 0.8 mill. SAR-linked and target
mapped structures from ~ 50K papers
• The submission of ~15 mill. patent structures to PubChem ensures at
least representation from the majority of medicinal chemistry patents
(many of which spawned the subsequent ChEMBL papers)
• Those who want to share their structures globally (e.g. OSDD) have an
expanding set of options for surfacing their results.

[31]

You can find me @...CDD Booth 205
PAPER ID: 13433
PAPER TITLE: “Dispensing processes profoundly impact biological assays and computational and statistical
analyses”
April 8th 8.35am Room 349

PAPER ID: 14750
PAPER TITLE: “Enhancing High Throughput Screening For Mycobacterium tuberculosis Drug Discovery
Using Bayesian Models”
April 9th 1.30pm Room 353
PAPER ID: 21524

PAPER TITLE: “Navigating between patents, papers, abstracts and databases using public sources and
tools”
April 9th 3.50pm Room 350
PAPER ID: 13358

PAPER TITLE: “TB Mobile: Appifying Data on Anti-tuberculosis Molecule Targets”

PAPER ID: 13382
PAPER TITLE: “Challenges and recommendations for obtaining chemical structures of industry-provided
repurposing candidates”

PAPER ID: 13438
PAPER TITLE: “Dual-event machine learning models to accelerate drug discovery”
April 10th 3.05 pm Room 350 [32]

Navigatingbetween patents, papers, abstracts and databases using public sources and tools

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Navigatingbetween patents, papers, abstracts and databases using public sources and tools

Similar to Navigatingbetween patents, papers, abstracts and databases using public sources and tools (20)

More from Sean Ekins

More from Sean Ekins (20)

Recently uploaded

Recently uploaded (20)

Navigatingbetween patents, papers, abstracts and databases using public sources and tools

Editor's Notes