Mining Drug Targets, Structures and Activity Data

Mining Drug Targets, Structures and
Activity Data Using Open Full-Text
Patent Sources and Web Tools

Christopher Southan

ChrisDS Consulting, Göteborg, Sweden,

Prepared for BioIT, Boston, April 2012,
Track 11, Open Source Solutions, Wednesday, 13:45

[1]

Key Relationships
Extractable from Patents and Papers
MAQALPWLLLWMGAGVLPAHGTQHGIRLPLRSGLGGA
PLGLRLPRETDEEPEEPGRRGSFVEMVDNLRGKSGQGY
YVEMTVGSPPQTLNILVDTGSSNFAVGAAPHPFLHRYYQ
RQLSSTYRDLRKGVYVPYTQGKWEGELGTDLVSIPHGP
NVTVRANIAAITESDKFFINGSNWEGILGLAYAEIARPDD
SLEPFFDSLVKQTHVPNLFSLQLCGAGFPLNQSEVLASV
GGSMIIGGIDHSLYTGSLWYTPIRREWYYEVIIVRVEINGQ
DLKMDCKEYNYDKSIVDSGTTNLRLPKKVFEAAVKSIKA
ASSTEKFPDGFWLGEQLVCWQAGTTPWNIFPVISLYLM
GEVTNQSFRITILPQQYLRPVEDVATSQDDCYKFAISQSS
TGTVMGAVIMEGFYVVFDRARKRIGFAVSACHVHDEFRT
AAVEGPFVTLDMEDCGYNIPQTDESTLMTIAYVMAAICAL
FMLPLCLMVCQWRCLRCLRQQHDDFADDISLLK

Document Assay Result Compound Target

2011 PMID 21569515

2010 doi:10.1007/978-3-642-15120-0_9

Important ”bag of targets” exceptions (eg bacterial/parasite whole cells)
[3]

The Good News: Patent Mining Utility
• Novel bioactive chemical structures related to drug discovery exceeding those in
journals by at least five-fold.
• Encompass academic, as well as commercial, global med. chem. output.
• Targets, assays, mechanisms of action, disease descriptions and in-vivo data.
• ~ 70% of data initially patent-only, some never disclosed elswhere.
• Include synthetic descriptions and other useful enabling information.
• Precede journal or meeting reports by ~ 1.5 to 5 years.
• Can be complementary to papers (e.g. larger SAR matrix).
• Intersect with papers at chemistry, target, disease, author and citation levels
• IP exploitable for Neglected Tropical Disease research becoming ”open”.

[4]

The Bad News: Patent Mining Can be Tough
• High-specificity retrieval of relevant documents difficult
• Massive chaff-to-wheat ratio in 100s of pages
• Differences in layout, house style and data location
• Markush permutation
• Variability in IUPAC strings and image rendering
• Use of non-standard gene/protein names
• Obfuscation via;
– Qualitative or binned assay results
– Structure-to-data links non-obvious, patchy or absent
– Less than 50% of titles include target names
– The ”hiding the lead and core structures” game
– Blunderbuss disease and use exemplifications
– Tense ambiguity (i.e. ”could be” vs. ”was” done)
• Quality judgments dificult
• Patents cite papers and patents but few papers cite patents
• Document redundancy of Kind codes, patent families and equivalents
• Finding drug candidate first-filings is difficult
• The PDF hamburger problem and OCR noise
[5]

Reasons for Rolling-your-own Patent
Chemistry and Data Extraction
• Limited budget
• You are likely to be a tacit super-curator by profession
• Best-of-both-worlds synergy with licensed sources (e.g. digging deeper)
• Combine automated outputs with manual triage
• Develop a technical understanding and comparison of vendor offerings
• Commercial dbs cap the number of manually-extracted examples
• Need SAR analogues for a few targets rather than many (e.g. mechanistic
enzymology or systems chemical biology)
• Only require data sampling across specific disease areas
• Not overly concerned about false-negatives (i.e. don’t need
comprehensive prior-art check or scoping of claims)
• Open tools operate on any text or web source, not just patents
• You may already have commercial text mining capability
• Flexibility of intersecting patent with literature chemistry (e.g. ChEMBL,
journals you subscribe to, PubMed and PMC)
• You can slice-and-dice PubChem patent chemistry in ways
complementary to commercial databases

[6]

Open Sources and Tools Overview
• Searching metadata, abstracts and text
– Official public portals: EPO/Espacenet, USPTO, WIPO, EBI CiteExplore
– Open full-text: FreePatentsOnline, Google patents, Google Scholar, et al.
• Metadata, full-text and chemical structure search - SureChemOpen
• Bulk name-to-structure conversion - ChemAxon Chemicalize
• Individulal name-to-structure - OPSIN
• Conversion of images to structures - OSRA
• Sketcher inputs – many options
• Corroborative search in SureChemOpen, PubChem, ChemSpider, Chemicalize
• EPO patent number searching in PubChem
• PDF24.org for cutting pages and OnlineOCR.net for sections or tables
• Utopia bioentity mark-up
(those below not included in this presentation but relevant)
• NCI/CADD Chemical Identifier Resolver and Online SMILES Translator
• Open cheminformatics tools – CDK, ChemViz, Taverna, OpenBabel etc.
• OSCAR/PatentEye, Murray-Rust group, organic-reaction.com Laconde et al,
SCRIPDB, Juristica group
(n.b. Google should give urls for all these source and tool names)
[7]

So What’s in PubChem ?

[8]

PubChem Patent-derived Content ~6 million

• ~ 2.8 million Discovery Gate/Thomson Pharma intersect mainly Derwent WPI
pharmaceutical patents plus some journal extractions
• ~ 5.1 million (allpat above) is the union of Thomson/Derwent plus IBM
• ~ 3.5 million of these are Lipinski-ROF compliant
• ~ 40% journal-extracted structures in ChEMBL have a match in the 5.1 million
• ~ 70% of these are Lipinski-ROF compliant
• ~ 90% of these have assay data
• ~ 60% of the IBM structures (1.5 million) are novel as defined by unique CIDs
• ~ 2.3 million SureChem pre-2007 structures also in there (but not selectable)
[9]

Chemistry > Patents in PubChem

[10]

You found a CID, what are the Patent and
Journal links?
PubChem > BioAssay/ChEMBL > CiteExplore/PubMed/ > PDB

[11]

Patent Links from SLING and IBM

[12]

PubChem > SureChem > Patent > Stucture >
Data > Target

[13]

Target-Centric Patent Searching

[14]

Synonym Recall

• Title only BACE1 = 8
• Title + abstract BACE1 = 97
• Title + abstract BACE2 = 29
• Title + abstract BACE = 392
• Title + abstract ”Beta secretase” = 1056
• Title + abstract memapsin = 87
• Title + abstract BACE1 OR "Beta secretase" OR BACE OR BACE2 OR
Memapsin = 1383
• Title + abstract BACE1 OR "Beta secretase" OR BACE OR BACE2 OR
Memapsin AND inhibitors = 841
• Same query to PubMed (this interface) = 1031
[15]

Target Query > Patent Retrieval from Espacenet

[16]

Linking Examples to Data in the Patent

[17]

Extracting Chemical Structrures

[18]

IUPAC-to-structure: OPSIN

Instalable
application

Also chemical
dictionary
conversions

Result; Example 31 structure is 24 nM BACE1 inhibitor
[19]

Image-to-strucuture: OSRA

• Patchy results but fixable by editing and similarity iteration in PubChem
• Also an installable application
• Useful to cross-check between images and IUPACs

[20]

Follow-up Searching

[21]

Structure Search in PubChem
SMILES (via OPSIN, OSRA, SureChem, Chemicalize or sketcher)

Often see stero differences to the Derwent entry in PubChem
[22]

PubChem Similarity ”Walking”

• 2D and 3D different results
• Can do multiple steps
• Can ”read” CID history
• Possible to ”walk” between patents
• Look for links to ChEMBL, BioAssay, PubMed, chemical suppliers etc.
[23]

Direct Patent <> Chemistry

[24]

SureChemOpen: Patent Retrieval

• Patent searching, chemistry-to-patent and patent-to-chemistry in one portal
• Higher rate of name-to-structure conversion than Chemicalize or OPSIN (but not
bulk export)
[25]

SurChemOpen, WIPO, OPSIN and PubChem

Result 1nm (?) BACE2 inhibitor
with assay and synthesis details.
[26]

SureChemOpen: Structure > Patent

Direct answers to: ”which patents contain compounds simiar to my query”
and ”show me all the compounds in these patents”
[27]

Non-target Activity Data and Bulk
Chemistry Extraction

[28]

Malaria Query: CiteExplore > WIPO

Example 60, sub-200nM potency,
with solubilty and clearance data
[29]

Espacenet EP2391601 > ChemAxon Chemicalize.org
• Description URL from
Espacenet pasted into
Chemcalize.org

• Most of 74 examples
converted

• Example 60 had 4
analgues in PubChem
at 95% Tamimoto (e.g.
CID 46852300) but no
exact match

• Claims section was
Markush description
so no relevant
structures converted

[30]

EP2391601 > Chemicalize > PubChem

Chemicalize Similarity listing PubChem Tanimoto sub-cluster

• EP2391601 description text > Chemicalize SDF download > PubChem
Structure Search upload = 311 structures
• Of these 206 have PubChem exact matches
• Of these 176 have Thomson Pharma matches
• The example cluster (Thomson/Derwent extraction) cluster is ~15
• The example cluster from Chemicalize is ~ 90
• Ipso facto Chemicalize extracted at least 70 novel structures
• But only 10 examples were in the highest-potency bin
[31]

Tips and Tricks

[32]

Tables and Recalcitrant IUPACs
PDF

Find tables

Snip image

Online OCR

Word Pad

Chemicalize

OPSIN

OSRA

• iterative fixing of OCR
errors (e.g. 1 vs l)
• cross-check Mw in the
document

[33]

Utopia Mark-up of Patent Introduction

Bioentity mark-up (green) via EMBL Reflect with rich call-out options
[34]

Tips for Joining Everything up
• SureChemOpen is continuing to back-fill and add features.
• Check the Chemicalize archive (~ 0.5 million) for unique content.
• Between Chemicalize, OSRA, OPSIN and sketching you can extract most things
(e.g. journal or meeting abstracts, PubMed Central full-text, catalogues, wiki
pages, blog posts and MeSH IUPACs).
• Check PubChem ”same connectivity” for tautomer forms in different CIDs.
• Check PubChem ”similar” compounds for analogues even if you cannot track
back to a patent number.
• Most PDB ligands published by companies have a patent analogue series.
• Espacenet text chemicalizes well but FreePantentsOnline can be better.
• Google Scholar tracks patent citations.
• Full-text is good but don’t forget to eyeball the original PDF
• You can ”walk” between patents by 2D/3D clusters, inventors or citations.
• Less-common author/inventor names may track a journal paper back to a patent.
• CiteExplore includes selectable ChEMBL structure links.
• Check ChEMBL structures for SureChem links via ChemSpider.
• On a good day you can paste OCR table data into Excel.
• You can set SciBitely patent keyword alerts and see posts on Twitter.
[35]

Conclusions

• Roll-your-own patent mining can take you a long way.
• Complementary to commerical databases.
• Target-centric recall and specificity is reasonable.
• Published patents are indexed and open text-extracted within weeks.
• You need perspicacity to dig out SAR details.
• Can cherry pick examples by potency or collate whole series
• Establishing intersects between journal articles and patents is valuable.
• Exemplified structures typically cover a broader range of analogue space
and SAR data than papers.
• You can ”walk” between patents via citation and chemistry clustering.
• PubChem already contains over 6 million patent-derived structures with
more depositions and links expected.
• The increased public surfacing of chemical structres and bioactivity data
from patents will expedite medicinal chemistry, tropical disease research
and chemical biology.

[36]

Questions Welcome

ChrisDS Consulting: http://www.cdsouthan.info/Consult/CDS_cons.htm
Mobile: +46(0)702-530710
Skype: cdsouthan
Email: cdsouthan – at - hotmail.com
Twitter: http://twitter.com/#!/cdsouthan
Blog: http://cdsouthan.blogspot.com/ (includes postings on patent themes)
LinkedIN: http://www.linkedin.com/in/cdsouthan
Website: http://www.cdsouthan.info/CDS_prof.htm
Publications: http://www.citeulike.org/user/cdsouthan/publications/order/year
Citations:http://scholar.google.com/citations?user=y1DsHJ8AAAAJ&hl=en
Presentations: http://www.slideshare.net/cdsouthan

[37]

Mining Drug Targets, Structures and Activity Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mining Drug Targets, Structures and Activity Data

Similar to Mining Drug Targets, Structures and Activity Data (20)

More from Chris Southan

More from Chris Southan (20)

Recently uploaded

Recently uploaded (20)

Mining Drug Targets, Structures and Activity Data