Searching for patent information in PubChem

Searching for
Patent Information in PubChem
Sunghwan Kim (sunghwan.kim@nih.gov),
Paul Thiessen, Asta Gindulyte, Evan Bolton
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
ACS Fall 2018 National Meeting in Boston, MA
Sunday, August 19, 2018

2
 NIH’s chemical information resource.
 Collects public-domain chemical data from >620 data sources.
 Disseminates it back to the public free of charge.
What is PubChem?
The Public
Data
Collection
Data
Dissemination
(free of charge)
Government agencies
University labs
Publishers
Pharma Companies
Chemical venders
Public databases

3
 Data organization in PubChem
Unique chemical
structure extraction
through Standardization
Depositor-provided
substance descriptions
Unique chemical structures
Data Contributors
Substance
deposition
Depositor-provided
Bioactivity test results
Activity of tested
“substances”
Activity of “compounds” derived
from associated “substances”
Assay
deposition

4
Unique chemical
structure extraction
through Standardization
Activity of tested
“substances”
Activity of “compounds” derived
from associated “substances”
Data Contributors
Substance
deposition
Assay
deposition
 Data organization in PubChem
Substance ID (SID)
Compound ID (CID)
Assay ID (AID)

5
 PubChem (https://pubchem.ncbi.nlm.nih.gov)
 PubChem contains:
• >247.2 million substance descriptions,
• >96.4 million unique chemical structures,
• >236.7 million biological activity test results
• >1.25 million biological assays, covering >10,000
unique protein sequence targets.
The largest collection of
publicly available chemical information
from >620 data sources.
(as of August 15, 2018)

6
Patent Information
in PubChem

7
 Patent Information Sources
 SureChEMBL (formerly SureChem)
(https://www.surechembl.org/)
 IBM Almaden Research Center
(https://www.research.ibm.com/labs/almaden/)
 SCRIPDB
(http://dcv.uhnres.utoronto.ca/SCRIPDB/search/)
 NextMove Software
(https://www.nextmovesoftware.com/)
 BindingDB
(https://www.bindingdb.org/)

8
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
SureChEMBL IBM SCRIPDB NextMove BindingDB
# SID w/ patent # CID w/ patent # patent IDs

9
# Patent IDs 6,858,886
# SIDs with patent links 40,149,647
# CIDs with patent links 21,211,221
# SID-patent pairs 405,234,094
# CID-patent pairs 350,995,421
21 million compounds associated with
6.9 million patent documents.

10
How to Access
PubChem Patent Information

11
How to access PubChem patent information
1. How to find patent information for a given chemical.
2. How to find chemicals mentioned in a given patent document.
3. How to retrieve all chemicals with patent information.
4. How to search for chemicals with patent information through:
• Identity/similarity search
• Substructure/superstructure search
5. How to retrieve chemicals associated with a patent classification.
6. How to access patent information programmatically.

12
5. How to retrieve chemicals associated with a patent classification
6. How to access patent information programmatically

13
 Compound Summary page
 Provides an aggregated view of all information available in PubChem
for a given chemical.
 Can be accessed:
• from various search/analysis tools
• via a simple URL ending with the CID or common chemical name
(ex) aspirin (CID 2244)
https://pubchem.ncbi.nlm.nih.gov/compound/2244
https://pubchem.ncbi.nlm.nih.gov/compound/aspirin

14
 Compound Summary page
 Includes patent information on a given chemical.
• Drug patents from FDA Orange Book and DrugBank
• Depositor-provided patents that mention the chemical
• WIPO International Patent Classification
• Related records with patent information

18
Link to
the “Patent View” page
(to be discussed later)

19
Link to the USPTO
page
Link to
the “FDA Orange
Book” page

20
WIPO
International Patent
Classification
(IPC)

24

25
 Patent View
 PubChem generates the Patent View page for a patent document
available in PubChem.
 The Patent View provides:
• Patent title and abstract
• Inventor and applicant
• Application and publication dates
• List of chemicals mentioned
• Patent classification information
based on the WIPO International Patent Classification (IPC).

26
 Patent View
 Accessible via a simple web address containing the patent number at
the end.
(ex) The Patent View page for EP0521471:
https://pubchem.ncbi.nlm.nih.gov/patent/EP0521471
 It can also be accessed through several PubChem tools and services
such as:
• Compound Summary
• PubChem Search
• Classification Browser

27
Go to “PubChem Search” for
structure search!

36

37
Type “has_patent”[filter]

39

40
Go to “PubChem Search” for
structure search!

57

58
 Classification Browser
(https://pubchem.ncbi.nlm.nih.gov/classification)
 Browse PubChem data using a classification of interest.
 Search for records annotated with the desired classification/term.
 Available ontologies/classifications:
• MeSH
• ChEBI
• FDA Pharm Classes
• KEGG
• LIPID MAPS classification system for lipids
• PubChem Compound Table of Contents
• PubChem BioAssay Classification
• WHO ATC Code (Anatomical Therapeutic Chemical Classification
System)
• WIPO International Patent Classification
• ……

61
Select “WIPO: International Patent
Classification”

63
Click to retrieve
the 7,597 chemicals

65
 Classification Browser
(https://pubchem.ncbi.nlm.nih.gov/classification)
 Useful for retrieval of compounds from a small node (~103 compounds)
 Not good for retrieving compounds from a very large node (~106
compounds)
 This issue will be addressed in the future.

66

67
PUG-REST
 Representational State Transfer (REST)-style
interface.
 Simplified access route
without the overhead of XML or SOAP envelopes
 Access to data that are not accessible
through other PUG Services.
 Intended to handle short, synchronous requests (<30
seconds).

68
https://pubchem.ncbi.nlm.nih.gov/rest/pug/<INPUT>/<OPERATION>/<OUTPUT>[?OPTIONS]
Prolog
(common to all PUG REST requests)
Options specific to
some operations
<INPUT>
Specifies identifiers of interest,
by identifiers
by chemical name
by chemical structure search
by cross reference
by listkey, ......
<OPERATION>
Specifies what to do with input
get full records
get molecular properties
get synonyms or images
get cross references
many other operations
<OUTPUT>
Specifies desired output format
XML  PNG
JSON  SDF
JSONP  CSV
ASNB  TXT
ASNT
 URL construction for a PUG-REST request
 The three parts are (mostly) independent of each other.
 Many possible requests in a PUG-REST request.

69
Prolog
Options specific to
some operations
<INPUT>
by identifiers
by chemical name
<OPERATION>
get full records
<OUTPUT>
XML  PNG
JSON  SDF
JSONP  CSV
ASNB  TXT
ASNT
 Retrieve all Patent IDs associated with CID 2244.
https://......../rest/pug/compound/cid/2244/xrefs/PatentID/TXT

70
Prolog
Options specific to
some operations
<INPUT>
by identifiers
by chemical name
<OPERATION>
get full records
<OUTPUT>
XML  PNG
JSON  SDF
JSONP  CSV
ASNB  TXT
ASNT
 Retrieve all compounds associated with Patent US20050159403A1.
https://....../rest/pug/compound/xref/PatentID/US20050159403A1/cids/TXT

71
Limitations of
PubChem Patent Information

72
 Limitations
1. PubChem does not directly extract information from
patents. Instead, it relies on voluntary contributions from
data sources.
• Lag time between PubChem and original data sources.
• If the data sources are wrong, so is PubChem.
• PubChem does not cover all patent documents.
2. Not all patents worldwide are considered.
• Primary focus on USPTO
• EPO, WIPO, JPO

73
 Limitations
3. Multiple patent documents about a single invention (e.g.,
with different kind codes) are aggregated into a single
patent view page.
• It is not possible to tell between documents which
chemicals are mentioned.
4. Only WIPO IPC is available.
• Cooperative Patent Classification (CPC) information is
not available at this time.

75
 21 M unique compounds associated with 6.9 M
patents from five data sources, including:
• SureChEMBL
• IBM
• SCRIPDB
• NextMove
• BindingDB
 On the Summary page for each compound
• Patent IDs
• Patent Classifications
• FDA Orange book patents
• Structurally similar compounds with patent
information

76
 Various search types for chemicals with patent
information are supported.
• Text (chemical name) search
 Classification browser to retrieve compounds with a
given patent classification
 Programmatic access to patent information through
PUG-REST

77
Acknowledgements
Evan Bolton
Jie Chen
Tiejun Cheng
Asta Gindulyte
Jia He
Siqian He
Qingliang Li
Benjamin Shoemaker
Thiessen Paul
Bo Yu
Leonid Zaslavsky
Jian Zhang
 The PubChem Team
 PubChem depositors, users, and collaborators
 Funded by the National Library of Medicine

Searching for patent information in PubChem

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Searching for patent information in PubChem

Similar to Searching for patent information in PubChem (20)

More from Sunghwan Kim

More from Sunghwan Kim (11)

Recently uploaded

Recently uploaded (20)

Searching for patent information in PubChem