EUGM15 - Matthias Negri, Árpád Figyelmesi (Boehringer Ingelheim, ChemAxon):Chemistry-enriched patent curation - automatized chemical and semantic analysis and elaboration of large patent sets

Matthias Negri , PhD
Scientific Information Center
Boehringer Ingelheim Pharma GmbH & Co. KG
Chemistry-Enriched Patent Curation
semi-automatic analysis and elaboration of patents
ChemAxon UGM 2015, Budapest, 20 May 2015
Árpád Figyelmesi
ChemAxon

Content
1. Chemistry in patents
2. Why do we need a patent curation workflow?
3. Semi-automatic Patent Curation Workflow - Overview
4. Linked tools/technologies
5. ChemCurator (ChemCC)
6. Semi-automatic Patent Curation Workflow – Step by Step
7. Lessons learned, weak-points, limitations
8. Outlook
2Negri Matthias, ChemAxon UGM 2015

Chemistry in patents
Chemistry appears within diverse form in patents:
1. TEXT - IUPAC names, common names, etc
2. IMAGES - embedded within or attached to the document
3. ATTACHMENTS (MOL/CDX)
4. TABLES
– as ONE-image file (tables with chemistry and bioactivity data)
– as chemistry-only image files embedded within table tags
5. Markush Structures/Formulas with R-groups
---------------------------------------------------------------------------------------
 Currently NO commercial solution covers all these cases
 Most of the cases are considered in the patent curation workflow
(Markush/R-group Formulas recognized and stored separately)

Why do we need a patent curation workflow?
Motivations:
1. Linked chemistry-retrieval from patents (+ chemistry as images)
2. IUPAC-enriched XML patent files  as NEW source for text-mining
3. extraction of bioactivity data/targets/diseases/… in relation to chemistry
4. Similarity/Substructure frequency in compound sets of patents
5. …

Semi-automatic Patent Curation Workflow
Overview – current state
2 parallel branches
5
I2E API KNIME – Batch indexing, text-mining and (relational) data retrieval
SLOWER & memory intensive vs BUTHigher Quality, More Control & IUPAC-enriched XML
FASTER vs LESS informative/flexible - ChemCC as the (near) future perspectiveINPUT
Negri Matthias, ChemAxon UGM 2015

Linked tools/technologies
1. KNIME/XPATH
2. ChemAxon ChemCurator (ChemCC)
3. Other ChemAxon tools in KNIME nodes (document2structure/d2s,
Naming, Molconverter, Structure checker, Standardizer, …)
4. Text/data-mining – Linguamatics I2E (+I2E Chemistry)
5. Optical Structure Recognition – Keymodule CLiDE Batch

Content
8. Outlook

Computer-aided chemical data extraction
 English, Chinese and Japanese N2S
 Markush Editor
 Structure Checker
 Hit visualization
 Third party OSR technologies
ChemCurator (ChemCC)
8 Árpád Figyelmesi, ChemAxon UGM 2015

Name to Structure
 Support for many nomenclatures (common, drug names, …)
 IUPAC names
 Custom dictionaries
 English (2008)
 Chinese (2013)
 Japanese (2014)

Compound Extraction View
Compound listProject explorer
Annotated document
Selected structures
10

Markush Extraction View
Markush editor
Example structures
Annotated document
Project explorer
Selected structures
Structure checker
11

General Document Curation
Extract Markush Structures from patents
Extract specific structures
 Journal articles
 Company reports
 Patent examples
Structure extraction wizards
 Exclude fragments, chemical elements, etc.

Integration & Information Sharing
Other ChemAxon products:
 Direct IJC schema connection
 Project sharing function
 Accessible from Plexus, IJC, etc.
Third party tools:
 Standard file formats
 Export functions
 Easily processable projects

Content
8. Outlook

a) input sources and b) bibliographic data
a) Input sources
 files with patent-IDs list
 XML collection
 …
b) Retrieval of bibliographic information and attachment data
 family ID, patent references, expiration date, etc
 Attachment files MOL/CDX (US-patents only), TIF files
 ….

c) chemistry retrieval/extraction/filtering
1. ChemCurator branch
 data retrieval (XML, attachments) from IFI Claims Direct BI-server
 ChemCurator project creation/sharing/annotation  html output
 Chemistry extraction name2structure/document2structure  sdf output
 Generation of pre-annotated patent set stored as ChemCC projects
 Faster, but lower quality within the chemistry extraction process

2. KNIME branch
- OCR-errors CLEAN-UP in KNIME  improved chemistry recognition
- MOL/CDX/TIF - standardizer, structure checker  filter formulas, solvents, R-groups
 Higher quality and more control in chemistry extraction process

2. KNIME branch
 MOL  IUPAC
 CDX  IUPAC
 TIFF (via CLiDE)  IUPAC

Merging and Comparison of the converted chemistry
output of MOL/CDX/TIF – 2 “quality” checks
 IUPAC
 string length (different output order of chemicals
in multiple molecules image/multiMOL files
 OCR-correction (“dictionary” based)
2. KNIME - Chemistry “Normalization”
 (within KNIME) set up a relation between each TIFF/attachment file
1. to (one or more) IUPAC name(s)
2. to a position/section in the text/document
19
Merge IUPAC Clean-Up IUPAC
If NO IUPAC  IMG-name is set
“Normalize” IUPAC names
Negri Matthias, ChemAxon UGM 2015

d) TIF/attachment replacement with IUPAC names
Chemistry present as text is recognized and extracted either via
- Textmining (I2E chemistry – d2s is working in behind) or
- Within KNIME/ChemCC using annotate/molconvert
Replacement:
<chemistry> vs IUPAC
IUPAC-enriched XML

OCR-errors in chemical names
d) TIF/attachment replacement with IUPAC names
TIF
CDX
MOL
This text-chunk is replaced by the IUPAC name

XPATH/XML parsing and extraction of:
 Tables
 Rows - XML tags & strings
 Entries - XML tags & strings
e) Bioactivity/tabular data extraction with KNIME/XPATH

IUPAC-enriched XML as source for I2E API/textmining
 indexing
 pre-defined queries
 results retrieval
 saved as SDF files (KNIME)
f)Text-/datamining with Linguamatics I2E via KNIME
Text-mining retrieved (chemistry-related) information
 Example Nr.
 Bioactivity data from tables
 Claims, regions where chemistry appears in patents
 Genes, diseases

1. Example Nr. – IUPAC
Table:Image:
For comparison – chemistry in PDF:
f) Bioactivity Data using I2E multi-queries – 2 steps
Source: (IUPAC-enriched) XML
2. Example Nr. – Bioactivity data
24
IUPAC
Bioactivity
Example Nr.

g) Visualize data-/textmining results in ChemCC
 SDF file imported into ChemCC project + automatic mapping to existing chemistry

Lessons learned, weak-points, limitations
1. Advantages KNIME Full-Mode (MOL/CDX/TIF) vs ChemCC branch
 chemistry check/normalization – 3 input sources  improved quality
 improved chemistry recall - ALL images (incl. tables and drawings)
 More filtering options in KNIME workflow vs ChemCurator only
 IUPAC-enriched XML as new source for I2E
Advantages ChemCC vs KNIME Full-Mode (MOL/CDX/TIF)
 faster
 Image processing using CLiDE is already incorporated with naming

Lessons learned, weak-points, limitations
2. No full automation of the workflow due to lack of homogenicity in patent data (US
vs WO, EP, etc..)
 Missing attachment files
 No tables present in XML
 Error rate in chemistry recognition (OPSIN vs n2s/d2s)
 …
 NEEDS: different workflows/branches, patent-files clean-up (OCR)
3. Time & Computational Resources-consuming process

Outlook
1. KNIME Workflow
 Add new data fields to Chemicals: BI-internal codes, genes, targets, etc..
 Usage of ChemCC html output as source for textmining
 Ontology mapping
 Expand workflow by including other sources (internal PDF, literature full-text)
 Use KNIME to interconnect to BI-intern workflows, DB, etc
 chemistry-linked information in a patent-DB  improved (semantic) search

Outlook
2. ChemCurator
 Improved n2s
 New command-line functions
 Complex-phrase requests from IFI server
 Improved SDF import
 Preprocessing wizards
Árpád Figyelmesi, ChemAxon UGM 201529

Thank You !
Negri Matthias, ChemAxon UGM 2015 30
INPUT
Árpád Figyelmesi, ChemAxon UGM 2015

EUGM15 - Matthias Negri, Árpád Figyelmesi (Boehringer Ingelheim, ChemAxon):Chemistry-enriched patent curation - automatized chemical and semantic analysis and elaboration of large patent sets

Recommended

Recommended

More Related Content

Similar to EUGM15 - Matthias Negri, Árpád Figyelmesi (Boehringer Ingelheim, ChemAxon):Chemistry-enriched patent curation - automatized chemical and semantic analysis and elaboration of large patent sets

Similar to EUGM15 - Matthias Negri, Árpád Figyelmesi (Boehringer Ingelheim, ChemAxon):Chemistry-enriched patent curation - automatized chemical and semantic analysis and elaboration of large patent sets (20)

More from ChemAxon

More from ChemAxon (20)

Recently uploaded

Recently uploaded (20)

EUGM15 - Matthias Negri, Árpád Figyelmesi (Boehringer Ingelheim, ChemAxon):Chemistry-enriched patent curation - automatized chemical and semantic analysis and elaboration of large patent sets