SlideShare a Scribd company logo
Daniel Lowe and Roger Sayle, NextMove Software Ltd, Cambridge, UK
daniel@nextmovesoftware.com
NextMove Software Limited
Innovation Centre (Unit 23)
Cambridge Science Park
Milton Road, Cambridge
England CB4 0EY
www.nextmovesoftware.co.uk
www.nextmovesoftware.com
Introduction
Conclusions
Example application: Database of chemicals from Korean patents
Chinese, Japanese and Korean (CJK) patents account for over half of all national
patent filings and hence are of increasing importance to patent informatics. In
the domain of chemistry the specific chemicals mentioned in a patent are
often crucial for finding relevant patents and prior art searching.
The most interesting chemicals in a patent are typically the novel ones, which
are described using systematic chemical nomenclature. Fortunately while CJK
chemical names appear quite different to European chemical names, the
underlying nomenclature is essentially the same. Hence we have developed
tools to translate CJK chemical names to English such that existing chemical
entity recognition and conversion tools may be used.
† Lowe, D. M.; Sayle, R. A. LeadMine: A Grammar and Dictionary Driven Approach to
Entity Recognition. J. Cheminformatics 2015, 7, S5.
We have developed state of the art chemical name translation for Chinese,
Japanese and Korean chemical names. As an example, we have shown that
over a million chemical compounds can be extracted from Korean patents and
that many of these are novel and/or published earlier than their USPTO
counterparts.
Chemistry Enabling Chinese, Japanese
and Korean Patents
63 thousand Korean patent applications (as PDF) were analysed (spanning
1990 to March 2015). After translation, LeadMine† was used to extract
1,740,040 distinct compounds with a mean heavy atom count of 27.1. By
contrast without translation only 39,824 distinct compounds could be
extracted!
Asian chemical nomenclature
Features
• Simplified and traditional Chinese characters are accepted interchangeably
where their meaning is equivalent
• Erroneous spaces and common OCR errors handled in Chinese documents
• KIPO (Korean Intellectual Property Office) patents as PDFs can be
converted to XML with patent sections annotated
• The translation tool is available as a Java library or command-line utility
6-aminopyrimidine-2,4,5-triol
Chinese (Hanzi used for each morpheme)
6-氨基嘧啶-2,4,5-三醇
Japanese (Phonetic translation to Katakana)
6-アミノピリミジン-2,4,5-トリオール
Korean (Phonetic translation to Hangul)
6-아미노피리미딘-2,4,5-트리올
ammonia radical pyrimidine three alcohol
amino pyrimidine tri ol
amino pyrimidine tri ol
Challenging cases
Word order (Chinese)
Chinese Literal translation English
间硝基氯化苄 meta-nitrochlorine of benzyl meta-nitrobenzyl chloride
Context sensitivity (Chinese)
Implicit ester word (Japanese/Korean)
Acid character has two interpretations (Chinese/Japanese/Korean)
Japanese Literal translation English
酢酸エチル acetic acid ethyl acetic acid ethyl ester
0
10000
20000
30000
40000
50000
60000
70000
80000
0 10 20 30 40 50 60
Numberofdistinctcompounds
Number of heavy atoms
With Translation
No Translation
Performance on Japanese and Chinese
DataSet Same structure obtained from English
name and translated Asian name*
51,215 Chinese/English catalogue
names (from chemBlink)
87.1%
260,965 Japanese/English catalogue
names (from ChemicalBook)
88.3%
*As the Asian name may be semantically different to the English name (due to
a mistake in the catalogue) this gives a lower bound on actual performance
Compared to an analysis of USPTO patents (1976-March 2015) 230,770
compounds were novel to the Korean patents. In the period 2006-2014, 46%
of compounds appeared in a KIPO filing before a USPTO filing.
Different nomenclature used in English (Chinese)
Chinese Literal translation English
丁酮二酸 butanonedioic acid oxaloacetic acid
酮 ketone 环己酮 cyclohexanone 2-酮丁酸 2-ketobutanoic acid
醋酸 acetic acid 醋酸钠 acetate sodium
OCR errors
Patent text OCR error Translation after fix
已二酸二甲酯 已 should be 己 hexanedioic acid dimethyl ester
Adding spaces in English translation (Chinese/Japanese)
Japanese Literal translation English
ピペリジン臭化水素酸塩 piperidinehydrobromide piperidine hydrobromide

More Related Content

Viewers also liked

Sketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sightSketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sight
NextMove Software
 

Viewers also liked (6)

Sketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sightSketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sight
 
GHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulGHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be useful
 
Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)
 
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
CINF 29: Visualization and manipulation of Matched Molecular Series for decis...
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChem
 
Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?
 

Similar to Chemistry Enabling Chinese, Japanese and Korean Patents (7)

C8 akumaran
C8 akumaranC8 akumaran
C8 akumaran
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4L
 
Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...Improved chemical text mining of patents using infinite dictionaries, transla...
Improved chemical text mining of patents using infinite dictionaries, transla...
 
Overview of SureChEMBL
Overview of SureChEMBLOverview of SureChEMBL
Overview of SureChEMBL
 
ELSE IF 2019: Multilingual Text Analytics for Extracting Pharma Real-World Ev...
ELSE IF 2019: Multilingual Text Analytics for Extracting Pharma Real-World Ev...ELSE IF 2019: Multilingual Text Analytics for Extracting Pharma Real-World Ev...
ELSE IF 2019: Multilingual Text Analytics for Extracting Pharma Real-World Ev...
 
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical PatentsChemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
 
Chemical Named Entity Recognition
Chemical Named Entity RecognitionChemical Named Entity Recognition
Chemical Named Entity Recognition
 

More from NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
NextMove Software
 

More from NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 

Chemistry Enabling Chinese, Japanese and Korean Patents

  • 1. Daniel Lowe and Roger Sayle, NextMove Software Ltd, Cambridge, UK daniel@nextmovesoftware.com NextMove Software Limited Innovation Centre (Unit 23) Cambridge Science Park Milton Road, Cambridge England CB4 0EY www.nextmovesoftware.co.uk www.nextmovesoftware.com Introduction Conclusions Example application: Database of chemicals from Korean patents Chinese, Japanese and Korean (CJK) patents account for over half of all national patent filings and hence are of increasing importance to patent informatics. In the domain of chemistry the specific chemicals mentioned in a patent are often crucial for finding relevant patents and prior art searching. The most interesting chemicals in a patent are typically the novel ones, which are described using systematic chemical nomenclature. Fortunately while CJK chemical names appear quite different to European chemical names, the underlying nomenclature is essentially the same. Hence we have developed tools to translate CJK chemical names to English such that existing chemical entity recognition and conversion tools may be used. † Lowe, D. M.; Sayle, R. A. LeadMine: A Grammar and Dictionary Driven Approach to Entity Recognition. J. Cheminformatics 2015, 7, S5. We have developed state of the art chemical name translation for Chinese, Japanese and Korean chemical names. As an example, we have shown that over a million chemical compounds can be extracted from Korean patents and that many of these are novel and/or published earlier than their USPTO counterparts. Chemistry Enabling Chinese, Japanese and Korean Patents 63 thousand Korean patent applications (as PDF) were analysed (spanning 1990 to March 2015). After translation, LeadMine† was used to extract 1,740,040 distinct compounds with a mean heavy atom count of 27.1. By contrast without translation only 39,824 distinct compounds could be extracted! Asian chemical nomenclature Features • Simplified and traditional Chinese characters are accepted interchangeably where their meaning is equivalent • Erroneous spaces and common OCR errors handled in Chinese documents • KIPO (Korean Intellectual Property Office) patents as PDFs can be converted to XML with patent sections annotated • The translation tool is available as a Java library or command-line utility 6-aminopyrimidine-2,4,5-triol Chinese (Hanzi used for each morpheme) 6-氨基嘧啶-2,4,5-三醇 Japanese (Phonetic translation to Katakana) 6-アミノピリミジン-2,4,5-トリオール Korean (Phonetic translation to Hangul) 6-아미노피리미딘-2,4,5-트리올 ammonia radical pyrimidine three alcohol amino pyrimidine tri ol amino pyrimidine tri ol Challenging cases Word order (Chinese) Chinese Literal translation English 间硝基氯化苄 meta-nitrochlorine of benzyl meta-nitrobenzyl chloride Context sensitivity (Chinese) Implicit ester word (Japanese/Korean) Acid character has two interpretations (Chinese/Japanese/Korean) Japanese Literal translation English 酢酸エチル acetic acid ethyl acetic acid ethyl ester 0 10000 20000 30000 40000 50000 60000 70000 80000 0 10 20 30 40 50 60 Numberofdistinctcompounds Number of heavy atoms With Translation No Translation Performance on Japanese and Chinese DataSet Same structure obtained from English name and translated Asian name* 51,215 Chinese/English catalogue names (from chemBlink) 87.1% 260,965 Japanese/English catalogue names (from ChemicalBook) 88.3% *As the Asian name may be semantically different to the English name (due to a mistake in the catalogue) this gives a lower bound on actual performance Compared to an analysis of USPTO patents (1976-March 2015) 230,770 compounds were novel to the Korean patents. In the period 2006-2014, 46% of compounds appeared in a KIPO filing before a USPTO filing. Different nomenclature used in English (Chinese) Chinese Literal translation English 丁酮二酸 butanonedioic acid oxaloacetic acid 酮 ketone 环己酮 cyclohexanone 2-酮丁酸 2-ketobutanoic acid 醋酸 acetic acid 醋酸钠 acetate sodium OCR errors Patent text OCR error Translation after fix 已二酸二甲酯 已 should be 己 hexanedioic acid dimethyl ester Adding spaces in English translation (Chinese/Japanese) Japanese Literal translation English ピペリジン臭化水素酸塩 piperidinehydrobromide piperidine hydrobromide