SlideShare a Scribd company logo
Towards automated mining
of chemical structures
in Chinese Patents
Daniel Bonniot de Ruisselet
ChemAxon
ICIC 2013, Vienna
October 16th 2013
2
3
4
Why Chinese patents matter
• Volume, exploding...
• Increasingly innovative
• Potential infrigment, lawsuits
●
Apple (2008, 2012, 2013), Schneider Electric,
Samsung, ...
• Hard to access because of language

5
Why chemical mining matters
• Find interesting patent(s) using text search
– Each patent can contain 100s of chemical names
– Convert them automatically to structures
– Enables chemical calculations

• Find interesting patent(s) using chemical structure
search
– Requires building a chemical database index

• Track structures accross multiple patents
– Including multiple languages
– Searching for prior art, infringment, …
– Chemical similarity search

• ...
6
Putting it together
Chinese patents matter
&
chemical mining matters
→
Chemical mining of chinese
patents matters

7
ChemAxon?
• Cheminformatics, since 1998
• All of the top 15 global pharmas are customers
• Chemical database: indexing and searching
• English Name to Structure
• Document to Structure
• Missing piece: Chinese Name to Structure

8
Chinese Name to Structure

邓巍 (Wei Deng, a.k.a. David)
Builds on english name to structure
Specific dictionaries
Changes in algorithms...
9
The Challenges
1. Chinese texts have no spaces
2. Ester & Salt

乙酸乙酯

Ethyl Acetate
10
The Challenges
3.

English: name alterations
丁烷 → buta + ane → butane

4. Chinese: many Characters have different
meanings
盐
= salt
酸
= acid
盐酸 = hydrochloric acid
11
OCR Error correction
3-( 笨基 ) 丙酸
苯
苯基
丙酸

12

= benzene
= phenyl
= proprionic acid
OCR Error correction
3-( 笨基 ) 丙酸
苯
苯基
丙酸

13

= benzene
= phenyl
= proprionic acid
Chinese Document to Structure
• Additional challenge: no spaces
• 如式 I 所示的 {5-[2-(4- 正辛基苯基 ) 乙基 ]-2 , 2- 二
甲基 -1 , 3- 二氧六环 -5- 基 } 氨基甲酸叔丁酯是合成
芬戈莫德及其衍生物的重要中间体。

14
Chinese Document to Structure
• Additional challenge: no spaces
• 如式 I 所示的 {5-[2-(4- 正辛基苯基 ) 乙基 ]-2 , 2- 二
甲基 -1 , 3- 二氧六环 -5- 基 } 氨基甲酸叔丁酯是合
成芬戈莫德及其衍生物的重要中间体。

15
Chinese Document to Structure
• Additional challenge: no spaces
• 如式 I 所示的 {5-[2-(4- 正辛基苯基 ) 乙基 ]-2 , 2- 二
甲基 -1 , 3- 二氧六环 -5- 基 } 氨基甲酸叔丁酯是合
成芬戈莫德及其衍生物的重要中间体。
• XML Markup
●
Patent metadata
●

Encoding of characters

●

Tags (e.g. <p>)

• Document annotation

16
Document to Database
Document to Database

18
Document to Database

19
Document to Database

20
Validation 1: Chinese name to structure
• Test set: 38,600 Chinese names + CAS
number
• Contains unusual, incorrect, ambiguous
names, radicals, inorganic salts,
• Conversion rate = 59 – 79 %
• Accuracy = 91%

21
Validation 2: Chinese patents
• 54K chinese patents with automated english
translation
• Filter: structures with at least 20 heavy atoms, and
patents with at least 20 structures
• Remains: 2108 patents

22
Validation 2: Chinese patents

23
Conclusions
• Patent volume in chinese is booming
• It is important to mine & monitor it
• Automated solutions are needed, but hard
• General purpose auto translation is not enough
• Chinese N2S already gives better results
• ChemAxon can build solutions for specific workflows
• More collaboration with patent providers is needed to
keep improving quality and solutions
谢谢!
24
Extra information

谢谢!

25
Automatic OCR Error Correction
(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate
(2R)-2-methylsulfanyl-3-hydroxybutanedioate
Λr-benzyl-Λr-[3-(lH-tetrazol-5-yl)phenyl]propanamide
N-benzyl-N-[3-(1H-tetrazol-5-yl)phenyl]propanamide
我们日前止在研究开友中文化字名称的 OCR 白动纠错工力能

我们目前正在研究开发中文化学名称的 OCR 自动纠错功能
26
From Document to Structures

27

Non-searchable patent (50 pages)

Structure (text + image) + location
ChemAxon’s “Document to Structure”
• Extract chemical information from documents
–
–
–
–
–

28

Names: powered by the Naming Technology
Also import SMILES, InChI, CAS number …
Images: OSRA, ...
Works with scanned non-searchable PDF
Returns structures and their location in the document
ChemAxon’s “Document to Structure”
• Supported formats:
– MS Office document: doc, docx, ppt, pptx, xls, xlsx, odt …
– Embedded structure objects (ChemDraw, Symyx, Marvin, …)
– PDF, text, XML, HTML

29
ChemAxon’s “Document to Database”
• Data in DB:
– Structures
– Source (name, smiles, embedded, …) and location
– Documents, Authors, Metadata...

• Questions:
– What structures appear in a specific document?
– What documents contain a structure/substructure/...?
– What documents written since 2010 in location X contain
substructure S?
– ...

30

More Related Content

Viewers also liked

II-SDV 2017 - The International Information Conference on Search, Data Mining...
II-SDV 2017 - The International Information Conference on Search, Data Mining...II-SDV 2017 - The International Information Conference on Search, Data Mining...
II-SDV 2017 - The International Information Conference on Search, Data Mining...
Dr. Haxel Consult
 
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
Dr. Haxel Consult
 
ICIC 2013 New Product Introductions Minesoft
ICIC 2013 New Product Introductions MinesoftICIC 2013 New Product Introductions Minesoft
ICIC 2013 New Product Introductions MinesoftDr. Haxel Consult
 
ICIC 2014 New Product Introduction CAS
ICIC 2014 New Product Introduction CASICIC 2014 New Product Introduction CAS
ICIC 2014 New Product Introduction CAS
Dr. Haxel Consult
 
New Product Introductions - GenomeQuest Life Sciences
New Product Introductions - GenomeQuest Life SciencesNew Product Introductions - GenomeQuest Life Sciences
New Product Introductions - GenomeQuest Life Sciences
Dr. Haxel Consult
 
ICIC 2013 New Product Introductions Dolcera
ICIC 2013 New Product Introductions DolceraICIC 2013 New Product Introductions Dolcera
ICIC 2013 New Product Introductions DolceraDr. Haxel Consult
 
ICIC 2014 Panel: Mobile Apps for Patent Searchers
ICIC 2014 Panel: Mobile Apps for Patent SearchersICIC 2014 Panel: Mobile Apps for Patent Searchers
ICIC 2014 Panel: Mobile Apps for Patent Searchers
Dr. Haxel Consult
 
ICIC 2013 Conference Proceedings Sebastian Radestock
ICIC 2013 Conference Proceedings Sebastian RadestockICIC 2013 Conference Proceedings Sebastian Radestock
ICIC 2013 Conference Proceedings Sebastian Radestock
Dr. Haxel Consult
 
ICIC 2014 New Product Introduction Gridlogisc
ICIC 2014 New Product Introduction GridlogiscICIC 2014 New Product Introduction Gridlogisc
ICIC 2014 New Product Introduction Gridlogisc
Dr. Haxel Consult
 
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities  ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
Dr. Haxel Consult
 
ICIC 2013 Conference Proceedings Krishna Molecular Connections
ICIC 2013 Conference Proceedings Krishna Molecular ConnectionsICIC 2013 Conference Proceedings Krishna Molecular Connections
ICIC 2013 Conference Proceedings Krishna Molecular Connections
Dr. Haxel Consult
 
ICIC 2013 New Product Introductions GenomeQuest
ICIC 2013 New Product Introductions GenomeQuestICIC 2013 New Product Introductions GenomeQuest
ICIC 2013 New Product Introductions GenomeQuestDr. Haxel Consult
 
ICIC 2013 New Product Introductions InfoChem
ICIC 2013 New Product Introductions InfoChemICIC 2013 New Product Introductions InfoChem
ICIC 2013 New Product Introductions InfoChemDr. Haxel Consult
 
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
Dr. Haxel Consult
 
ICIC 2014 The Intermediates are becoming extict - radical Change for Info Pr...
ICIC 2014 The Intermediates are becoming extict  - radical Change for Info Pr...ICIC 2014 The Intermediates are becoming extict  - radical Change for Info Pr...
ICIC 2014 The Intermediates are becoming extict - radical Change for Info Pr...
Dr. Haxel Consult
 
ICIC 2014 New Product Introduction Averbis
ICIC 2014 New Product Introduction AverbisICIC 2014 New Product Introduction Averbis
ICIC 2014 New Product Introduction Averbis
Dr. Haxel Consult
 
ICIC 2013 New Product Introductions Linguamatics
ICIC 2013 New Product Introductions LinguamaticsICIC 2013 New Product Introductions Linguamatics
ICIC 2013 New Product Introductions LinguamaticsDr. Haxel Consult
 
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
Dr. Haxel Consult
 

Viewers also liked (18)

II-SDV 2017 - The International Information Conference on Search, Data Mining...
II-SDV 2017 - The International Information Conference on Search, Data Mining...II-SDV 2017 - The International Information Conference on Search, Data Mining...
II-SDV 2017 - The International Information Conference on Search, Data Mining...
 
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
 
ICIC 2013 New Product Introductions Minesoft
ICIC 2013 New Product Introductions MinesoftICIC 2013 New Product Introductions Minesoft
ICIC 2013 New Product Introductions Minesoft
 
ICIC 2014 New Product Introduction CAS
ICIC 2014 New Product Introduction CASICIC 2014 New Product Introduction CAS
ICIC 2014 New Product Introduction CAS
 
New Product Introductions - GenomeQuest Life Sciences
New Product Introductions - GenomeQuest Life SciencesNew Product Introductions - GenomeQuest Life Sciences
New Product Introductions - GenomeQuest Life Sciences
 
ICIC 2013 New Product Introductions Dolcera
ICIC 2013 New Product Introductions DolceraICIC 2013 New Product Introductions Dolcera
ICIC 2013 New Product Introductions Dolcera
 
ICIC 2014 Panel: Mobile Apps for Patent Searchers
ICIC 2014 Panel: Mobile Apps for Patent SearchersICIC 2014 Panel: Mobile Apps for Patent Searchers
ICIC 2014 Panel: Mobile Apps for Patent Searchers
 
ICIC 2013 Conference Proceedings Sebastian Radestock
ICIC 2013 Conference Proceedings Sebastian RadestockICIC 2013 Conference Proceedings Sebastian Radestock
ICIC 2013 Conference Proceedings Sebastian Radestock
 
ICIC 2014 New Product Introduction Gridlogisc
ICIC 2014 New Product Introduction GridlogiscICIC 2014 New Product Introduction Gridlogisc
ICIC 2014 New Product Introduction Gridlogisc
 
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities  ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
ICIC 2014 Chemical Patent Curation and Management – New Tools and Capabilities
 
ICIC 2013 Conference Proceedings Krishna Molecular Connections
ICIC 2013 Conference Proceedings Krishna Molecular ConnectionsICIC 2013 Conference Proceedings Krishna Molecular Connections
ICIC 2013 Conference Proceedings Krishna Molecular Connections
 
ICIC 2013 New Product Introductions GenomeQuest
ICIC 2013 New Product Introductions GenomeQuestICIC 2013 New Product Introductions GenomeQuest
ICIC 2013 New Product Introductions GenomeQuest
 
ICIC 2013 New Product Introductions InfoChem
ICIC 2013 New Product Introductions InfoChemICIC 2013 New Product Introductions InfoChem
ICIC 2013 New Product Introductions InfoChem
 
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
 
ICIC 2014 The Intermediates are becoming extict - radical Change for Info Pr...
ICIC 2014 The Intermediates are becoming extict  - radical Change for Info Pr...ICIC 2014 The Intermediates are becoming extict  - radical Change for Info Pr...
ICIC 2014 The Intermediates are becoming extict - radical Change for Info Pr...
 
ICIC 2014 New Product Introduction Averbis
ICIC 2014 New Product Introduction AverbisICIC 2014 New Product Introduction Averbis
ICIC 2014 New Product Introduction Averbis
 
ICIC 2013 New Product Introductions Linguamatics
ICIC 2013 New Product Introductions LinguamaticsICIC 2013 New Product Introductions Linguamatics
ICIC 2013 New Product Introductions Linguamatics
 
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
 

More from Dr. Haxel Consult

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
Dr. Haxel Consult
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
Dr. Haxel Consult
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
Dr. Haxel Consult
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
Dr. Haxel Consult
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
Dr. Haxel Consult
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
Dr. Haxel Consult
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
Dr. Haxel Consult
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
Dr. Haxel Consult
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
Dr. Haxel Consult
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
Dr. Haxel Consult
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
Dr. Haxel Consult
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
Dr. Haxel Consult
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
Dr. Haxel Consult
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance Center
Dr. Haxel Consult
 
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IPAI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IP
Dr. Haxel Consult
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOC
Dr. Haxel Consult
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
Dr. Haxel Consult
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
Dr. Haxel Consult
 

More from Dr. Haxel Consult (20)

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance Center
 
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IPAI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IP
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOC
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
 

Recently uploaded

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 

Recently uploaded (20)

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 

ICIC 2013 Conference Proceedings Daniel Bonniot ChemAxon

  • 1. Towards automated mining of chemical structures in Chinese Patents Daniel Bonniot de Ruisselet ChemAxon ICIC 2013, Vienna October 16th 2013
  • 2. 2
  • 3. 3
  • 4. 4
  • 5. Why Chinese patents matter • Volume, exploding... • Increasingly innovative • Potential infrigment, lawsuits ● Apple (2008, 2012, 2013), Schneider Electric, Samsung, ... • Hard to access because of language 5
  • 6. Why chemical mining matters • Find interesting patent(s) using text search – Each patent can contain 100s of chemical names – Convert them automatically to structures – Enables chemical calculations • Find interesting patent(s) using chemical structure search – Requires building a chemical database index • Track structures accross multiple patents – Including multiple languages – Searching for prior art, infringment, … – Chemical similarity search • ... 6
  • 7. Putting it together Chinese patents matter & chemical mining matters → Chemical mining of chinese patents matters 7
  • 8. ChemAxon? • Cheminformatics, since 1998 • All of the top 15 global pharmas are customers • Chemical database: indexing and searching • English Name to Structure • Document to Structure • Missing piece: Chinese Name to Structure 8
  • 9. Chinese Name to Structure 邓巍 (Wei Deng, a.k.a. David) Builds on english name to structure Specific dictionaries Changes in algorithms... 9
  • 10. The Challenges 1. Chinese texts have no spaces 2. Ester & Salt 乙酸乙酯 Ethyl Acetate 10
  • 11. The Challenges 3. English: name alterations 丁烷 → buta + ane → butane 4. Chinese: many Characters have different meanings 盐 = salt 酸 = acid 盐酸 = hydrochloric acid 11
  • 12. OCR Error correction 3-( 笨基 ) 丙酸 苯 苯基 丙酸 12 = benzene = phenyl = proprionic acid
  • 13. OCR Error correction 3-( 笨基 ) 丙酸 苯 苯基 丙酸 13 = benzene = phenyl = proprionic acid
  • 14. Chinese Document to Structure • Additional challenge: no spaces • 如式 I 所示的 {5-[2-(4- 正辛基苯基 ) 乙基 ]-2 , 2- 二 甲基 -1 , 3- 二氧六环 -5- 基 } 氨基甲酸叔丁酯是合成 芬戈莫德及其衍生物的重要中间体。 14
  • 15. Chinese Document to Structure • Additional challenge: no spaces • 如式 I 所示的 {5-[2-(4- 正辛基苯基 ) 乙基 ]-2 , 2- 二 甲基 -1 , 3- 二氧六环 -5- 基 } 氨基甲酸叔丁酯是合 成芬戈莫德及其衍生物的重要中间体。 15
  • 16. Chinese Document to Structure • Additional challenge: no spaces • 如式 I 所示的 {5-[2-(4- 正辛基苯基 ) 乙基 ]-2 , 2- 二 甲基 -1 , 3- 二氧六环 -5- 基 } 氨基甲酸叔丁酯是合 成芬戈莫德及其衍生物的重要中间体。 • XML Markup ● Patent metadata ● Encoding of characters ● Tags (e.g. <p>) • Document annotation 16
  • 21. Validation 1: Chinese name to structure • Test set: 38,600 Chinese names + CAS number • Contains unusual, incorrect, ambiguous names, radicals, inorganic salts, • Conversion rate = 59 – 79 % • Accuracy = 91% 21
  • 22. Validation 2: Chinese patents • 54K chinese patents with automated english translation • Filter: structures with at least 20 heavy atoms, and patents with at least 20 structures • Remains: 2108 patents 22
  • 23. Validation 2: Chinese patents 23
  • 24. Conclusions • Patent volume in chinese is booming • It is important to mine & monitor it • Automated solutions are needed, but hard • General purpose auto translation is not enough • Chinese N2S already gives better results • ChemAxon can build solutions for specific workflows • More collaboration with patent providers is needed to keep improving quality and solutions 谢谢! 24
  • 26. Automatic OCR Error Correction (2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate (2R)-2-methylsulfanyl-3-hydroxybutanedioate Λr-benzyl-Λr-[3-(lH-tetrazol-5-yl)phenyl]propanamide N-benzyl-N-[3-(1H-tetrazol-5-yl)phenyl]propanamide 我们日前止在研究开友中文化字名称的 OCR 白动纠错工力能 我们目前正在研究开发中文化学名称的 OCR 自动纠错功能 26
  • 27. From Document to Structures 27 Non-searchable patent (50 pages) Structure (text + image) + location
  • 28. ChemAxon’s “Document to Structure” • Extract chemical information from documents – – – – – 28 Names: powered by the Naming Technology Also import SMILES, InChI, CAS number … Images: OSRA, ... Works with scanned non-searchable PDF Returns structures and their location in the document
  • 29. ChemAxon’s “Document to Structure” • Supported formats: – MS Office document: doc, docx, ppt, pptx, xls, xlsx, odt … – Embedded structure objects (ChemDraw, Symyx, Marvin, …) – PDF, text, XML, HTML 29
  • 30. ChemAxon’s “Document to Database” • Data in DB: – Structures – Source (name, smiles, embedded, …) and location – Documents, Authors, Metadata... • Questions: – What structures appear in a specific document? – What documents contain a structure/substructure/...? – What documents written since 2010 in location X contain substructure S? – ... 30