SlideShare a Scribd company logo
Matthias Negri , PhD
Scientific Information Center
Boehringer Ingelheim Pharma GmbH & Co. KG
Chemistry-Enriched Patent Curation
semi-automatic analysis and elaboration of patents
ChemAxon UGM 2015, Budapest, 20 May 2015
Árpád Figyelmesi
ChemAxon
Content
1. Chemistry in patents
2. Why do we need a patent curation workflow?
3. Semi-automatic Patent Curation Workflow - Overview
4. Linked tools/technologies
5. ChemCurator (ChemCC)
6. Semi-automatic Patent Curation Workflow – Step by Step
7. Lessons learned, weak-points, limitations
8. Outlook
2Negri Matthias, ChemAxon UGM 2015
Chemistry in patents
Chemistry appears within diverse form in patents:
1. TEXT - IUPAC names, common names, etc
2. IMAGES - embedded within or attached to the document
3. ATTACHMENTS (MOL/CDX)
4. TABLES
– as ONE-image file (tables with chemistry and bioactivity data)
– as chemistry-only image files embedded within table tags
5. Markush Structures/Formulas with R-groups
---------------------------------------------------------------------------------------
 Currently NO commercial solution covers all these cases
 Most of the cases are considered in the patent curation workflow
(Markush/R-group Formulas recognized and stored separately)
3Negri Matthias, ChemAxon UGM 2015
Why do we need a patent curation workflow?
Motivations:
1. Linked chemistry-retrieval from patents (+ chemistry as images)
2. IUPAC-enriched XML patent files  as NEW source for text-mining
3. extraction of bioactivity data/targets/diseases/… in relation to chemistry
4. Similarity/Substructure frequency in compound sets of patents
5. …
4Negri Matthias, ChemAxon UGM 2015
Semi-automatic Patent Curation Workflow
Overview – current state
2 parallel branches
5
I2E API KNIME – Batch indexing, text-mining and (relational) data retrieval
SLOWER & memory intensive vs BUTHigher Quality, More Control & IUPAC-enriched XML
FASTER vs LESS informative/flexible - ChemCC as the (near) future perspectiveINPUT
Negri Matthias, ChemAxon UGM 2015
Linked tools/technologies
1. KNIME/XPATH
2. ChemAxon ChemCurator (ChemCC)
3. Other ChemAxon tools in KNIME nodes (document2structure/d2s,
Naming, Molconverter, Structure checker, Standardizer, …)
4. Text/data-mining – Linguamatics I2E (+I2E Chemistry)
5. Optical Structure Recognition – Keymodule CLiDE Batch
6Negri Matthias, ChemAxon UGM 2015
Content
1. Chemistry in patents
2. Why do we need a patent curation workflow?
3. Semi-automatic Patent Curation Workflow - Overview
4. Linked tools/technologies
5. ChemCurator (ChemCC)
6. Semi-automatic Patent Curation Workflow – Step by Step
7. Lessons learned, weak-points, limitations
8. Outlook
7Negri Matthias, ChemAxon UGM 2015
Computer-aided chemical data extraction
 English, Chinese and Japanese N2S
 Markush Editor
 Structure Checker
 Hit visualization
 Third party OSR technologies
ChemCurator (ChemCC)
8 Árpád Figyelmesi, ChemAxon UGM 2015
ChemCurator (ChemCC)
Name to Structure
 Support for many nomenclatures (common, drug names, …)
 IUPAC names
 Custom dictionaries
 English (2008)
 Chinese (2013)
 Japanese (2014)
9 Árpád Figyelmesi, ChemAxon UGM 2015
Compound Extraction View
Compound listProject explorer
Annotated document
Selected structures
ChemCurator (ChemCC)
10
Markush Extraction View
Markush editor
Example structures
Annotated document
Project explorer
Selected structures
Structure checker
ChemCurator (ChemCC)
11
General Document Curation
Extract Markush Structures from patents
Extract specific structures
 Journal articles
 Company reports
 Patent examples
Structure extraction wizards
 Exclude fragments, chemical elements, etc.
ChemCurator (ChemCC)
12 Árpád Figyelmesi, ChemAxon UGM 2015
ChemCurator (ChemCC)
Integration & Information Sharing
Other ChemAxon products:
 Direct IJC schema connection
 Project sharing function
 Accessible from Plexus, IJC, etc.
Third party tools:
 Standard file formats
 Export functions
 Easily processable projects
13 Árpád Figyelmesi, ChemAxon UGM 2015
Content
1. Chemistry in patents
2. Why do we need a patent curation workflow?
3. Semi-automatic Patent Curation Workflow - Overview
4. Linked tools/technologies
5. ChemCurator (ChemCC)
6. Semi-automatic Patent Curation Workflow – Step by Step
7. Lessons learned, weak-points, limitations
8. Outlook
14Negri Matthias, ChemAxon UGM 2015
Semi-automatic Patent Curation Workflow
a) input sources and b) bibliographic data
a) Input sources
 files with patent-IDs list
 XML collection
 …
b) Retrieval of bibliographic information and attachment data
 family ID, patent references, expiration date, etc
 Attachment files MOL/CDX (US-patents only), TIF files
 ….
15Negri Matthias, ChemAxon UGM 2015
Semi-automatic Patent Curation Workflow
c) chemistry retrieval/extraction/filtering
1. ChemCurator branch
 data retrieval (XML, attachments) from IFI Claims Direct BI-server
 ChemCurator project creation/sharing/annotation  html output
 Chemistry extraction name2structure/document2structure  sdf output
 Generation of pre-annotated patent set stored as ChemCC projects
 Faster, but lower quality within the chemistry extraction process
16Negri Matthias, ChemAxon UGM 2015
2. KNIME branch
- OCR-errors CLEAN-UP in KNIME  improved chemistry recognition
- MOL/CDX/TIF - standardizer, structure checker  filter formulas, solvents, R-groups
 Higher quality and more control in chemistry extraction process
Semi-automatic Patent Curation Workflow
c) chemistry retrieval/extraction/filtering
17Negri Matthias, ChemAxon UGM 2015
2. KNIME branch
 MOL  IUPAC
 CDX  IUPAC
 TIFF (via CLiDE)  IUPAC
Semi-automatic Patent Curation Workflow
c) chemistry retrieval/extraction/filtering
18Negri Matthias, ChemAxon UGM 2015
Merging and Comparison of the converted chemistry
output of MOL/CDX/TIF – 2 “quality” checks
 IUPAC
 string length (different output order of chemicals
in multiple molecules image/multiMOL files
 OCR-correction (“dictionary” based)
2. KNIME - Chemistry “Normalization”
 (within KNIME) set up a relation between each TIFF/attachment file
1. to (one or more) IUPAC name(s)
2. to a position/section in the text/document
Semi-automatic Patent Curation Workflow
c) chemistry retrieval/extraction/filtering
19
Merge IUPAC Clean-Up IUPAC
If NO IUPAC  IMG-name is set
“Normalize” IUPAC names
Negri Matthias, ChemAxon UGM 2015
Semi-automatic Patent Curation Workflow
d) TIF/attachment replacement with IUPAC names
Chemistry present as text is recognized and extracted either via
- Textmining (I2E chemistry – d2s is working in behind) or
- Within KNIME/ChemCC using annotate/molconvert
Replacement:
<chemistry> vs IUPAC
IUPAC-enriched XML
20Negri Matthias, ChemAxon UGM 2015
OCR-errors in chemical names
Semi-automatic Patent Curation Workflow
d) TIF/attachment replacement with IUPAC names
TIF
CDX
MOL
This text-chunk is replaced by the IUPAC name
21Negri Matthias, ChemAxon UGM 2015
XPATH/XML parsing and extraction of:
 Tables
 Rows - XML tags & strings
 Entries - XML tags & strings
Semi-automatic Patent Curation Workflow
e) Bioactivity/tabular data extraction with KNIME/XPATH
22Negri Matthias, ChemAxon UGM 2015
IUPAC-enriched XML as source for I2E API/textmining
 indexing
 pre-defined queries
 results retrieval
 saved as SDF files (KNIME)
Semi-automatic Patent Curation Workflow
f)Text-/datamining with Linguamatics I2E via KNIME
Text-mining retrieved (chemistry-related) information
 Example Nr.
 Bioactivity data from tables
 Claims, regions where chemistry appears in patents
 Genes, diseases
23Negri Matthias, ChemAxon UGM 2015
1. Example Nr. – IUPAC
Table:Image:
For comparison – chemistry in PDF:
Semi-automatic Patent Curation Workflow
f) Bioactivity Data using I2E multi-queries – 2 steps
Source: (IUPAC-enriched) XML
2. Example Nr. – Bioactivity data
24
IUPAC
Bioactivity
Example Nr.
Semi-automatic Patent Curation Workflow
g) Visualize data-/textmining results in ChemCC
 SDF file imported into ChemCC project + automatic mapping to existing chemistry
25Negri Matthias, ChemAxon UGM 2015
Lessons learned, weak-points, limitations
1. Advantages KNIME Full-Mode (MOL/CDX/TIF) vs ChemCC branch
 chemistry check/normalization – 3 input sources  improved quality
 improved chemistry recall - ALL images (incl. tables and drawings)
 More filtering options in KNIME workflow vs ChemCurator only
 IUPAC-enriched XML as new source for I2E
Advantages ChemCC vs KNIME Full-Mode (MOL/CDX/TIF)
 faster
 Image processing using CLiDE is already incorporated with naming
26Negri Matthias, ChemAxon UGM 2015
Lessons learned, weak-points, limitations
2. No full automation of the workflow due to lack of homogenicity in patent data (US
vs WO, EP, etc..)
 Missing attachment files
 No tables present in XML
 Error rate in chemistry recognition (OPSIN vs n2s/d2s)
 …
 NEEDS: different workflows/branches, patent-files clean-up (OCR)
3. Time & Computational Resources-consuming process
27Negri Matthias, ChemAxon UGM 2015
Outlook
1. KNIME Workflow
 Add new data fields to Chemicals: BI-internal codes, genes, targets, etc..
 Usage of ChemCC html output as source for textmining
 Ontology mapping
 Expand workflow by including other sources (internal PDF, literature full-text)
 Use KNIME to interconnect to BI-intern workflows, DB, etc
 chemistry-linked information in a patent-DB  improved (semantic) search
28Negri Matthias, ChemAxon UGM 2015
Outlook
2. ChemCurator
 Improved n2s
 New command-line functions
 Complex-phrase requests from IFI server
 Improved SDF import
 Preprocessing wizards
Árpád Figyelmesi, ChemAxon UGM 201529
Thank You !
Negri Matthias, ChemAxon UGM 2015 30
INPUT
Árpád Figyelmesi, ChemAxon UGM 2015

More Related Content

Similar to EUGM15 - Matthias Negri, Árpád Figyelmesi (Boehringer Ingelheim, ChemAxon):Chemistry-enriched patent curation - automatized chemical and semantic analysis and elaboration of large patent sets

Product design and value engineering (PDVE) Ch 1 introduction
Product design and value engineering (PDVE) Ch 1 introductionProduct design and value engineering (PDVE) Ch 1 introduction
Product design and value engineering (PDVE) Ch 1 introduction
Chirag Patel
 
EUGM 2014 - Richard Bolton (GlaxoSmithKline): GlaxoSmithKline: 5 years with ...
EUGM 2014 -  Richard Bolton (GlaxoSmithKline): GlaxoSmithKline: 5 years with ...EUGM 2014 -  Richard Bolton (GlaxoSmithKline): GlaxoSmithKline: 5 years with ...
EUGM 2014 - Richard Bolton (GlaxoSmithKline): GlaxoSmithKline: 5 years with ...
ChemAxon
 
An investigation of extreme programming practices and its impact on software ...
An investigation of extreme programming practices and its impact on software ...An investigation of extreme programming practices and its impact on software ...
An investigation of extreme programming practices and its impact on software ...
Roberto Pepato
 
Automation in Manufacturing (Unit-4) by Varun Pratap Singh.pdf
Automation in Manufacturing (Unit-4) by Varun Pratap Singh.pdfAutomation in Manufacturing (Unit-4) by Varun Pratap Singh.pdf
Automation in Manufacturing (Unit-4) by Varun Pratap Singh.pdf
Varun Pratap Singh
 
A "STEP" Forward for Product Lifecycle Management
A "STEP" Forward for Product Lifecycle Management A "STEP" Forward for Product Lifecycle Management
A "STEP" Forward for Product Lifecycle Management
CORETECHNOLOGIE
 
U-1.pptx
U-1.pptxU-1.pptx
U-1.pptx
DrBorigorlaVenu
 
Chemical data management system - Case Study
Chemical data management system - Case StudyChemical data management system - Case Study
Chemical data management system - Case Study
Right Information
 
COMBINE (archive) meta data
COMBINE (archive) meta dataCOMBINE (archive) meta data
COMBINE (archive) meta data
Martin Scharm
 
TAG Manufacturing Kick Off Meeting, The Future of Manufacturing
TAG Manufacturing Kick Off Meeting, The Future of ManufacturingTAG Manufacturing Kick Off Meeting, The Future of Manufacturing
TAG Manufacturing Kick Off Meeting, The Future of Manufacturing
Melanie Brandt
 
Intro to rapid prototyping
Intro to rapid prototypingIntro to rapid prototyping
Intro to rapid prototyping
Dorothy Kare
 
Intro to rapid prototyping
Intro to rapid prototypingIntro to rapid prototyping
Intro to rapid prototyping
Dorothy Kare
 
INTRO TO RAPID PROTOTYPING.pptx
INTRO TO RAPID PROTOTYPING.pptxINTRO TO RAPID PROTOTYPING.pptx
INTRO TO RAPID PROTOTYPING.pptx
KareDorathi
 
Post Processing
Post Processing Post Processing
Post Processing
KTN
 
UNIT 1.pptx
UNIT 1.pptxUNIT 1.pptx
UNIT 1.pptx
SenthilkumarKR5
 
A step forward to product lifecycle
A step forward to product lifecycleA step forward to product lifecycle
A step forward to product lifecycle
CORETECHNOLOGIE
 
Iochem.carles bo
Iochem.carles boIochem.carles bo
Iochem.carles bo
maredata
 
EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...
EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...
EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...
ChemAxon
 
Different Types of Process Involved in the Information Content Product Model
Different Types of Process Involved in the Information Content Product ModelDifferent Types of Process Involved in the Information Content Product Model
Different Types of Process Involved in the Information Content Product Model
Yatish Bathla
 
Accelerating Media Business Developments
Accelerating Media Business DevelopmentsAccelerating Media Business Developments
Accelerating Media Business Developments
Alpen-Adria-Universität
 
The Role of Models in Semiconductor Smart Manufacturing
The Role of Models in Semiconductor Smart ManufacturingThe Role of Models in Semiconductor Smart Manufacturing
The Role of Models in Semiconductor Smart Manufacturing
Kimberly Daich
 

Similar to EUGM15 - Matthias Negri, Árpád Figyelmesi (Boehringer Ingelheim, ChemAxon):Chemistry-enriched patent curation - automatized chemical and semantic analysis and elaboration of large patent sets (20)

Product design and value engineering (PDVE) Ch 1 introduction
Product design and value engineering (PDVE) Ch 1 introductionProduct design and value engineering (PDVE) Ch 1 introduction
Product design and value engineering (PDVE) Ch 1 introduction
 
EUGM 2014 - Richard Bolton (GlaxoSmithKline): GlaxoSmithKline: 5 years with ...
EUGM 2014 -  Richard Bolton (GlaxoSmithKline): GlaxoSmithKline: 5 years with ...EUGM 2014 -  Richard Bolton (GlaxoSmithKline): GlaxoSmithKline: 5 years with ...
EUGM 2014 - Richard Bolton (GlaxoSmithKline): GlaxoSmithKline: 5 years with ...
 
An investigation of extreme programming practices and its impact on software ...
An investigation of extreme programming practices and its impact on software ...An investigation of extreme programming practices and its impact on software ...
An investigation of extreme programming practices and its impact on software ...
 
Automation in Manufacturing (Unit-4) by Varun Pratap Singh.pdf
Automation in Manufacturing (Unit-4) by Varun Pratap Singh.pdfAutomation in Manufacturing (Unit-4) by Varun Pratap Singh.pdf
Automation in Manufacturing (Unit-4) by Varun Pratap Singh.pdf
 
A "STEP" Forward for Product Lifecycle Management
A "STEP" Forward for Product Lifecycle Management A "STEP" Forward for Product Lifecycle Management
A "STEP" Forward for Product Lifecycle Management
 
U-1.pptx
U-1.pptxU-1.pptx
U-1.pptx
 
Chemical data management system - Case Study
Chemical data management system - Case StudyChemical data management system - Case Study
Chemical data management system - Case Study
 
COMBINE (archive) meta data
COMBINE (archive) meta dataCOMBINE (archive) meta data
COMBINE (archive) meta data
 
TAG Manufacturing Kick Off Meeting, The Future of Manufacturing
TAG Manufacturing Kick Off Meeting, The Future of ManufacturingTAG Manufacturing Kick Off Meeting, The Future of Manufacturing
TAG Manufacturing Kick Off Meeting, The Future of Manufacturing
 
Intro to rapid prototyping
Intro to rapid prototypingIntro to rapid prototyping
Intro to rapid prototyping
 
Intro to rapid prototyping
Intro to rapid prototypingIntro to rapid prototyping
Intro to rapid prototyping
 
INTRO TO RAPID PROTOTYPING.pptx
INTRO TO RAPID PROTOTYPING.pptxINTRO TO RAPID PROTOTYPING.pptx
INTRO TO RAPID PROTOTYPING.pptx
 
Post Processing
Post Processing Post Processing
Post Processing
 
UNIT 1.pptx
UNIT 1.pptxUNIT 1.pptx
UNIT 1.pptx
 
A step forward to product lifecycle
A step forward to product lifecycleA step forward to product lifecycle
A step forward to product lifecycle
 
Iochem.carles bo
Iochem.carles boIochem.carles bo
Iochem.carles bo
 
EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...
EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...
EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...
 
Different Types of Process Involved in the Information Content Product Model
Different Types of Process Involved in the Information Content Product ModelDifferent Types of Process Involved in the Information Content Product Model
Different Types of Process Involved in the Information Content Product Model
 
Accelerating Media Business Developments
Accelerating Media Business DevelopmentsAccelerating Media Business Developments
Accelerating Media Business Developments
 
The Role of Models in Semiconductor Smart Manufacturing
The Role of Models in Semiconductor Smart ManufacturingThe Role of Models in Semiconductor Smart Manufacturing
The Role of Models in Semiconductor Smart Manufacturing
 

More from ChemAxon

Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?
Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?
Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?
ChemAxon
 
Chemaxon EU UGM 2022 | Translating data to predictive models
Chemaxon EU UGM 2022 | Translating data to predictive modelsChemaxon EU UGM 2022 | Translating data to predictive models
Chemaxon EU UGM 2022 | Translating data to predictive models
ChemAxon
 
Translating data to predictive models
Translating data to predictive modelsTranslating data to predictive models
Translating data to predictive models
ChemAxon
 
Efficient biomolecular structural data handling and analysis - Webinar with D...
Efficient biomolecular structural data handling and analysis - Webinar with D...Efficient biomolecular structural data handling and analysis - Webinar with D...
Efficient biomolecular structural data handling and analysis - Webinar with D...
ChemAxon
 
Biomolecule structural data management
Biomolecule structural data managementBiomolecule structural data management
Biomolecule structural data management
ChemAxon
 
Cheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first release
Cheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first releaseCheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first release
Cheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first release
ChemAxon
 
Enhanced stereochemistry representation
Enhanced stereochemistry representation Enhanced stereochemistry representation
Enhanced stereochemistry representation
ChemAxon
 
Intellectual property (IP) intelligence solutions designed for the way resear...
Intellectual property (IP) intelligence solutions designed for the way resear...Intellectual property (IP) intelligence solutions designed for the way resear...
Intellectual property (IP) intelligence solutions designed for the way resear...
ChemAxon
 
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
ChemAxon
 
Patent Data for Artificial Intelligence based Drug Discovery
Patent Data for Artificial Intelligence based Drug DiscoveryPatent Data for Artificial Intelligence based Drug Discovery
Patent Data for Artificial Intelligence based Drug Discovery
ChemAxon
 
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
ChemAxon
 
Research data management on the cloud
Research data management on the cloudResearch data management on the cloud
Research data management on the cloud
ChemAxon
 
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound RegistrationCheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
ChemAxon
 
Cheminfo Stories APAC 2020 - JChem Engines introduction
Cheminfo Stories APAC 2020 - JChem Engines introduction Cheminfo Stories APAC 2020 - JChem Engines introduction
Cheminfo Stories APAC 2020 - JChem Engines introduction
ChemAxon
 
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
ChemAxon
 
Cheminfo Stories APAC 2020 -- Markush technology
Cheminfo Stories APAC 2020 -- Markush technology Cheminfo Stories APAC 2020 -- Markush technology
Cheminfo Stories APAC 2020 -- Markush technology
ChemAxon
 
JChem Microservices
JChem MicroservicesJChem Microservices
JChem Microservices
ChemAxon
 
Migration from joc to jpc or choral
Migration from joc to jpc or choralMigration from joc to jpc or choral
Migration from joc to jpc or choral
ChemAxon
 
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
ChemAxon
 
Chemicalize Pro - Cheminfo Stories 2020 Day 5
Chemicalize Pro - Cheminfo Stories 2020 Day 5Chemicalize Pro - Cheminfo Stories 2020 Day 5
Chemicalize Pro - Cheminfo Stories 2020 Day 5
ChemAxon
 

More from ChemAxon (20)

Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?
Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?
Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?
 
Chemaxon EU UGM 2022 | Translating data to predictive models
Chemaxon EU UGM 2022 | Translating data to predictive modelsChemaxon EU UGM 2022 | Translating data to predictive models
Chemaxon EU UGM 2022 | Translating data to predictive models
 
Translating data to predictive models
Translating data to predictive modelsTranslating data to predictive models
Translating data to predictive models
 
Efficient biomolecular structural data handling and analysis - Webinar with D...
Efficient biomolecular structural data handling and analysis - Webinar with D...Efficient biomolecular structural data handling and analysis - Webinar with D...
Efficient biomolecular structural data handling and analysis - Webinar with D...
 
Biomolecule structural data management
Biomolecule structural data managementBiomolecule structural data management
Biomolecule structural data management
 
Cheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first release
Cheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first releaseCheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first release
Cheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first release
 
Enhanced stereochemistry representation
Enhanced stereochemistry representation Enhanced stereochemistry representation
Enhanced stereochemistry representation
 
Intellectual property (IP) intelligence solutions designed for the way resear...
Intellectual property (IP) intelligence solutions designed for the way resear...Intellectual property (IP) intelligence solutions designed for the way resear...
Intellectual property (IP) intelligence solutions designed for the way resear...
 
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
 
Patent Data for Artificial Intelligence based Drug Discovery
Patent Data for Artificial Intelligence based Drug DiscoveryPatent Data for Artificial Intelligence based Drug Discovery
Patent Data for Artificial Intelligence based Drug Discovery
 
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
 
Research data management on the cloud
Research data management on the cloudResearch data management on the cloud
Research data management on the cloud
 
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound RegistrationCheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
 
Cheminfo Stories APAC 2020 - JChem Engines introduction
Cheminfo Stories APAC 2020 - JChem Engines introduction Cheminfo Stories APAC 2020 - JChem Engines introduction
Cheminfo Stories APAC 2020 - JChem Engines introduction
 
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
 
Cheminfo Stories APAC 2020 -- Markush technology
Cheminfo Stories APAC 2020 -- Markush technology Cheminfo Stories APAC 2020 -- Markush technology
Cheminfo Stories APAC 2020 -- Markush technology
 
JChem Microservices
JChem MicroservicesJChem Microservices
JChem Microservices
 
Migration from joc to jpc or choral
Migration from joc to jpc or choralMigration from joc to jpc or choral
Migration from joc to jpc or choral
 
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
 
Chemicalize Pro - Cheminfo Stories 2020 Day 5
Chemicalize Pro - Cheminfo Stories 2020 Day 5Chemicalize Pro - Cheminfo Stories 2020 Day 5
Chemicalize Pro - Cheminfo Stories 2020 Day 5
 

Recently uploaded

一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
dakas1
 
Boost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management AppsBoost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management Apps
Jhone kinadey
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
ervikas4
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
Tier1 app
 
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
widenerjobeyrl638
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
gapen1
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
ToXSL Technologies
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid
 
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
Luigi Fugaro
 
DevOps Consulting Company | Hire DevOps Services
DevOps Consulting Company | Hire DevOps ServicesDevOps Consulting Company | Hire DevOps Services
DevOps Consulting Company | Hire DevOps Services
seospiralmantra
 
ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.
Maitrey Patel
 
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
OnePlan Solutions
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Upturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in NashikUpturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in Nashik
Upturn India Technologies
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
Patrick Weigel
 
Liberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptxLiberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptx
Massimo Artizzu
 
TMU毕业证书精仿办理
TMU毕业证书精仿办理TMU毕业证书精仿办理
TMU毕业证书精仿办理
aeeva
 
Kubernetes at Scale: Going Multi-Cluster with Istio
Kubernetes at Scale:  Going Multi-Cluster  with IstioKubernetes at Scale:  Going Multi-Cluster  with Istio
Kubernetes at Scale: Going Multi-Cluster with Istio
Severalnines
 

Recently uploaded (20)

一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
 
Boost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management AppsBoost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management Apps
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISDECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSIS
 
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
 
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
 
DevOps Consulting Company | Hire DevOps Services
DevOps Consulting Company | Hire DevOps ServicesDevOps Consulting Company | Hire DevOps Services
DevOps Consulting Company | Hire DevOps Services
 
ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.
 
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...Transforming Product Development using OnePlan To Boost Efficiency and Innova...
Transforming Product Development using OnePlan To Boost Efficiency and Innova...
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Upturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in NashikUpturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in Nashik
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
 
Liberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptxLiberarsi dai framework con i Web Component.pptx
Liberarsi dai framework con i Web Component.pptx
 
TMU毕业证书精仿办理
TMU毕业证书精仿办理TMU毕业证书精仿办理
TMU毕业证书精仿办理
 
Kubernetes at Scale: Going Multi-Cluster with Istio
Kubernetes at Scale:  Going Multi-Cluster  with IstioKubernetes at Scale:  Going Multi-Cluster  with Istio
Kubernetes at Scale: Going Multi-Cluster with Istio
 

EUGM15 - Matthias Negri, Árpád Figyelmesi (Boehringer Ingelheim, ChemAxon):Chemistry-enriched patent curation - automatized chemical and semantic analysis and elaboration of large patent sets

  • 1. Matthias Negri , PhD Scientific Information Center Boehringer Ingelheim Pharma GmbH & Co. KG Chemistry-Enriched Patent Curation semi-automatic analysis and elaboration of patents ChemAxon UGM 2015, Budapest, 20 May 2015 Árpád Figyelmesi ChemAxon
  • 2. Content 1. Chemistry in patents 2. Why do we need a patent curation workflow? 3. Semi-automatic Patent Curation Workflow - Overview 4. Linked tools/technologies 5. ChemCurator (ChemCC) 6. Semi-automatic Patent Curation Workflow – Step by Step 7. Lessons learned, weak-points, limitations 8. Outlook 2Negri Matthias, ChemAxon UGM 2015
  • 3. Chemistry in patents Chemistry appears within diverse form in patents: 1. TEXT - IUPAC names, common names, etc 2. IMAGES - embedded within or attached to the document 3. ATTACHMENTS (MOL/CDX) 4. TABLES – as ONE-image file (tables with chemistry and bioactivity data) – as chemistry-only image files embedded within table tags 5. Markush Structures/Formulas with R-groups ---------------------------------------------------------------------------------------  Currently NO commercial solution covers all these cases  Most of the cases are considered in the patent curation workflow (Markush/R-group Formulas recognized and stored separately) 3Negri Matthias, ChemAxon UGM 2015
  • 4. Why do we need a patent curation workflow? Motivations: 1. Linked chemistry-retrieval from patents (+ chemistry as images) 2. IUPAC-enriched XML patent files  as NEW source for text-mining 3. extraction of bioactivity data/targets/diseases/… in relation to chemistry 4. Similarity/Substructure frequency in compound sets of patents 5. … 4Negri Matthias, ChemAxon UGM 2015
  • 5. Semi-automatic Patent Curation Workflow Overview – current state 2 parallel branches 5 I2E API KNIME – Batch indexing, text-mining and (relational) data retrieval SLOWER & memory intensive vs BUTHigher Quality, More Control & IUPAC-enriched XML FASTER vs LESS informative/flexible - ChemCC as the (near) future perspectiveINPUT Negri Matthias, ChemAxon UGM 2015
  • 6. Linked tools/technologies 1. KNIME/XPATH 2. ChemAxon ChemCurator (ChemCC) 3. Other ChemAxon tools in KNIME nodes (document2structure/d2s, Naming, Molconverter, Structure checker, Standardizer, …) 4. Text/data-mining – Linguamatics I2E (+I2E Chemistry) 5. Optical Structure Recognition – Keymodule CLiDE Batch 6Negri Matthias, ChemAxon UGM 2015
  • 7. Content 1. Chemistry in patents 2. Why do we need a patent curation workflow? 3. Semi-automatic Patent Curation Workflow - Overview 4. Linked tools/technologies 5. ChemCurator (ChemCC) 6. Semi-automatic Patent Curation Workflow – Step by Step 7. Lessons learned, weak-points, limitations 8. Outlook 7Negri Matthias, ChemAxon UGM 2015
  • 8. Computer-aided chemical data extraction  English, Chinese and Japanese N2S  Markush Editor  Structure Checker  Hit visualization  Third party OSR technologies ChemCurator (ChemCC) 8 Árpád Figyelmesi, ChemAxon UGM 2015
  • 9. ChemCurator (ChemCC) Name to Structure  Support for many nomenclatures (common, drug names, …)  IUPAC names  Custom dictionaries  English (2008)  Chinese (2013)  Japanese (2014) 9 Árpád Figyelmesi, ChemAxon UGM 2015
  • 10. Compound Extraction View Compound listProject explorer Annotated document Selected structures ChemCurator (ChemCC) 10
  • 11. Markush Extraction View Markush editor Example structures Annotated document Project explorer Selected structures Structure checker ChemCurator (ChemCC) 11
  • 12. General Document Curation Extract Markush Structures from patents Extract specific structures  Journal articles  Company reports  Patent examples Structure extraction wizards  Exclude fragments, chemical elements, etc. ChemCurator (ChemCC) 12 Árpád Figyelmesi, ChemAxon UGM 2015
  • 13. ChemCurator (ChemCC) Integration & Information Sharing Other ChemAxon products:  Direct IJC schema connection  Project sharing function  Accessible from Plexus, IJC, etc. Third party tools:  Standard file formats  Export functions  Easily processable projects 13 Árpád Figyelmesi, ChemAxon UGM 2015
  • 14. Content 1. Chemistry in patents 2. Why do we need a patent curation workflow? 3. Semi-automatic Patent Curation Workflow - Overview 4. Linked tools/technologies 5. ChemCurator (ChemCC) 6. Semi-automatic Patent Curation Workflow – Step by Step 7. Lessons learned, weak-points, limitations 8. Outlook 14Negri Matthias, ChemAxon UGM 2015
  • 15. Semi-automatic Patent Curation Workflow a) input sources and b) bibliographic data a) Input sources  files with patent-IDs list  XML collection  … b) Retrieval of bibliographic information and attachment data  family ID, patent references, expiration date, etc  Attachment files MOL/CDX (US-patents only), TIF files  …. 15Negri Matthias, ChemAxon UGM 2015
  • 16. Semi-automatic Patent Curation Workflow c) chemistry retrieval/extraction/filtering 1. ChemCurator branch  data retrieval (XML, attachments) from IFI Claims Direct BI-server  ChemCurator project creation/sharing/annotation  html output  Chemistry extraction name2structure/document2structure  sdf output  Generation of pre-annotated patent set stored as ChemCC projects  Faster, but lower quality within the chemistry extraction process 16Negri Matthias, ChemAxon UGM 2015
  • 17. 2. KNIME branch - OCR-errors CLEAN-UP in KNIME  improved chemistry recognition - MOL/CDX/TIF - standardizer, structure checker  filter formulas, solvents, R-groups  Higher quality and more control in chemistry extraction process Semi-automatic Patent Curation Workflow c) chemistry retrieval/extraction/filtering 17Negri Matthias, ChemAxon UGM 2015
  • 18. 2. KNIME branch  MOL  IUPAC  CDX  IUPAC  TIFF (via CLiDE)  IUPAC Semi-automatic Patent Curation Workflow c) chemistry retrieval/extraction/filtering 18Negri Matthias, ChemAxon UGM 2015
  • 19. Merging and Comparison of the converted chemistry output of MOL/CDX/TIF – 2 “quality” checks  IUPAC  string length (different output order of chemicals in multiple molecules image/multiMOL files  OCR-correction (“dictionary” based) 2. KNIME - Chemistry “Normalization”  (within KNIME) set up a relation between each TIFF/attachment file 1. to (one or more) IUPAC name(s) 2. to a position/section in the text/document Semi-automatic Patent Curation Workflow c) chemistry retrieval/extraction/filtering 19 Merge IUPAC Clean-Up IUPAC If NO IUPAC  IMG-name is set “Normalize” IUPAC names Negri Matthias, ChemAxon UGM 2015
  • 20. Semi-automatic Patent Curation Workflow d) TIF/attachment replacement with IUPAC names Chemistry present as text is recognized and extracted either via - Textmining (I2E chemistry – d2s is working in behind) or - Within KNIME/ChemCC using annotate/molconvert Replacement: <chemistry> vs IUPAC IUPAC-enriched XML 20Negri Matthias, ChemAxon UGM 2015
  • 21. OCR-errors in chemical names Semi-automatic Patent Curation Workflow d) TIF/attachment replacement with IUPAC names TIF CDX MOL This text-chunk is replaced by the IUPAC name 21Negri Matthias, ChemAxon UGM 2015
  • 22. XPATH/XML parsing and extraction of:  Tables  Rows - XML tags & strings  Entries - XML tags & strings Semi-automatic Patent Curation Workflow e) Bioactivity/tabular data extraction with KNIME/XPATH 22Negri Matthias, ChemAxon UGM 2015
  • 23. IUPAC-enriched XML as source for I2E API/textmining  indexing  pre-defined queries  results retrieval  saved as SDF files (KNIME) Semi-automatic Patent Curation Workflow f)Text-/datamining with Linguamatics I2E via KNIME Text-mining retrieved (chemistry-related) information  Example Nr.  Bioactivity data from tables  Claims, regions where chemistry appears in patents  Genes, diseases 23Negri Matthias, ChemAxon UGM 2015
  • 24. 1. Example Nr. – IUPAC Table:Image: For comparison – chemistry in PDF: Semi-automatic Patent Curation Workflow f) Bioactivity Data using I2E multi-queries – 2 steps Source: (IUPAC-enriched) XML 2. Example Nr. – Bioactivity data 24 IUPAC Bioactivity Example Nr.
  • 25. Semi-automatic Patent Curation Workflow g) Visualize data-/textmining results in ChemCC  SDF file imported into ChemCC project + automatic mapping to existing chemistry 25Negri Matthias, ChemAxon UGM 2015
  • 26. Lessons learned, weak-points, limitations 1. Advantages KNIME Full-Mode (MOL/CDX/TIF) vs ChemCC branch  chemistry check/normalization – 3 input sources  improved quality  improved chemistry recall - ALL images (incl. tables and drawings)  More filtering options in KNIME workflow vs ChemCurator only  IUPAC-enriched XML as new source for I2E Advantages ChemCC vs KNIME Full-Mode (MOL/CDX/TIF)  faster  Image processing using CLiDE is already incorporated with naming 26Negri Matthias, ChemAxon UGM 2015
  • 27. Lessons learned, weak-points, limitations 2. No full automation of the workflow due to lack of homogenicity in patent data (US vs WO, EP, etc..)  Missing attachment files  No tables present in XML  Error rate in chemistry recognition (OPSIN vs n2s/d2s)  …  NEEDS: different workflows/branches, patent-files clean-up (OCR) 3. Time & Computational Resources-consuming process 27Negri Matthias, ChemAxon UGM 2015
  • 28. Outlook 1. KNIME Workflow  Add new data fields to Chemicals: BI-internal codes, genes, targets, etc..  Usage of ChemCC html output as source for textmining  Ontology mapping  Expand workflow by including other sources (internal PDF, literature full-text)  Use KNIME to interconnect to BI-intern workflows, DB, etc  chemistry-linked information in a patent-DB  improved (semantic) search 28Negri Matthias, ChemAxon UGM 2015
  • 29. Outlook 2. ChemCurator  Improved n2s  New command-line functions  Complex-phrase requests from IFI server  Improved SDF import  Preprocessing wizards Árpád Figyelmesi, ChemAxon UGM 201529
  • 30. Thank You ! Negri Matthias, ChemAxon UGM 2015 30 INPUT Árpád Figyelmesi, ChemAxon UGM 2015