SlideShare a Scribd company logo
1 of 14
Download to read offline
Expanding the scope of “literature data”
with document to structure tools
“PatentInformatics” applications at Aptuit
Alfonso Pozzan
Computational and Analytical Chemistry
Drug Design and Discovery Department
Aptuit Verona (Italy)
Chemaxon 2014 User Group Meeting
Budapest 20th-21st May 2014
2
Major Site Facilities
Drug Design & Discovery, Preclinical Bioscience, API
Development & Manufacture, Solid State Chemistry
Solid & Oral Dosage Form. Dev., Physical & Analytical Chemistry
Clinical Sciences
Verona
Background
• In the last decades the number of freely/easily available information
relating to the biological data of “drug like” molecules is consistently
increased. In the past, the majority of published small-molecule bioactivity
data were exclusively available via commercial products (SciFinder,
CrossFire, AureusScience, ...).
– Small companies, and academic institutions can now extend the scope
of knowledge based drug discovery thanks to these initiatives
• ChEMBL
• DrugBank
• BindingDB
• PubChem
• ChemSpider
• Like in the case of ChEMBL, the core data are manually extracted from the
full text of peer-reviewed scientific publications in a variety of journals, such
as Journal of Medicinal Chemistry, Bioorganic Medicinal Chemistry Letters
and Journal of Natural Products.
• For patents there is now the possibility to search +20M structures using the
SureChem Open application
3
Chemaxon Document 2 Structure
4
Case study
• Orexin1 Orexin2 antagonists recent patents (23, 4 in Japanese*)
– WO2013182972A, US2012165331, WO2013139730, WO2013127913A1, WO2013119639, WO2013092893A1,
WO13092893, WO2013068935, WO2010048012, WO2010048016A, WO2008147518A1, WO2010048013A1,
WO2004026866A1, WO2013050938, WO2013059163, WO2013062858, WO2013062857, WO2013059222,
WO2011006960A1, WO2012039371A1*, WO2012081692A1*, WO2012153729A1*, WO2013123240A1*
• Structure extraction made using Chemaxon molconvert (6.2.1) and Igor
Filippov OSRA (2.0.0) all running on a IBM X3950 Linux (RH) server
– molconvert mrv patent.pdf –o patent.mrv
• Initial raw “structures” where ~30k (not uniques check)
– Using a simple MW<600, NumHeavy > 6, ClogP= “exist” gives ~13k cps (not uniq check)
5
* Japanese Patents were translated in English using the Google patents translation capabilities
Removing not relevant structures…
• Automatic structure extraction from a batch of documents (patents) lead to
a number of structures that are not relevant and/or some artifacts…
• Running Markush based Sub Structure Searches (SSS) based on each patent
would be accurate but very time consuming
– We would like to have in place a sustainable “unsupervised” process to
repeat for each target of interest.
• However simple filters can be effectively used
– A) Remove “artifacts” with MW= 0
• A typical artifacts would be the word “halogen” , “alkyl”, “aryl”, … usually those have
MW= 0 and can be easily removed (~3k entries)
– B) Remove molecules containing atoms “either than” {C,N,O,P,S,F,Cl,Br,I}
• [!$([#6,#7,#8,#15,#16,#9,#17,#35,#53])] either = 1 (~6k entries)
– C) Remove molecules that does not contain any carbon atom
• [$[(#6)] Carbon = 0 [~12k entries]
• Combining A or B or C results in 18k entries removed from the initial list
6
Removing not relevant structures…
7
[Log]
Color coded by Molecular
Complexity as defined
by number of “unique”
linear fragments ( 1 to 7
atoms in length) of each
molecule*
Removing not relevant structures…
• The following cut-offs were added:
– Fragments for each molecule >= 130
• This filters helps in removing lower complexity symmetric reagents
– Num Heavy >= 22
– NumDuplicates <= 3
• This leaved us with ~1200 “unique” structures (1878 non
unique structures from the patents…)
• At this point we were left with some questions…
– Did we get/kept all the relevant structure from all patents?
– What is the level of recognition/conversion at each
document level?
8
Retrospective analysis
9
All the initial patents
(23) were represented
at the end of the
analysis. There is a big
variability in the
number of structures
extracted from each
patent, this led us in
having a more
detailed analysis at
single patent level.
The figure represent the number of structures originating from
each patent that reached the end of the conversion and filtering process
Some examples of recognized and unrecognized
structures from the same patent (WO2004026866A1)
10
This structure was
not recognized
This structure was
recognized and
converted
This seems to be more an OCR issue than a
text2structure one…
From D2S to “PatentInformatics”
11
Stereocentre expansion
Pka Fix
ConfAnalysis
Patents 2D
Patents 3D
Shape
Similarity
Pharmacophore
screening
This is a very simple patent mining platform were 3D structures derived
from the Orexin patents example have been annotated with various
descriptors including the RMS fit with the OX1 and OX2 pharmacophores.
Flipper, fixPka, Omega2 and ROCS (from OpenEye Software were used to
prepare, and generate 3D conformations as well to perform Shape
Similarities). MOE was used for the ph4 analysis and screening
Pseudo automatic Patents Data Extraction example:
Structure and IC50 extraction (355 datapoints)
reported in patent WO2013050938A1
12
The following example highlight some
useful application of data extraction from
patents. In this case D2S is needed but not
sufficient as combination of pdf OCR and
Perl scripting/parsing was needed
The Overall Process
13
Conclusions
PROS
• Chemaxon Documents 2 Structure and Filippov’s OSRA combined with an
appropriate post processing workflow can effectively increase the knowledge on
a particular target of interest.
– Sometime this could touch areas not covered from commercially available
databases (particular papers, patents, papers in different languages)
• The overall process allow to “curate” the level of information extraction
– Structures “active” on a specific target (higher automation)
– Structures and associated curve data (lower automation)
• Need to have in place an effective post processing, in this case large amount of
patents can be mined.
• High level of automation can be achieved
CONS
• The process presented here it is still far from being perfect
– great variability from patent to patent
• Extracting biological data and associate them to structures can still take some
time (not fully automated)
14

More Related Content

Similar to EUGM 2014 - Alfonso Pozzan (Aptuit): Expanding the scope of “literature data” with document to structure tools

Legal analysis of source code
Legal analysis of source codeLegal analysis of source code
Legal analysis of source code
Robert Viseur
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics Patterns
Srinath Perera
 

Similar to EUGM 2014 - Alfonso Pozzan (Aptuit): Expanding the scope of “literature data” with document to structure tools (20)

Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKenna
 
Legal analysis of source code
Legal analysis of source codeLegal analysis of source code
Legal analysis of source code
 
Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)
Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)
Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)
 
OpenChain Webinar #58 - FOSS License Management through aliens4friends in Ecl...
OpenChain Webinar #58 - FOSS License Management through aliens4friends in Ecl...OpenChain Webinar #58 - FOSS License Management through aliens4friends in Ecl...
OpenChain Webinar #58 - FOSS License Management through aliens4friends in Ecl...
 
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
OSLC KM (Knowledge Management): elevating the meaning of data and operations ...
 
How to be a bioinformatician
How to be a bioinformaticianHow to be a bioinformatician
How to be a bioinformatician
 
Norman and McCraken, "OpenURL Implementation: Link Resolution That Users Will...
Norman and McCraken, "OpenURL Implementation: Link Resolution That Users Will...Norman and McCraken, "OpenURL Implementation: Link Resolution That Users Will...
Norman and McCraken, "OpenURL Implementation: Link Resolution That Users Will...
 
1005 cern-active mq-v2
1005 cern-active mq-v21005 cern-active mq-v2
1005 cern-active mq-v2
 
Is 20TB really Big Data?
Is 20TB really Big Data?Is 20TB really Big Data?
Is 20TB really Big Data?
 
Lab6 rtos
Lab6 rtosLab6 rtos
Lab6 rtos
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Exokernel Operating System (1)
Exokernel Operating System (1)Exokernel Operating System (1)
Exokernel Operating System (1)
 
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdfParallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
 
USUGM 2014 - Erin Bolstad (ChemAxon): Consultancy report - New capabilities a...
USUGM 2014 - Erin Bolstad (ChemAxon): Consultancy report - New capabilities a...USUGM 2014 - Erin Bolstad (ChemAxon): Consultancy report - New capabilities a...
USUGM 2014 - Erin Bolstad (ChemAxon): Consultancy report - New capabilities a...
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics Patterns
 
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming AnalyticsDEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 
Digitizing documents to provide a public spectroscopy database
Digitizing documents to provide a public spectroscopy databaseDigitizing documents to provide a public spectroscopy database
Digitizing documents to provide a public spectroscopy database
 
And Then There Are Algorithms
And Then There Are AlgorithmsAnd Then There Are Algorithms
And Then There Are Algorithms
 
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...Im symposium presentation -  OCR and Text analytics for Medical Chart Review ...
Im symposium presentation - OCR and Text analytics for Medical Chart Review ...
 

More from ChemAxon

Translating data to predictive models
Translating data to predictive modelsTranslating data to predictive models
Translating data to predictive models
ChemAxon
 

More from ChemAxon (20)

Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?
Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?
Akos Tarcsay (ChemAxon): How fast is Chemaxon RDBMS Search?
 
Chemaxon EU UGM 2022 | Translating data to predictive models
Chemaxon EU UGM 2022 | Translating data to predictive modelsChemaxon EU UGM 2022 | Translating data to predictive models
Chemaxon EU UGM 2022 | Translating data to predictive models
 
Translating data to predictive models
Translating data to predictive modelsTranslating data to predictive models
Translating data to predictive models
 
Efficient biomolecular structural data handling and analysis - Webinar with D...
Efficient biomolecular structural data handling and analysis - Webinar with D...Efficient biomolecular structural data handling and analysis - Webinar with D...
Efficient biomolecular structural data handling and analysis - Webinar with D...
 
Biomolecule structural data management
Biomolecule structural data managementBiomolecule structural data management
Biomolecule structural data management
 
Cheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first release
Cheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first releaseCheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first release
Cheminfo Stories 2021 | Virtual UGM | Marvin Pro: The first release
 
Enhanced stereochemistry representation
Enhanced stereochemistry representation Enhanced stereochemistry representation
Enhanced stereochemistry representation
 
Intellectual property (IP) intelligence solutions designed for the way resear...
Intellectual property (IP) intelligence solutions designed for the way resear...Intellectual property (IP) intelligence solutions designed for the way resear...
Intellectual property (IP) intelligence solutions designed for the way resear...
 
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
GPS for Chemical Space - Digital Assistants to Support Molecule Design - Chem...
 
Patent Data for Artificial Intelligence based Drug Discovery
Patent Data for Artificial Intelligence based Drug DiscoveryPatent Data for Artificial Intelligence based Drug Discovery
Patent Data for Artificial Intelligence based Drug Discovery
 
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
Cheminfo Stories APAC 2020 - Chemical Descriptors & Standardizers for Machine...
 
Research data management on the cloud
Research data management on the cloudResearch data management on the cloud
Research data management on the cloud
 
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound RegistrationCheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
Cheminfo Stories APAC 2020 - Introducing Design Hub & Compound Registration
 
Cheminfo Stories APAC 2020 - JChem Engines introduction
Cheminfo Stories APAC 2020 - JChem Engines introduction Cheminfo Stories APAC 2020 - JChem Engines introduction
Cheminfo Stories APAC 2020 - JChem Engines introduction
 
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
Cheminfo Stories APAC 2020 - Database management on desktop with JChem for Of...
 
Cheminfo Stories APAC 2020 -- Markush technology
Cheminfo Stories APAC 2020 -- Markush technology Cheminfo Stories APAC 2020 -- Markush technology
Cheminfo Stories APAC 2020 -- Markush technology
 
JChem Microservices
JChem MicroservicesJChem Microservices
JChem Microservices
 
Migration from joc to jpc or choral
Migration from joc to jpc or choralMigration from joc to jpc or choral
Migration from joc to jpc or choral
 
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
ChemAxon's Compliance Checker - Cheminfo Stories 2020 Day 5
 
Chemicalize Pro - Cheminfo Stories 2020 Day 5
Chemicalize Pro - Cheminfo Stories 2020 Day 5Chemicalize Pro - Cheminfo Stories 2020 Day 5
Chemicalize Pro - Cheminfo Stories 2020 Day 5
 

Recently uploaded

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Recently uploaded (20)

HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 

EUGM 2014 - Alfonso Pozzan (Aptuit): Expanding the scope of “literature data” with document to structure tools

  • 1. Expanding the scope of “literature data” with document to structure tools “PatentInformatics” applications at Aptuit Alfonso Pozzan Computational and Analytical Chemistry Drug Design and Discovery Department Aptuit Verona (Italy) Chemaxon 2014 User Group Meeting Budapest 20th-21st May 2014
  • 2. 2 Major Site Facilities Drug Design & Discovery, Preclinical Bioscience, API Development & Manufacture, Solid State Chemistry Solid & Oral Dosage Form. Dev., Physical & Analytical Chemistry Clinical Sciences Verona
  • 3. Background • In the last decades the number of freely/easily available information relating to the biological data of “drug like” molecules is consistently increased. In the past, the majority of published small-molecule bioactivity data were exclusively available via commercial products (SciFinder, CrossFire, AureusScience, ...). – Small companies, and academic institutions can now extend the scope of knowledge based drug discovery thanks to these initiatives • ChEMBL • DrugBank • BindingDB • PubChem • ChemSpider • Like in the case of ChEMBL, the core data are manually extracted from the full text of peer-reviewed scientific publications in a variety of journals, such as Journal of Medicinal Chemistry, Bioorganic Medicinal Chemistry Letters and Journal of Natural Products. • For patents there is now the possibility to search +20M structures using the SureChem Open application 3
  • 4. Chemaxon Document 2 Structure 4
  • 5. Case study • Orexin1 Orexin2 antagonists recent patents (23, 4 in Japanese*) – WO2013182972A, US2012165331, WO2013139730, WO2013127913A1, WO2013119639, WO2013092893A1, WO13092893, WO2013068935, WO2010048012, WO2010048016A, WO2008147518A1, WO2010048013A1, WO2004026866A1, WO2013050938, WO2013059163, WO2013062858, WO2013062857, WO2013059222, WO2011006960A1, WO2012039371A1*, WO2012081692A1*, WO2012153729A1*, WO2013123240A1* • Structure extraction made using Chemaxon molconvert (6.2.1) and Igor Filippov OSRA (2.0.0) all running on a IBM X3950 Linux (RH) server – molconvert mrv patent.pdf –o patent.mrv • Initial raw “structures” where ~30k (not uniques check) – Using a simple MW<600, NumHeavy > 6, ClogP= “exist” gives ~13k cps (not uniq check) 5 * Japanese Patents were translated in English using the Google patents translation capabilities
  • 6. Removing not relevant structures… • Automatic structure extraction from a batch of documents (patents) lead to a number of structures that are not relevant and/or some artifacts… • Running Markush based Sub Structure Searches (SSS) based on each patent would be accurate but very time consuming – We would like to have in place a sustainable “unsupervised” process to repeat for each target of interest. • However simple filters can be effectively used – A) Remove “artifacts” with MW= 0 • A typical artifacts would be the word “halogen” , “alkyl”, “aryl”, … usually those have MW= 0 and can be easily removed (~3k entries) – B) Remove molecules containing atoms “either than” {C,N,O,P,S,F,Cl,Br,I} • [!$([#6,#7,#8,#15,#16,#9,#17,#35,#53])] either = 1 (~6k entries) – C) Remove molecules that does not contain any carbon atom • [$[(#6)] Carbon = 0 [~12k entries] • Combining A or B or C results in 18k entries removed from the initial list 6
  • 7. Removing not relevant structures… 7 [Log] Color coded by Molecular Complexity as defined by number of “unique” linear fragments ( 1 to 7 atoms in length) of each molecule*
  • 8. Removing not relevant structures… • The following cut-offs were added: – Fragments for each molecule >= 130 • This filters helps in removing lower complexity symmetric reagents – Num Heavy >= 22 – NumDuplicates <= 3 • This leaved us with ~1200 “unique” structures (1878 non unique structures from the patents…) • At this point we were left with some questions… – Did we get/kept all the relevant structure from all patents? – What is the level of recognition/conversion at each document level? 8
  • 9. Retrospective analysis 9 All the initial patents (23) were represented at the end of the analysis. There is a big variability in the number of structures extracted from each patent, this led us in having a more detailed analysis at single patent level. The figure represent the number of structures originating from each patent that reached the end of the conversion and filtering process
  • 10. Some examples of recognized and unrecognized structures from the same patent (WO2004026866A1) 10 This structure was not recognized This structure was recognized and converted This seems to be more an OCR issue than a text2structure one…
  • 11. From D2S to “PatentInformatics” 11 Stereocentre expansion Pka Fix ConfAnalysis Patents 2D Patents 3D Shape Similarity Pharmacophore screening This is a very simple patent mining platform were 3D structures derived from the Orexin patents example have been annotated with various descriptors including the RMS fit with the OX1 and OX2 pharmacophores. Flipper, fixPka, Omega2 and ROCS (from OpenEye Software were used to prepare, and generate 3D conformations as well to perform Shape Similarities). MOE was used for the ph4 analysis and screening
  • 12. Pseudo automatic Patents Data Extraction example: Structure and IC50 extraction (355 datapoints) reported in patent WO2013050938A1 12 The following example highlight some useful application of data extraction from patents. In this case D2S is needed but not sufficient as combination of pdf OCR and Perl scripting/parsing was needed
  • 14. Conclusions PROS • Chemaxon Documents 2 Structure and Filippov’s OSRA combined with an appropriate post processing workflow can effectively increase the knowledge on a particular target of interest. – Sometime this could touch areas not covered from commercially available databases (particular papers, patents, papers in different languages) • The overall process allow to “curate” the level of information extraction – Structures “active” on a specific target (higher automation) – Structures and associated curve data (lower automation) • Need to have in place an effective post processing, in this case large amount of patents can be mined. • High level of automation can be achieved CONS • The process presented here it is still far from being perfect – great variability from patent to patent • Extracting biological data and associate them to structures can still take some time (not fully automated) 14