SlideShare a Scribd company logo
1 of 35
Download to read offline
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Sketchy sketches: Hiding chemistry
in plain sight
Daniel Lowe, John May and Roger Sayle
NextMove Software
Cambridge, UK
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Overview
• Motivation for mining sketches
• Tricky cases when interpreting sketches
• Combining text-mining with sketch
interpretation
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Motivation
• The chemical matter discussed in a document
is often critical in determining if it is relevant
• Chemical sketches are not indexed by text-
mining
• If chemical sketches can be made “chemistry
searchable” this helps with:
– Identifying relevant documents
– Prior-art searching
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
What input should be used?
• Image to structure techniques tools
(OSRA/Clide/Imago etc.) work with images
– Introduces OCR errors on atom labels
– Crossing bonds present difficulties
– Often can find chemistry in non-chemical images
• Where the sketch is available as a “computer-
readable” format can these issues be
avoided?
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Sources of chemdraw sketches
• United States patents (2001-present)
– Over 24 million ChemDraw files!
• Journal articles (albeit in most cases not
publicly accessible)
• Thesis (albeit only if the original manuscript is
made available)
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Ambiguous symbols
Symbol Naïve interpretation Possible meaning
Ac Actinium acetyl
Ar Argon aryl
B Boron Generic label
D Deuterium Generic label
P Phosphorus Generic label
Ra Radium Generic label
Rb Rubidium Generic label
V Vanadium Generic label
W Tungsten Generic label
Y Yttrium Generic label
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Ambiguous symbols-cont.
• Can disambiguate with text-mining:
– E.g. “B is aryl or heteroaryl”, “B is boron”
• Can disambiguate by connectivity e.g. is a
Yttrium atom with one bond likely?
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Attachment point representation
(Below: naïve interpretation)
tert-butyl
methyl
tert-butyl
methyl
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Implicit Attachment point
representation
Unlabelled
methyl
Under-valent
atom
Sketch parser needs to be given a hint that
the sketch is a substituent definition!
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Formula Interpretation
Input ChemDraw 15 This work
HATU
C4F9
H3PO4
CON(cHex)2 No result
III-2 No result
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Categorisation
1) Sketch Type
Molecule
Reaction
Substituent
No connection table
2) Detail
Specific
Generic
Unknown
3)
Confidence in
interpretation
High
Medium
Low
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Examples of categorisation
Molecule/Specific/High Substituent/Specific/High
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Molecule/Generic/Low
Examples of categorisation
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Examples of categorisation
Molecule/Unknown
Formula uninterpretable so can’t know for sure
whether molecule is specific or generic!
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Reaction/Specific/Medium
Two reactions
extracted
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Examples of categorisation
No Connection Table
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Repeated group detection
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Electron Localisation
Some delocalised systems don’t
yield valid SMILES
convert to localised system
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Positional variation
Naïve export:
Association of R-groups with
ring atoms captured
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Evaluation
(Dec 2015 US patent applications)
Molecule
Reaction
Substituent
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Comparison with other approaches
*Results courtesy of the SureChEMBL database
Not found by
text-mining
Also found by
text-mining
This work
(parsing CDX
files)
49,119 36,829 (42.8%)
Image to
structure*
49,836 35,545 (41.6%)
ChemDraw
exported Mol
files*
58,169 28,926 (33.2%)
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Exemplified compound
R-group Tables
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Approach
• Sketches are extracted to extended SMILES capturing:
– R-group labels
– Positional variation
– Repeat groups
• USPTO tables precisely describe how tables should be
displayed but are weak on semantics
– Heuristics used to determine which lines are the same row
– Table caption disambiguated from table column headings
– Column widths used to determine columns
– Colspans detected
• Name to structure used to interpret chemical names/formulas as
R-groups; sketches interpreted as R-groups
• Structure assembled from core and R-groups
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Core variation
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Positional variation
Incorrect formula
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Substituents defined as sketches
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Current results
• 2001- June 2016 USPTO patent applications:
– 1.96 million potential table entries detected
– 1.13 million (57.9%) converted to specific
chemical structures
– 621 thousand unique chemical structures
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Novelty of results
(versus other pipelines)
Data type
Unique
Compounds
Not found in
text /sketches
Not found in
text
Not found in
sketches
Exemplified
compound
R-group tables
621,140
529,417
(85.2%)
541,974
(87.3%)
590,889
(95.1%)
Text 4,759,009 0% 0%
2,960,937
(62.2%)
Sketches 4,479,113 0%
2,681,041
(59.9%)
0%
Structural identity checks performed using StdInChI
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Heavy atom count distribution
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Novelty of results
(versus pubchem)
Data type
Unique
Compounds
Not in PubChem
Not in PubChem
(SureChEMBL)
Exemplified
compound
R-group tables
621,140
496,831
(80.0%)
532,166
(85.7%)
Text 4,759,009
564,886
(11.9%)
911,976
(19.2%)
Sketches 4,479,113
886,991
(19.8%)
1,179,229
(26.3%)
Structural identity checks performed using StdInChI
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Current limitations
• Application of variable repeat groups
• Obtuse ways of depicting attachment points
• R-groups defined in terms of other R-groups
• R-groups defined elsewhere in the document
• Positional variation R-group representing multiple groups e.g. “3,4-diCl”
• Formulas involving substituted rings e.g.“4-ClPh”
• “Formulas” that mix systematic names with formula e.g. “4-OMe-phenyl”
• Algorithmic number of simple ring-systems (for positional variation)
• Ditto mark
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Current limitations
x implicitly 1?
Which is position 8?
Nested R-group
definition
Partially defined by
this text and the table
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Conclusions
• Direct interpretation of ChemDraw files can provide
precision benefits over using ChemDraw exported
Mol files or optical structure recognition approaches
• Structures from R-group tables are not handled by
existing text-mining approaches (e.g. SureChEMBL)
• Extracting structures from R-group tables is
complementary to existing approaches
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Acknowledgements
• George Papadatos
• Funding provided by:
252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016
Thank you for your time!
http://nextmovesoftware.com
http://nextmovesoftware.com/blog
daniel@nextmovesoftware.com

More Related Content

Viewers also liked

Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelona
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelonaJournalism, Networks, Ontology: Pat kane presentation at Media140 barcelona
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelonawww.patkane.global
 
22 of the best marketing quotes
22 of the best marketing quotes22 of the best marketing quotes
22 of the best marketing quotessherinshaju
 
Mobile for SharePoint with Windows Phone
Mobile for SharePoint with Windows PhoneMobile for SharePoint with Windows Phone
Mobile for SharePoint with Windows PhoneEdgewater
 
Частотный преобразователь
Частотный преобразовательЧастотный преобразователь
Частотный преобразовательkulibin
 
AQA Biology-Physical factors affecting organisms
AQA Biology-Physical factors affecting organismsAQA Biology-Physical factors affecting organisms
AQA Biology-Physical factors affecting organismssherinshaju
 
Content Marketing: How to Attract Talent using Sponsored Updates
Content Marketing: How to Attract Talent using Sponsored UpdatesContent Marketing: How to Attract Talent using Sponsored Updates
Content Marketing: How to Attract Talent using Sponsored UpdatesRebecca Feldman
 
PDF of the 101 Things you need to know about the police
PDF of the 101 Things you need to know about the policePDF of the 101 Things you need to know about the police
PDF of the 101 Things you need to know about the policeThe Star Newspaper
 
"The Blockchain Effect on the Future of the Humanities" by Sherry Jones (July...
"The Blockchain Effect on the Future of the Humanities" by Sherry Jones (July..."The Blockchain Effect on the Future of the Humanities" by Sherry Jones (July...
"The Blockchain Effect on the Future of the Humanities" by Sherry Jones (July...Sherry Jones
 
La Informática en el Laboratorio de Microbiología. Recursos Informáticos par...
La Informática en el Laboratorio de Microbiología.  Recursos Informáticos par...La Informática en el Laboratorio de Microbiología.  Recursos Informáticos par...
La Informática en el Laboratorio de Microbiología. Recursos Informáticos par...Rigoberto José Meléndez Cuauro
 
The impact of mobile on the IT organization
The impact of mobile on the IT organizationThe impact of mobile on the IT organization
The impact of mobile on the IT organizationChris Pepin
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksAndrii Gakhov
 
B2B Customer Experience Benchmark Report 2016
B2B Customer Experience Benchmark Report 2016B2B Customer Experience Benchmark Report 2016
B2B Customer Experience Benchmark Report 2016Kapost
 
Doc sprints: The ultimate in collaborative document development
Doc sprints: The ultimate in collaborative document developmentDoc sprints: The ultimate in collaborative document development
Doc sprints: The ultimate in collaborative document developmentSarah Maddox
 

Viewers also liked (18)

Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelona
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelonaJournalism, Networks, Ontology: Pat kane presentation at Media140 barcelona
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelona
 
Cómo hacer presentaciones exitosas
Cómo hacer presentaciones exitosasCómo hacer presentaciones exitosas
Cómo hacer presentaciones exitosas
 
22 of the best marketing quotes
22 of the best marketing quotes22 of the best marketing quotes
22 of the best marketing quotes
 
Mobile for SharePoint with Windows Phone
Mobile for SharePoint with Windows PhoneMobile for SharePoint with Windows Phone
Mobile for SharePoint with Windows Phone
 
Частотный преобразователь
Частотный преобразовательЧастотный преобразователь
Частотный преобразователь
 
AQA Biology-Physical factors affecting organisms
AQA Biology-Physical factors affecting organismsAQA Biology-Physical factors affecting organisms
AQA Biology-Physical factors affecting organisms
 
BIOSTER Technology Research Institute
BIOSTER Technology Research InstituteBIOSTER Technology Research Institute
BIOSTER Technology Research Institute
 
Diana maria morales hernandez actividad1 mapa_c
Diana maria morales hernandez actividad1 mapa_cDiana maria morales hernandez actividad1 mapa_c
Diana maria morales hernandez actividad1 mapa_c
 
Content Marketing: How to Attract Talent using Sponsored Updates
Content Marketing: How to Attract Talent using Sponsored UpdatesContent Marketing: How to Attract Talent using Sponsored Updates
Content Marketing: How to Attract Talent using Sponsored Updates
 
Tech
TechTech
Tech
 
PDF of the 101 Things you need to know about the police
PDF of the 101 Things you need to know about the policePDF of the 101 Things you need to know about the police
PDF of the 101 Things you need to know about the police
 
"The Blockchain Effect on the Future of the Humanities" by Sherry Jones (July...
"The Blockchain Effect on the Future of the Humanities" by Sherry Jones (July..."The Blockchain Effect on the Future of the Humanities" by Sherry Jones (July...
"The Blockchain Effect on the Future of the Humanities" by Sherry Jones (July...
 
La Informática en el Laboratorio de Microbiología. Recursos Informáticos par...
La Informática en el Laboratorio de Microbiología.  Recursos Informáticos par...La Informática en el Laboratorio de Microbiología.  Recursos Informáticos par...
La Informática en el Laboratorio de Microbiología. Recursos Informáticos par...
 
The impact of mobile on the IT organization
The impact of mobile on the IT organizationThe impact of mobile on the IT organization
The impact of mobile on the IT organization
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected Talks
 
avaliacao
avaliacaoavaliacao
avaliacao
 
B2B Customer Experience Benchmark Report 2016
B2B Customer Experience Benchmark Report 2016B2B Customer Experience Benchmark Report 2016
B2B Customer Experience Benchmark Report 2016
 
Doc sprints: The ultimate in collaborative document development
Doc sprints: The ultimate in collaborative document developmentDoc sprints: The ultimate in collaborative document development
Doc sprints: The ultimate in collaborative document development
 

Similar to Hiding Chemistry in Plain Sight: Interpreting Chemical Sketches from Text

Unlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articlesUnlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articlesNextMove Software
 
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...Dr. Haxel Consult
 
Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)NextMove Software
 
Chemxseer qr-sagnik
Chemxseer qr-sagnikChemxseer qr-sagnik
Chemxseer qr-sagnikTahseenaM
 
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping AlgorithmsEvaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping AlgorithmsNextMove Software
 
Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)BIOVIA
 
Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...NextMove Software
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 

Similar to Hiding Chemistry in Plain Sight: Interpreting Chemical Sketches from Text (11)

Unlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articlesUnlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articles
 
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
 
Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)
 
Chemxseer qr-sagnik
Chemxseer qr-sagnikChemxseer qr-sagnik
Chemxseer qr-sagnik
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
 
Chemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet EraChemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet Era
 
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping AlgorithmsEvaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
 
Modern analytical chemistry
Modern analytical chemistryModern analytical chemistry
Modern analytical chemistry
 
Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)
 
Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 

More from NextMove Software

Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeNextMove Software
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]NextMove Software
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsNextMove Software
 
GHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulGHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulNextMove Software
 
Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?NextMove Software
 

More from NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical Depictions
 
GHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulGHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be useful
 
Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?
 

Recently uploaded

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Hiding Chemistry in Plain Sight: Interpreting Chemical Sketches from Text

  • 1. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Sketchy sketches: Hiding chemistry in plain sight Daniel Lowe, John May and Roger Sayle NextMove Software Cambridge, UK
  • 2. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Overview • Motivation for mining sketches • Tricky cases when interpreting sketches • Combining text-mining with sketch interpretation
  • 3. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Motivation • The chemical matter discussed in a document is often critical in determining if it is relevant • Chemical sketches are not indexed by text- mining • If chemical sketches can be made “chemistry searchable” this helps with: – Identifying relevant documents – Prior-art searching
  • 4. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 What input should be used? • Image to structure techniques tools (OSRA/Clide/Imago etc.) work with images – Introduces OCR errors on atom labels – Crossing bonds present difficulties – Often can find chemistry in non-chemical images • Where the sketch is available as a “computer- readable” format can these issues be avoided?
  • 5. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Sources of chemdraw sketches • United States patents (2001-present) – Over 24 million ChemDraw files! • Journal articles (albeit in most cases not publicly accessible) • Thesis (albeit only if the original manuscript is made available)
  • 6. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Ambiguous symbols Symbol Naïve interpretation Possible meaning Ac Actinium acetyl Ar Argon aryl B Boron Generic label D Deuterium Generic label P Phosphorus Generic label Ra Radium Generic label Rb Rubidium Generic label V Vanadium Generic label W Tungsten Generic label Y Yttrium Generic label
  • 7. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Ambiguous symbols-cont. • Can disambiguate with text-mining: – E.g. “B is aryl or heteroaryl”, “B is boron” • Can disambiguate by connectivity e.g. is a Yttrium atom with one bond likely?
  • 8. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Attachment point representation (Below: naïve interpretation) tert-butyl methyl tert-butyl methyl
  • 9. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Implicit Attachment point representation Unlabelled methyl Under-valent atom Sketch parser needs to be given a hint that the sketch is a substituent definition!
  • 10. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Formula Interpretation Input ChemDraw 15 This work HATU C4F9 H3PO4 CON(cHex)2 No result III-2 No result
  • 11. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Categorisation 1) Sketch Type Molecule Reaction Substituent No connection table 2) Detail Specific Generic Unknown 3) Confidence in interpretation High Medium Low
  • 12. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Examples of categorisation Molecule/Specific/High Substituent/Specific/High
  • 13. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Molecule/Generic/Low Examples of categorisation
  • 14. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Examples of categorisation Molecule/Unknown Formula uninterpretable so can’t know for sure whether molecule is specific or generic!
  • 15. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Reaction/Specific/Medium Two reactions extracted
  • 16. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Examples of categorisation No Connection Table
  • 17. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Repeated group detection
  • 18. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Electron Localisation Some delocalised systems don’t yield valid SMILES convert to localised system
  • 19. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Positional variation Naïve export: Association of R-groups with ring atoms captured
  • 20. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Evaluation (Dec 2015 US patent applications) Molecule Reaction Substituent
  • 21. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Comparison with other approaches *Results courtesy of the SureChEMBL database Not found by text-mining Also found by text-mining This work (parsing CDX files) 49,119 36,829 (42.8%) Image to structure* 49,836 35,545 (41.6%) ChemDraw exported Mol files* 58,169 28,926 (33.2%)
  • 22. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Exemplified compound R-group Tables
  • 23. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Approach • Sketches are extracted to extended SMILES capturing: – R-group labels – Positional variation – Repeat groups • USPTO tables precisely describe how tables should be displayed but are weak on semantics – Heuristics used to determine which lines are the same row – Table caption disambiguated from table column headings – Column widths used to determine columns – Colspans detected • Name to structure used to interpret chemical names/formulas as R-groups; sketches interpreted as R-groups • Structure assembled from core and R-groups
  • 24. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Core variation
  • 25. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Positional variation Incorrect formula
  • 26. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Substituents defined as sketches
  • 27. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Current results • 2001- June 2016 USPTO patent applications: – 1.96 million potential table entries detected – 1.13 million (57.9%) converted to specific chemical structures – 621 thousand unique chemical structures
  • 28. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Novelty of results (versus other pipelines) Data type Unique Compounds Not found in text /sketches Not found in text Not found in sketches Exemplified compound R-group tables 621,140 529,417 (85.2%) 541,974 (87.3%) 590,889 (95.1%) Text 4,759,009 0% 0% 2,960,937 (62.2%) Sketches 4,479,113 0% 2,681,041 (59.9%) 0% Structural identity checks performed using StdInChI
  • 29. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Heavy atom count distribution
  • 30. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Novelty of results (versus pubchem) Data type Unique Compounds Not in PubChem Not in PubChem (SureChEMBL) Exemplified compound R-group tables 621,140 496,831 (80.0%) 532,166 (85.7%) Text 4,759,009 564,886 (11.9%) 911,976 (19.2%) Sketches 4,479,113 886,991 (19.8%) 1,179,229 (26.3%) Structural identity checks performed using StdInChI
  • 31. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Current limitations • Application of variable repeat groups • Obtuse ways of depicting attachment points • R-groups defined in terms of other R-groups • R-groups defined elsewhere in the document • Positional variation R-group representing multiple groups e.g. “3,4-diCl” • Formulas involving substituted rings e.g.“4-ClPh” • “Formulas” that mix systematic names with formula e.g. “4-OMe-phenyl” • Algorithmic number of simple ring-systems (for positional variation) • Ditto mark
  • 32. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Current limitations x implicitly 1? Which is position 8? Nested R-group definition Partially defined by this text and the table
  • 33. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Conclusions • Direct interpretation of ChemDraw files can provide precision benefits over using ChemDraw exported Mol files or optical structure recognition approaches • Structures from R-group tables are not handled by existing text-mining approaches (e.g. SureChEMBL) • Extracting structures from R-group tables is complementary to existing approaches
  • 34. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Acknowledgements • George Papadatos • Funding provided by:
  • 35. 252nd ACS National Meeting, Philadelphia PA, USA 25th August 2016 Thank you for your time! http://nextmovesoftware.com http://nextmovesoftware.com/blog daniel@nextmovesoftware.com