SlideShare a Scribd company logo
State and Future of the IUPAC InChI, NIH, 16-18 Aug 2017
Building on Sand
John	Mayfield,	Roger	Sayle
NextMove Software Ltd
Standard InChIs on non-standard molfiles
MDL VALENCE (MDLBENCH1)
Version Accuracy Precission Version Accuracy Precission
CDK 1.4.13 92.65% 95.11% 2.0 100.00% 100.00%
Open Babel 2.3.90 91.73% 93.34% GitHub 100.00% 100.00%
MDL/BIOVIA Direct 8.0 90.30% 99.76% 2017 97.67% 97.73%
OEChem 1.9 97.20% 99.78% 20170613 97.20% 99.78%
ChemAxon 5.1 88.98% 92.99% 17.17 93.13% 97.33%
GGA/EPAM Indigo 1.1.4 70.80% 97.52% 1.3.0.r16 97.22% 97.22%
RDKit 2012.09 13.62% 22.74% 2017.03.03 67.30% 85.83%
Valence defined either explicitly (safe) or implicitly as a
default value
“The correct valence is specified by MDL/ISIS”
Roger Sayle, MDL Bench, Cheminformatics Toolkits: A Personal Perspective, RDKit UGM, Oct 2012
2012 2017
MDL VALENCE-MAGEDDON
BIOVIA	2017	changes	the	interpretation	of	MDL	files	
Changes	MF	of	213,097	records	in	PubChem	Compound
MDL MASS DELTA (MDLBENCH2)
MDL files originally stored atomic mass delta
‣InChI inherited this decision
‣Resolved by M ISO in molfile
BIOVIA Direct 2017 11B 128Te 266Sg
CDK 2.0 11B 130Te 258Sg
ChemAxon 17.17 11B 130Te 0Sg
DataWarrior 4.6.0 11B 130Te 0Sg
InChI 1.0.5 11B 130Te 269Sg
Indigo 1.3.0b 11B 128Te 271Sg
OEChem 20170613 11B 130Te 263Sg
Open Babel 2.4.1 10B 127Te 271Sg
RDKit 2017.03.03 11B 130Te 271Sg
stereo parity (MDLBENCH3)
0D 2D 3D
0 1 2 3 0 1 2 3 0 1 2 3
ChemAxon 17.17 - S R - - - - - R R R R
CDK 2.0 - S R - - - - - R R R -
Open Babel 2.4.1 - S R - - - - - R R R R
OEChem 20170613 - S R - - S R - R R R R
InChI 1.0.5 - - - - - - - - R R R R
RDKit 2017.03.03 - - - - - - - - - - - -
BIOVIA Direct 2017 - - - - - - - - - R R R
Indigo 1.3.0b - - - - - - - - R R R R
Table shows default behaviour, often can be tweaked – Open Babel and CDK have options
to use parity value for 2D input.
Plain
CoordinationDashedCharge Separated
zero-order bonds
O-
O
N
N
N
Fe
N
-O O
O-
O
N
N
N
Fe
N
-O O
O-
O
N
N
N
Fe
N
-O O
O-
O
N+
N+
N
Fe2-
N
-O O
Omitted
O-
O
N
N
N
Fe
N
-O O
Bonding required to describe configuration
Representation part of the solution (and sometimes
part of the problem), normalisation still required
How can they be represented in a molfile?
ctab representation
…
CTfile	Formats	“Nov	2011	onwards”	V3000	only,	many	tools	allow	it	in	V2000
Alex	Clark.	Accurate	Specification	of	Molecular	Structures:	The	Case	for	Zero-
Order	Bonds	and	Explicit	Hydrogen	Counting.	J.	Chem.	Inf.	Model.	2011,	51,	
3149–3157
(Syntax	Extensions)
ctab representation
M STY 1 1 DAT
M SAL 1 2 12 29
M SDT 1 MRV_COORDINATE_BOND_TYPE
M SDD 1 0.0000 0.0000 DR ALL 0 0
M SED 1 31
ChemAxon	specific	information	in	MDL	MOL	files,		
http://docs.chemaxon.com
(Semantic	Extensions)
PubChem	SD	File	Formatted	Data	V2.0.1	
ftp://ftp.ncbi.nih.gov/pubchem/specifications
BondTypeID Meaning
---------- -----------------
5 Dative Bond
6 Complex Bond
7 Ionic Bond
255 Unspecified or Unknown Connectivity
summary
Systematic benchmarks highlight differences in
interpretations
‣Often simple to change, but can need agreement
‣Chemistry is a moving target
Existing different ways the format has been enhanced to
handle zero-order bonds
‣Can cause unexpected behaviour elsewhere
‣Normalisation still difficult
Acknowledgements
Noel O’Boyle and Shuzhe Wang
ENDS
sgroups
Annotation layer over part of a structure
Gushurst et al. The substance module: the representation, storage, and searching of complex structures. J.
Chem. Inf. Comput. Sci. (1991)
Blanke G. Sgroups – Abbreviations, Mixtures, Formulations, Polymers, Structures with Statistical Distribution and
Other Special Cases. Online - StructurePendium Technologies GmbH
Display Shortcut Polymer Mixture Data
25%
75%
enhanced stereo 1
Enhanced stereo is for handling racemic mixtures and relative
stereochemistry
&1
&1
&2
&2&1
&1
and enantiomer
A B C D
E
BIOVIA	(NEMA-KEY)
A,B,C,D 47CZTH5YZKMZ9K3MVCCVHSUF2378UH
E NULL
ChemAxon	(CXSMILES)
A,D C[C@H](O)[C@@H](O)C=O |&1:1,3,r|
B C[C@@H](O)[C@H](O)C=O |&1:1,3,r|
C C[C@H](O)[C@@H](O)C=O |r|
D C[C@@H](O)[C@H](O)C=O.C[C@H](O)[C@@H](O)C=O |…|
DataWarrior
A,B,C,D gNq`AjdmsURQAh@
E dgLF@@rnT|bTtARfcUSUQHPUDtZP@
enhanced stereo 1
Enhanced stereo is a shortcut for racemic mixtures and relative
stereochemistry
A B C D
E
n/a&1 &2&1
BIOVIA	(NEMA-KEY)
A,B,D,E NULL
ChemAxon	(CXSMILES)
A,D C[C@H](O)[C@@H](O)C=O |&1:3,r|
B C[C@@H](O)[C@H](O)C=O |&1:1,r|
D C[C@@H](O)[C@@H](O)C=O.C[C@H](O)[C@@H](O)C=O |…|
DataWarrior
A,B,D gNq`AjdmsURQA`@
E dgLF@@rnT|bTtARfcUSUQHPUDdZP@

More Related Content

Similar to Building on Sand: Standard InChIs on non-standard molfiles

OPTE: Online Per-title Encoding for Live Video Streaming
OPTE: Online Per-title Encoding for Live Video StreamingOPTE: Online Per-title Encoding for Live Video Streaming
OPTE: Online Per-title Encoding for Live Video StreamingAlpen-Adria-Universität
 
OPTE: Online Per-title Encoding for Live Video Streaming.pdf
OPTE: Online Per-title Encoding for Live Video Streaming.pdfOPTE: Online Per-title Encoding for Live Video Streaming.pdf
OPTE: Online Per-title Encoding for Live Video Streaming.pdfVignesh V Menon
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}Rajarshi Guha
 
CMEL 2.8 inch Amoled(240x320) Datasheet
CMEL 2.8 inch Amoled(240x320) DatasheetCMEL 2.8 inch Amoled(240x320) Datasheet
CMEL 2.8 inch Amoled(240x320) DatasheetPanox Display
 
A tale of bug prediction in software development
A tale of bug prediction in software developmentA tale of bug prediction in software development
A tale of bug prediction in software developmentMartin Pinzger
 
Mexico 3070 user group meeting 2012 test coverage john
Mexico 3070 user group meeting 2012  test coverage johnMexico 3070 user group meeting 2012  test coverage john
Mexico 3070 user group meeting 2012 test coverage johnInterlatin
 
Integrating R with the CDK: Enhanced Chemical Data Mining
Integrating R with the CDK: Enhanced Chemical Data MiningIntegrating R with the CDK: Enhanced Chemical Data Mining
Integrating R with the CDK: Enhanced Chemical Data MiningRajarshi Guha
 
Parsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelParsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelchk49
 
Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level FeatureDongmin Choi
 
Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...
Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...
Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...Data Con LA
 
Anomaly Detection with Azure and .net
Anomaly Detection with Azure and .netAnomaly Detection with Azure and .net
Anomaly Detection with Azure and .netMarco Parenzan
 
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Agepetermurrayrust
 
Learning Biologically Relevant Features Using Convolutional Neural Networks f...
Learning Biologically Relevant Features Using Convolutional Neural Networks f...Learning Biologically Relevant Features Using Convolutional Neural Networks f...
Learning Biologically Relevant Features Using Convolutional Neural Networks f...Wesley De Neve
 
SuperAGILE Standard Orbital data Analysis pipeline
SuperAGILE Standard Orbital  data Analysis pipelineSuperAGILE Standard Orbital  data Analysis pipeline
SuperAGILE Standard Orbital data Analysis pipelineFrancesco Lazzarotto
 
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...Kamel Mansouri
 

Similar to Building on Sand: Standard InChIs on non-standard molfiles (20)

Keynote HotSWUp 2012
Keynote HotSWUp 2012Keynote HotSWUp 2012
Keynote HotSWUp 2012
 
OPTE: Online Per-title Encoding for Live Video Streaming
OPTE: Online Per-title Encoding for Live Video StreamingOPTE: Online Per-title Encoding for Live Video Streaming
OPTE: Online Per-title Encoding for Live Video Streaming
 
OPTE: Online Per-title Encoding for Live Video Streaming.pdf
OPTE: Online Per-title Encoding for Live Video Streaming.pdfOPTE: Online Per-title Encoding for Live Video Streaming.pdf
OPTE: Online Per-title Encoding for Live Video Streaming.pdf
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
 
CMEL 2.8 inch Amoled(240x320) Datasheet
CMEL 2.8 inch Amoled(240x320) DatasheetCMEL 2.8 inch Amoled(240x320) Datasheet
CMEL 2.8 inch Amoled(240x320) Datasheet
 
A tale of bug prediction in software development
A tale of bug prediction in software developmentA tale of bug prediction in software development
A tale of bug prediction in software development
 
Mexico 3070 user group meeting 2012 test coverage john
Mexico 3070 user group meeting 2012  test coverage johnMexico 3070 user group meeting 2012  test coverage john
Mexico 3070 user group meeting 2012 test coverage john
 
Integrating R with the CDK: Enhanced Chemical Data Mining
Integrating R with the CDK: Enhanced Chemical Data MiningIntegrating R with the CDK: Enhanced Chemical Data Mining
Integrating R with the CDK: Enhanced Chemical Data Mining
 
Parsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelParsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernel
 
Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level Feature
 
Applying QbD to Biotech Process Validation
Applying QbD to Biotech Process ValidationApplying QbD to Biotech Process Validation
Applying QbD to Biotech Process Validation
 
NMR Prediction Accuracy Validation
NMR Prediction Accuracy ValidationNMR Prediction Accuracy Validation
NMR Prediction Accuracy Validation
 
Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...
Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...
Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...
 
Anomaly Detection with Azure and .net
Anomaly Detection with Azure and .netAnomaly Detection with Azure and .net
Anomaly Detection with Azure and .net
 
Lec11 object-re-id
Lec11 object-re-idLec11 object-re-id
Lec11 object-re-id
 
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Age
 
Learning Biologically Relevant Features Using Convolutional Neural Networks f...
Learning Biologically Relevant Features Using Convolutional Neural Networks f...Learning Biologically Relevant Features Using Convolutional Neural Networks f...
Learning Biologically Relevant Features Using Convolutional Neural Networks f...
 
SuperAGILE Standard Orbital data Analysis pipeline
SuperAGILE Standard Orbital  data Analysis pipelineSuperAGILE Standard Orbital  data Analysis pipeline
SuperAGILE Standard Orbital data Analysis pipeline
 
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
 

More from NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeNextMove Software
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsNextMove Software
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]NextMove Software
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChemNextMove Software
 

More from NextMove Software (20)

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patents
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChem
 

Recently uploaded

The solar dynamo begins near the surface
The solar dynamo begins near the surfaceThe solar dynamo begins near the surface
The solar dynamo begins near the surfaceSérgio Sacani
 
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Sérgio Sacani
 
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243Sérgio Sacani
 
Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...Sérgio Sacani
 
INSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere UniversityINSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere UniversitySteffi Friedrichs
 
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...Sérgio Sacani
 
GBSN - Microbiology (Lab 2) Compound Microscope
GBSN - Microbiology (Lab 2) Compound MicroscopeGBSN - Microbiology (Lab 2) Compound Microscope
GBSN - Microbiology (Lab 2) Compound MicroscopeAreesha Ahmad
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSELF-EXPLANATORY
 
SAMPLING.pptx for analystical chemistry sample techniques
SAMPLING.pptx for analystical chemistry sample techniquesSAMPLING.pptx for analystical chemistry sample techniques
SAMPLING.pptx for analystical chemistry sample techniquesrodneykiptoo8
 
Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Sérgio Sacani
 
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...Subhajit Sahu
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptsreddyrahul
 
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Sérgio Sacani
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsmuralinath2
 
Topography and sediments of the floor of the Bay of Bengal
Topography and sediments of the floor of the Bay of BengalTopography and sediments of the floor of the Bay of Bengal
Topography and sediments of the floor of the Bay of BengalMd Hasan Tareq
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureSérgio Sacani
 
mixotrophy in cyanobacteria: a dual nutritional strategy
mixotrophy in cyanobacteria: a dual nutritional strategymixotrophy in cyanobacteria: a dual nutritional strategy
mixotrophy in cyanobacteria: a dual nutritional strategyMansiBishnoi1
 
Seminar on Halal AGriculture and Fisheries.pptx
Seminar on Halal AGriculture and Fisheries.pptxSeminar on Halal AGriculture and Fisheries.pptx
Seminar on Halal AGriculture and Fisheries.pptxRUDYLUMAPINET2
 
Transport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSETransport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSEjordanparish425
 

Recently uploaded (20)

The solar dynamo begins near the surface
The solar dynamo begins near the surfaceThe solar dynamo begins near the surface
The solar dynamo begins near the surface
 
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
 
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243
Constraints on Neutrino Natal Kicks from Black-Hole Binary VFTS 243
 
Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...Jet reorientation in central galaxies of clusters and groups: insights from V...
Jet reorientation in central galaxies of clusters and groups: insights from V...
 
INSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere UniversityINSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere University
 
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
Exomoons & Exorings with the Habitable Worlds Observatory I: On the Detection...
 
GBSN - Microbiology (Lab 2) Compound Microscope
GBSN - Microbiology (Lab 2) Compound MicroscopeGBSN - Microbiology (Lab 2) Compound Microscope
GBSN - Microbiology (Lab 2) Compound Microscope
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
SAMPLING.pptx for analystical chemistry sample techniques
SAMPLING.pptx for analystical chemistry sample techniquesSAMPLING.pptx for analystical chemistry sample techniques
SAMPLING.pptx for analystical chemistry sample techniques
 
Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...
 
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
 
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
 
Topography and sediments of the floor of the Bay of Bengal
Topography and sediments of the floor of the Bay of BengalTopography and sediments of the floor of the Bay of Bengal
Topography and sediments of the floor of the Bay of Bengal
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a Technosignature
 
mixotrophy in cyanobacteria: a dual nutritional strategy
mixotrophy in cyanobacteria: a dual nutritional strategymixotrophy in cyanobacteria: a dual nutritional strategy
mixotrophy in cyanobacteria: a dual nutritional strategy
 
Seminar on Halal AGriculture and Fisheries.pptx
Seminar on Halal AGriculture and Fisheries.pptxSeminar on Halal AGriculture and Fisheries.pptx
Seminar on Halal AGriculture and Fisheries.pptx
 
Transport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSETransport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSE
 

Building on Sand: Standard InChIs on non-standard molfiles

  • 1. State and Future of the IUPAC InChI, NIH, 16-18 Aug 2017 Building on Sand John Mayfield, Roger Sayle NextMove Software Ltd Standard InChIs on non-standard molfiles
  • 2. MDL VALENCE (MDLBENCH1) Version Accuracy Precission Version Accuracy Precission CDK 1.4.13 92.65% 95.11% 2.0 100.00% 100.00% Open Babel 2.3.90 91.73% 93.34% GitHub 100.00% 100.00% MDL/BIOVIA Direct 8.0 90.30% 99.76% 2017 97.67% 97.73% OEChem 1.9 97.20% 99.78% 20170613 97.20% 99.78% ChemAxon 5.1 88.98% 92.99% 17.17 93.13% 97.33% GGA/EPAM Indigo 1.1.4 70.80% 97.52% 1.3.0.r16 97.22% 97.22% RDKit 2012.09 13.62% 22.74% 2017.03.03 67.30% 85.83% Valence defined either explicitly (safe) or implicitly as a default value “The correct valence is specified by MDL/ISIS” Roger Sayle, MDL Bench, Cheminformatics Toolkits: A Personal Perspective, RDKit UGM, Oct 2012 2012 2017
  • 4. MDL MASS DELTA (MDLBENCH2) MDL files originally stored atomic mass delta ‣InChI inherited this decision ‣Resolved by M ISO in molfile BIOVIA Direct 2017 11B 128Te 266Sg CDK 2.0 11B 130Te 258Sg ChemAxon 17.17 11B 130Te 0Sg DataWarrior 4.6.0 11B 130Te 0Sg InChI 1.0.5 11B 130Te 269Sg Indigo 1.3.0b 11B 128Te 271Sg OEChem 20170613 11B 130Te 263Sg Open Babel 2.4.1 10B 127Te 271Sg RDKit 2017.03.03 11B 130Te 271Sg
  • 5. stereo parity (MDLBENCH3) 0D 2D 3D 0 1 2 3 0 1 2 3 0 1 2 3 ChemAxon 17.17 - S R - - - - - R R R R CDK 2.0 - S R - - - - - R R R - Open Babel 2.4.1 - S R - - - - - R R R R OEChem 20170613 - S R - - S R - R R R R InChI 1.0.5 - - - - - - - - R R R R RDKit 2017.03.03 - - - - - - - - - - - - BIOVIA Direct 2017 - - - - - - - - - R R R Indigo 1.3.0b - - - - - - - - R R R R Table shows default behaviour, often can be tweaked – Open Babel and CDK have options to use parity value for 2D input.
  • 6. Plain CoordinationDashedCharge Separated zero-order bonds O- O N N N Fe N -O O O- O N N N Fe N -O O O- O N N N Fe N -O O O- O N+ N+ N Fe2- N -O O Omitted O- O N N N Fe N -O O Bonding required to describe configuration Representation part of the solution (and sometimes part of the problem), normalisation still required How can they be represented in a molfile?
  • 8. ctab representation M STY 1 1 DAT M SAL 1 2 12 29 M SDT 1 MRV_COORDINATE_BOND_TYPE M SDD 1 0.0000 0.0000 DR ALL 0 0 M SED 1 31 ChemAxon specific information in MDL MOL files, http://docs.chemaxon.com (Semantic Extensions) PubChem SD File Formatted Data V2.0.1 ftp://ftp.ncbi.nih.gov/pubchem/specifications BondTypeID Meaning ---------- ----------------- 5 Dative Bond 6 Complex Bond 7 Ionic Bond 255 Unspecified or Unknown Connectivity
  • 9. summary Systematic benchmarks highlight differences in interpretations ‣Often simple to change, but can need agreement ‣Chemistry is a moving target Existing different ways the format has been enhanced to handle zero-order bonds ‣Can cause unexpected behaviour elsewhere ‣Normalisation still difficult Acknowledgements Noel O’Boyle and Shuzhe Wang
  • 10. ENDS
  • 11. sgroups Annotation layer over part of a structure Gushurst et al. The substance module: the representation, storage, and searching of complex structures. J. Chem. Inf. Comput. Sci. (1991) Blanke G. Sgroups – Abbreviations, Mixtures, Formulations, Polymers, Structures with Statistical Distribution and Other Special Cases. Online - StructurePendium Technologies GmbH Display Shortcut Polymer Mixture Data 25% 75%
  • 12. enhanced stereo 1 Enhanced stereo is for handling racemic mixtures and relative stereochemistry &1 &1 &2 &2&1 &1 and enantiomer A B C D E BIOVIA (NEMA-KEY) A,B,C,D 47CZTH5YZKMZ9K3MVCCVHSUF2378UH E NULL ChemAxon (CXSMILES) A,D C[C@H](O)[C@@H](O)C=O |&1:1,3,r| B C[C@@H](O)[C@H](O)C=O |&1:1,3,r| C C[C@H](O)[C@@H](O)C=O |r| D C[C@@H](O)[C@H](O)C=O.C[C@H](O)[C@@H](O)C=O |…| DataWarrior A,B,C,D gNq`AjdmsURQAh@ E dgLF@@rnT|bTtARfcUSUQHPUDtZP@
  • 13. enhanced stereo 1 Enhanced stereo is a shortcut for racemic mixtures and relative stereochemistry A B C D E n/a&1 &2&1 BIOVIA (NEMA-KEY) A,B,D,E NULL ChemAxon (CXSMILES) A,D C[C@H](O)[C@@H](O)C=O |&1:3,r| B C[C@@H](O)[C@H](O)C=O |&1:1,r| D C[C@@H](O)[C@@H](O)C=O.C[C@H](O)[C@@H](O)C=O |…| DataWarrior A,B,D gNq`AjdmsURQA`@ E dgLF@@rnT|bTtARfcUSUQHPUDdZP@