SlideShare a Scribd company logo
1 of 31
Download to read offline
cheminformatics toolkits:
            a personal perspective

                                 Roger Sayle
                       Nextmove software ltd
                            Cambridge uk



1st RDKit User Group Meeting, London, UK, 4th October 2012
overview

• Models of Chemistry
  – Implicit and Explicit Hydrogens
  – Atom types
  – Aromaticity Models
  – Valence Models
• The “mdlbench.sdf” benchmark
• Some words on performance
about the author

•   RasMol, 1989-1997.
•   GSK (Glaxo Wellcome), 1994-1997.
•   Daylight/Metaphorics, 1998-2001.
•   OpenEye Scientific Software, 2001-2010.
•   AstraZeneca, Abbott & NCBI, 2009.
•   NextMove Software, 2010-
models of chemistry

• Algorithms + Data Structures = Programs
• Abstraction of storage and representation
  from assumptions about the “real world”.
connection table compression

• A distinguishing feature between
  computational chemistry and cheminformatics
  is the representation of hydrogens.
• Typically in organic chemistry, approximately
  half of the bonds in a “full” connection table
  are the bonds to terminal hydrogen atoms.
hydrogen representations

• The two representations of the structure are
  equivalent, and either form can be deduced
  from the other [or even a hybrid form].
• In OEChem terminology:
  – OESuppressHydrogens(mol)
  – OEAddExplicitHydrogens(mol)
• Confusingly, Daylight’s SMILES toolkit auto-
  supresses hydrogens during dt_modoff, and a
  bug in OEChem sprouts chiral hydrogens.
handle with care

• Isotopes of Hydrogen; deuterium, tritium and
  protium.
• Charged hydrogens.
• Bridging hydrogens.
• Hydrogens with atom maps.
• Hydrogens with none single bonds.

• Preserving chirality on tetrahedral parent.
hydrogens in file formats
• Implicit/Explicit is also preserved with file I/O.
• SMILES: [H]C([H])([H])[H] vs C (or [CH4]).
• MDL connection tables:
[CH4]
   RDKit                                           OpenBabel10021215582D


 1 0 0 0 0 0 0 0 0 0999 V2000                       5 4 0 0   0 0 0 0   0 0999 V2000
  0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0     0.0000    0.0000   0.0000 C 0 0   0   0   0   0   0   0   0   0   0   0
M END                                                0.0000    0.0000   0.0000 H 0 0   0   0   0   0   0   0   0   0   0   0
$$$$                                                 0.0000    0.0000   0.0000 H 0 0   0   0   0   0   0   0   0   0   0   0
                                                     0.0000    0.0000   0.0000 H 0 0   0   0   0   0   0   0   0   0   0   0
                                                     0.0000    0.0000   0.0000 H 0 0   0   0   0   0   0   0   0   0   0   0
                                                    1 2 1 0   0 0 0
                                                    1 3 1 0   0 0 0
                                                    1 4 1 0   0 0 0
                                                    1 5 1 0   0 0 0
                                                   M END
                                                   $$$$
hydrogen related properties

• A number of properties are influenced by the
  choice of hydrogen representation.
  – NumAtoms()
  – ImplicitHCount() / ExplicitHCount()/TotalHCount()
  – Degree() / ExplicitDegree() / HvyDegree()
  – Valence() / ExplicitValence() / HvyValence()
• Several of these are exposed by SMARTS.
atom types

• Another distinction between computational
  chemistry toolkits and cheminformatics
  toolkits concerns the roles of “atom types”.
• In cheminformatics, atoms can be described
  by the atomic number and charge, but in
  molecular modeling simulations more is
  required → Domain of applicability.
• Ideally, integer atom types should be an
  annotation not a fundamental pre-requisite.
aromaticity

• The primary function of aromaticity in
  cheminformatics is to treat the multiple
  Kekulé forms of benzenoid rings as equivalent.
slippery slope

• Unfortunately the problem with aromaticity is
  that its a useful physical property of a
  molecule with uses beyond cheminformatics!
• Conjugated structures behave differently to
  saturated systems, and hence the degree to
  which a pi-system is shared and resonance
  stablized is useful in computational chemistry.
Aromaticity and qsar
              Original XLogP
              R-O-R:        0.327     0.327
              R=CHX        -0.166    -0.332
              R=CHR         0.236     0.472
              H             0.046     0.184
              Total:                 0.651

Name: Furan   Aromatic Furan XLogP
Test: #96     R-O-R        0.327     0.327
              R-CH-X       0.142     0.284
Exptl: 1.34   R-CH-R       0.281     0.562
              H            0.046     0.184
              Total:                 1.357
aromaticity

• The upshot is that different notions of
  aromaticity make sense in different contexts,
  with no one universally accepted [acceptable]
  answer.
  – Aromaticity is one of those unpleasant topics that is
    simultaneously simple and impossibly complicated. Since
    neither experimental nor theoretical chemists can agree
    with each other about a definition, it’s necessary to pick
    something arbitrary and stick to it. This is the approach
    taken in the RDKit.
aromaticity models

• A better approach is to acknowledge the
  different (conflicting) definitions of
  aromaticity and support them much like atom
  types; using the Tripos aromaticity model for
  XLogP and handling Sybyl mol2 files, the
  Daylight aromaticity model for SMILES strings,
  the MDL aromaticity model for MACCS keys,
  the CACTVS aromaticity model for PubChem
  fingerprints and so on.
quinonoid forms

• Unfortunately, quinonoid forms make even
  the simplest aromaticity models (MDL’s) more
  complex than just alternating single and
  double bonds around a six membered ring.
longer paths

• MDL’s aromaticity model doesn’t consider
  azulene to be aromatic.
anti-aromaticity

• Some conjugated rings systems perhaps
  shouldn’t be considered equivalent...
buried ring aromaticity
avoid sssr for aromaticity




SMILES: C12P3P1P23   SMILES: P12C3P1P23
you don’t want sssr anyway!
aromaticity models by atom type
ToolKit     *CH=*   *O*   *N*   *C(=O)*   *As=* *Te*   *B*   *B=* *S(=O)*
MDL           1      N     N       N       N      N     N         N   N
Tripos        1      N     N       N       N      N     N         N   N
Merck         1      2     2       N       N      N     N         N   N
Daylight      1      2     2       0       1      N     N         N   2
OEChem        1      2     2       0       1      2     N         N   2
RDKit         1      2     2       0       N      2     N         1   2
ChemAxon      1      2     2       0       1      N     N         N   2
OpenBabel     1      2     2       0       N      N     N         N   2
Indigo        1      2     2       N       1      2!     0        1   N
CDK           1      2     2       N       1      N     N         N   N
Derwent       1                   1?


             π Electrons contributed to Huckel 4n+2 calculation
valence models

• To provide life an interesting challenge,
  developers of cheminformatics file formats
  that support implicit hydrogens (Daylight and
  MDL) can optionally omit the number of
  implicit hydrogens on an atom, to save space,
  instead relying on the default number of
  implicit hydrogens for a given atoms
  environment.
example valence model

• For example, Daylight’s valence model for
  SMILES is that all aromatic nitrogens have no
  implicit hydrogens by default.
• If a hydrogen is present, it has to be specified
  as “[nH]”.
• This means that “[n]” should never occur in a
  generated/canonical SMILES string.
the “mdlbench.sdf” benchmark

• To test the fidelity of MDL valence
  implementations in the SD file readers, we
  evaluate the SMILES generated from a
  standard reference test file.
• This SD file contains 10,208 connection tables:
  – 114 different elements plus “D” and “T”
  – 11 charge states (from -4 to +6 inclusive)
  – 8 environments (minimum valences from 0 to 7)
  – 116*11*8 = 10208
mdlbench.sdf examples

• The correct valence is specified by MDL/ISIS.

• Neutral carbon should be four valent.
  – Everyone agrees on this (hopefully).
• +1 Nitrogen cation should be four valent.
• +1 Carbon cation should be three valent.
• -4 Boron should a have valence 1.
  – OEChem says 0, OpenBabel says 3, RDKit says 7...
mdlbench.sdf results
ToolKit               Failures   Incorrect   Correct   Recall   Precision
OEChem v1.9             264         22        9922     97.20%   99.78%
CDK v1.4.13             264        486        9458     95.11%   95.11%
OpenBabel v2.3.90       176        668        9364     91.73%   93.34%
CACTVS v3.407           352        511        9339     91.49%   94.81%
MDL Direct v8.0         968         22        9218     90.30%   99.76%
ChemAxon v5.10          440        685        9083     88.98%   92.99%
Pipeline Pilot v9.0     968        243        8997     88.14%   97.37%
ChemDraw v12.0          704        548        8956     87.74%   94.23%
GGA Indigo v1.1.4      2797        184        7227     70.80%   97.52%
MOE v2011.10           4221        936        5051     49.48%   84.37%
RDKit v2012_09         4095        4723       1390     13.62%   22.74%
mdlbench.sdf summary

• Although many toolkits can read MDL SD files,
  they don’t all agree on the semantics.
• Different readers, sometimes from the same
  company, can generate different SMILES from
  the exact same MOL file.
• Fortunately, most compounds encountered in
  the pharmaceutical industry fall into the
  widely understood “well-behaved” subset.
a little about performance
• Time to find molecules containing chlorine in the
  250,251 SMILES of the NCI August 2000 data set.
                 ToolKit         Times (secs)   Unfair   Count
     UNIX grep                      0.06        0.06     43509
     OpenEye OEChem v1.8             4.7         2.5     43509
     Daylight Toolkit v4.95          8.0                 43509
     OELib CVS (2002-04-01)         32.8        3.25     43509
     ChemAxon JChem v5.10           39.4                 43509
     GGA Indigo Toolkit v1.1.4      41.6                 43508*
     RDKit v2011_03_2               109.6                42692*
     OpenBabel v2.3.1               148.0                43509
     CDK v1.4.13                    1278                 43501*
     PerlMol v0.35                  1775                 43509
conclusions

• There is a one-to-one mapping between MDL
  connection tables and SMILES, and perfect
  portable round-tripping should be possible.
• A major step towards this is to separate the
  chemistry from the computer science in
  cheminformatics toolkits to permit flexible
  models of valence and aromaticity.
acknowledgements

• Sorel Muresan and Plamen Petrov, AstraZeneca,
  Molndal, Sweden.
• Richard Hall, Astex Pharmaceuticals, Cambridge, UK.
• Markus Sitzmann, NCI, Washington DC, USA.
• Daniel Lowe and Noel O’Boyle, NextMove Software,
  Cambridge, UK.

• Thank you for you time. Questions?

More Related Content

Similar to Cheminformatics toolkits: a personal perspective

Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...NextMove Software
 
Chemistry - Chp 12 - Stoichiometry - Notes
Chemistry - Chp 12 - Stoichiometry - NotesChemistry - Chp 12 - Stoichiometry - Notes
Chemistry - Chp 12 - Stoichiometry - NotesMr. Walajtys
 
Application of the HLD microemulsion model for the Development of Phase Stabl...
Application of the HLD microemulsion model for the Development of Phase Stabl...Application of the HLD microemulsion model for the Development of Phase Stabl...
Application of the HLD microemulsion model for the Development of Phase Stabl...Erika Szekeres
 
Application of the Hydrophile Lipophile Difference microemulsion model
Application of the Hydrophile Lipophile Difference microemulsion modelApplication of the Hydrophile Lipophile Difference microemulsion model
Application of the Hydrophile Lipophile Difference microemulsion modelDavid Scheuing
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
A pragmatic perspective on lithium ion batteries
A pragmatic perspective on lithium ion batteriesA pragmatic perspective on lithium ion batteries
A pragmatic perspective on lithium ion batteriesBing Hsieh
 
2006-01-0240
2006-01-02402006-01-0240
2006-01-0240Emad Amin
 
Chemoinformatics in Action
Chemoinformatics in ActionChemoinformatics in Action
Chemoinformatics in ActionSSA KPI
 
Developing Chemoinformatics Models
Developing Chemoinformatics ModelsDeveloping Chemoinformatics Models
Developing Chemoinformatics ModelsSSA KPI
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Advanced technologies as "DOC, DPF, SCR" to reduce Diesel engines harmful em...
Advanced technologies  as "DOC, DPF, SCR" to reduce Diesel engines harmful em...Advanced technologies  as "DOC, DPF, SCR" to reduce Diesel engines harmful em...
Advanced technologies as "DOC, DPF, SCR" to reduce Diesel engines harmful em...Omar Qasim
 
Episode 57 : Simulation for Design and Analysis
Episode 57 : Simulation for Design and AnalysisEpisode 57 : Simulation for Design and Analysis
Episode 57 : Simulation for Design and AnalysisSAJJAD KHUDHUR ABBAS
 
Combined Atomic Absorptoin Spectroscopy
Combined Atomic Absorptoin SpectroscopyCombined Atomic Absorptoin Spectroscopy
Combined Atomic Absorptoin Spectroscopyknowledge1995
 
Internship oral presentation
Internship oral presentationInternship oral presentation
Internship oral presentationAnh-Mai
 
Oral presentation
Oral presentationOral presentation
Oral presentationAnh-Mai
 
Open Science Data Repository - Dataledger
Open Science Data Repository - DataledgerOpen Science Data Repository - Dataledger
Open Science Data Repository - DataledgerAlexandru Korotcov
 
Finding Transition States Algorithmically for Automatic Reaction Mechanism Ge...
Finding Transition States Algorithmically for Automatic Reaction Mechanism Ge...Finding Transition States Algorithmically for Automatic Reaction Mechanism Ge...
Finding Transition States Algorithmically for Automatic Reaction Mechanism Ge...Richard West
 
Quantitative aspect of Stoichiometry
Quantitative aspect of StoichiometryQuantitative aspect of Stoichiometry
Quantitative aspect of StoichiometrySekwanele Sibanyoni
 

Similar to Cheminformatics toolkits: a personal perspective (20)

Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
 
some_other_API
some_other_APIsome_other_API
some_other_API
 
Chemistry - Chp 12 - Stoichiometry - Notes
Chemistry - Chp 12 - Stoichiometry - NotesChemistry - Chp 12 - Stoichiometry - Notes
Chemistry - Chp 12 - Stoichiometry - Notes
 
Application of the HLD microemulsion model for the Development of Phase Stabl...
Application of the HLD microemulsion model for the Development of Phase Stabl...Application of the HLD microemulsion model for the Development of Phase Stabl...
Application of the HLD microemulsion model for the Development of Phase Stabl...
 
Application of the Hydrophile Lipophile Difference microemulsion model
Application of the Hydrophile Lipophile Difference microemulsion modelApplication of the Hydrophile Lipophile Difference microemulsion model
Application of the Hydrophile Lipophile Difference microemulsion model
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
A pragmatic perspective on lithium ion batteries
A pragmatic perspective on lithium ion batteriesA pragmatic perspective on lithium ion batteries
A pragmatic perspective on lithium ion batteries
 
2006-01-0240
2006-01-02402006-01-0240
2006-01-0240
 
Chemoinformatics in Action
Chemoinformatics in ActionChemoinformatics in Action
Chemoinformatics in Action
 
Developing Chemoinformatics Models
Developing Chemoinformatics ModelsDeveloping Chemoinformatics Models
Developing Chemoinformatics Models
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Advanced technologies as "DOC, DPF, SCR" to reduce Diesel engines harmful em...
Advanced technologies  as "DOC, DPF, SCR" to reduce Diesel engines harmful em...Advanced technologies  as "DOC, DPF, SCR" to reduce Diesel engines harmful em...
Advanced technologies as "DOC, DPF, SCR" to reduce Diesel engines harmful em...
 
Episode 57 : Simulation for Design and Analysis
Episode 57 : Simulation for Design and AnalysisEpisode 57 : Simulation for Design and Analysis
Episode 57 : Simulation for Design and Analysis
 
Combined Atomic Absorptoin Spectroscopy
Combined Atomic Absorptoin SpectroscopyCombined Atomic Absorptoin Spectroscopy
Combined Atomic Absorptoin Spectroscopy
 
Internship oral presentation
Internship oral presentationInternship oral presentation
Internship oral presentation
 
Oral presentation
Oral presentationOral presentation
Oral presentation
 
Catalyst brochure
Catalyst brochureCatalyst brochure
Catalyst brochure
 
Open Science Data Repository - Dataledger
Open Science Data Repository - DataledgerOpen Science Data Repository - Dataledger
Open Science Data Repository - Dataledger
 
Finding Transition States Algorithmically for Automatic Reaction Mechanism Ge...
Finding Transition States Algorithmically for Automatic Reaction Mechanism Ge...Finding Transition States Algorithmically for Automatic Reaction Mechanism Ge...
Finding Transition States Algorithmically for Automatic Reaction Mechanism Ge...
 
Quantitative aspect of Stoichiometry
Quantitative aspect of StoichiometryQuantitative aspect of Stoichiometry
Quantitative aspect of Stoichiometry
 

More from NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeNextMove Software
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsNextMove Software
 

More from NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patents
 

Recently uploaded

Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayMakMakNepo
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxLigayaBacuel1
 

Recently uploaded (20)

Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up Friday
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptx
 

Cheminformatics toolkits: a personal perspective

  • 1. cheminformatics toolkits: a personal perspective Roger Sayle Nextmove software ltd Cambridge uk 1st RDKit User Group Meeting, London, UK, 4th October 2012
  • 2. overview • Models of Chemistry – Implicit and Explicit Hydrogens – Atom types – Aromaticity Models – Valence Models • The “mdlbench.sdf” benchmark • Some words on performance
  • 3. about the author • RasMol, 1989-1997. • GSK (Glaxo Wellcome), 1994-1997. • Daylight/Metaphorics, 1998-2001. • OpenEye Scientific Software, 2001-2010. • AstraZeneca, Abbott & NCBI, 2009. • NextMove Software, 2010-
  • 4. models of chemistry • Algorithms + Data Structures = Programs • Abstraction of storage and representation from assumptions about the “real world”.
  • 5. connection table compression • A distinguishing feature between computational chemistry and cheminformatics is the representation of hydrogens. • Typically in organic chemistry, approximately half of the bonds in a “full” connection table are the bonds to terminal hydrogen atoms.
  • 6. hydrogen representations • The two representations of the structure are equivalent, and either form can be deduced from the other [or even a hybrid form]. • In OEChem terminology: – OESuppressHydrogens(mol) – OEAddExplicitHydrogens(mol) • Confusingly, Daylight’s SMILES toolkit auto- supresses hydrogens during dt_modoff, and a bug in OEChem sprouts chiral hydrogens.
  • 7. handle with care • Isotopes of Hydrogen; deuterium, tritium and protium. • Charged hydrogens. • Bridging hydrogens. • Hydrogens with atom maps. • Hydrogens with none single bonds. • Preserving chirality on tetrahedral parent.
  • 8. hydrogens in file formats • Implicit/Explicit is also preserved with file I/O. • SMILES: [H]C([H])([H])[H] vs C (or [CH4]). • MDL connection tables: [CH4] RDKit OpenBabel10021215582D 1 0 0 0 0 0 0 0 0 0999 V2000 5 4 0 0 0 0 0 0 0 0999 V2000 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 M END 0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 $$$$ 0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 1 3 1 0 0 0 0 1 4 1 0 0 0 0 1 5 1 0 0 0 0 M END $$$$
  • 9. hydrogen related properties • A number of properties are influenced by the choice of hydrogen representation. – NumAtoms() – ImplicitHCount() / ExplicitHCount()/TotalHCount() – Degree() / ExplicitDegree() / HvyDegree() – Valence() / ExplicitValence() / HvyValence() • Several of these are exposed by SMARTS.
  • 10. atom types • Another distinction between computational chemistry toolkits and cheminformatics toolkits concerns the roles of “atom types”. • In cheminformatics, atoms can be described by the atomic number and charge, but in molecular modeling simulations more is required → Domain of applicability. • Ideally, integer atom types should be an annotation not a fundamental pre-requisite.
  • 11. aromaticity • The primary function of aromaticity in cheminformatics is to treat the multiple Kekulé forms of benzenoid rings as equivalent.
  • 12. slippery slope • Unfortunately the problem with aromaticity is that its a useful physical property of a molecule with uses beyond cheminformatics! • Conjugated structures behave differently to saturated systems, and hence the degree to which a pi-system is shared and resonance stablized is useful in computational chemistry.
  • 13. Aromaticity and qsar Original XLogP R-O-R: 0.327 0.327 R=CHX -0.166 -0.332 R=CHR 0.236 0.472 H 0.046 0.184 Total: 0.651 Name: Furan Aromatic Furan XLogP Test: #96 R-O-R 0.327 0.327 R-CH-X 0.142 0.284 Exptl: 1.34 R-CH-R 0.281 0.562 H 0.046 0.184 Total: 1.357
  • 14. aromaticity • The upshot is that different notions of aromaticity make sense in different contexts, with no one universally accepted [acceptable] answer. – Aromaticity is one of those unpleasant topics that is simultaneously simple and impossibly complicated. Since neither experimental nor theoretical chemists can agree with each other about a definition, it’s necessary to pick something arbitrary and stick to it. This is the approach taken in the RDKit.
  • 15. aromaticity models • A better approach is to acknowledge the different (conflicting) definitions of aromaticity and support them much like atom types; using the Tripos aromaticity model for XLogP and handling Sybyl mol2 files, the Daylight aromaticity model for SMILES strings, the MDL aromaticity model for MACCS keys, the CACTVS aromaticity model for PubChem fingerprints and so on.
  • 16. quinonoid forms • Unfortunately, quinonoid forms make even the simplest aromaticity models (MDL’s) more complex than just alternating single and double bonds around a six membered ring.
  • 17. longer paths • MDL’s aromaticity model doesn’t consider azulene to be aromatic.
  • 18. anti-aromaticity • Some conjugated rings systems perhaps shouldn’t be considered equivalent...
  • 20. avoid sssr for aromaticity SMILES: C12P3P1P23 SMILES: P12C3P1P23
  • 21. you don’t want sssr anyway!
  • 22. aromaticity models by atom type ToolKit *CH=* *O* *N* *C(=O)* *As=* *Te* *B* *B=* *S(=O)* MDL 1 N N N N N N N N Tripos 1 N N N N N N N N Merck 1 2 2 N N N N N N Daylight 1 2 2 0 1 N N N 2 OEChem 1 2 2 0 1 2 N N 2 RDKit 1 2 2 0 N 2 N 1 2 ChemAxon 1 2 2 0 1 N N N 2 OpenBabel 1 2 2 0 N N N N 2 Indigo 1 2 2 N 1 2! 0 1 N CDK 1 2 2 N 1 N N N N Derwent 1 1? π Electrons contributed to Huckel 4n+2 calculation
  • 23. valence models • To provide life an interesting challenge, developers of cheminformatics file formats that support implicit hydrogens (Daylight and MDL) can optionally omit the number of implicit hydrogens on an atom, to save space, instead relying on the default number of implicit hydrogens for a given atoms environment.
  • 24. example valence model • For example, Daylight’s valence model for SMILES is that all aromatic nitrogens have no implicit hydrogens by default. • If a hydrogen is present, it has to be specified as “[nH]”. • This means that “[n]” should never occur in a generated/canonical SMILES string.
  • 25. the “mdlbench.sdf” benchmark • To test the fidelity of MDL valence implementations in the SD file readers, we evaluate the SMILES generated from a standard reference test file. • This SD file contains 10,208 connection tables: – 114 different elements plus “D” and “T” – 11 charge states (from -4 to +6 inclusive) – 8 environments (minimum valences from 0 to 7) – 116*11*8 = 10208
  • 26. mdlbench.sdf examples • The correct valence is specified by MDL/ISIS. • Neutral carbon should be four valent. – Everyone agrees on this (hopefully). • +1 Nitrogen cation should be four valent. • +1 Carbon cation should be three valent. • -4 Boron should a have valence 1. – OEChem says 0, OpenBabel says 3, RDKit says 7...
  • 27. mdlbench.sdf results ToolKit Failures Incorrect Correct Recall Precision OEChem v1.9 264 22 9922 97.20% 99.78% CDK v1.4.13 264 486 9458 95.11% 95.11% OpenBabel v2.3.90 176 668 9364 91.73% 93.34% CACTVS v3.407 352 511 9339 91.49% 94.81% MDL Direct v8.0 968 22 9218 90.30% 99.76% ChemAxon v5.10 440 685 9083 88.98% 92.99% Pipeline Pilot v9.0 968 243 8997 88.14% 97.37% ChemDraw v12.0 704 548 8956 87.74% 94.23% GGA Indigo v1.1.4 2797 184 7227 70.80% 97.52% MOE v2011.10 4221 936 5051 49.48% 84.37% RDKit v2012_09 4095 4723 1390 13.62% 22.74%
  • 28. mdlbench.sdf summary • Although many toolkits can read MDL SD files, they don’t all agree on the semantics. • Different readers, sometimes from the same company, can generate different SMILES from the exact same MOL file. • Fortunately, most compounds encountered in the pharmaceutical industry fall into the widely understood “well-behaved” subset.
  • 29. a little about performance • Time to find molecules containing chlorine in the 250,251 SMILES of the NCI August 2000 data set. ToolKit Times (secs) Unfair Count UNIX grep 0.06 0.06 43509 OpenEye OEChem v1.8 4.7 2.5 43509 Daylight Toolkit v4.95 8.0 43509 OELib CVS (2002-04-01) 32.8 3.25 43509 ChemAxon JChem v5.10 39.4 43509 GGA Indigo Toolkit v1.1.4 41.6 43508* RDKit v2011_03_2 109.6 42692* OpenBabel v2.3.1 148.0 43509 CDK v1.4.13 1278 43501* PerlMol v0.35 1775 43509
  • 30. conclusions • There is a one-to-one mapping between MDL connection tables and SMILES, and perfect portable round-tripping should be possible. • A major step towards this is to separate the chemistry from the computer science in cheminformatics toolkits to permit flexible models of valence and aromaticity.
  • 31. acknowledgements • Sorel Muresan and Plamen Petrov, AstraZeneca, Molndal, Sweden. • Richard Hall, Astex Pharmaceuticals, Cambridge, UK. • Markus Sitzmann, NCI, Washington DC, USA. • Daniel Lowe and Noel O’Boyle, NextMove Software, Cambridge, UK. • Thank you for you time. Questions?