5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

Markus Sitzmann 1 , Wolf-Dietrich Ihlenfeldt 2 , and Marc C. Nicklaus 1 [1] Computer-Aided Drug Design Group, Chemical Biology Laboratory, NCI-Frederick, NIH, DHHS [2] Xemistry GmbH, Auf den Stieden 8, D-35094 Lahntal, Germany NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space

Chemistry Space Analysis how many small-molecules are there currently? since the early 2000s: number of databases “publishing” small molecules grew enormously, e.g. PubChem, ChemSpider, ChEMBL, DrugBank – what is the overlap? many ambiguities in the representation of small molecules (e.g. tautomerism, salts, ionic resonance forms) growing number of chemical structure identifiers (InChI/InChIKey, PubChem SID/CID, ChemSpider ID, ChEBI ID, …)

Chemical Identifier Resolver chemical structure NCI/CADD Identifiers InChI/InChIKey ChemSpider ID PubChem SID/CID chemical names CAS Registry Number NSC number FDA UNII ChemNavigator SID SMILES SD File Chemical Formula ChEBI ID PDB Ligand ID MRV CML SYBYL Line Notation GIF image

http://cactus.nci.nih.gov/chemical/structure Works as a resolver for different chemical structure identifiers. Allows one to convert a given structure identifier into another representation or structure identifier. Chemical Identifier Resolver NCI/CADD Web Resources first beta release: July 2009 current release (beta 4): April 2011

it is usable by a simple URL API: example: http://cactus.nci.nih.gov/chemical/structure/ Tamiflu / cas 204255-11-8 http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation” MIME type: text/plain Chemical Identifier Resolver NCI/CADD Web Resources XML format: http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation” /xml if a request is not resolvable: HTTP404 status message

resolver chemical names IUPAC names (by OPSIN ) CAS numbers SMILES strings IUPAC InChI/InChIKeys NCI/CADD Identifiers CACTVS HASHISY NSC number PubChem SID ChemSpider ID ChemNavigator SID FDA UNII /smiles /names, /iupac_name /cas /inchi, /stdinchi /inchikey, /stdinchikey /ficts, /ficus, /uuuuu /image /file, /sdf /mw, /monoisotopic_mass /formula /twirl, /3d /urls /chemspider_id /pubchem_sid /chemnavigator_sid “ identifier” “ representation” http://cactus.nci.nih.gov/chemcial/structure Chemical Identifier Resolver NCI/CADD Public Web Resources

identifier representation http request http response detection of the identifier type identifier is a full structure representation (e.g. SMILES, InChI) calculation of the requested structure representation identifier is a hashed structure representation (e.g. InChIKey), trivial name etc. database lookup MIME type Chemical Identifier Resolver NCI/CADD Web Resources structure e.g. InChI, GIF image e.g. CAS number, chemical name CACTVS NCI/CADD Chemical Structure Database (CSDB)

identifier representation http request http response identifier is a full structure representation (e.g. SMILES, InChI) calculation of the requested structure representation identifier is a hashed structure representation (e.g. InChIKey), trivial name etc. database lookup MIME type Chemical Identifier Resolver NCI/CADD Web Resources structure e.g. InChI, GIF image e.g. CAS number, chemical name CACTVS NCI/CADD Chemical Structure Database (CSDB) detection of the identifier type

<request string=" L-alanin " representation=" smiles "> <data id=" 1 " resolver=" name_by_chemspider " string_class=" Chemical Name (ChemSpider) "> <item id=" 1 "> C[C@H](N)C(O)=O </item> </data> <data id=" 2 " resolver=" name_by_opsin " string_class=" IUPAC Name (OPSIN) "> <item id=" 1 "> C[C@H](N)C(O)=O </item> </data> <data id=" 3 " resolver=" name_by_cir " string_class=" Chemical Name (CIR) "> <item id=" 1 “> C[C@H](N)C(O)=O </item> </data> </request> http://cactus.nci.nih.gov/chemical/structure/ L-alanin /smiles/xmls ?resolver= name_by_chemspider , name_by_opsin , name_by_cir Chemical Identifier Resolver NCI/CADD Web Resources

ChemNavigator iResearch Library compilation of commercially available screening compounds from ~330 inter- national chemistry suppliers PubChem database including Open NCI database, EPA DSSTox databases, NIAID HIV databases, NIST Webbook, NLM ChemIDplus, ChemSpider … Commercial Sources / others Asinex, Comgenex, eMolecules, ChEMBL, … currently: ~ 150 chemical structure databases ~120 million structure records ~81.6 million unique structures by NCI/CADD FICuS Identifier ~84 million unique structures by Std. InChIKey ChemNav. iResearch Lib. ~56% PubChem ~38% others ~6% Chemical Structure Database (CSDB) Chemical Identifier Resolver

NCI/CADD Structure Identifiers FICTS, FICuS, uuuuu

based on hashcodes calculated by the chemoinformatics toolkit CACTVS CACTVS hashcodes: represent a chemical structure uniquely as 16-digit hexadecimal number (64-bit unsigned) high sensitivity to structural features of a compound change if connectivity changes NCI/CADD Structure Identifiers Unique Representation of Chemical Structures 9850FD9F9E2B4E25 H N N N H 2 O H O

original structure record Molfile SDF SMILES ChemDraw cdx PDB structure normalization parent structure SDF SMILES database NCI/CADD Identifier hashcode calculation E_HASHISY NCI/CADD Structure Identifiers Unique Representation of Chemical Structures

structure normalization parent structure NCI/CADD Identifier hashcode calculation E_HASHISY calculation of a set of parent structures with different sensitivity to chemical features representation of chemical structures on different levels FICTS original structure record Molfile SDF SMILES ChemDraw cdx PDB FICuS uuuuu SDF SMILES database NCI/CADD Structure Identifiers Unique Representation of Chemical Structures

Fragments Isotopes Charges Stereo Tautomers FICTS FICuS uuuuu sensitive / not sensitive <CACTVS hashcode (E_HASHISY)>-<tag>-<version>-<checksum> Na + 4A122D094098B50D -FICTS-01-1D 0E26B623DF7FAD30 -FICuS-01-70 9850FD9F9E2B4E25 -uuuuu-01-27 NCI/CADD Structure Identifiers Unique Representation of Chemical Structures H N N N H 2 O - O

H N N N H 2 O - O N a + charged form tautomer isotope salt stereoisomers “ errors” histidine N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O H N N N H 2 O H O

A3DAE0788050DDE4-FICTS E5F83F10C5DB080A -FICTS B2FDA68AEDA06DB9-FICTS 9850FD9F9E2B4E25 -FICTS E5F83F10C5DB080A -FICTS E92E4BA2869F3611-FICTS 8A7AD1EB498CC76A-FICTS 6C16DE2351F9FF50-FICTS H N N N H 2 O - O N a + 9850FD9F9E2B4E25 -FICTS charged form tautomer isotope salt stereoisomers FICTS “ errors” histidine H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O

A3DAE0788050DDE4-FICuS E5F83F10C5DB080A -FICuS B2FDA68AEDA06DB9-FICuS 9850FD9F9E2B4E25 -FICuS E5F83F10C5DB080A -FICuS E92E4BA2869F3611-FICuS 8A7AD1EB498CC76A-FICuS 9850FD9F9E2B4E25 -FICuS H N N N H 2 O - O N a + charged form tautomer isotope salt stereoisomers FICuS “ errors” 9850FD9F9E2B4E25 -FICuS histidine N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O H N N N H 2 O H O

9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -FICuS 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu H N N N H 2 O - O N a + charged form tautomer isotope stereoisomers salt uuuuu “ errors” 9850FD9F9E2B4E25 -uuuuu histidine N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O H N N N H 2 O H O

HNDVDQJCIGZPNO -UHFFFAOYSA-N HNDVDQJCIGZPNO -CDYZYAPPSA-N HNDVDQJCIGZPNO -RXMQYKEDSA-N HNDVDQJCIGZPNO -YFKPBYRVSA-N HNDVDQJCIGZPNO - UHFFFAOYSA -N H N N N H 2 O - O N a + charged form tautomer isotope stereoisomers salt Std. InChIKey “ errors” HNDVDQJCIGZPNO - UHFFFAOYSA -N UHPNKBYGGMJTIM -UHFFFAOYSA-M UHPNKBYGGMJTIM -UHFFFAOYSA-M histidine HNDVDQJCIGZPNO - UHFFFAOYSA -N N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O H N N N H 2 O H O

original record original record original record original record original record original record original record original record original record original record original record NCI/CADD Chemical Structure Database Structure Normalization 119.8 million original structure records in CSDB

FICTS original record original record original record original record FICTS original record original record original record original record original record original record original record FICTS FICTS FICTS FICTS FICTS FICTS 83.1 million FICTS parent structures 119.8 million original structure records in CSDB NCI/CADD Chemical Structure Database Structure Normalization

FICTS original record original record original record original record FICTS original record original record original record original record original record original record original record FICTS FICTS FICTS FICTS FICTS FICTS FICuS FICuS FICuS FICuS FICuS FICuS 83.1 million FICTS parent structures 81.6 million FICuS parent structures 119.8 million original structure records in CSDB NCI/CADD Chemical Structure Database Structure Normalization

FICTS original record original record original record original record FICTS original record original record original record original record original record original record original record FICTS FICTS FICTS FICTS FICTS FICTS FICuS FICuS FICuS FICuS FICuS FICuS uuuuu uuuuu uuuuu uuuuu 83.1 million FICTS parent structures 81.6 million FICuS parent structures 76.2 million uuuuu parent structures 119.8 million original structure records in CSDB NCI/CADD Chemical Structure Database Structure Normalization

FICTS original record original record original record original record FICTS original record original record original record original record original record original record original record FICTS FICTS FICTS FICTS FICTS FICTS FICuS FICuS FICuS FICuS FICuS FICuS uuuuu uuuuu uuuuu uuuuu tautomer- invariant 83.1 million FICTS parent structures 81.6 million FICuS parent structures 76.2 million uuuuu parent structures 119.8 million original structure records in CSDB NCI/CADD Chemical Structure Database Structure Normalization

Tautomer Analysis How much “chemical space” is “just generated” by drawing tautomers?

CACTVS: generation of all formal tautomers for a given organic compound (prototropic tautomerism) rule set of 21 transforms encoded as (CACTVS-extended) SMIRKS rule set is systematically applied to the original structure (and all tautomers that have been generated in previous steps) tautomer generation is limited to 1000 SMIRKS transform operations/structure all tautomers are ranked by a scoring function the highest ranked tautomer is defined as the canonical tautomer NCI/CADD Chemical Structure Database Tautomer Analysis

rule 12 : furanones rule 11 : 1.11 (aromatic) heteroatom H shift rule 10 : 1.9 (aromatic) heteroatom H shift rule 9 : 1.7 (aromatic) heteroatom H shift rule 8 : 1.5 aromatic heteroatom H shift (2) rule 7 : 1.5 (aromatic) heteroatom H shift (1) rule 6 : 1.3 heteroatom H shift rule 5 : 1.3 aromatic heteroatom H shift rule 4 : special imine rule 3 : simple (aliphatic) imine rule 2 : 1.5 (thio)keto/(thio)enol rule 1 : 1.3 (thio)keto/(thio)enol 21 SMIRKS transform rules: rule 21 : phosphonic acids rule 20 : isocyanides rule 19 : formamidinesulfinic acids rule 18 : cyanic/iso-cyanic acids rule 17 : oxim/nitroso via phenol rule 16 : oxim/nitroso rule 15 : pentavalent nitro/aci-nitro rule 14 : ionic nitro/aci-nitro rule 13 : keten/ynol exchange NCI/CADD Chemical Structure Database Tautomer Analysis

FICuS FICuS FICuS FICuS FICuS FICuS 70.6 million FICuS parent structures NCI/CADD Chemical Structure Database Tautomer Analysis starting from the set of FICuS parent structures we systematically generated all tautomers based on the 21 SMIRKS rule set available in CACTVS generated 680 million tautomers for 1.7% of the FICuS parent structures the enumeration was not exhaustive (2009 DB version)

NCI/CADD Chemical Structure Database Tautomer Analysis number database releases 0 10 20 30 40 50 60 70 80 90 0.0 0.5 1.0 1.5 2.0 frequency tautomeric overlap within each individual database release (%) average: ~0.3% of original structure records

NCI/CADD Chemical Structure Database Tautomer Analysis number database releases 0 10 20 30 40 50 60 70 80 90 0.0 0.5 1.0 1.5 2.0 frequency tautomeric overlap within each individual database release (%) average: ~0.3% of original structure records Asinex ChemBridge ComGenex ChemNavigator Columbia University Molecular Screening Center EPA DSSTox Specs Ambinter BIND BindingDB ChemNavigator KEGG NCI Open Database NIST WebBook NLM ChemIDplus NMRShiftDB Thomson Pharma Wombat NCI/DTP PASS Training Set SGC-Ox ChemDB ZINC ChEBI ChemSpider

NCI/CADD Chemical Structure Database Tautomer Analysis 0 5 10 15 20 25 30 0.5 2.5 4.5 6.5 8.5 10.5 12.5 14.5 16.5 18.5 20.5 22.5 24.5 frequency number database releases percentage of FICuS parent structure in each database release occurring somewhere in CSDB with a conflict occurrence of “tautomerism-critical” molecules within each individual database release (%) average: ~9.5% of FICuS parent structures

HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) HPMBP is used in liquid membranes (selective removal of metal ions) selectivity and efficiency depends on the tautomeric form of HPMBP the tautomeric form depends on solvent and concentration of HPMBP He, D.; Li Z.; Ma M.; Huang J.; Yang Y. Study of extraction characteristics of HPMBP. 1. Tautomer and extraction characteristics. J. Chem. Eng. Data 2009 , 54(10), 2944-2947 Example for a Tautomer “Conflict” H N N O O

HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) CACTVS generates 7 tautomers Example for a Tautomer “Conflict” canonical tautomer by CACTVS 5 tautomers have potential stereo center on atoms or bonds N N O H O H N N O O H N N O O R/S H N N O H O H R/S H N N O O H E/Z N N O O H E/Z N N O O R/S

H H 4551-69-1 33064-14-1 127117-31-1 859 references 49 references 3 references HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) 3 tautomers have CAS Registry Numbers assigned Example for a Tautomer “Conflict” (no stereo) (Z) N N O O H N N O O H N N O O R/S H N N O H O H R/S N N O O H E/Z N N O O H E/Z N N O O R/S

N N O H O N N O O N N O O H H N N O O H H N N O H O H H N N O O 6 databases 16 databases (no stereo) 3 databases (R) 2 databases (S) 12 databases 1 database (no stereo) HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) Example for a Tautomer “Conflict” occurrences in databases indexed in CSDB R/S R/S E/Z E/Z R/S H N N O O

6 databases 16 databases (no stereo) 3 databases (R) 2 databases (S) 12 databases occurrences in databases N N O H O 1 database (no stereo) HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) Example for a Tautomer “Conflict” ACD 3D Ambinter BindingDB ChemBank ChemDB ChemSpider ChemNavigator MLSMR NIAID Scripps Screening Center Thomson Pharma ZINC ChemDB ACD 3D ACX Ambinter BioByte QSAR ChemBank ChemBridge ChemDB ChemSpider DiscoveryGate EPA GCES MLSMR NCI Open Database NIST MS-Lib NLM ChemIDplus Sigma-Aldrich Thomson Pharma Ambinter ChemDB ChemSpider DiscoveryGate ChemNavigator Thomson Pharma ChemSpider ZINC ChemSpider ECOTOX ZINC N N O O R / S H N N O O N N O O H E / Z H N N O O H E / Z H N N O H O H R / S H N N O O R / S

Scaffold Analysis NCI/CADD Chemical Structure Database molecular scaffold tree archetype scaffold simple scaffold Schuffenhauer et al. J. Chem. Inf. Model. 2007 , 47 , 47-58 Bemis et al. J. Med. Chem. 1996, 39 , 2887-2893 Bemis et al. J. Med. Chem. 1996, 39 , 2887-2893 S O O N N O level 2 level 1 example N N H O N N H O N N H

NCI/CADD Chemical Structure Database 76.2 million CSDB Scaffold Analysis uuuuu compound set

NCI/CADD Chemical Structure Database molecular scaffold tree archetype scaffold simple scaffold 76.2 million 8.1 million scaffolds 6.8 million scaffolds 0.8 million scaffolds CSDB Scaffold Analysis uuuuu compound set level 2 level 1 N N H O O N N H N N H

NCI/CADD Chemical Structure Database 76.2 million number of unique scaffolds per hierarchy level CSDB Scaffold Analysis uuuuu compound set 8.1 million scaffolds 0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 1 2 3 4 5 6 7 8 9 10 Hierarchy Level Number of Unique Scaffolds (in millions) 0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 Number of unique structures (in million) level 2 level 1 molecular scaffold tree N N H O O N N H

Multilevel Neighborhoods of Atoms (MNA) HC C(C(CC-H)C(CC-C)-H(C)) HO C(C(CC-H)C(CN-H)-H(C)) CHCC C(C(CC-H)C(CN-H)-C(C-O-O)) CHCN C(C(CC-H)N(CC)-H(C)) CCCC C(C(CC-C)N(CC)-H(C)) CCOO N(C(CN-H)C(CN-H)) NCC -H(C(CC-H)) OHC -H(C(CN-H)) OC -H(-O(-H-C)) -C(C(CC-C)-O(-H-C)-O(-C)) -O(-H(-O)-C(C-O-O)) -O(-C(C-O-O)) NCI/CADD Chemical Structure Database Filimonov D., Poroikov V., Borodina Yu., Gloriozova T. J. Chem. Inf. Comput. Sci., 1999 , 39 (4), 666-670. MNA level 1 MNA level 2 N O H O H H

Multilevel Neighborhoods of Atoms (MNA) NCI/CADD Chemical Structure Database 76.2 million CSDB uuuuu compound set

Multilevel Neighborhoods of Atoms (MNA) NCI/CADD Chemical Structure Database Unique MNAs level 1 level 2 13,426 918,516 76.2 million CSDB uuuuu compound set

Multilevel Neighborhoods of Atoms (MNA) NCI/CADD Chemical Structure Database Unique MNAs level 1 level 2 13,426 918,516 2.3 billion relationships 1.3 billion relationships ~ 17 MNAs per uuuuu parent structure ~ 30 MNAs per uuuuu parent structure 76.2 million CSDB uuuuu compound set

Multilevel Neighborhoods of Atoms (MNA) NCI/CADD Chemical Structure Database surprising: 424,784 MNAs (level 2) are exclusive to a set of 1,3 million structures in ChemSpider Unique MNAs level 1 level 2 13,426 918,516 2.3 billion relationships 1.3 billion relationships ~ 17 MNAs per uuuuu parent structure ~ 30 MNAs per uuuuu parent structure 76.2 million CSDB uuuuu compound set

Chemical Structure Web Services NCI/CADD web service NCI/CADD web service NCI/CADD Chemical Structure Database (CSDB) CACTVS external (web) services http Chemical Identifier Resolver other software packages e.g. OPSIN Chemical Structure Web Services NCI/CADD Web Resources

IUPHAR DATABASE http://www.iuphar-db.org http://www.akosgmbh.eu/globalsearch/index.htm CACTVS http://www.xemistry.com gChem Virtual Molecular Model Kit http://chemagic.com/web_molecules/script_page_large.aspx Chemical Identifier Resolver NCI/CADD Web Resources Symyx Draw Resolver http://www.symyx.com/ webel.py - A Cinfony module http://baoilleach.blogspot.com/2009/11/ introducing-webel-cheminformatics.html avogadro.openmolecules.net/

Chemical Structure Lookup Service II Work in progress …

Acknowledgments ChemNavigator Scott Hutton Tad Hurst Thanks to all database providers! http://cactus.nci.nih.gov Our web site: University of Cambridge Daniel Lowe Peter Murray-Rust Noel’ O Boyle (University College Cork, Ireland) Richard Apodaca (Metamolecular) Hans-Juergen Himmler CADD Group, CBL, NCI Igor Filippov ChemSpider Antony Williams Valery Tkachenko

http://cactus.nci.nih.gov/chemical/structure Chemical Identifier Resolver NCI/CADD Web Resources http://cactus.nci.nih.gov/blog

Acknowledgments - Software Python Web Framework Python SQL library Javascript library Peter Ertl CACTVS ChemWriter

5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

More Related Content

Similar to 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

Recently uploaded

5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

Editor's Notes