SlideShare a Scribd company logo
Markus Sitzmann 1 , Wolf-Dietrich Ihlenfeldt 2 , and Marc C. Nicklaus 1 [1] Computer-Aided Drug Design Group, Chemical Biology Laboratory, NCI-Frederick, NIH, DHHS [2] Xemistry GmbH, Auf den Stieden 8, D-35094 Lahntal, Germany NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space
Chemistry Space Analysis ,[object Object],[object Object],[object Object],[object Object]
Chemical Identifier Resolver chemical structure NCI/CADD Identifiers InChI/InChIKey ChemSpider ID PubChem SID/CID chemical   names CAS Registry Number NSC number FDA UNII ChemNavigator SID SMILES SD File Chemical Formula ChEBI ID PDB Ligand ID MRV   CML SYBYL Line Notation   GIF image
http://cactus.nci.nih.gov/chemical/structure Works as a resolver for different  chemical structure identifiers.  Allows one to convert a given structure identifier into another representation or structure identifier. Chemical Identifier Resolver NCI/CADD Web Resources first beta release:  July 2009 current release (beta   4):  April 2011
[object Object],example:  http://cactus.nci.nih.gov/chemical/structure/ Tamiflu / cas 204255-11-8 http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation” MIME type:  text/plain Chemical Identifier Resolver NCI/CADD Web Resources XML format:  http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation” /xml ,[object Object]
resolver chemical names IUPAC names (by  OPSIN ) CAS numbers SMILES strings IUPAC InChI/InChIKeys NCI/CADD Identifiers CACTVS HASHISY NSC number PubChem SID ChemSpider ID ChemNavigator SID FDA UNII /smiles /names, /iupac_name /cas /inchi, /stdinchi /inchikey, /stdinchikey /ficts, /ficus, /uuuuu  /image /file, /sdf /mw, /monoisotopic_mass   /formula /twirl, /3d /urls /chemspider_id /pubchem_sid /chemnavigator_sid “ identifier” “ representation” http://cactus.nci.nih.gov/chemcial/structure Chemical Identifier Resolver NCI/CADD Public Web Resources
identifier representation http request http response detection of the identifier type identifier is a full structure  representation (e.g. SMILES, InChI) calculation of the requested structure representation identifier is a hashed structure representation (e.g. InChIKey), trivial name etc. database lookup MIME type Chemical Identifier Resolver NCI/CADD Web Resources structure e.g. InChI, GIF image e.g. CAS number, chemical name CACTVS NCI/CADD Chemical Structure Database (CSDB)
identifier representation http request http response identifier is a full structure  representation (e.g. SMILES, InChI) calculation of the requested structure representation identifier is a hashed structure representation (e.g. InChIKey), trivial name etc. database lookup MIME type Chemical Identifier Resolver NCI/CADD Web Resources structure e.g. InChI, GIF image e.g. CAS number, chemical name CACTVS NCI/CADD Chemical Structure Database (CSDB) detection of the identifier type
<request string=&quot; L-alanin &quot; representation=&quot; smiles &quot;> <data id=&quot; 1 &quot; resolver=&quot; name_by_chemspider &quot; string_class=&quot; Chemical Name (ChemSpider) &quot;> <item id=&quot; 1 &quot;> C[C@H](N)C(O)=O </item> </data> <data id=&quot; 2 &quot; resolver=&quot; name_by_opsin &quot; string_class=&quot; IUPAC Name (OPSIN) &quot;> <item id=&quot; 1 &quot;> C[C@H](N)C(O)=O </item> </data> <data id=&quot; 3 &quot; resolver=&quot; name_by_cir &quot; string_class=&quot; Chemical Name (CIR) &quot;> <item id=&quot; 1 “> C[C@H](N)C(O)=O </item> </data> </request> http://cactus.nci.nih.gov/chemical/structure/ L-alanin /smiles/xmls ?resolver= name_by_chemspider , name_by_opsin , name_by_cir   Chemical Identifier Resolver NCI/CADD Web Resources
[object Object],[object Object],[object Object],currently: ~ 150 chemical structure databases ~120 million structure records   ~81.6 million unique structures by  NCI/CADD FICuS Identifier ~84 million unique structures by Std. InChIKey ChemNav. iResearch Lib. ~56% PubChem ~38% others ~6% Chemical Structure Database (CSDB) Chemical Identifier Resolver
[object Object],FICTS, FICuS, uuuuu
[object Object],[object Object],[object Object],[object Object],[object Object],NCI/CADD Structure Identifiers Unique Representation of Chemical Structures 9850FD9F9E2B4E25 H N N N H 2 O H O
original structure record Molfile SDF SMILES ChemDraw cdx PDB structure normalization parent structure SDF SMILES database NCI/CADD Identifier hashcode calculation E_HASHISY NCI/CADD Structure Identifiers Unique Representation of Chemical Structures
structure normalization parent structure NCI/CADD Identifier hashcode calculation E_HASHISY ,[object Object],[object Object],FICTS original structure record Molfile SDF SMILES ChemDraw cdx PDB FICuS uuuuu SDF SMILES database NCI/CADD Structure Identifiers Unique Representation of Chemical Structures
Fragments Isotopes Charges Stereo Tautomers FICTS FICuS uuuuu sensitive /  not sensitive <CACTVS hashcode (E_HASHISY)>-<tag>-<version>-<checksum> Na + 4A122D094098B50D -FICTS-01-1D  0E26B623DF7FAD30 -FICuS-01-70 9850FD9F9E2B4E25 -uuuuu-01-27 NCI/CADD Structure Identifiers Unique Representation of Chemical Structures H N N N H 2 O - O
H N N N H 2 O - O N a + charged form tautomer isotope salt stereoisomers “ errors” histidine N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O H N N N H 2 O H O
A3DAE0788050DDE4-FICTS  E5F83F10C5DB080A -FICTS B2FDA68AEDA06DB9-FICTS 9850FD9F9E2B4E25 -FICTS E5F83F10C5DB080A -FICTS E92E4BA2869F3611-FICTS 8A7AD1EB498CC76A-FICTS 6C16DE2351F9FF50-FICTS H N N N H 2 O - O N a + 9850FD9F9E2B4E25 -FICTS charged form tautomer isotope salt stereoisomers FICTS “ errors” histidine H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O
A3DAE0788050DDE4-FICuS  E5F83F10C5DB080A -FICuS B2FDA68AEDA06DB9-FICuS 9850FD9F9E2B4E25 -FICuS E5F83F10C5DB080A -FICuS E92E4BA2869F3611-FICuS 8A7AD1EB498CC76A-FICuS 9850FD9F9E2B4E25 -FICuS H N N N H 2 O - O N a + charged form tautomer isotope salt stereoisomers FICuS “ errors” 9850FD9F9E2B4E25 -FICuS histidine N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O H N N N H 2 O H O
9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -FICuS 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu H N N N H 2 O - O N a + charged form tautomer isotope stereoisomers salt uuuuu “ errors” 9850FD9F9E2B4E25 -uuuuu histidine N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O H N N N H 2 O H O
HNDVDQJCIGZPNO -UHFFFAOYSA-N HNDVDQJCIGZPNO -CDYZYAPPSA-N HNDVDQJCIGZPNO -RXMQYKEDSA-N  HNDVDQJCIGZPNO -YFKPBYRVSA-N HNDVDQJCIGZPNO - UHFFFAOYSA -N H N N N H 2 O - O N a + charged form tautomer isotope stereoisomers salt Std. InChIKey “ errors” HNDVDQJCIGZPNO - UHFFFAOYSA -N UHPNKBYGGMJTIM -UHFFFAOYSA-M   UHPNKBYGGMJTIM -UHFFFAOYSA-M  histidine HNDVDQJCIGZPNO - UHFFFAOYSA -N N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O H N N N H 2 O H O
original record original record original record original record original record original record original record original record original record original record original record NCI/CADD Chemical Structure Database Structure Normalization 119.8 million original structure records in  CSDB
FICTS original record original record original record original record FICTS original record original record original record original record original record original record original record FICTS FICTS FICTS FICTS FICTS FICTS 83.1 million FICTS parent structures 119.8 million original structure records in  CSDB NCI/CADD Chemical Structure Database Structure Normalization
FICTS original record original record original record original record FICTS original record original record original record original record original record original record original record FICTS FICTS FICTS FICTS FICTS FICTS FICuS FICuS FICuS FICuS FICuS FICuS 83.1 million FICTS parent structures 81.6 million FICuS parent structures 119.8 million original structure records in  CSDB NCI/CADD Chemical Structure Database Structure Normalization
FICTS original record original record original record original record FICTS original record original record original record original record original record original record original record FICTS FICTS FICTS FICTS FICTS FICTS FICuS FICuS FICuS FICuS FICuS FICuS uuuuu uuuuu uuuuu uuuuu 83.1 million FICTS parent structures 81.6 million FICuS parent structures 76.2 million uuuuu parent structures 119.8 million original structure records in  CSDB NCI/CADD Chemical Structure Database Structure Normalization
FICTS original record original record original record original record FICTS original record original record original record original record original record original record original record FICTS FICTS FICTS FICTS FICTS FICTS FICuS FICuS FICuS FICuS FICuS FICuS uuuuu uuuuu uuuuu uuuuu tautomer- invariant 83.1 million FICTS parent structures 81.6 million FICuS parent structures 76.2 million uuuuu parent structures 119.8 million original structure records in  CSDB NCI/CADD Chemical Structure Database Structure Normalization
Tautomer Analysis How much “chemical space” is “just generated” by drawing tautomers?
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],NCI/CADD Chemical Structure Database Tautomer Analysis
rule 12 : furanones rule 11 : 1.11 (aromatic) heteroatom H shift rule 10 : 1.9 (aromatic) heteroatom H shift rule 9 : 1.7 (aromatic) heteroatom H shift rule 8 : 1.5 aromatic heteroatom H shift (2) rule 7 : 1.5 (aromatic) heteroatom H shift (1) rule 6 : 1.3 heteroatom H shift rule 5 : 1.3 aromatic heteroatom H shift rule 4 : special imine rule 3 : simple (aliphatic) imine rule 2 : 1.5 (thio)keto/(thio)enol rule 1 : 1.3 (thio)keto/(thio)enol ,[object Object],rule 21 : phosphonic acids rule 20 : isocyanides rule 19 : formamidinesulfinic acids rule 18 : cyanic/iso-cyanic acids rule 17 : oxim/nitroso via phenol rule 16 : oxim/nitroso rule 15 : pentavalent nitro/aci-nitro rule 14 : ionic nitro/aci-nitro rule 13 : keten/ynol exchange NCI/CADD Chemical Structure Database Tautomer Analysis
FICuS FICuS FICuS FICuS FICuS FICuS 70.6 million FICuS parent structures NCI/CADD Chemical Structure Database Tautomer Analysis starting from the set of  FICuS parent structures  we systematically  generated all tautomers based on the  21 SMIRKS rule set  available in CACTVS generated 680 million tautomers for 1.7% of the   FICuS parent   structures the enumeration was not exhaustive (2009 DB version)
NCI/CADD Chemical Structure Database Tautomer Analysis number database releases 0 10 20 30 40 50 60 70 80 90 0.0 0.5 1.0 1.5 2.0 frequency tautomeric overlap within each individual database release (%)  average:   ~0.3% of original structure records
NCI/CADD Chemical Structure Database Tautomer Analysis number database releases 0 10 20 30 40 50 60 70 80 90 0.0 0.5 1.0 1.5 2.0 frequency tautomeric overlap within each individual database release (%)  average:   ~0.3% of original structure records Asinex ChemBridge ComGenex ChemNavigator Columbia University Molecular Screening Center EPA DSSTox Specs Ambinter BIND BindingDB ChemNavigator KEGG NCI Open Database NIST WebBook NLM ChemIDplus NMRShiftDB Thomson Pharma Wombat NCI/DTP PASS Training Set SGC-Ox ChemDB ZINC ChEBI ChemSpider
NCI/CADD Chemical Structure Database Tautomer Analysis 0 5 10 15 20 25 30 0.5 2.5 4.5 6.5 8.5 10.5 12.5 14.5 16.5 18.5 20.5 22.5 24.5 frequency number database releases percentage of FICuS parent structure in each database release occurring somewhere in CSDB with a conflict occurrence of “tautomerism-critical” molecules within each individual database release (%) average:   ~9.5% of FICuS parent structures
HPMBP  (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) ,[object Object],[object Object],[object Object],He, D.; Li Z.; Ma M.; Huang J.; Yang Y. Study of extraction characteristics of HPMBP. 1. Tautomer and extraction characteristics. J. Chem. Eng. Data  2009 , 54(10), 2944-2947 Example for a Tautomer “Conflict” H N N O O
HPMBP  (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) CACTVS generates 7 tautomers Example for a Tautomer “Conflict” canonical  tautomer by CACTVS 5 tautomers have potential stereo center on atoms or bonds N N O H O H N N O O H N N O O R/S H N N O H O H R/S H N N O O H E/Z N N O O H E/Z N N O O R/S
H H 4551-69-1 33064-14-1 127117-31-1 859  references 49 references 3  references HPMBP  (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) 3 tautomers have  CAS Registry Numbers assigned Example for a Tautomer “Conflict” (no stereo) (Z) N N O O H N N O O H N N O O R/S H N N O H O H R/S N N O O H E/Z N N O O H E/Z N N O O R/S
N N O H O N N O O N N O O H H N N O O H H N N O H O H H N N O O 6 databases 16 databases  (no stereo) 3 databases  (R) 2 databases  (S) 12 databases 1 database (no stereo) HPMBP  (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) Example for a Tautomer “Conflict” occurrences in databases indexed in CSDB R/S R/S E/Z E/Z R/S H N N O O
6   databases 16 databases  (no stereo) 3 databases  (R) 2 databases  (S) 12  databases occurrences in databases N N O H O 1 database (no stereo) HPMBP  (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) Example for a Tautomer “Conflict” ACD 3D Ambinter BindingDB ChemBank ChemDB ChemSpider ChemNavigator MLSMR NIAID  Scripps Screening   Center Thomson Pharma ZINC ChemDB ACD 3D ACX Ambinter BioByte QSAR ChemBank ChemBridge ChemDB ChemSpider DiscoveryGate EPA GCES MLSMR NCI Open Database NIST MS-Lib NLM ChemIDplus Sigma-Aldrich Thomson Pharma   Ambinter ChemDB ChemSpider DiscoveryGate ChemNavigator Thomson Pharma   ChemSpider ZINC   ChemSpider ECOTOX ZINC  N N O O R / S H N N O O N N O O H E / Z H N N O O H E / Z H N N O H O H R / S H N N O O R / S
Scaffold Analysis
Scaffold Analysis NCI/CADD Chemical Structure Database molecular scaffold tree archetype scaffold simple scaffold Schuffenhauer et al. J. Chem. Inf. Model.  2007 ,  47 , 47-58  Bemis et al. J. Med. Chem.  1996,  39 , 2887-2893 Bemis et al. J. Med. Chem.  1996,  39 , 2887-2893 S O O N N O level 2 level 1  example N N H O N N H O N N H
NCI/CADD Chemical Structure Database 76.2 million CSDB Scaffold Analysis uuuuu  compound  set
NCI/CADD Chemical Structure Database molecular scaffold tree archetype scaffold simple scaffold 76.2 million 8.1 million scaffolds 6.8 million scaffolds 0.8 million scaffolds CSDB Scaffold Analysis uuuuu  compound  set level 2 level 1  N N H O O N N H N N H
NCI/CADD Chemical Structure Database 76.2 million number of unique scaffolds per hierarchy level CSDB Scaffold Analysis uuuuu  compound  set 8.1 million scaffolds 0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 1 2 3 4 5 6 7 8 9 10 Hierarchy Level Number of Unique Scaffolds (in millions) 0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 Number of unique structures (in million) level 2 level 1  molecular scaffold tree N N H O O N N H
Atom Neighborhoods
Multilevel Neighborhoods of Atoms (MNA) HC  C(C(CC-H)C(CC-C)-H(C)) HO  C(C(CC-H)C(CN-H)-H(C)) CHCC  C(C(CC-H)C(CN-H)-C(C-O-O)) CHCN  C(C(CC-H)N(CC)-H(C)) CCCC  C(C(CC-C)N(CC)-H(C)) CCOO   N(C(CN-H)C(CN-H)) NCC  -H(C(CC-H)) OHC   -H(C(CN-H)) OC  -H(-O(-H-C)) -C(C(CC-C)-O(-H-C)-O(-C)) -O(-H(-O)-C(C-O-O)) -O(-C(C-O-O)) NCI/CADD Chemical Structure Database Filimonov D., Poroikov V., Borodina Yu., Gloriozova T. J. Chem. Inf. Comput. Sci.,  1999 , 39 (4), 666-670. MNA level 1 MNA level 2 N O H O H H
Multilevel Neighborhoods of Atoms (MNA) NCI/CADD Chemical Structure Database 76.2 million CSDB uuuuu  compound  set
Multilevel Neighborhoods of Atoms (MNA) NCI/CADD Chemical Structure Database Unique MNAs  level 1 level 2 13,426 918,516 76.2 million CSDB uuuuu  compound  set
Multilevel Neighborhoods of Atoms (MNA) NCI/CADD Chemical Structure Database Unique MNAs  level 1 level 2 13,426 918,516 2.3 billion relationships 1.3 billion relationships ~ 17 MNAs per uuuuu parent structure ~ 30 MNAs per uuuuu parent structure 76.2 million CSDB uuuuu  compound  set
Multilevel Neighborhoods of Atoms (MNA) NCI/CADD Chemical Structure Database surprising: 424,784 MNAs (level 2) are exclusive to a set of  1,3 million structures in ChemSpider Unique MNAs  level 1 level 2 13,426 918,516 2.3 billion relationships 1.3 billion relationships ~ 17 MNAs per uuuuu parent structure ~ 30 MNAs per uuuuu parent structure 76.2 million CSDB uuuuu  compound  set
Chemical Structure Web Services NCI/CADD web service NCI/CADD web service NCI/CADD Chemical Structure Database (CSDB) CACTVS external (web) services http Chemical Identifier Resolver other software packages e.g. OPSIN Chemical Structure Web Services NCI/CADD Web Resources
IUPHAR DATABASE http://www.iuphar-db.org http://www.akosgmbh.eu/globalsearch/index.htm  CACTVS http://www.xemistry.com gChem Virtual Molecular Model Kit http://chemagic.com/web_molecules/script_page_large.aspx  Chemical Identifier Resolver NCI/CADD Web Resources Symyx Draw Resolver http://www.symyx.com/  webel.py - A Cinfony module http://baoilleach.blogspot.com/2009/11/ introducing-webel-cheminformatics.html   avogadro.openmolecules.net/
Chemical Structure Lookup Service II Work in progress …
Chemical Structure Lookup Service II Work in progress …
Acknowledgments ChemNavigator Scott Hutton Tad Hurst Thanks to all database providers! http://cactus.nci.nih.gov Our web site: University of Cambridge Daniel Lowe Peter Murray-Rust Noel’ O Boyle (University College Cork, Ireland)  Richard Apodaca (Metamolecular) Hans-Juergen Himmler  CADD Group, CBL, NCI Igor Filippov ChemSpider Antony Williams Valery Tkachenko
http://cactus.nci.nih.gov/chemical/structure Chemical Identifier Resolver NCI/CADD Web Resources http://cactus.nci.nih.gov/blog
Acknowledgments - Software Python Web Framework Python SQL library Javascript library Peter Ertl CACTVS ChemWriter
 

More Related Content

Similar to 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

ICCS9 2011 Talk
ICCS9 2011 TalkICCS9 2011 Talk
ICCS9 2011 Talk
Markus Sitzmann
 
ACS Salt Lake City 2009 CINF Talk (InChI Symposium)
ACS Salt Lake City 2009 CINF Talk (InChI Symposium)ACS Salt Lake City 2009 CINF Talk (InChI Symposium)
ACS Salt Lake City 2009 CINF Talk (InChI Symposium)
Markus Sitzmann
 
Chemicals, Chemical Identifiers and Navigating Through Databases
Chemicals, Chemical Identifiers and Navigating Through DatabasesChemicals, Chemical Identifiers and Navigating Through Databases
Chemicals, Chemical Identifiers and Navigating Through Databases
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical PatentsChemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
dan2097
 
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
Dr. Haxel Consult
 
EB-eye Back End
EB-eye Back EndEB-eye Back End
EB-eye Back End
Franck Valentin
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
NextMove Software
 
Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...
Ken Karapetyan
 
Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Chemistry Resource FS1:15
Chemistry Resource FS1:15Chemistry Resource FS1:15
Chemistry Resource FS1:15
Krystal Huffer
 
Acs 2013 indianapolis_cvsp
Acs 2013 indianapolis_cvspAcs 2013 indianapolis_cvsp
Acs 2013 indianapolis_cvsp
Ken Karapetyan
 
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical PatentsChemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
NextMove Software
 
AZ of Chemspider February 2011
AZ of Chemspider February 2011AZ of Chemspider February 2011
AZ of Chemspider February 2011
Royal Society of Chemistry
 
ChemSpider – The Vision and Challenges Associated with Building a Free Online...
ChemSpider – The Vision and Challenges Associated with Building a Free Online...ChemSpider – The Vision and Challenges Associated with Building a Free Online...
ChemSpider – The Vision and Challenges Associated with Building a Free Online...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Increasingly Accurate Representation of Biochemistry (v2)
Increasingly Accurate Representation of Biochemistry (v2)Increasingly Accurate Representation of Biochemistry (v2)
Increasingly Accurate Representation of Biochemistry (v2)
Michel Dumontier
 
SureChEMBL and Open PHACTS
SureChEMBL and Open PHACTSSureChEMBL and Open PHACTS
SureChEMBL and Open PHACTS
George Papadatos
 
Sourcing high quality online data resources for computational toxicology
Sourcing high quality online data resources for computational toxicologySourcing high quality online data resources for computational toxicology
Sourcing high quality online data resources for computational toxicology
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Math 225-spring-2012
Math 225-spring-2012Math 225-spring-2012
Math 225-spring-2012
Bruce Slutsky
 
20200219 Shinya Oki
20200219 Shinya Oki20200219 Shinya Oki
20200219 Shinya Oki
KyushuBioinformaticsCommunity
 

Similar to 5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk (20)

ICCS9 2011 Talk
ICCS9 2011 TalkICCS9 2011 Talk
ICCS9 2011 Talk
 
ACS Salt Lake City 2009 CINF Talk (InChI Symposium)
ACS Salt Lake City 2009 CINF Talk (InChI Symposium)ACS Salt Lake City 2009 CINF Talk (InChI Symposium)
ACS Salt Lake City 2009 CINF Talk (InChI Symposium)
 
Chemicals, Chemical Identifiers and Navigating Through Databases
Chemicals, Chemical Identifiers and Navigating Through DatabasesChemicals, Chemical Identifiers and Navigating Through Databases
Chemicals, Chemical Identifiers and Navigating Through Databases
 
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical PatentsChemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
 
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
 
EB-eye Back End
EB-eye Back EndEB-eye Back End
EB-eye Back End
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...
 
Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...
 
Chemistry Resource FS1:15
Chemistry Resource FS1:15Chemistry Resource FS1:15
Chemistry Resource FS1:15
 
Acs 2013 indianapolis_cvsp
Acs 2013 indianapolis_cvspAcs 2013 indianapolis_cvsp
Acs 2013 indianapolis_cvsp
 
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
 
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical PatentsChemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
 
AZ of Chemspider February 2011
AZ of Chemspider February 2011AZ of Chemspider February 2011
AZ of Chemspider February 2011
 
ChemSpider – The Vision and Challenges Associated with Building a Free Online...
ChemSpider – The Vision and Challenges Associated with Building a Free Online...ChemSpider – The Vision and Challenges Associated with Building a Free Online...
ChemSpider – The Vision and Challenges Associated with Building a Free Online...
 
Increasingly Accurate Representation of Biochemistry (v2)
Increasingly Accurate Representation of Biochemistry (v2)Increasingly Accurate Representation of Biochemistry (v2)
Increasingly Accurate Representation of Biochemistry (v2)
 
SureChEMBL and Open PHACTS
SureChEMBL and Open PHACTSSureChEMBL and Open PHACTS
SureChEMBL and Open PHACTS
 
Sourcing high quality online data resources for computational toxicology
Sourcing high quality online data resources for computational toxicologySourcing high quality online data resources for computational toxicology
Sourcing high quality online data resources for computational toxicology
 
Math 225-spring-2012
Math 225-spring-2012Math 225-spring-2012
Math 225-spring-2012
 
20200219 Shinya Oki
20200219 Shinya Oki20200219 Shinya Oki
20200219 Shinya Oki
 

Recently uploaded

Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
Data Hops
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 

Recently uploaded (20)

Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 

5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk

  • 1. Markus Sitzmann 1 , Wolf-Dietrich Ihlenfeldt 2 , and Marc C. Nicklaus 1 [1] Computer-Aided Drug Design Group, Chemical Biology Laboratory, NCI-Frederick, NIH, DHHS [2] Xemistry GmbH, Auf den Stieden 8, D-35094 Lahntal, Germany NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space
  • 2.
  • 3. Chemical Identifier Resolver chemical structure NCI/CADD Identifiers InChI/InChIKey ChemSpider ID PubChem SID/CID chemical names CAS Registry Number NSC number FDA UNII ChemNavigator SID SMILES SD File Chemical Formula ChEBI ID PDB Ligand ID MRV CML SYBYL Line Notation GIF image
  • 4. http://cactus.nci.nih.gov/chemical/structure Works as a resolver for different chemical structure identifiers. Allows one to convert a given structure identifier into another representation or structure identifier. Chemical Identifier Resolver NCI/CADD Web Resources first beta release: July 2009 current release (beta 4): April 2011
  • 5.
  • 6. resolver chemical names IUPAC names (by OPSIN ) CAS numbers SMILES strings IUPAC InChI/InChIKeys NCI/CADD Identifiers CACTVS HASHISY NSC number PubChem SID ChemSpider ID ChemNavigator SID FDA UNII /smiles /names, /iupac_name /cas /inchi, /stdinchi /inchikey, /stdinchikey /ficts, /ficus, /uuuuu /image /file, /sdf /mw, /monoisotopic_mass /formula /twirl, /3d /urls /chemspider_id /pubchem_sid /chemnavigator_sid “ identifier” “ representation” http://cactus.nci.nih.gov/chemcial/structure Chemical Identifier Resolver NCI/CADD Public Web Resources
  • 7. identifier representation http request http response detection of the identifier type identifier is a full structure representation (e.g. SMILES, InChI) calculation of the requested structure representation identifier is a hashed structure representation (e.g. InChIKey), trivial name etc. database lookup MIME type Chemical Identifier Resolver NCI/CADD Web Resources structure e.g. InChI, GIF image e.g. CAS number, chemical name CACTVS NCI/CADD Chemical Structure Database (CSDB)
  • 8. identifier representation http request http response identifier is a full structure representation (e.g. SMILES, InChI) calculation of the requested structure representation identifier is a hashed structure representation (e.g. InChIKey), trivial name etc. database lookup MIME type Chemical Identifier Resolver NCI/CADD Web Resources structure e.g. InChI, GIF image e.g. CAS number, chemical name CACTVS NCI/CADD Chemical Structure Database (CSDB) detection of the identifier type
  • 9. <request string=&quot; L-alanin &quot; representation=&quot; smiles &quot;> <data id=&quot; 1 &quot; resolver=&quot; name_by_chemspider &quot; string_class=&quot; Chemical Name (ChemSpider) &quot;> <item id=&quot; 1 &quot;> C[C@H](N)C(O)=O </item> </data> <data id=&quot; 2 &quot; resolver=&quot; name_by_opsin &quot; string_class=&quot; IUPAC Name (OPSIN) &quot;> <item id=&quot; 1 &quot;> C[C@H](N)C(O)=O </item> </data> <data id=&quot; 3 &quot; resolver=&quot; name_by_cir &quot; string_class=&quot; Chemical Name (CIR) &quot;> <item id=&quot; 1 “> C[C@H](N)C(O)=O </item> </data> </request> http://cactus.nci.nih.gov/chemical/structure/ L-alanin /smiles/xmls ?resolver= name_by_chemspider , name_by_opsin , name_by_cir Chemical Identifier Resolver NCI/CADD Web Resources
  • 10.
  • 11.
  • 12.
  • 13. original structure record Molfile SDF SMILES ChemDraw cdx PDB structure normalization parent structure SDF SMILES database NCI/CADD Identifier hashcode calculation E_HASHISY NCI/CADD Structure Identifiers Unique Representation of Chemical Structures
  • 14.
  • 15. Fragments Isotopes Charges Stereo Tautomers FICTS FICuS uuuuu sensitive / not sensitive <CACTVS hashcode (E_HASHISY)>-<tag>-<version>-<checksum> Na + 4A122D094098B50D -FICTS-01-1D 0E26B623DF7FAD30 -FICuS-01-70 9850FD9F9E2B4E25 -uuuuu-01-27 NCI/CADD Structure Identifiers Unique Representation of Chemical Structures H N N N H 2 O - O
  • 16. H N N N H 2 O - O N a + charged form tautomer isotope salt stereoisomers “ errors” histidine N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O H N N N H 2 O H O
  • 17. A3DAE0788050DDE4-FICTS E5F83F10C5DB080A -FICTS B2FDA68AEDA06DB9-FICTS 9850FD9F9E2B4E25 -FICTS E5F83F10C5DB080A -FICTS E92E4BA2869F3611-FICTS 8A7AD1EB498CC76A-FICTS 6C16DE2351F9FF50-FICTS H N N N H 2 O - O N a + 9850FD9F9E2B4E25 -FICTS charged form tautomer isotope salt stereoisomers FICTS “ errors” histidine H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O
  • 18. A3DAE0788050DDE4-FICuS E5F83F10C5DB080A -FICuS B2FDA68AEDA06DB9-FICuS 9850FD9F9E2B4E25 -FICuS E5F83F10C5DB080A -FICuS E92E4BA2869F3611-FICuS 8A7AD1EB498CC76A-FICuS 9850FD9F9E2B4E25 -FICuS H N N N H 2 O - O N a + charged form tautomer isotope salt stereoisomers FICuS “ errors” 9850FD9F9E2B4E25 -FICuS histidine N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O H N N N H 2 O H O
  • 19. 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -FICuS 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu H N N N H 2 O - O N a + charged form tautomer isotope stereoisomers salt uuuuu “ errors” 9850FD9F9E2B4E25 -uuuuu histidine N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O H N N N H 2 O H O
  • 20. HNDVDQJCIGZPNO -UHFFFAOYSA-N HNDVDQJCIGZPNO -CDYZYAPPSA-N HNDVDQJCIGZPNO -RXMQYKEDSA-N HNDVDQJCIGZPNO -YFKPBYRVSA-N HNDVDQJCIGZPNO - UHFFFAOYSA -N H N N N H 2 O - O N a + charged form tautomer isotope stereoisomers salt Std. InChIKey “ errors” HNDVDQJCIGZPNO - UHFFFAOYSA -N UHPNKBYGGMJTIM -UHFFFAOYSA-M UHPNKBYGGMJTIM -UHFFFAOYSA-M histidine HNDVDQJCIGZPNO - UHFFFAOYSA -N N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O H N N N H 2 O H O
  • 21. original record original record original record original record original record original record original record original record original record original record original record NCI/CADD Chemical Structure Database Structure Normalization 119.8 million original structure records in CSDB
  • 22. FICTS original record original record original record original record FICTS original record original record original record original record original record original record original record FICTS FICTS FICTS FICTS FICTS FICTS 83.1 million FICTS parent structures 119.8 million original structure records in CSDB NCI/CADD Chemical Structure Database Structure Normalization
  • 23. FICTS original record original record original record original record FICTS original record original record original record original record original record original record original record FICTS FICTS FICTS FICTS FICTS FICTS FICuS FICuS FICuS FICuS FICuS FICuS 83.1 million FICTS parent structures 81.6 million FICuS parent structures 119.8 million original structure records in CSDB NCI/CADD Chemical Structure Database Structure Normalization
  • 24. FICTS original record original record original record original record FICTS original record original record original record original record original record original record original record FICTS FICTS FICTS FICTS FICTS FICTS FICuS FICuS FICuS FICuS FICuS FICuS uuuuu uuuuu uuuuu uuuuu 83.1 million FICTS parent structures 81.6 million FICuS parent structures 76.2 million uuuuu parent structures 119.8 million original structure records in CSDB NCI/CADD Chemical Structure Database Structure Normalization
  • 25. FICTS original record original record original record original record FICTS original record original record original record original record original record original record original record FICTS FICTS FICTS FICTS FICTS FICTS FICuS FICuS FICuS FICuS FICuS FICuS uuuuu uuuuu uuuuu uuuuu tautomer- invariant 83.1 million FICTS parent structures 81.6 million FICuS parent structures 76.2 million uuuuu parent structures 119.8 million original structure records in CSDB NCI/CADD Chemical Structure Database Structure Normalization
  • 26. Tautomer Analysis How much “chemical space” is “just generated” by drawing tautomers?
  • 27.
  • 28.
  • 29. FICuS FICuS FICuS FICuS FICuS FICuS 70.6 million FICuS parent structures NCI/CADD Chemical Structure Database Tautomer Analysis starting from the set of FICuS parent structures we systematically generated all tautomers based on the 21 SMIRKS rule set available in CACTVS generated 680 million tautomers for 1.7% of the FICuS parent structures the enumeration was not exhaustive (2009 DB version)
  • 30. NCI/CADD Chemical Structure Database Tautomer Analysis number database releases 0 10 20 30 40 50 60 70 80 90 0.0 0.5 1.0 1.5 2.0 frequency tautomeric overlap within each individual database release (%) average: ~0.3% of original structure records
  • 31. NCI/CADD Chemical Structure Database Tautomer Analysis number database releases 0 10 20 30 40 50 60 70 80 90 0.0 0.5 1.0 1.5 2.0 frequency tautomeric overlap within each individual database release (%) average: ~0.3% of original structure records Asinex ChemBridge ComGenex ChemNavigator Columbia University Molecular Screening Center EPA DSSTox Specs Ambinter BIND BindingDB ChemNavigator KEGG NCI Open Database NIST WebBook NLM ChemIDplus NMRShiftDB Thomson Pharma Wombat NCI/DTP PASS Training Set SGC-Ox ChemDB ZINC ChEBI ChemSpider
  • 32. NCI/CADD Chemical Structure Database Tautomer Analysis 0 5 10 15 20 25 30 0.5 2.5 4.5 6.5 8.5 10.5 12.5 14.5 16.5 18.5 20.5 22.5 24.5 frequency number database releases percentage of FICuS parent structure in each database release occurring somewhere in CSDB with a conflict occurrence of “tautomerism-critical” molecules within each individual database release (%) average: ~9.5% of FICuS parent structures
  • 33.
  • 34. HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) CACTVS generates 7 tautomers Example for a Tautomer “Conflict” canonical tautomer by CACTVS 5 tautomers have potential stereo center on atoms or bonds N N O H O H N N O O H N N O O R/S H N N O H O H R/S H N N O O H E/Z N N O O H E/Z N N O O R/S
  • 35. H H 4551-69-1 33064-14-1 127117-31-1 859 references 49 references 3 references HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) 3 tautomers have CAS Registry Numbers assigned Example for a Tautomer “Conflict” (no stereo) (Z) N N O O H N N O O H N N O O R/S H N N O H O H R/S N N O O H E/Z N N O O H E/Z N N O O R/S
  • 36. N N O H O N N O O N N O O H H N N O O H H N N O H O H H N N O O 6 databases 16 databases (no stereo) 3 databases (R) 2 databases (S) 12 databases 1 database (no stereo) HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) Example for a Tautomer “Conflict” occurrences in databases indexed in CSDB R/S R/S E/Z E/Z R/S H N N O O
  • 37. 6 databases 16 databases (no stereo) 3 databases (R) 2 databases (S) 12 databases occurrences in databases N N O H O 1 database (no stereo) HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) Example for a Tautomer “Conflict” ACD 3D Ambinter BindingDB ChemBank ChemDB ChemSpider ChemNavigator MLSMR NIAID Scripps Screening Center Thomson Pharma ZINC ChemDB ACD 3D ACX Ambinter BioByte QSAR ChemBank ChemBridge ChemDB ChemSpider DiscoveryGate EPA GCES MLSMR NCI Open Database NIST MS-Lib NLM ChemIDplus Sigma-Aldrich Thomson Pharma Ambinter ChemDB ChemSpider DiscoveryGate ChemNavigator Thomson Pharma ChemSpider ZINC ChemSpider ECOTOX ZINC N N O O R / S H N N O O N N O O H E / Z H N N O O H E / Z H N N O H O H R / S H N N O O R / S
  • 39. Scaffold Analysis NCI/CADD Chemical Structure Database molecular scaffold tree archetype scaffold simple scaffold Schuffenhauer et al. J. Chem. Inf. Model. 2007 , 47 , 47-58 Bemis et al. J. Med. Chem. 1996, 39 , 2887-2893 Bemis et al. J. Med. Chem. 1996, 39 , 2887-2893 S O O N N O level 2 level 1 example N N H O N N H O N N H
  • 40. NCI/CADD Chemical Structure Database 76.2 million CSDB Scaffold Analysis uuuuu compound set
  • 41. NCI/CADD Chemical Structure Database molecular scaffold tree archetype scaffold simple scaffold 76.2 million 8.1 million scaffolds 6.8 million scaffolds 0.8 million scaffolds CSDB Scaffold Analysis uuuuu compound set level 2 level 1 N N H O O N N H N N H
  • 42. NCI/CADD Chemical Structure Database 76.2 million number of unique scaffolds per hierarchy level CSDB Scaffold Analysis uuuuu compound set 8.1 million scaffolds 0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 1 2 3 4 5 6 7 8 9 10 Hierarchy Level Number of Unique Scaffolds (in millions) 0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 Number of unique structures (in million) level 2 level 1 molecular scaffold tree N N H O O N N H
  • 44. Multilevel Neighborhoods of Atoms (MNA) HC C(C(CC-H)C(CC-C)-H(C)) HO C(C(CC-H)C(CN-H)-H(C)) CHCC C(C(CC-H)C(CN-H)-C(C-O-O)) CHCN C(C(CC-H)N(CC)-H(C)) CCCC C(C(CC-C)N(CC)-H(C)) CCOO N(C(CN-H)C(CN-H)) NCC -H(C(CC-H)) OHC -H(C(CN-H)) OC -H(-O(-H-C)) -C(C(CC-C)-O(-H-C)-O(-C)) -O(-H(-O)-C(C-O-O)) -O(-C(C-O-O)) NCI/CADD Chemical Structure Database Filimonov D., Poroikov V., Borodina Yu., Gloriozova T. J. Chem. Inf. Comput. Sci., 1999 , 39 (4), 666-670. MNA level 1 MNA level 2 N O H O H H
  • 45. Multilevel Neighborhoods of Atoms (MNA) NCI/CADD Chemical Structure Database 76.2 million CSDB uuuuu compound set
  • 46. Multilevel Neighborhoods of Atoms (MNA) NCI/CADD Chemical Structure Database Unique MNAs level 1 level 2 13,426 918,516 76.2 million CSDB uuuuu compound set
  • 47. Multilevel Neighborhoods of Atoms (MNA) NCI/CADD Chemical Structure Database Unique MNAs level 1 level 2 13,426 918,516 2.3 billion relationships 1.3 billion relationships ~ 17 MNAs per uuuuu parent structure ~ 30 MNAs per uuuuu parent structure 76.2 million CSDB uuuuu compound set
  • 48. Multilevel Neighborhoods of Atoms (MNA) NCI/CADD Chemical Structure Database surprising: 424,784 MNAs (level 2) are exclusive to a set of 1,3 million structures in ChemSpider Unique MNAs level 1 level 2 13,426 918,516 2.3 billion relationships 1.3 billion relationships ~ 17 MNAs per uuuuu parent structure ~ 30 MNAs per uuuuu parent structure 76.2 million CSDB uuuuu compound set
  • 49. Chemical Structure Web Services NCI/CADD web service NCI/CADD web service NCI/CADD Chemical Structure Database (CSDB) CACTVS external (web) services http Chemical Identifier Resolver other software packages e.g. OPSIN Chemical Structure Web Services NCI/CADD Web Resources
  • 50. IUPHAR DATABASE http://www.iuphar-db.org http://www.akosgmbh.eu/globalsearch/index.htm CACTVS http://www.xemistry.com gChem Virtual Molecular Model Kit http://chemagic.com/web_molecules/script_page_large.aspx Chemical Identifier Resolver NCI/CADD Web Resources Symyx Draw Resolver http://www.symyx.com/ webel.py - A Cinfony module http://baoilleach.blogspot.com/2009/11/ introducing-webel-cheminformatics.html avogadro.openmolecules.net/
  • 51. Chemical Structure Lookup Service II Work in progress …
  • 52. Chemical Structure Lookup Service II Work in progress …
  • 53. Acknowledgments ChemNavigator Scott Hutton Tad Hurst Thanks to all database providers! http://cactus.nci.nih.gov Our web site: University of Cambridge Daniel Lowe Peter Murray-Rust Noel’ O Boyle (University College Cork, Ireland) Richard Apodaca (Metamolecular) Hans-Juergen Himmler CADD Group, CBL, NCI Igor Filippov ChemSpider Antony Williams Valery Tkachenko
  • 54. http://cactus.nci.nih.gov/chemical/structure Chemical Identifier Resolver NCI/CADD Web Resources http://cactus.nci.nih.gov/blog
  • 55. Acknowledgments - Software Python Web Framework Python SQL library Javascript library Peter Ertl CACTVS ChemWriter
  • 56.  

Editor's Notes

  1. All calculate the same uuuuu identifier, i.e. you would find all of them indentently which one you have used as query