ICCS9 2011 Talk
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

ICCS9 2011 Talk

on

  • 4,181 views

 

Statistics

Views

Total Views
4,181
Views on SlideShare
1,394
Embed Views
2,787

Actions

Likes
2
Downloads
28
Comments
0

6 Embeds 2,787

http://cactus.nci.nih.gov 2432
http://cactvs.nci.nih.gov 341
url_unknown 7
http://translate.googleusercontent.com 5
http://feeds.feedburner.com 1
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

ICCS9 2011 Talk Presentation Transcript

  • 1. Markus Sitzmann 1 , Wolf-Dietrich Ihlenfeldt 2 , and Marc C. Nicklaus 1 [1] Computer-Aided Drug Design Group, Chemical Biology Laboratory, NCI-Frederick, NIH, DHHS [2] Xemistry GmbH, Auf den Stieden 8, D-35094 Lahntal, Germany NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space
  • 2. Chemistry Space Analysis
    • how many small-molecules are there currently?
    • since the early 2000s: enormous increase of the number of databases containing small molecules, e.g. PubChem, ChemSpider, ChEMBL, DrugBank – what is the overlap?
    • many ambiguities in the representation of small molecules (e.g. tautomerism, salts, ionic resonance forms)
    • growing number of chemical structure identifiers (InChI/InChIKey, PubChem SID/CID, ChemSpider ID, ChEBI ID, …)
  • 3. Chemical Identifier Resolver chemical structure NCI/CADD Identifiers InChI/InChIKey ChemSpider ID PubChem SID/CID chemical names CAS Registry Number NSC number FDA UNII ChemNavigator SID SMILES SD File Chemical Formula ChEBI ID PDB Ligand ID MRV CML SYBYL Line Notation GIF image
  • 4. http://cactus.nci.nih.gov/chemical/structure Works as a resolver for different chemical structure identifiers. Allows one to convert a given structure identifier into another representation or structure identifier. Chemical Identifier Resolver NCI/CADD Web Resources first beta release: July 2009 current release (beta 4): April 2011
  • 5.
    • it is usable by a simple URL API:
    example: http://cactus.nci.nih.gov/chemical/structure/ Tamiflu / cas 204255-11-8 http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation” MIME type: text/plain Chemical Identifier Resolver NCI/CADD Web Resources XML format: http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation” /xml
    • if a request is not resolvable: HTTP404 status message
  • 6. resolver chemical names IUPAC names (by OPSIN ) CAS numbers SMILES strings IUPAC InChI/InChIKeys NCI/CADD Identifiers CACTVS HASHISY NSC number PubChem SID ChemSpider ID ChemNavigator SID FDA UNII /smiles /names, /iupac_name /cas /inchi, /stdinchi /inchikey, /stdinchikey /ficts, /ficus, /uuuuu /image /file, /sdf /mw, /monoisotopic_mass /formula /twirl, /3d /urls /chemspider_id /pubchem_sid /chemnavigator_sid “ identifier” “ representation” http://cactus.nci.nih.gov/chemcial/structure Chemical Identifier Resolver NCI/CADD Public Web Resources
  • 7. identifier representation http request http response detection of the identifier type identifier is a full structure representation (e.g. SMILES, InChI) calculation of the requested structure representation identifier is a hashed structure representation (e.g. InChIKey), trivial name etc. database lookup MIME type Chemical Identifier Resolver NCI/CADD Web Resources structure e.g. InChI, GIF image e.g. CAS number, chemical name CACTVS NCI/CADD Chemical Structure Database (CSDB)
  • 8.
    • ChemNavigator iResearch Library compilation of commercially available screening compounds from ~300 inter- national chemistry suppliers
    • PubChem database including Open NCI database, EPA DSSTox databases, NIAID HIV databases, NIST Webbook, NLM ChemIDplus, ChemSpider …
    • Commercial Sources / others Asinex, Comgenex, eMolecules, ChEMBL, …
    currently: ~ 150 chemical structure databases ~120 million structure records ~81.6 million unique structures by NCI/CADD FICuS Identifier ~84 million unique structures by Std. InChIKey ChemNav. iResearch Lib. ~56% PubChem ~38% others ~6% Chemical Structure Database (CSDB) Chemical Identifier Resolver
  • 9.
    • NCI/CADD Structure Identifiers
    FICTS, FICuS, uuuuu
  • 10.
    • based on hashcodes calculated by the chemoinformatics toolkit CACTVS
    • CACTVS hashcodes:
      • represent a chemical structure uniquely as 16-digit hexadecimal number (64-bit unsigned)
      • high sensitivity to structural features of a compound
      • change if connectivity changes
    NCI/CADD Structure Identifiers Unique Representation of Chemical Structures 9850FD9F9E2B4E25 H N N N H 2 O H O
  • 11. structure normalization parent structure NCI/CADD Identifier hashcode calculation E_HASHISY
    • calculation of a set of parent structures with different sensitivity to chemical features
    • representation of chemical structures on different levels
    FICTS original structure record Molfile SDF SMILES ChemDraw cdx PDB FICuS uuuuu SDF SMILES database NCI/CADD Structure Identifiers Unique Representation of Chemical Structures
  • 12.
    • adjustable levels of sensitivity:
    Fragments sensitive keep only largest organic fragment Isotopes ignore isotope labels sensitive Charges uncharge sensitive find canonical tautomer Stereochemistry sensitive discard stereo information un-sensitive un-sensitive un-sensitive un-sensitive sensitive Tautomers Na + un-sensitive NCI/CADD Structure Identifiers Unique Representation of Chemical Structures D D D D D D O O C O O H N H 2 O - O N H 3 + O H O N H 2 O O H O O H C O O H H N H 2 C O O H N H 2 H O O - O O H
  • 13. Fragments Isotopes Charges sensitive sensitive sensitive un-sensitive un-sensitive un-sensitive un-sensitive Tautomers Stereochemistry sensitive sensitive Na + NCI/CADD Structure Identifiers Unique Representation of Chemical Structures D D D D D D O O C O O H N H 2 O - O N H 3 + O H O N H 2 O O H O O H C O O H H N H 2 C O O H N H 2 H O O - O O H
  • 14. Fragments Isotopes Charges sensitive sensitive sensitive F I C representation of the exact drawing un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive T ≠ ≠ ≠ Tautomers Stereochemistry sensitive sensitive ≠ ≠ S Na + ≠ ≠ FICTS NCI/CADD Structure Identifiers Unique Representation of Chemical Structures D D D D D D O O C O O H N H 2 O - O N H 3 + O H O N H 2 O O H O O H C O O H H N H 2 C O O H N H 2 H O O - O O H
  • 15. Fragments Isotopes Charges sensitive sensitive sensitive F I C comes closest to how a chemist perceives a compound un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive u Tautomers Stereochemistry sensitive sensitive = ≠ S Na + FICuS ≠ ≠ ≠ ≠ = NCI/CADD Structure Identifiers Unique Representation of Chemical Structures D D D D D D O O C O O H N H 2 O - O N H 3 + O H O N H 2 O O H O O H C O O H H N H 2 C O O H N H 2 H O O - O O H
  • 16. Fragments Isotopes Charges Tautomers Stereochemistry Na + sensitive sensitive sensitive sensitive sensitive = = = = = = = = closely related forms of the same compound u u u u u un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive uuuuu NCI/CADD Structure Identifiers Unique Representation of Chemical Structures O O - D D D D D D O - O N H 3 + O O H O O H C O O H H N H 2 C O O H N H 2 H O O H O O C O O H N H 2 O H O N H 2
  • 17. Fragments Isotopes Charges Stereo Tautomers FICTS FICuS uuuuu sensitive / not sensitive <CACTVS hashcode (E_HASHISY)>-<tag>-<version>-<checksum> Na + 4A122D094098B50D -FICTS-01-1D 0E26B623DF7FAD30 -FICuS-01-70 9850FD9F9E2B4E25 -uuuuu-01-27 NCI/CADD Structure Identifiers Unique Representation of Chemical Structures H N N N H 2 O - O
  • 18. H N N N H 2 O - O N a + charged form tautomer isotope salt stereoisomers “ errors” histidine H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O
  • 19. A3DAE0788050DDE4-FICTS E5F83F10C5DB080A -FICTS B2FDA68AEDA06DB9-FICTS 9850FD9F9E2B4E25 -FICTS E5F83F10C5DB080A -FICTS E92E4BA2869F3611-FICTS 8A7AD1EB498CC76A-FICTS 6C16DE2351F9FF50-FICTS H N N N H 2 O - O N a + 9850FD9F9E2B4E25 -FICTS charged form tautomer isotope salt stereoisomers FICTS “ errors” H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O
  • 20. A3DAE0788050DDE4-FICuS E5F83F10C5DB080A -FICuS B2FDA68AEDA06DB9-FICuS 9850FD9F9E2B4E25 -FICuS E5F83F10C5DB080A -FICuS E92E4BA2869F3611-FICuS 8A7AD1EB498CC76A-FICuS 9850FD9F9E2B4E25 -FICuS H N N N H 2 O - O N a + 9850FD9F9E2B4E25 -FICuS charged form tautomer isotope salt stereoisomers FICuS “ errors” H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O
  • 21. 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -FICuS 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu H N N N H 2 O - O N a + 9850FD9F9E2B4E25 -uuuuu charged form tautomer isotope stereoisomers salt uuuuu “ errors” H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O
  • 22. HNDVDQJCIGZPNO -UHFFFAOYSA-N HNDVDQJCIGZPNO -CDYZYAPPSA-N HNDVDQJCIGZPNO -RXMQYKEDSA-N HNDVDQJCIGZPNO -YFKPBYRVSA-N HNDVDQJCIGZPNO - UHFFFAOYSA -N H N N N H 2 O - O N a + HNDVDQJCIGZPNO - UHFFFAOYSA -N charged form tautomer isotope stereoisomers salt Std. InChIKey “ errors” HNDVDQJCIGZPNO - UHFFFAOYSA -N UHPNKBYGGMJTIM -UHFFFAOYSA-M UHPNKBYGGMJTIM -UHFFFAOYSA-M H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O
  • 23. FICTS original record original record original record original record FICTS original record original record original record original record original record original record original record FICTS FICTS FICTS FICTS FICTS FICTS FICuS FICuS FICuS FICuS FICuS FICuS uuuuu uuuuu uuuuu uuuuu 83.1 million FICTS parent structures 81.6 million FICuS parent structures 76.2 million uuuuu parent structures 119.8 million original structure records in CSDB NCI/CADD Chemical Structure Database Structure Normalization
  • 24. FICTS original record original record original record original record FICTS original record original record original record original record original record original record original record FICTS FICTS FICTS FICTS FICTS FICTS FICuS FICuS FICuS FICuS FICuS FICuS uuuuu uuuuu uuuuu uuuuu tautomer- invariant 83.1 million FICTS parent structures 81.6 million FICuS parent structures 76.2 million uuuuu parent structures 119.8 million original structure records in CSDB NCI/CADD Chemical Structure Database Structure Normalization
  • 25. Tautomer Analysis How much “chemical space” is “just generated” by drawing tautomers?
  • 26.
    • CACTVS: generation of all formal tautomers for a given organic compound (prototropic tautomerism)
    • rule set of 21 transforms encoded as (CACTVS-extended) SMIRKS
    • rule set is systematically applied to the original structure (and all tautomers that have been generated in previous steps)
    • tautomer generation is limited to 1000 SMIRKS transform operations/structure
    • all tautomers are ranked by a scoring function
    • the highest ranked tautomer is defined as the canonical tautomer
    NCI/CADD Chemical Structure Database Tautomer Analysis
  • 27. rule 12 : furanones rule 11 : 1.11 (aromatic) heteroatom H shift rule 10 : 1.9 (aromatic) heteroatom H shift rule 9 : 1.7 (aromatic) heteroatom H shift rule 8 : 1.5 aromatic heteroatom H shift (2) rule 7 : 1.5 (aromatic) heteroatom H shift (1) rule 6 : 1.3 heteroatom H shift rule 5 : 1.3 aromatic heteroatom H shift rule 4 : special imine rule 3 : simple (aliphatic) imine rule 2 : 1.5 (thio)keto/(thio)enol rule 1 : 1.3 (thio)keto/(thio)enol
    • 21 SMIRKS transform rules:
    rule 21 : phosphonic acids rule 20 : isocyanides rule 19 : formamidinesulfinic acids rule 18 : cyanic/iso-cyanic acids rule 17 : oxim/nitroso via phenol rule 16 : oxim/nitroso rule 15 : pentavalent nitro/aci-nitro rule 14 : ionic nitro/aci-nitro rule 13 : keten/ynol exchange NCI/CADD Chemical Structure Database Tautomer Analysis
  • 28. [O,S,Se,Te;X1:1]=[C;z{1-2}:2][CX4R{0-2}:3] [#1:4] >> [#1:4] [O,S,Se,Te;X2:1][#6;z{1-2}:2]=[C,cz{0-1}R{0-1}:3] [N,n,S,s,O,o,Se,Te:1]=[NX2,nX2,C,c,P,p:2][N,n,S,O,Se,Te:3] [#1:4] >> [#1:4] [N,n,S,O,Se,Te:1][NX2,nX2,C,c,P,p:2]=[N,n,S,s,O,o,Se,Te:3] 1.3 keto/enol 1.3 heteroatom H shift rule 1: 1.3 (thio)keto/(thio)enol rule 6: 1.3 heteroatom H shift NCI/CADD Chemical Structure Database Tautomer Analysis 3 2 O 1 H 4 3 2 O 1 H 4 N 2 S 1 N 3 H H 4 H N 2 S 1 N 3 H H 4 H
  • 29. FICTS FICTS FICTS FICTS FICTS FICTS FICTS FICTS 72.0 million FICTS parent structures NCI/CADD Chemical Structure Database Tautomer Analysis FICuS FICuS FICuS FICuS FICuS FICuS 8.6% change tautomeric form during FICuS normalization FICTS parent structures 70.6 million FICuS parent structures structure counts are on basis of the 2009 version of CSDB (103.9 million structure records) FICuS parent structures 1.5% have an one-to-many relationship to several FICTS parent structures (“ conflict ”) 98.5% have an one-to-one relationship to a single FICTS parent structure
  • 30. NCI/CADD Chemical Structure Database Tautomer Analysis number database releases 0 10 20 30 40 50 60 70 80 90 0.0 0.5 1.0 1.5 2.0 frequency tautomeric overlap within each individual database release (%) average: ~0.3% of original structure records
  • 31. NCI/CADD Chemical Structure Database Tautomer Analysis number database releases 0 10 20 30 40 50 60 70 80 90 0.0 0.5 1.0 1.5 2.0 frequency tautomeric overlap within each individual database release (%) average: ~0.3% of original structure records Asinex ChemBridge ComGenex ChemNavigator Columbia University Molecular Screening Center EPA DSSTox Specs Ambinter BIND BindingDB ChemNavigator KEGG NCI Open Database NIST WebBook NLM ChemIDplus NMRShiftDB Thomson Pharma Wombat NCI/DTP PASS Training Set SGC-Ox ChemDB ZINC ChEBI ChemSpider
  • 32. NCI/CADD Chemical Structure Database Tautomer Analysis 0 5 10 15 20 25 30 0.5 2.5 4.5 6.5 8.5 10.5 12.5 14.5 16.5 18.5 20.5 22.5 24.5 frequency number database releases percentage of FICuS parent structure in each database release occurring somewhere in CSDB with a conflict occurrence of “tautomerism-critical” molecules within each individual database release (%) average: ~9.5% of FICuS parent structures
  • 33. HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5)
    • HPMBP is used in liquid membranes (selective removal of metal ions)
    • selectivity and efficiency depends on the tautomeric form of HPMBP
    • the tautomeric form depends on solvent and concentration of HPMBP
    He, D.; Li Z.; Ma M.; Huang J.; Yang Y. Study of extraction characteristics of HPMBP. 1. Tautomer and extraction characteristics. J. Chem. Eng. Data 2009 , 54(10), 2944-2947 Example for a Tautomer “Conflict” H N N O O
  • 34. HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) CACTVS generates 7 tautomers Example for a Tautomer “Conflict” canonical tautomer by CACTVS 5 have potential stereo center on atoms or bonds N N O H O H N N O O H N N O O R/S H N N O H O H R/S H N N O O H E/Z N N O O H E/Z N N O O R/S
  • 35. H H 4551-69-1 33064-14-1 127117-31-1 859 references 49 references 3 references HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) 3 have CAS Registry Numbers assigned Example for a Tautomer “Conflict” (no stereo) (Z) N N O O H N N O O H N N O O R/S H N N O H O H R/S N N O O H E/Z N N O O H E/Z N N O O R/S
  • 36. N N O H O N N O O N N O O H H N N O O H H N N O H O H H N N O O 6 databases 16 databases (no stereo) 3 databases (R) 2 databases (S) 12 databases 1 database (no stereo) HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) Example for a Tautomer “Conflict” occurrences in databases indexed in CSDB R/S R/S E/Z E/Z R/S H N N O O
  • 37. 6 databases 16 databases (no stereo) 3 databases (R) 2 databases (S) 12 databases occurrences in databases N N O H O 1 database (no stereo) HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) Example for a Tautomer “Conflict” ACD 3D Ambinter BindingDB ChemBank ChemDB ChemSpider ChemNavigator MLSMR NIAID Scripps Screening Center Thomson Pharma ZINC ChemDB ACD 3D ACX Ambinter BioByte QSAR ChemBank ChemBridge ChemDB ChemSpider DiscoveryGate EPA GCES MLSMR NCI Open Database NIST MS-Lib NLM ChemIDplus Sigma-Aldrich Thomson Pharma Ambinter ChemDB ChemSpider DiscoveryGate ChemNavigator Thomson Pharma ChemSpider ZINC ChemSpider ECOTOX ZINC N N O O R / S H N N O O N N O O H E / Z H N N O O H E / Z H N N O H O H R / S H N N O O R / S
  • 38. FICuS FICuS FICuS FICuS FICuS FICuS 70.6 million FICuS parent structures NCI/CADD Chemical Structure Database Tautomer Analysis
    • how many tautomers are generated?
    • how often is each rule applied (type of tautomerism)?
    • how many tautomers per structure?
    starting from the set of FICuS parent structures we systematically generated all tautomers based on the 21 SMIRKS rule set available in CACTVS generated 680 million tautomers for 1.7% of the FICuS parent structures the enumeration was not exhaustive
  • 39. Tautomer Analysis NCI/CADD Chemical Structure Database
    • usage of SMIRKS rules (1/2):
    2.6 17,860,604 rule 12 : furanones 0.2 1,374,235 rule 11 : 1.11 (aromatic) heteroatom H shift 0.7 5,061,731 rule 10 : 1.9 (aromatic) heteroatom H shift 8.4 57,242,472 rule 9 : 1.7 (aromatic) heteroatom H shift <0.1 26,819 rule 8 : 1.5 aromatic heteroatom H shift (2) 4.0 27,542,770 rule 7 : 1.5 (aromatic) heteroatom H shift (1) 36.8 250,453,882 rule 6 : 1.3 heteroatom H shift 3.8 25,678,446 rule 5 : 1.3 aromatic heteroatom H shift 0.6 4,306,155 rule 4 : special imine 5.3 35,917,415 rule 3 : simple (aliphatic) imine 1.7 11,541,452 rule 2 : 1.5 (thio)keto/(thio)enol 25.4 173,002,712 rule 1 : 1.3 (thio)keto/(thio)enol % count generated tautomers tautomer rule
  • 40. <0.1 54,926 rule 21 : phosphonic acids <0.1 229 rule 20 : isocyanides <0.1 1392 rule 19 : formamidinesulfinic acids <0.1 181 rule 18 : cyanic/iso-cyanic acids <0.1 131,502 rule 17 : oxim/nitroso via phenol <0.1 505,695 rule 16 : oxim/nitroso <0.1 129 rule 15 : pentavalent nitro/aci-nitro <0.1 428,266 rule 14 : ionic nitro/aci-nitro <0.1 57,989 rule 13 : keten/ynol exchange % count generated tautomers tautomer rule Tautomer Analysis NCI/CADD Chemical Structure Database
    • usage of SMIRKS rules (2/2):
  • 41. NCI/CADD Chemical Structure Database Tautomer Analysis
    • number of tautomers per structure:
    <0.1 3 801–832 tautomers <0.1 362 701-800 tautomers <0.1 1,400 601-700 tautomers <0.1 4,323 501-600 tautomers <0.1 17,241 401-500 tautomers <0.1 35,144 301-400 tautomers <0.1 104,875 201-300 tautomers 0.8 565,199 101-200 tautomers 1.6 1,136,066 51-100 tautomers 3.7 2,622,587 25-50 tautomers 15.4 10,870,312 11-25 tautomers 47.5 33,532,284 2-10 tautomers 15.2 10,721,845 one tautomer 13.8 9,756,186 no tautomers % count FICuS structures with
  • 42. NCI/CADD Chemical Structure Database Tautomer Analysis
    • number of tautomers per structure:
    many minor tautomeric forms (but you find them in databases) <0.1 3 801–832 tautomers <0.1 362 701-800 tautomers <0.1 1,400 601-700 tautomers <0.1 4,323 501-600 tautomers <0.1 17,241 401-500 tautomers <0.1 35,144 301-400 tautomers 0.1 104,875 201-300 tautomers 0.8 565,199 101-200 tautomers 1.6 1,136,066 51-100 tautomers 3.7 2,622,587 25-50 tautomers 15.4 10,870,312 11-25 tautomers 47.5 33,532,284 2-10 tautomers 15.2 10,721,845 one tautomer 13,8 9,756,186 no tautomers % count FICuS structures with
  • 43. 45.6 310,725,465 >0.9-1.0 31.5 214,747,976 >0.8-0.9 16.4 111,954,384 >0.7-0.8 5.3 36,448,651 >0.6-0.7 0.9 6,304,436 >0.5-0.6 <0.1 369,331 >0.4-0.5 <0.1 6,580 >0.3-0.4 <0.1 6 >0.2-0.3 0.0 0 >0.0-0.2 % Count Tanimoto index range Tautomer Analysis Tanimoto Similarities of Tautomers
    • canonical tautomer vs. generated tautomers (680 million tautomer set)
    PubChem/CACTVS E_SCREEN bitvector (881 bits) ~ 23% below 0.8 Tanimoto similarity (although the same molecule)
  • 44. Scaffold Analysis
  • 45. Scaffold Analysis NCI/CADD Chemical Structure Database molecular scaffold tree archetype scaffold simple scaffold Schuffenhauer et al. J. Chem. Inf. Model. 2007 , 47 , 47-58 Bemis et al. J. Med. Chem. 1996, 39 , 2887-2893 Bemis et al. J. Med. Chem. 1996, 39 , 2887-2893 S O O N N O level 2 level 1 example N N H O N N H O N N H
  • 46. NCI/CADD Chemical Structure Database molecular scaffold tree archetype scaffold simple scaffold 76.2 million 8.1 million scaffolds 6.8 million scaffolds 0.8 million scaffolds CSDB Scaffold Analysis uuuuu compound set level 2 level 1 N N H O O N N H N N H
  • 47. NCI/CADD Chemical Structure Database 76.2 million number of unique scaffolds per hierarchy level CSDB Scaffold Analysis uuuuu compound set 8.1 million scaffolds 0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 1 2 3 4 5 6 7 8 9 10 Hierarchy Level Number of Unique Scaffolds (in millions) 0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 Number of unique structures (in million) level 2 level 1 molecular scaffold tree N N H O O N N H
  • 48. NCI/CADD Chemical Structure Database 1667 58 5 1 2 33 11 2 N N O R 2 R 1 R 9 R 8 R 7 R 6 R 5 R 4 N N R 10 R 2 R 1 R 9 R 8 R 7 R 6 R 5 R 4 R 3 21 R 3 96 5 3 4 25 1693 16 7 73 44 2,281 uuuuu parent structures 2,726 uuuuu parent structures 744,469 uuuuu parent structures 5334 structure records in 64 databases 6007 structure records in 66 databases 1,069,046 structure records in 66 databases Scaffold Analysis S O O N N O N N H O N N H
  • 49. Atom Neighborhoods
  • 50. Multilevel Neighborhoods of Atoms (MNA) HC C(C(CC-H)C(CC-C)-H(C)) HO C(C(CC-H)C(CN-H)-H(C)) CHCC C(C(CC-H)C(CN-H)-C(C-O-O)) CHCN C(C(CC-H)N(CC)-H(C)) CCCC C(C(CC-C)N(CC)-H(C)) CCOO N(C(CN-H)C(CN-H)) NCC -H(C(CC-H)) OHC -H(C(CN-H)) OC -H(-O(-H-C)) -C(C(CC-C)-O(-H-C)-O(-C)) -O(-H(-O)-C(C-O-O)) -O(-C(C-O-O)) NCI/CADD Chemical Structure Database Filimonov D., Poroikov V., Borodina Yu., Gloriozova T. J. Chem. Inf. Comput. Sci., 1999 , 39 (4), 666-670. MNA level 1 MNA level 2 N O H O H H
  • 51. Multilevel Neighborhoods of Atoms (MNA) NCI/CADD Chemical Structure Database Unique MNAs level 1 level 2 13,426 918,516 2.3 billion relationships 1.3 billion relationships ~ 17 per uuuuu parent structure ~ 30 per uuuuu parent structure 76.2 million CSDB uuuuu compound set
  • 52. Multilevel Neighborhoods of Atoms (MNA) NCI/CADD Chemical Structure Database 424,784 MNAs (level 2) are exclusive to a set of 1,3 million structures in ChemSpider Unique MNAs level 1 level 2 13,426 918,516 2.3 billion relationships 1.3 billion relationships ~ 17 per uuuuu parent structure ~ 30 per uuuuu parent structure 76.2 million CSDB uuuuu compound set
  • 53. Chemical Structure Web Services NCI/CADD web service NCI/CADD web service NCI/CADD Chemical Structure Database (CSDB) CACTVS external web services http Chemical Identifier Resolver other software packages e.g. OPSIN Chemical Structure Web Services Indexing Chemical Space
  • 54. http://cactus.nci.nih.gov/chemical/structure Chemical Identifier Resolver NCI/CADD Web Resources http://cactus.nci.nih.gov/blog
  • 55. Acknowledgments ChemNavigator Scott Hutton Tad Hurst CADD Group, CBL, NCI Igor Filippov Thanks to all database providers! http://cactus.nci.nih.gov Our web site: University of Cambridge Daniel Lowe Peter Murray-Rust Noel’ O Boyle (University College Cork, Ireland) Richard Apodaca (Metamolecular) Hans-Juergen Himmler
  • 56. Acknowledgments - Software CACTVS Python Web Framework Python SQL Library Peter Ertl (Novartis) ChemWriter Javascript library
  • 57.