Your SlideShare is downloading. ×
ACS San Francisco 2010 CINF Talk
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

ACS San Francisco 2010 CINF Talk

1,399
views

Published on


0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,399
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
33
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. NCI/CADD: Open-access chemical structure web platform Markus Sitzmann 1 , Wolf-Dietrich Ihlenfeldt 2 , and Marc C. Nicklaus 1 [1] Computer-Aided Drug Design Group, Chemical Biology Laboratory, NCI-Frederick, NIH, DHHS [2] Xemistry GmbH, Auf den Stieden 8, D-35094 Lahntal, Germany
  • 2. NCI/CADD Public Web Services Enhanced NCI Database Browser http://cactus.nci.nih.gov/ncidb2 web service for NCI/DTP’s Open NCI Database
    • first release 1998, updated 2001
    • ~250,000 structure records
    • ~60 million data points
    Chemical Structure Lookup Service http://cactus.nci.nih.gov/lookup
    • first release 2006, updated 2008
    • ~74 million structure records (~46 million unique structures)
    structure lookup in over 100 database
  • 3. NCI/CADD Public Web Services OSRA http://cactus.nci.nih.gov/osra/ converts graphical representations of chemical structures in journal articles, patent documents, textbooks, trade magazines etc., into SMILES Online SMILES Translator http://cactus.nci.nih.gov/translate/ GIF Creator for Chemical Structures http://cactus.nci.nih.gov/gifcreator/ PROSIT: Online Pseudorotation Tool Version 2 http://cactus.nci.nih.gov/prosit/
  • 4. http://cactus.nci.nih.gov
  • 5. New Web Services
  • 6. Chemical Structure Representations chemical structure NCI/CADD Identifiers InChI/InChIKey ChemSpider ID PubChem SID/CID chemical names CAS Registry Number NSC number FDA UNII ChemNavigator SID SMILES SD File Chemical Formula ChEBI ID PDB Ligand ID MRV CML SYBYL Line Notation GIF image
  • 7. http://cactus.nci.nih.gov/chemical/structure Works as a resolver for different chemical structure identifiers. Allows one to convert a given structure identifier into another representation or structure identifier. Chemical Identifier Resolver NCI/CADD Web Resources
  • 8. http://cactus.nci.nih.gov/chemical/structure first beta release: July 2009 second beta release: Nov. 2009 third beta release: April/May 2010 (beta versions will continue through 2010) 3.0 million requests since July 1, 2009 (~11.000/day) Chemical Identifier Resolver NCI/CADD Web Resources
  • 9.
    • it is usable by a simple URL API:
    example: http://cactus.nci.nih.gov/chemical/structure/ Tamiflu / cas 204255-11-8 http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation” MIME type: text/plain Chemical Identifier Resolver NCI/CADD Web Resources XML format: http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation” /xml
    • if a request is not resolvable: HTTP404 status message
  • 10. identifier representation http request http response detection of the identifier type identifier is a full structure representation (e.g. SMILES, InChI) calculation of the requested structure representation identifier is a hashed structure representation (e.g. InChIKey), chemical name etc. database lookup MIME type Chemical Identifier Resolver NCI/CADD Web Resources structure e.g. InChI, GIF image e.g. CAS number, chemical name
  • 11. “Chemical Structure Web Engine” Chemical Structure Web Engine NCI/CADD web service NCI/CADD web service NCI/CADD Chemical Structure Database (CSDB) CACTVS external web services http Chemical Identifier Resolver other software packages
  • 12.
    • number of structure records: 103.9 million
    • number of unique structures:
      • Std. InChIKey : ~73.0 million
      • FICuS : ~70.6 million
      • uuuuu : ~65.3 million
    • from the set of ~83.6 million unique structures we have derived about ~10 million additional scaffold-type structures (for future structure searches); thus:
    • for lookup “ identifier  structure ” available:
      • ~92.9 million Standard InChIKeys
      • ~93.3 million NCI/CADD Identifiers
      • ~70 million chemical names linked to ~16 million structures
    } union set of unique structures: ~83.6 million Chemical Structure Database NCI/CADD Web Resources
  • 13.
    • ChemNavigator iResearch Library compilation of commercially available screening compounds from ~300 inter- national chemistry suppliers
    • PubChem database including Open NCI database, EPA DSSTox databases, NIAID HIV databases, NIST Webbook, NLM ChemIDplus, ChemSpider …
    • Commercial Sources / others Asinex, Comgenex, …
    as of March 2010: 140 chemical structure databases 103.9 million structure records ~70.6 million unique structures by FICuS ChemNav. iResearch Lib. ~56% PubChem ~38% others ~6% Chemical Structure Database NCI/CADD Web Resources
  • 14.
    • based on hashcodes calculated by the chemoinformatics toolkit CACTVS
    • CACTVS hashcodes:
      • represent a chemical structure uniquely as 16-digit hexadecimal number (64-bit unsigned)
      • have a high sensitivity to structural features of a compound
      • change if connectivity changes
    NCI/CADD Structure Identifiers Unique Representation of Chemical Structures 9850FD9F9E2B4E25 H N N N H 2 O H O
  • 15. charged form A3DAE0788050DDE4 3ECEF579D7DF025A tautomers isotope “ errors” E92E4BA2869F3611 8A7AD1EB498CC76A stereoisomers 6C16DE2351F9FF50 salt 9850FD9F9E2B4E25 H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 2 O - O N a + H N N N H 3 + O - O 8F7A1DE5A733F0E0 O H N N N H 2 O N a 60525E1AF41497B6 H N N N H O H O B2FDA68AEDA06DB9 N H N 1 5 N H 2 O H O
  • 16. input structure MDL Molfile MDL SDF SMILES ChemDraw cdx PDB structure normalization parent structure MDL SDF SMILES database NCI/CADD Identifier hashcode calculation NCI/CADD Structure Identifiers Unique Representation of Chemical Structures E_HASHISY
  • 17.
    • adjustable levels of sensitivity:
    NCI/CADD Structure Identifiers Fragments sensitive keep only largest organic fragment Isotopes ignore isotope labels sensitive Charges uncharge sensitive find canonical tautomer Stereochemistry sensitive discard stereo information un-sensitive un-sensitive un-sensitive un-sensitive sensitive Tautomers Na + Structure Normalization un-sensitive D D D D D D O O C O O H N H 2 O - O N H 3 + O H O N H 2 O O H O O H C O O H H N H 2 C O O H N H 2 H O O - O O H
  • 18. NCI/CADD Structure Identifiers Fragments Isotopes Charges sensitive sensitive sensitive un-sensitive un-sensitive un-sensitive un-sensitive Tautomers Stereochemistry sensitive sensitive Na + Structure Normalization D D D D D D O O C O O H N H 2 O - O N H 3 + O H O N H 2 O O H O O H C O O H H N H 2 C O O H N H 2 H O O - O O H
  • 19. NCI/CADD Structure Identifiers Fragments Isotopes Charges sensitive sensitive sensitive F I C FICTS identifier: representation of the exact drawing un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive T ≠ ≠ ≠ Tautomers Stereochemistry sensitive sensitive ≠ ≠ S Na + = = ≠ ≠ Structure Normalization D D D D D D O O C O O H N H 2 O - O N H 3 + O H O N H 2 O O H O O H C O O H H N H 2 C O O H N H 2 H O O - O O H
  • 20. NCI/CADD Structure Identifiers Fragments Isotopes Charges sensitive sensitive sensitive F I C FICuS identifier: comes closest to how a chemist perceives a compound un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive u ≠ ≠ ≠ ≠ Tautomers Stereochemistry sensitive sensitive = = ≠ ≠ S Na + Structure Normalization D D D D D D O O C O O H N H 2 O - O N H 3 + O H O N H 2 O O H O O H C O O H H N H 2 C O O H N H 2 H O O - O O H
  • 21. NCI/CADD Structure Identifier Fragments Isotopes Charges Tautomers Stereochemistry Na + sensitive sensitive sensitive sensitive sensitive = = = = = = = = uuuuu identifier: closely related forms of the same compound u u u u u un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive Structure Normalization O O - D D D D D D O - O N H 3 + O O H O O H C O O H H N H 2 C O O H N H 2 H O O H O O C O O H N H 2 O H O N H 2
  • 22. A3DAE0788050DDE4-FICTS E5F83F10C5DB080A -FICTS B2FDA68AEDA06DB9-FICTS 9850FD9F9E2B4E25 -FICTS E5F83F10C5DB080A -FICTS E92E4BA2869F3611-FICTS 8A7AD1EB498CC76A-FICTS 6C16DE2351F9FF50-FICTS H N N N H 2 O - O N a + 9850FD9F9E2B4E25 -FICTS charged form tautomers isotope salt stereoisomers FICTS “ errors” H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O
  • 23. A3DAE0788050DDE4-FICuS E5F83F10C5DB080A -FICuS B2FDA68AEDA06DB9-FICuS 9850FD9F9E2B4E25 -FICuS E5F83F10C5DB080A -FICuS E92E4BA2869F3611-FICuS 8A7AD1EB498CC76A-FICuS 9850FD9F9E2B4E25 -FICuS H N N N H 2 O - O N a + 9850FD9F9E2B4E25 -FICuS charged form tautomers isotope salt stereoisomers FICuS “ errors” H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O
  • 24. 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -FICuS 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu H N N N H 2 O - O N a + 9850FD9F9E2B4E25 -uuuuu charged form tautomers isotope stereoisomers salt uuuuu “ errors” H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O
  • 25. NCI/CADD Chemical Structure Database NCI/CADD:RID NCI/CADD:CID structure records compounds (structures unique by CACTVS HASHISY) FICTS associations ~72.0 million FICuS associations ~70.6 million uuuuu associations ~65.3 million 103.5 million 83.6 million ~130 million linkouts to original database records
    • linked to:
    • StdInChI[Key]
    • chemical names
    • chemical formula
    • properties
    • etc.
  • 26. resolver chemical names CAS numbers SMILES strings IUPAC InChI/InChIKeys NCI/CADD Identifiers CACTVS HASHISY NSC number PubChem SID/CID FDA UNII ChemSpider ID ChemNavigator SID Chemical Formula /smiles /names, /iupac_name /cas /inchi, /stdinchi /inchikey, /stdinchikey /ficts, /ficus, /uuuuu /image /file, /sdf /mw, /monoisotopic_mass /formula /twirl, /3d /urls /unii /chemspider_id /pubchem_sid /chemnavigator_sid “ identifier” “ representation” http://cactus.nci.nih.gov/chemcial/structure Chemical Identifier Resolver NCI/CADD Public Web Resources
  • 27. http://cactus.nci.nih.gov/chemical/structure/ LFQSCWFLJHTTHZ-UHFFFAOYSA-N / smiles Standard InChIKey Chemical Identifier Resolver
    • can resolve ~93.0 million Standard InChIKeys into a full structure representation:
    CCO http://cactus.nci.nih.gov/chemical/structure/ LFQSCWFLJHTTHZ-UHFFFAOYSA / smiles CCO CC[OH2+] http://cactus.nci.nih.gov/chemical/structure/ LFQSCWFLJHTTHZ / smiles C(C(O)([2H])[2H])[2H] CC(O)([2H])[2H] C(CO)([2H])([2H])[2H] CC[17OH] C(CO)[2H] [14CH3]CO CCO
  • 28. alc  Alchemy format cdxml  CambridgeSoft ChemDraw XML format cerius  MSI Cerius II format charmm   Chemistry at HARvard Macromolecular Mechanics file format cif  Crystallographic Information File cml  Chemical Markup Language ctx  Gasteiger Clear Text format gjf  Gaussian input data file gromacs  GROMACS file format hyperchem  HyperChem file format jme  Java Molecule Editor format maestro  Schroedinger MacroModel structure file format mol  Symyx molecule file sybyl2/mol2  Tripos Sybyl MOL2 format mrv  ChemAxon MRV format pdb  Protein Data Bank sdf  Symyx Structure Data Format sdf3000  Symyx Structure Data Format 3000 sln  SYBYL Line Notation smiles   SMILES xyz  xyz file format
    • available formats:
    http://cactus.nci.nih.gov/chemical/structure/ LFQSCWFLJHTTHZ-UHFFFAOYSA-N / file ?format = sdf File Representation Chemical Identifier Resolver
  • 29. http://cactus.nci.nih.gov/chemical/structure/ buckyball / image ? height= 300 &width= 300 &bgcolor= black &bondcolor= white http://cactus.nci.nih.gov/chemical/structure/ aspirin / image ?height= 200 &width= 200 &symbolfontsize= 7 &footer=" Aspirin " Aspirin Structure Image Generation Chemical Identifier Resolver
  • 30. TwirlyMol Chemical Identifier Resolver implemented by Noel O'Boyle (University College Cork, Ireland) Chrome Safari FF3.5/3.6 FF3.0 FF2.0 IE8 IE7 IE6 simple javascript that allows you to render a rotatable/zoomable 3D representation of a molecule in your web browser no plugin is needed, only a modern browser:
  • 31.
    • simple viewer:
    http://cactus.nci.nih.gov/chemical/structure/ restasis / twirl
    • embed into a web page:
    <div id=“ canvas ” height=“ 400 ” width=“ 400 ”></div> <script src=“ http://cactus.nci.nih.gov/chemical/structure/ restasis / twirl_cached / canvas ” /> TwirlyMol Chemical Identifier Resolver
  • 32. restasis
  • 33. http://www.coronene.com/blog/ http://chemical-quantum-images.blogspot.com http://baoilleach.blogspot.com/ TwirlyMol Chemical Identifier Resolver
  • 34. ethanol name a specific resolver module : http://cactus.nci.nih.gov/chemical/structure/ CCO / iupac_name ?resolver= name 2-[[3-(3-chlorophenyl)-1,2,4-oxadiazol-5-yl]sulfanyl]acetic acid
    • e.g. the string “ CCO ”, can be resolved as
      • SMILES string of “ ethanol ”
      • abbreviation for “ Carboxymethylthio-3-(3-Chlorphenyl)-1,2,4-Oxadiazol) ”
    Ambiguous Identifiers Chemical Identifier Resolver http://cactus.nci.nih.gov/chemical/structure/ CCO / iupac_name ?resolver= smiles
  • 35. < ?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ? > < request string=&quot; CCO &quot; representation=“ iupac_name &quot; > < data id=&quot; 1 &quot; resolver=&quot; smiles &quot; string_class=&quot; SMILES String &quot;> < item id=&quot; 1 &quot;> ethanol < / item > < / data > < data id=&quot; 2 &quot; resolver=&quot; name &quot; string_class=&quot; Chemical Name &quot; > < item id=&quot; 1 &quot; > 2-[[3-(3-chlorophenyl)-1,2,4-oxadiazol-5-yl]sulfanyl]acetic acid < / item > < / data > < / request > XML format:
    • e.g. the string “ CCO ”, can be resolved as
      • SMILES string of “ ethanol ”
      • abbreviation for “ Carboxymethylthio-3-(3-Chlorphenyl)-1,2,4-Oxadiazol) ”
    Chemical Identifier Resolver Ambiguous Identifiers http://cactus.nci.nih.gov/chemical/structure/ CCO / iupac_name /xml
  • 36. < ?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ? > < request string=&quot; restasis &quot; representation=&quot; urls &quot;> < data id=&quot; 1 &quot; resolver=&quot; name &quot; string_class=&quot; Chemical Name &quot;> < item id=&quot; 1 &quot; classification=&quot; exact &quot; database=&quot; ChemSpider &quot; publisher=&quot; ChemSpider &quot;> http://chemspider.com/structure.4939506 < /item > < item id=&quot; 2 &quot; classification=&quot; exact &quot; database=&quot; ChemSpider “ publisher=&quot; PubChem &quot;> http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?sid=43028058 < /item > < item id=&quot; 3 &quot; classification=&quot; exact &quot; database=&quot; NLM ChemIDplus &quot; publisher=&quot; NLM &quot;> http://chem.sis.nlm.nih.gov/chemidplus/direct.jsp?result=advanced&regno=059865133 […] < /data > < /request >
    • get the URL of the original structure records:
    http://cactus.nci.nih.gov/chemical/structure/ restasis / urls /xml Chemical Identifier Resolver Database URL Lookup
  • 37.
    • get available names:
    http://cactus.nci.nih.gov/chemical/structure/ CC (= O)Oc1ccccc1C(O)=O/ names /xml Chemical Identifier Resolver Name Lookup <?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ? > < request string=&quot; CC(=O)Oc1ccccc1C(O)=O &quot; representation=&quot; names &quot;> < data id=&quot; 1 &quot; resolver=&quot; smiles &quot; string_class=&quot; SMILES String &quot; description=&quot; CC(=O)Oc1ccccc1C(O)=O &quot; > < item id=&quot; 1 &quot; classification =&quot; PUBCHEM_IUPAC_NAME &quot;> 2-acetyloxybenzoic acid < /item > < item id=&quot; 2 &quot; classification=&quot; PUBCHEM_IUPAC_OPENEYE_NAME &quot;> 2-Acetoxybenzoic acid < /item > < item id=&quot; 3 &quot; classification=&quot; PUBCHEM_GENERIC_REGISTRY_NAME &quot;> 50-78-2 < /item > < item id=&quot; 4 &quot; classification=&quot; PUBCHEM_GENERIC_REGISTRY_NAME &quot;> 11126-35-5 </ item > < item id=&quot; 5 &quot; classification=&quot; PUBCHEM_GENERIC_REGISTRY_NAME &quot;> 11126-37-7 </ item > < item id=&quot; 6 &quot; classification=&quot; PUBCHEM_GENERIC_REGISTRY_NAME &quot;> 2349-94-2 </ item > < item id=&quot; 7 &quot; classification=&quot; PUBCHEM_GENERIC_REGISTRY_NAME &quot;> 26914-13-6 </ item > < item id=&quot; 8 &quot; classification=&quot; PUBCHEM_SUBSTANCE_SYNONYM &quot;> NCGC00090977-04 </ item > < item id=&quot; 9 &quot; classification=&quot; PUBCHEM_SUBSTANCE_SYNONYM &quot;> KBioSS_002272 </ item > < item id=&quot; 10 &quot; classification=&quot; PUBCHEM_SUBSTANCE_SYNONYM &quot;> SBB015069 </ item > < item id=&quot; 11 &quot; classification=&quot; PUBCHEM_SUBSTANCE_SYNONYM &quot;> Aspirin </ item > < item id=&quot; 12 &quot; classification=&quot; PUBCHEM_SUBSTANCE_SYNONYM &quot;> D00109 </ item > […]
  • 38. http://cactus.nci.nih.gov/blog /chemical/structure Blog
  • 39. In Development http://cactus.nci.nih.gov/ TEST_ chemical/structure
  • 40.
    • manipulates the structure created from the identifier
    • new representation is calculated after structure manipulation
    http://cactus.nci.nih.gov/chemical/structure/ operator: identifier/representation “ Chemical Operators” Chemical Identifier Resolver operators: tautomers, canonical_tautomer, addh, removeh, nostereo, rings, …
  • 41. Tautomers “ Chemical Operator” http://cactus.nci.nih.gov/chemical/structure/ tautomers :guanine /” representation ” N N H N H N O H 2 N N N H N H N O H 2 N N N H N N O H H 2 N H N N N H N O H 2 N N N N H N O H H 2 N H N N N H N O H 2 N N N N H N O H H 2 N H N N N N O H H 2 N H N N H N H N O H N N N H N H N O H H N H N N H N H N O H N N N H N H N O H H N H N N H N N O H H N H N N N H N O H H N H N N N H N O H H N
  • 42.
    • (hopefully) there will be many resolvers from different providers with different background:
      • publishers
      • commercial databases
      • free sources and databases: ChemSpider, PubChem, ChEBI, …
    • Std. InChI[Key] is the perfect tool to interlink the resolvers
    • ChemSpider and NCI/CADD are working on a test protocol for a federated InChI/InChIKey resolver
    IUPAC InChI/InChIKey Resolver
  • 43. IUPAC InChI/InChIKey Resolver IUPAC Root Resolver Resolver 1 Resolver 2 Resolver 3 Resolver 3.1 Resolver 3.2 Resolver 3.3 Clients Chemical Identifier Resolver
  • 44. http://cactus.nci.nih.gov/chemical/structure Chemical Identifier Resolver NCI/CADD Web Resources http://cactus.nci.nih.gov/blog
  • 45. Acknowledgments ChemNavigator Scott Hutton Tad Hurst CADD Group, CBL, NCI Igor Filippov Noel O'Boyle Hans-Juergen Himmler (Akos) Thanks to all database providers! http://cactus.nci.nih.gov Our web site:
  • 46. Users webel.py - A Cinfony module IUPHAR DATABASE http://www.iuphar-db.org http://baoilleach.blogspot.com/2009/11/introducing-webel-cheminformatics.html http://www.akosgmbh.eu/globalsearch/index.htm avogadro.openmolecules.net/ CACTVS http://www.xemistry.com in silico toxicology http://www.in-silico.ch/ Symyx Draw Resolver http://www.symyx.com/