ACS Salt Lake City 2009 CINF Talk (InChI Symposium)

  • 590 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
590
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
9
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. InChI/InChIKey vs. NCI/CADD Structure Identifiers: A comparison Markus Sitzmann Computer-Aided Drug Design Group (NCI/CADD), Laboratory of Medicinal Chemistry, NCI-Frederick, NIH, DHHS
  • 2. The Adaption and Use of the IUPAC InChI/InChIKey NCI/CADD Identifiers InChI/InChIKey Chemical Structure Lookup Service FICTS FICuS uuuuu Std. InChI/InChIKey 74 million structure records – 46 million unique structures
  • 3.
    • based on hashcodes calculated by the chemoinformatics toolkit CACTVS
    • CACTVS hashcodes:
      • represent a chemical structure uniquely as 16-digit hexadecimal number (64-bit unsigned)
      • have a high sensitivity to structural features of a compound
      • change if connectivity changes
    NCI/CADD Structure Identifiers Unique Representation of Chemical Structures 9850FD9F9E2B4E25 H N N N H 2 O H O
  • 4. charged form A3DAE0788050DDE4 3ECEF579D7DF025A tautomers isotope “ errors” E92E4BA2869F3611 8A7AD1EB498CC76A stereoisomers 6C16DE2351F9FF50 salt 9850FD9F9E2B4E25 H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 2 O - O N a + H N N N H 3 + O - O 8F7A1DE5A733F0E0 O H N N N H 2 O N a 60525E1AF41497B6 H N N N H O H O B2FDA68AEDA06DB9 N H N 1 5 N H 2 O H O
  • 5. input structure MDL Molfile MDL SDF SMILES ChemDraw cdx PDB structure normalization parent structure MDL SDF SMILES database NCI/CADD Identifier hashcode calculation NCI/CADD Structure Identifiers Unique Representation of Chemical Structures E_HASHISY
  • 6.
    • adjustable levels of sensitivity:
    NCI/CADD Structure Identifiers Fragments sensitive keep only largest organic fragment Isotopes ignore isotope labels sensitive Charges uncharge sensitive find canonical tautomer Stereochemistry sensitive discard stereo information un-sensitive un-sensitive un-sensitive un-sensitive sensitive Tautomers Na + Structure Normalization un-sensitive D D D D D D O O C O O H N H 2 O - O N H 3 + O H O N H 2 O O H O O H C O O H H N H 2 C O O H N H 2 H O O - O O H
  • 7. NCI/CADD Structure Identifiers Fragments Isotopes Charges sensitive sensitive sensitive un-sensitive un-sensitive un-sensitive un-sensitive Tautomers Stereochemistry sensitive sensitive Na + Structure Normalization D D D D D D O O C O O H N H 2 O - O N H 3 + O H O N H 2 O O H O O H C O O H H N H 2 C O O H N H 2 H O O - O O H
  • 8. NCI/CADD Structure Identifiers Fragments Isotopes Charges sensitive sensitive sensitive F I C FICTS identifier: representation of the exact drawing un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive T ≠ ≠ ≠ Tautomers Stereochemistry sensitive sensitive ≠ ≠ S Na + = = ≠ ≠ Structure Normalization D D D D D D O O C O O H N H 2 O - O N H 3 + O H O N H 2 O O H O O H C O O H H N H 2 C O O H N H 2 H O O - O O H
  • 9. NCI/CADD Structure Identifiers Fragments Isotopes Charges sensitive sensitive sensitive F I C FICuS identifier: comes closest to how a chemist perceives a compound un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive u ≠ ≠ ≠ ≠ Tautomers Stereochemistry sensitive sensitive = = ≠ ≠ S Na + Structure Normalization D D D D D D O O C O O H N H 2 O - O N H 3 + O H O N H 2 O O H O O H C O O H H N H 2 C O O H N H 2 H O O - O O H
  • 10. NCI/CADD Structure Identifier Fragments Isotopes Charges Tautomers Stereochemistry Na + sensitive sensitive sensitive sensitive sensitive = = = = = = = = uuuuu identifier: closely related forms of the same compound u u u u u un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive Structure Normalization O O - D D D D D D O - O N H 3 + O O H O O H C O O H H N H 2 C O O H N H 2 H O O H O O C O O H N H 2 O H O N H 2
  • 11. NCI/CADD Structure Identifier correct structure: add hydrogen atoms correct functional groups correct metal atom bonds input structure normalize or discard stereo information define canonical tautomer discard isotope labels d Structure Normalization get largest fragment & uncharge: delete complex center get largest organic fragment delete radical center uncharge structure uuuuu uuuuS uuuTu uuuTS FICuu FICuS FICTS FICTu n n n n d d d define canonical resonance form/ protonation state parent structures
  • 12. NCI/CADD Structure Identifier 9850FD9F9E2B4E25 -FICTS-01-57 9850FD9F9E2B4E25 -FICuS-01-78 9850FD9F9E2B4E25 -uuuuu-01-27 <CACTVS hashcode (E_HASHISY)>-<tag>-<version>-<checksum> H N N N H 2 O H O
  • 13. A3DAE0788050DDE4-FICTS E5F83F10C5DB080A -FICTS B2FDA68AEDA06DB9-FICTS 9850FD9F9E2B4E25 -FICTS E5F83F10C5DB080A -FICTS E92E4BA2869F3611-FICTS 8A7AD1EB498CC76A-FICTS 6C16DE2351F9FF50-FICTS H N N N H 2 O - O N a + 9850FD9F9E2B4E25 -FICTS charged form tautomers isotope salt stereoisomers FICTS “ errors” H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O
  • 14. A3DAE0788050DDE4-FICuS E5F83F10C5DB080A -FICuS B2FDA68AEDA06DB9-FICuS 9850FD9F9E2B4E25 -FICuS E5F83F10C5DB080A -FICuS E92E4BA2869F3611-FICuS 8A7AD1EB498CC76A-FICuS 9850FD9F9E2B4E25 -FICuS H N N N H 2 O - O N a + 9850FD9F9E2B4E25 -FICuS charged form tautomers isotope salt stereoisomers FICuS “ errors” H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O
  • 15. 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -FICuS 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu H N N N H 2 O - O N a + 9850FD9F9E2B4E25 -uuuuu charged form tautomers isotope stereoisomers salt uuuuu “ errors” H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O
  • 16. HNDVDQJCIGZPNO -UHFFFAOYSA-N HNDVDQJCIGZPNO -CDYZYAPPSA-N HNDVDQJCIGZPNO -RXMQYKEDSA-N HNDVDQJCIGZPNO -YFKPBYRVSA-N HNDVDQJCIGZPNO -UHFFFAOYSA-N H N N N H 2 O - O N a + HNDVDQJCIGZPNO -UHFFFAOYSA-N charged form tautomers isotope stereoisomers salt Std. InChIKey “ errors” HNDVDQJCIGZPNO -UHFFFAOYSA-N UHPNKBYGGMJTIM-UHFFFAOYSA-M UHPNKBYGGMJTIM-UHFFFAOYSA-M H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O
  • 17. Structure Normalization Tautomers canonical tautomer ? O O OH O O OH O O O
  • 18.
    • CACTVS: generation of all formal tautomers for a given organic compound (prototropic tautomerism)
    • rule set of 21 transforms encoded as (CACTVS-extended) SMIRKS
    • types of tautomerism covered:
    Tautomers Structure Normalization
    • 1.3, 1.5 keto/enol  imine/enamine  imine/amine  lactam/lactim  1.3, 1.5, 1.7, 1.11 hydrogen atom shift on (aromatic) heteroatoms  keten/ynol  nitro/ aci -nitro  nitroso/oxime
    • special cases: cyanic/ iso -cyanic acid, phosphonic acid, formamidinesulfonic acid, isocyanide, furanones  and more …
  • 19. Tautomers Structure Normalization
    • transform: 1.3 keto-enol
    • [O,S,Se,Te;X1:1]=[Cx1:2][CX4R{0-2}:3] [#1:4] >> [#1:4] [O,S,Se,Te;X2:1][Cx1,cx1:2]=[C,cx1,cx0:3]
    • transform: 1.3 heteroatom H shift
    • [N,n,S,s,O,o,Se,Te:1]=[NX2,nX2,C,c,P,p:2] [N,n,S,O,Se,Te:3] [#1:4] >> [#1:4] [N,n,S,O,Se,Te:1] [NX2,nX2,C,c,P,p:2]=[N,n,S,s,O,o,Se,Te:3]
    • transform: 1.5 heteroatom H shift
    • [nX2,NX2,S,O,Se,Te:1]=[C,c,nX2,NX2:6][C,c:5]=[C,c,nX2:2] [N,n,S,s,O,o,Se,Te:3] [#1:4] >> [#1:4] [N,n,S,O,Se,Te:1] [C,c,nX2,NX2:6]=[C,c:5][C,c,nX2:2]=[NX2,S,O,Se,Te:3]
    • 21 SMIRKS transforms, examples:
  • 20. Tautomers Structure Normalization A6199E68A788F2F5 -FICTS 959B273B619C709F -FICTS 61248C4A7D045A47 -FICTS 675R4FCC50F45026 -FICTS 0B345B47F6625113 -FICTS 181CA9BCE3EF47F4 -FICTS 1AD375920BE60DAD -FICTS 67196F0B20B1D934 -FICTS BCCDA7D0CDACF120 -FICTS CE8F480C11DBFC4F -FICTS D46A1E6500B06AB6 -FICTS D979CF9770AC0BA5 -FICTS 56FFE8B5619FB01 -FICTS F802E527EC5C61BF -FICTS EF060DA9D97091DE -FICTS BCCDA7D0CDACF120 -FICuS guanine UYTPUPDQBNUYGX-UHFFFAOYSA-N N N H N H N O H 2 N N N H N H N O H 2 N N N H N N O H H 2 N H N N N H N O H 2 N N N N H N O H H 2 N H N N N H N O H 2 N N N N H N O H H 2 N H N N N N O H H 2 N H N N H N H N O H N N N H N H N O H H N H N N H N H N O H N N N H N H N O H H N H N N H N N O H H N H N N N H N O H H N H N N N H N O H H N
  • 21. Tautomerism & Stereochemistry methyl propenyl ketone Structure Normalization O Z O E
  • 22. tautomer tautomer methyl propenyl ketone Structure Normalization Tautomerism & Stereochemistry O Z O E O H
  • 23. 76D03F08ACDF6C0C -FICuS FICUS disregards stereo-chemistry on double bonds if the double bond is not located during tautomer generation. tautomer tautomer methyl propenyl ketone InChI/InChIKey - NCI/CADD Identifier comparison Tautomerism & Stereochemistry O Z O E O H O
  • 24. 76D03F08ACDF6C0C -FICuS FICUS disregards stereo-chemistry on double bonds if the double bond is not located during tautomer generation. tautomer InChI=1S/C5H8O/c1-3-4-5(2)6/h3-4H,1-2H3/b4-3+ LABTWGUMFABVFG -ONEGZZNKSA-N InChI=1S/C5H8O/c1-3-4-5(2)6/h3-4,6H,1H2,2H3/b5-4- LYGWZVOQSCPYDG -PLNGDYQASA-N InChI=1S/C5H8O/c1-3-4-5(2)6/h3-4H,1-2H3/b4-3- LABTWGUMFABVFG -ARJAWSKDSA-N tautomer methyl propenyl ketone InChI/InChIKey - NCI/CADD Identifier comparison Tautomerism & Stereochemistry InChI=1S/C5H8O/c1-3-4-5(2)6/h3-4H,1-2H3 LABTWGUMFABVFG -UHFFFAOYSA-N O Z O E O H O
  • 25. 821D8C17ACE5040E -FICTS 6EB4AA2BAA11965F -FICTS 1677645190718885 -FICTS tautomer tautomer 76D03F08ACDF6C0C -FICTS methyl propenyl ketone FICTS “sees” four different structures InChI/InChIKey - NCI/CADD Identifier comparison Tautomerism & Stereochemistry O Z O E O H O
  • 26. Charges in Resonance Systems Structure Normalization F3A27F03AE77A722 F3A27F03AE77A722 62FADCB01F197FC9 canonical resonance structure? uncharge ≠ uncharge problem! 2E011EE4519F7920 different protonation states N N H N N H H N N H N N H H
  • 27.
    • generation of all formal resonance structures for a given (charged) organic compound
    • rule set of 14 transforms encoded as (CACTVS-extended) SMIRKS
    Structure Normalization shifting of charges: 5 rules recombination of charges: 5 rules separation of charges: 4 rules O N O Charges in Resonance Systems O N O O N O O N O O N O O N O
  • 28. Structure Normalization (no plausible unpolarized resonance structure can be drawn) münchnones: 1.2 shift 1.2 recombination 1.2 recombination separation (pentavalent N atom) 1.3 shift 1.3 shift 1.3 recombination 1.3 shift 1.3 shift 1.3 shift 1.3 shift Charges in Resonance Systems IUYUGWCTOLFFCL-UHFFFAOYSA-N F68AC07DE0D3379F -FICuS N O O N O O N O O N O O N O O N O O N O O N O O
  • 29.
    • PubChem database (including Open NCI database, EPA DSSTox databases, NIAID HIV databases, NIST Webbook, NLM ChemIDplus, ChemSpider … )
    • ChemNavigator iResearch Library (compilation of commercially available screening compounds from ~250 international chemistry suppliers)
    • Commercial Sources / Others ( Asinex, Comgenex, … )
    »Chemical Structure Lookup Service« Database 74 million structure records (~46 million unique structures) InChI/InChIKey - NCI/CADD Identifier comparison ChemNav. iResearch Lib. ~43% PubChem ~47% Others ~ 10%
  • 30.
    • structure records registered in CSLS : 74.2 million
    successful calculation of: Standard InChI/InChIKey: 73.8 million records NCI/CADD Structure Identifiers: 73.7 million records
    • compound sets (unique chemical structure sets):
    Standard InChI/InChIKey: FICTS Identifier FICuS Identifier Standard InChIKey (first block) uuuuu Identifier 48,027,940 48,023,835 46,715,521 43,055,589 41,671,010 Standard InChI/InChIKeys where calculated by stdinchi-1 (Linux i-386 executable) from the original SD file records Unique Structure Counts InChI/InChIKey - NCI/CADD Identifier comparison
  • 31. original structure record set (74.2 million) FICuS compound set (46.7 million unique) Standard InchI/InChIKey set calculated by stdinchi-1 (73.8 million, 48.0 million unique) Detailed Comparison InChI/InChIKey - NCI/CADD Identifier comparison
  • 32. original structure record set (74.2 million) FICuS compound set (46.7 million unique) Standard InchI/InChIKey set calculated by stdinchi-1 (73.8 million, 48.0 million unique) Detailed Comparison 1 conflicts? InChI/InChIKey - NCI/CADD Identifier comparison
  • 33. original structure record set (74.2 million) FICuS compound set (46.7 million unique) Standard InchI/InChIKey set calculated by stdinchi-1 (73.8 million, 48.0 million unique) Detailed Comparison Standard InChI/InChIKey calculated by CACTVS from FICuS compound structure 1 conflicts? InChI/InChIKey - NCI/CADD Identifier comparison same InChI/InChIKey? 2
  • 34. no conflicts between Std. InChI/InChIKey and FICuS Detailed Comparison InChI/InChIKey - NCI/CADD Identifier comparison FICuS linked to a single InChI/InChIKey both linked to a single structure record both linked to multiple structure records 62.3 34.4 27.9 all structure records (46.9%) (38.0%) 73.7 (84.5%) structure records (million records) 1
  • 35. conflicts between Std. InChI/InChIKey and FICuS Detailed Comparison InChI/InChIKey - NCI/CADD Identifier comparison structure records (million records) all structure records FICuS is linked to multiple InChI/InChIKeys or vice versa one FICuS is linked to multiple InChI/InChIKeys one InChI/InChIKey is linked to multiple FICuS 10.4 3.6 6.8 (4.6%) (9.3%) (84.5%) 73.7 1
  • 36. conflicts between Std. InChI/InChIKey and FICuS Detailed Comparison InChI/InChIKey - NCI/CADD Identifier comparison structure records (million records) all structure records FICuS is linked to multiple InChI/InChIKeys or vice versa one FICuS is linked to multiple InChI/InChIKeys one InChI/InChIKey is linked to multiple FICuS 10.4 3.6 6.8 (4.6%) (9.3%) (84.5%) 73.7 number of InChIKeys first block 0.9 number of InChIKeys first block 2.3 (1.2%) (3.1%) 1
  • 37. Detailed Comparison FICuS FICTS uuuuu 46.7 48.0 41.6 6.4 (13.7%) 3.8 (7.9%) 11.9 (28.6%) compounds (unique structures) (million records) all compounds 73.7 9.3 4.6 (29.7%) 21.9 (6.2%) (12.7%) structure records (million records) all records InChI/InChIKey - NCI/CADD Identifier comparison same InChI/InChIKey? InChI changes InChI changes 2
  • 38. Detailed Comparison FICuS FICTS uuuuu 46.7 48.0 41.6 6.4 (13.7%) 3.8 (7.9%) 11.9 (28.6%) compounds (unique structures) (million records) all compounds structure records (million records) all records InChI/InChIKey - NCI/CADD Identifier comparison 3.2 6.3 (7.6%) (8.4%) vs. InChIKey first block InChI changes InChI changes same InChI/InChIKey? 73.7 9.3 4.6 (29.7%) 21.9 (6.2%) (12.7%) 2
  • 39. (formal) tautomer count > 1 (formal) tautomer count > 3 (formal) tautomer count > 10 full stereo contains metal atoms metal complexes salt has resonance charges inorganic compound classification 14.5% 18.5% 28.9% 16.9% 34.5% 52.1% 18.6% 52.1% 33.9% 56.4% 25.4% 5.5% 25.7% 0.8% 0.2% 1.0% 0.2% 0.1% Detailed Comparison InChI/InChIKey - NCI/CADD Identifier comparison occurrence in FICuS set occurrence in FICuS subset ( InChI changes )
  • 40. FICuS : 12 different structure records linked to this structure Std. InChI/InChIKey (stdinchi-1) : calculates 3 different strings/keys for these 12 structure records (all have the same connectivity layer/first block) all of these 3 StdInChI/InChIKey differ from the StdInChI/InChIKey calculated after FICuS normalization (including connectivity layer/ first block) InChI/InChIKey - NCI/CADD Identifier comparison ChemBlock A3422/0145215 H N O N N H O O
  • 41. H N O N N H O O N O N O O N H Z E InChI/InChIKey - NCI/CADD Identifier comparison ChemBlock A3422/0145215 H N O N N H O O
  • 42. H N O N N H O O N O N O O N H Z E tautomer: InChI/InChIKey - NCI/CADD Identifier comparison H N O N N H O O ChemBlock A3422/0145215 N O N N H O O
  • 43. H N O N N H O O N O N O O N H Z E tautomer: tautomeric interconversion? InChI/InChIKey - NCI/CADD Identifier comparison ChemBlock A3422/0145215 H N O N O O N H H N O N N H O O N O N N H O O
  • 44. H N O N N H O O N O N O O N H Z E tautomer: tautomeric interconversion? tautomeric interconversion? S R InChI/InChIKey - NCI/CADD Identifier comparison ChemBlock A3422/0145215 H N O N O O N H H N O N N H O O N O N N H O O N O N N H O O N O N N H O O
  • 45. H N O N N H O O N O N O O N H Z E tautomer: tautomeric interconversion? tautomeric interconversion? InChI/InChIKey - NCI/CADD Identifier comparison ChemBlock A3422/0145215 S R H N O N O O N H H N O N N H O O N O N N H O O N O N N H O O N O N N H O O
  • 46. H N O N N H O O N O N O O N H Z E tautomer: tautomeric interconversion? tautomeric interconversion? S R InChI/InChIKey - NCI/CADD Identifier comparison ChemBlock A3422/0145215 N O N N H O O How many structures? ZINC04685909 ChemBlock A3422/0145215 ChemNavigator 47748165 NIST MS-Lib 1967005690 ChemNavigator 34903393 ChemNavigator 65635274 H N O N O O N H H N O N N H O O N O N N H O O N O N N H O O
  • 47. H N O N N H O O N O N O O N H Z E tautomer: tautomeric interconversion? tautomeric interconversion? S R InChI/InChIKey - NCI/CADD Identifier comparison ChemBlock A3422/0145215 N O N N H O O How many structures? InChIKey A InChIKey B InChIKey C same connectivity layer/block FICuS parent structure H N O N O O N H H N O N N H O O N O N N H O O N O N N H O O
  • 48. Dithiazinine InChI/InChIKey - NCI/CADD Identifier comparison S N S N I original structure
  • 49. Dithiazinine InChI/InChIKey - NCI/CADD Identifier comparison S N S N I best representation S N S N I original structure
  • 50. Dithiazinine InChI/InChIKey - NCI/CADD Identifier comparison S N S N I S N S N H I H H H H H S N S N I H H H best representation InChI FICuS Z E E Z E S N S N I original structure
  • 51. The Adaption and Use of the IUPAC InChI/InChIKey NCI/CADD Identifiers InChI/InChIKey FICTS FICuS uuuuu Std. InChI/InChIKey 74 million structure records – 46 million unique structures http://cactus.nci.nih.gov/lookup Chemical Structure Lookup Service
  • 52. Web Service Chemical Structure REST Service (beta) http://cactus.nci.nih.gov/chemical/structure/ {identifier} / {method} http://cactus.nci.nih.gov/chemical/structure/ InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N / smiles http://cactus.nci.nih.gov/chemical/structure/ InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N / names http://cactus.nci.nih.gov/chemical/structure/ InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N / ficus http://cactus.nci.nih.gov/chemical/structure/ InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N / stdinchi http://cactus.nci.nih.gov/chemical/structure/ InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N / image http://cactus.nci.nih.gov/chemical/structure/ ethanol / stdinchikey http://cactus.nci.nih.gov/chemical/structure/ 64-17-5 / stdinchikey URL scheme: returns plain text/gif image if the structure identifier is not resolvable: http 404 status code
  • 53. Acknowledgments ChemNavigator Scott Hutton Tad Hurst CADD Group, LMC, NCI Marc Nicklaus Igor V. Filippov CACTVS, Xemistry GmbH Wolf-Dietrich Ihlenfeldt Thanks to all database providers http://cactus.nci.nih.gov Our web site: