Your SlideShare is downloading. ×
What’s New with Naming?Wei Deng (David), Daniel BonniotChemAxon European UGMMay 2013
2NamingStructure to Name Name to StructureDocument to StructureDocument to DatabaseJChem for SharePoint
Structure to “Name"3S2N• 2-(acetyloxy)benzoic acid• Aspirin
Structure to “Name"4S2N• 2-(acetyloxy)benzoic acid• Aspirin• 50-78-211126-35-511126-37-72349-94-226914-13-698201-60-6
“Name” to Structure5• 2-(acetyloxy)benzoic acid• Aspirin• Acetylsalicylate• Easprin …• 50-78-2N2S
Document to Structure6D2S• O=C(Oc1ccccc1C(=O)O)C• InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)• 2-...
Document to Structure• Extract chemical information fromdocuments– Names, SMILES, InChI, CAS number …– Embedded objects– S...
OCR Error Correction8(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate(2R)-2-methylsulfanyl-3-hydroxybutanedioateΛr-benzyl-Λr-[...
OCR Error Correction9(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate(2R)-2-methylsulfanyl-3-hydroxybutanedioateΛr-benzyl-Λr-[...
Document to Structure Demo10
Document to Database11
JChem for SharePoint• SharePoint 2010 and 2007– Sketch, Import/export, storestructures– Structure search– Calculate proper...
Free Online Service Chemicalize.org• Extract• Interactively display• Calculate• Search13Recently reviewed in J. Chem. Inf....
14Webpage - Chemicalized
PDF File - Chemicalized
Growing IP Filing in China16The Economist, Dec 2012
Name to Structure17• 2-(乙酰氧基)苯甲酸• 阿司匹林• 2-(acetyloxy)benzoic acid• Aspirin• Acetylsalicylate• Easprin …• 50-78-2N2SChinese
In Fact, Even without CN2S…18• 阿司匹林 N2SCN2SCustomizedDictionary
Customized Dictionary• A SMILES file “custom_names.smi”• Default location ChemAxon DIRe.g. in Windows 7 C:UsersUSERNAMEche...
Custom Dictionary Demo20
Customized Dictionary• A SMILES file “custom_names.smi”• Default location ChemAxon DIRe.g. in Windows 7 C:UsersUSERNAMEche...
Naming Web Service Demo22
Dictionary is Limited23N2SCustomizedDictionary• 2-(乙酰氧基)苯甲酸
The Real CN2S24• 2-(acetyloxy)benzoic acid• 2-(乙酰氧基)苯甲酸CN2SCN2S
Mapping Chinese Names to English中文 English 中文 English甲 meth- 苯 benzene乙 eth- 吲哚 indole丙 prop- 硝基 nitro-丁 buta- 盐 salt氧 oxy...
The Real CN2S262-(乙酰 氧基) 苯甲酸2-(acetyl oxy ) benzoic acid
The Challenges1. Chinese texts have no spaces2. Ester & Salt乙酸乙酯Ethyl Acetate27
The Challenges3. English: name alterations丁烷  buta + ane  butane4. Chinese: many Characters have differentmeanings盐 = sa...
The Challenges5. Chinese names are usually abbreviated苯 = benzene苯基 = phenylThere are many other challenges, overallsoluti...
Accuracy Result• Test set: 38,600 Chinese names + CASnumber• Contains unusual, incorrect, ambiguousnames, radicals, inorga...
Version History31N2SD2S: Text onlyD2S: Text + ImageD2S: OCR + ImageEnglish5.15.2.55.65.8Chinese5.125.125.126.x
Chinese D2S Demo32
Chinese Patent Document<p id="d66" num="067">将0.76g(2.9mmol)5-乙氧基-4-甲基-2-苯氧基羰基-2,4-<br/>二氢-3H-1,2,4-三唑-3-酮溶于40ml乙腈中,在室温(约2...
Upcoming in 6.1: Chinese D2S<p id="d66" num="067">将0.76g(2.9mmol)5-乙氧基-4-甲基-2-苯氧基羰基-2,4-<br/>二氢-3H-1,2,4-三唑-3-酮溶于40ml乙腈中,在...
CN2S on Chinese Patents Demo35
One Door  Many Other Doors36Ácido 2-(acetiloxi)-benzoico2-AcetoxybenzoesäureAcide 2-acétyloxybenzoïque2-(乙酰氧基)苯甲酸2-アセトキシ安...
The Other Half372-(乙酰氧基)苯甲酸CS2N
But English and Chinese Names are Different, Really(1S,3S)-1-bromo-1-chloro-3-ethyl-3-methylcyclohexaneFundamental Organic...
AcknowledgementDaniel Bonniot邓巍Wei Deng (David)39N2S
Please Help US Make it BetterEvaluators Welcome40
Upcoming SlideShare
Loading in...5
×

EUGM 2013 - David Deng, Daniel Bonniot (ChemAxon) - What’s New with Naming

1,999

Published on

ChemAxon’s Naming provides reliable English name and chemical structure conversion. It is the underlying technology utilized in ChemAxon’s chemical text mining tool D2S (Document-to-Structure), JChem for SharePoint, and Chemicalize.org. In this presentation, the latest enhancement will be highlighted, including: Chinese chemical name recognition to accommodate the fast grouping Chinese scientific literature, Custom corporate ID to structure conversion using a webservice, Database indexing of structures from document repositories.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,999
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "EUGM 2013 - David Deng, Daniel Bonniot (ChemAxon) - What’s New with Naming"

  1. 1. What’s New with Naming?Wei Deng (David), Daniel BonniotChemAxon European UGMMay 2013
  2. 2. 2NamingStructure to Name Name to StructureDocument to StructureDocument to DatabaseJChem for SharePoint
  3. 3. Structure to “Name"3S2N• 2-(acetyloxy)benzoic acid• Aspirin
  4. 4. Structure to “Name"4S2N• 2-(acetyloxy)benzoic acid• Aspirin• 50-78-211126-35-511126-37-72349-94-226914-13-698201-60-6
  5. 5. “Name” to Structure5• 2-(acetyloxy)benzoic acid• Aspirin• Acetylsalicylate• Easprin …• 50-78-2N2S
  6. 6. Document to Structure6D2S• O=C(Oc1ccccc1C(=O)O)C• InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)• 2-(acetyloxy)benzoic acid• Aspirin• Acetylsalicylate• Easprin …• 50-78-2
  7. 7. Document to Structure• Extract chemical information fromdocuments– Names, SMILES, InChI, CAS number …– Embedded objects– Structure imagesSupport: OSRA currentlyMultiple OSR engines (CLiDE, Imago…) in 6.1– Works with scanned non-searchable PDF– Returns structures and their locations in thedocument– Correct OCR errors– Supported formats:PDF, text, XML, HTML, MS Office document(doc, docx, ppt, pptx, xls, xlsx), OpenOffice7Non-searchablechemical patentDocumentsStructure (text + image)+ locationD2S
  8. 8. OCR Error Correction8(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate(2R)-2-methylsulfanyl-3-hydroxybutanedioateΛr-benzyl-Λr-[3-(lH-tetrazol-5-yl)phenyl]propanamide?-benzyl-?-[3-(?H-tetrazol-5-yl)phenyl]propanamide
  9. 9. OCR Error Correction9(2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate(2R)-2-methylsulfanyl-3-hydroxybutanedioateΛr-benzyl-Λr-[3-(lH-tetrazol-5-yl)phenyl]propanamideN-benzyl-N-[3-(1H-tetrazol-5-yl)phenyl]propanamide
  10. 10. Document to Structure Demo10
  11. 11. Document to Database11
  12. 12. JChem for SharePoint• SharePoint 2010 and 2007– Sketch, Import/export, storestructures– Structure search– Calculate properties and naming– Filtering and Sorting• New improvement– Index and search chemicalinformation in documents• Text• Embedded structure object– Connect SharePoint to yourchemical database
  13. 13. Free Online Service Chemicalize.org• Extract• Interactively display• Calculate• Search13Recently reviewed in J. Chem. Inf. Model., 2012, 52 (2), pp 613–615
  14. 14. 14Webpage - Chemicalized
  15. 15. PDF File - Chemicalized
  16. 16. Growing IP Filing in China16The Economist, Dec 2012
  17. 17. Name to Structure17• 2-(乙酰氧基)苯甲酸• 阿司匹林• 2-(acetyloxy)benzoic acid• Aspirin• Acetylsalicylate• Easprin …• 50-78-2N2SChinese
  18. 18. In Fact, Even without CN2S…18• 阿司匹林 N2SCN2SCustomizedDictionary
  19. 19. Customized Dictionary• A SMILES file “custom_names.smi”• Default location ChemAxon DIRe.g. in Windows 7 C:UsersUSERNAMEchemaxon• FormatSMILES Tab ANY text string19c1ccccc1 CXN000001
  20. 20. Custom Dictionary Demo20
  21. 21. Customized Dictionary• A SMILES file “custom_names.smi”• Default location ChemAxon DIRe.g. in Windows 7 C:UsersUSERNAMEchemaxon• FormatSMILES Tab ANY text string•From Version 6.0, a custom web service can also be used21c1ccccc1 CXN000001
  22. 22. Naming Web Service Demo22
  23. 23. Dictionary is Limited23N2SCustomizedDictionary• 2-(乙酰氧基)苯甲酸
  24. 24. The Real CN2S24• 2-(acetyloxy)benzoic acid• 2-(乙酰氧基)苯甲酸CN2SCN2S
  25. 25. Mapping Chinese Names to English中文 English 中文 English甲 meth- 苯 benzene乙 eth- 吲哚 indole丙 prop- 硝基 nitro-丁 buta- 盐 salt氧 oxy- 环 cyclo-氢 hydrogen- 二 bi-硫 thio- 三 tri-酸 acid 顺 cis-醇 -ol 反 trans-酯 ester 异 iso-25
  26. 26. The Real CN2S262-(乙酰 氧基) 苯甲酸2-(acetyl oxy ) benzoic acid
  27. 27. The Challenges1. Chinese texts have no spaces2. Ester & Salt乙酸乙酯Ethyl Acetate27
  28. 28. The Challenges3. English: name alterations丁烷  buta + ane  butane4. Chinese: many Characters have differentmeanings盐 = salt酸 = acid盐酸 = hydrochloric acid28
  29. 29. The Challenges5. Chinese names are usually abbreviated苯 = benzene苯基 = phenylThere are many other challenges, overallsolution:Make our N2S more tolerant to mistakes29
  30. 30. Accuracy Result• Test set: 38,600 Chinese names + CASnumber• Contains unusual, incorrect, ambiguousnames, radicals, inorganic salts,• Conversion rate = 58 – 78 %• Accuracy = 91%• Look for another test set from Chinesepatents30
  31. 31. Version History31N2SD2S: Text onlyD2S: Text + ImageD2S: OCR + ImageEnglish5.15.2.55.65.8Chinese5.125.125.126.x
  32. 32. Chinese D2S Demo32
  33. 33. Chinese Patent Document<p id="d66" num="067">将0.76g(2.9mmol)5-乙氧基-4-甲基-2-苯氧基羰基-2,4-<br/>二氢-3H-1,2,4-三唑-3-酮溶于40ml乙腈中,在室温(约20℃)、搅<br/>拌下,以每次少量的方式与0.75g(3.2mmol)4-甲氧基羰基-2-甲<br/>基噻吩-3-磺酰胺和0.49g(3.2mmol)1,8-二氮杂二环[5.4.0]十<br/>一碳-7-烯(DBU)混合。将该反应混合物在室温搅拌12小时,然后减<br/>压浓缩。将残余物置于二氯甲烷中,依次用1N盐酸和水洗涤,用硫<br/>酸钠干燥,并过滤。将滤液在水泵真空下浓缩,将残余物用异丙醇蒸<br/>煮,通过抽滤分离出所得结晶产物。<br/></p>
  34. 34. Upcoming in 6.1: Chinese D2S<p id="d66" num="067">将0.76g(2.9mmol)5-乙氧基-4-甲基-2-苯氧基羰基-2,4-<br/>二氢-3H-1,2,4-三唑-3-酮溶于40ml乙腈中,在室温(约20℃)、搅<br/>拌下,以每次少量的方式与0.75g(3.2mmol)4-甲氧基羰基-2-甲<br/>基噻吩-3-磺酰胺和0.49g(3.2mmol)1,8-二氮杂二环[5.4.0]十<br/>一碳-7-烯(DBU)混合。将该反应混合物在室温搅拌12小时,然后减<br/>压浓缩。将残余物置于二氯甲烷中,依次用1N盐酸和水洗涤,用硫<br/>酸钠干燥,并过滤。将滤液在水泵真空下浓缩,将残余物用异丙醇蒸<br/>煮,通过抽滤分离出所得结晶产物。<br/></p>
  35. 35. CN2S on Chinese Patents Demo35
  36. 36. One Door  Many Other Doors36Ácido 2-(acetiloxi)-benzoico2-AcetoxybenzoesäureAcide 2-acétyloxybenzoïque2-(乙酰氧基)苯甲酸2-アセトキシ安息香酸2-아세톡시-벤조산2-ацетилоксибензойнаякислотаCN2SJN2SKN2SRN2SFN2SGN2SSN2S
  37. 37. The Other Half372-(乙酰氧基)苯甲酸CS2N
  38. 38. But English and Chinese Names are Different, Really(1S,3S)-1-bromo-1-chloro-3-ethyl-3-methylcyclohexaneFundamental Organic ChemistryI, Qiyi Xing et. al., Ed. 3, Page 4438(1S,3S)-1-甲基-1-乙基-3-氯-3-溴环己烷基础有机化学(上),邢其毅等,第三版,第44页
  39. 39. AcknowledgementDaniel Bonniot邓巍Wei Deng (David)39N2S
  40. 40. Please Help US Make it BetterEvaluators Welcome40

×