Advances in Automatic Chemical     Spelling Correction                Roger Sayle and Daniel Lowe                    NextM...
example spelling errors• Sample misspellings of pyrimidine in US patent  grants since the beginning of this year.       In...
previous work• Roger Sayle, Paul Hongxing Xie and Sorel Muresan,  “Improved Chemical Text Mining of Patents with  Infinite...
STring EDIt Distance• The Levenshtein Distance (Levenshtein 1965) is the  minimum number of edits (insertions, deletions o...
needleman-wunsch-sellers                   c l     o r a m p h e n i               c a l             c             h      ...
needleman-wunsch-sellers                   c l      o r a m p h e n i              c a l             c             h      ...
further complications• Edit operations can have their own specific penalties.• The latest implementation supports transpos...
dictionaries as automata• Nitrogen containing heterocycles as minimal DFA:  – Pyrrole, Pyrazole, Imidazole, Pyrdine, Pyrid...
Example iupac-like grammar• More generally, still CaffeineFix FSMs can represent  formal grammars, i.e. infinite dictionar...
Iupac-like grammar examples•   methane•   chloroethane•   2-bromo-propane•   chloro-bromo-methane•   1-fluoro-2-chloro-eth...
Representing grammars as dFAs Backward edges allow matching an infinite number of words.  ACS National Meeting, Philadelph...
current iupac grammar FSM• As of July 2012, the current CaffeineFix grammar  contains nearly 1.2 million edges.• This gram...
SPelling correction examples• lH-ben zimidazole → 1H-benzimidazole• triphenylposhine → triphenylphosphine• 4- (2-ADAMANTYL...
low cost “frequent” edit ops• A number of common corrections are so frequent as  to be given a lower (free) cost.  1.   De...
handle with care• Alas, introducing automatic spelling correction (fuzzy  matching) to entity recognition often requires t...
benchmarking and analysis• To quantify the benefits of automatic spelling  correction to “real world” chemical text mining...
total molecule entities         All Extracted Molecule Entities                                 OPSIN Interpeted Molecule ...
unique molecule entries         All Extracted Molecule Entities                                 OPSIN Interpeted Molecule ...
influence on n2s software (opsin)      Interpretation                                  Count         Fraction Not Interpre...
break-down of edit operations       Edit Operation                         Count          Fraction    Deletion            ...
drug dictionary entities     All Drug Dictionary Entities                                  Unique Drug Dictionary Entities...
target dictionary entries• Simple correction of ChEMBL protein target names•   prostaglandin H- 2 synthase- 1 → prostaglan...
non-word spelling correction• Automatic correction technology can also be applied  to entities other than words or IUPAC-l...
cas registry number grammar                                                                          9                    ...
Cas check digit calculation• More generally CaffeineFix’s finite state machines  can do limited processing...• The final c...
Fsm for matching cas check digits                               21       31        41          51   61   71   81          ...
cas number correction example• 7732-18-8? Did you mean...  –   7732-18-5  –   7732-11-8  –   77328-18-8  –   7733-18-8  – ...
take home message• Adding advanced automatic chemical spelling  correction to an annotation pipeline typically  improves r...
acknowledgements• Daniel Lowe, NextMove Software.• Sorel Muresan and Paul Hongxing Xie, AstraZeneca.• Thank you for your t...
Upcoming SlideShare
Loading in …5
×

Advances in Automatic Chemical Spelling Correction

397 views

Published on

Presented by Roger Sayle at the ACS National Meeting Philadelphia Fall 2012

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
397
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Advances in Automatic Chemical Spelling Correction

  1. 1. Advances in Automatic Chemical Spelling Correction Roger Sayle and Daniel Lowe NextMove Software Cambridge, UK ACS National Meeting, Philadelphia, USA 19th August 2012
  2. 2. example spelling errors• Sample misspellings of pyrimidine in US patent grants since the beginning of this year. Incorrect Name US Patent No. Issue Date pryimidine 8093264 10th January 2012 pyrmidine 8097728 17th January 2012 pyrimdine 8114996 14th February 2012 pyrimidne 8129897 6th March 2012 pyrimidinc 8148339 3rd April 2012 pyridmidine 8158627 8th May 2012 ACS National Meeting, Philadelphia, USA 19th August 2012
  3. 3. previous work• Roger Sayle, Paul Hongxing Xie and Sorel Muresan, “Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction”, Journal of Chemical Information and Modeling, Vol. 52, No. 1, pp. 51-62, 2012.• G.H. Kirby, M.R. Lord, J.D. Rayner, “Computer Translation of IUPAC Systematic Organic Chemical Nomenclature. 6. (Semi)automatic name correction, Journal of Chemical Information and Computer Science, Vol. 31, pp. 153-160, 1991. ACS National Meeting, Philadelphia, USA 19th August 2012
  4. 4. STring EDIt Distance• The Levenshtein Distance (Levenshtein 1965) is the minimum number of edits (insertions, deletions or substitutions) required to transform one string into another.• The Damerau-Levenshtein Distance is an extension of Levenshtein Distance to include transposition of two adjacent characters.• The distances can be efficiently computed with dynamic programming using the Needleman- Wunsch-Sellers alignment algorithm (bioinformatics). ACS National Meeting, Philadelphia, USA 19th August 2012
  5. 5. needleman-wunsch-sellers c l o r a m p h e n i c a l c h l o r a m p h e n i c o lACS National Meeting, Philadelphia, USA 19th August 2012
  6. 6. needleman-wunsch-sellers c l o r a m p h e n i c a l c h l o r a m p h e n i c o lACS National Meeting, Philadelphia, USA 19th August 2012
  7. 7. further complications• Edit operations can have their own specific penalties.• The latest implementation supports transpositions, to catch spelling mistakes such as “chlorofrom”.• Some dictionaries to match against have tens of millions of entries, others are infinite.• The start and end of the input isn’t known in free- text and are assigned on the quality of the match.• Correct nesting of parenthesis and brackets needs to be enforced as part of the matching process.• In summary - It’s mind bogglingly complicated. ACS National Meeting, Philadelphia, USA 19th August 2012
  8. 8. dictionaries as automata• Nitrogen containing heterocycles as minimal DFA: – Pyrrole, Pyrazole, Imidazole, Pyrdine, Pyridazine, Pyrimidine, Pyrazine ACS National Meeting, Philadelphia, USA 19th August 2012
  9. 9. Example iupac-like grammar• More generally, still CaffeineFix FSMs can represent formal grammars, i.e. infinite dictionaries. alk := “meth” | “eth” | “prop” | “but” parent := alk “ane” subst := “bromo” | “chloro” | “fluoro” locant := “1” | “2” | “3” | “4” /* any digit */ prefix := [ prefix “-” ] [ loc “-” ] subst | [ prefix ] subst name := [ prefix [ “-” ] ] parent ACS National Meeting, Philadelphia, USA 19th August 2012
  10. 10. Iupac-like grammar examples• methane• chloroethane• 2-bromo-propane• chloro-bromo-methane• 1-fluoro-2-chloro-ethane• chlorofluoromethane• 4-bromomethane• 1-chloro-1-chloro-1-chloro-methane ACS National Meeting, Philadelphia, USA 19th August 2012
  11. 11. Representing grammars as dFAs Backward edges allow matching an infinite number of words. ACS National Meeting, Philadelphia, USA 19th August 2012
  12. 12. current iupac grammar FSM• As of July 2012, the current CaffeineFix grammar contains nearly 1.2 million edges.• This grammar covers... – 99.15% (232144/234142) names in the NCI00 database. – 95.28% (67995/71367) names in the Maybridge catalogue. – 95.27% (25890/48167) names in the Keyorganics catalogue.• These figures are comparable to name-to-structure conversion rates on these names. ACS National Meeting, Philadelphia, USA 19th August 2012
  13. 13. SPelling correction examples• lH-ben zimidazole → 1H-benzimidazole• triphenylposhine → triphenylphosphine• 4- (2-ADAMANTYLCARBAM0YL) -5-TERT-BUTYL- PYRAZOL-1-YL] BENZOIC ACID → 4-(2-adamantylcarbamoyl)-5-tert-butyl-pyrazol-1- yl]benzoic acid• didec-2-ene → dodec-2-ene• spiro[2.2]hexane → spiro[2.3]hexane ACS National Meeting, Philadelphia, USA 19th August 2012
  14. 14. low cost “frequent” edit ops• A number of common corrections are so frequent as to be given a lower (free) cost. 1. Deletion of whitespace. 2. Deletion of a hyphen (where not anticipated) 3. Substitution of “l” (lower case el) for “1” (one). 4. Substitution of “I” (upper case ey) for “l” (el) or “1” (one). 5. Substitution of “rn” by “m”. 6. Substitution of “1” (one) by “l” (el). 7. Substitution of “φ” by “rp” [OCR artifact]. ACS National Meeting, Philadelphia, USA 19th August 2012
  15. 15. handle with care• Alas, introducing automatic spelling correction (fuzzy matching) to entity recognition often requires the introduction of white word lists to avoid problems. – herein → heroin – aspiring → aspirin – cranium → uranium – ability → abilify• More aggressive correction leads to more problems: – “be that the line” → methantheline – “park on a zone” → parconazole ACS National Meeting, Philadelphia, USA 19th August 2012
  16. 16. benchmarking and analysis• To quantify the benefits of automatic spelling correction to “real world” chemical text mining we analysed the first 28 weeks of US patent grants from 2012.• This corresponds to 145,473 documents, issued between 3rd January 2012 and 10th July 2012.• A total of 4,061,670 IUPAC-like systematic names were identified, with 1,816,317 unique patent/name pairs.• OPSIN interprets 3,722,399 and 1,647,402. ACS National Meeting, Philadelphia, USA 19th August 2012
  17. 17. total molecule entities All Extracted Molecule Entities OPSIN Interpeted Molecule Entities 434487,97472, 87719, 12% 2% 552533, 2% 14% Correct Correct Simple Simple D=1 D=1 3411665, 3200193 84% , 86% Using correction retrieves ~19% more entities and ~16% more OPSIN recognizable names. ACS National Meeting, Philadelphia, USA 19th August 2012
  18. 18. unique molecule entries All Extracted Molecule Entities OPSIN Interpeted Molecule Entities75027, 4% 310359, 67317, 4% 236330, 17% 14% Correct Correct Simple Simple D=1 D=1 1430931, 79% 1343755, 82% The effect is more pronounced with patent-cmpnd pairs with a +27% improvement over no correction. ACS National Meeting, Philadelphia, USA 19th August 2012
  19. 19. influence on n2s software (opsin) Interpretation Count Fraction Not Interpretable 116216 17.88% Before not After 11779 1.81% After not Before 270453 41.61% Same Before/After 181693 27.95% Different Before/After 69864 10.75% Total 650005 100.00% Although some valid names are lost by correction, overall the effect is overwhelmingly positive. ACS National Meeting, Philadelphia, USA 19th August 2012
  20. 20. break-down of edit operations Edit Operation Count Fraction Deletion 392324 47.50% Insertion 232493 28.15% Substitution 198438 24.03% Transposition 2670 0.32% Total 825,925 100.00% ACS National Meeting, Philadelphia, USA 19th August 2012
  21. 21. drug dictionary entities All Drug Dictionary Entities Unique Drug Dictionary Entities 71684, 30687,6141, 0% 7% 4119, 1% 7% Correct Correct Simple Simple D=1 D=1 1013711, 395482, 93% 92% Some improvement is seen with drug dictionaries, but there’s little benefit in fixing simple OCR issues. ACS National Meeting, Philadelphia, USA 19th August 2012
  22. 22. target dictionary entries• Simple correction of ChEMBL protein target names• prostaglandin H- 2 synthase- 1 → prostaglandin H2 synthase 1• Alanine amino-transferase → Alanine aminotransferase• acetyl cholinesterase → acetylcholinesterase• cyclooxy-genase-2 → cyclooxygenase-2• MAP kinase ERK-2 → MAP kinase ERK2• HEC-GLCNAC-6-ST → HEC-GLCNAC6ST• Herne Oxygenase → Heme Oxygenase• Prealburnin → Prealbumin• p110- delta → p110delta ACS National Meeting, Philadelphia, USA 19th August 2012
  23. 23. non-word spelling correction• Automatic correction technology can also be applied to entities other than words or IUPAC-like chemical nomenclature.• For example, Chemical Abstract Service’s registry numbers. ACS National Meeting, Philadelphia, USA 19th August 2012
  24. 24. cas registry number grammar 9 0-9 - 8 - 0-9 7 - 0-9 0-9 0-9 - 0-9 10 11 12 13 14 - 0-9 6 0-9 0-9 1-4 1 3 50 0-9 - 5-9 0-9 2 4 - • Two to seven digits, followed by a hyphen, two digits, a hyphen and a final check digit – e.g. 7732-18-5 • Regular Expression: (([1-9]d{2,5})|([5-9]d))-dd-d ACS National Meeting, Philadelphia, USA 19th August 2012
  25. 25. Cas check digit calculation• More generally CaffeineFix’s finite state machines can do limited processing...• The final check digit of a CAS number is calculated by series term summation modulo 10.• The last digit time 1, the previous digit times 2, the previous digit times 3, and computing the sum modulo 10.• The CAS number for water is 7732-18-5.• The checksum 5 is calculated as (1x8 + 2x1 + 3x2 + 4x3 + 5x7 + 6x7) mod 10 = 5. ACS National Meeting, Philadelphia, USA 19th August 2012
  26. 26. Fsm for matching cas check digits 21 31 41 51 61 71 81 11 23 33 43 53 62 72 82 13 25 35 45 55 63 73 83 6 17 27 37 47 57 64 74 84 8 19 29 39 49 59 65 75 85 0 2 16 22 32 42 52 66 76 86 4 18 24 34 44 54 67 77 87 12 26 36 46 56 68 78 88 14 28 38 48 58 69 79 89 30 40 50 60 70 80 90 ACS National Meeting, Philadelphia, USA 19th August 2012
  27. 27. cas number correction example• 7732-18-8? Did you mean... – 7732-18-5 – 7732-11-8 – 77328-18-8 – 7733-18-8 – 77342-18-8 – 77392-18-8 – 71732-18-8 – 76732-18-8 – 97732-18-8 ACS National Meeting, Philadelphia, USA 19th August 2012
  28. 28. take home message• Adding advanced automatic chemical spelling correction to an annotation pipeline typically improves recall by about 20-40%. – Andrew Hinton, “Benchmarking ChemAxon’s Name-to-Structure batch tool on Patent Data”, 2011 ChemAxon EUGM, Budapest. – Sorel Muresan, “Automated Spelling Correction to Improve Recall Rates of Name-to-Structure Tools for Chemical Text Mining”, 2011 ChemAxon EUGM. ACS National Meeting, Philadelphia, USA 19th August 2012
  29. 29. acknowledgements• Daniel Lowe, NextMove Software.• Sorel Muresan and Paul Hongxing Xie, AstraZeneca.• Thank you for your time.• Any questions? ACS National Meeting, Philadelphia, USA 19th August 2012

×