NLP: Levenshtein Edit Distance & Skip Trie

812 views

Published on

Levenshtein Edit Distance & Skip Trie

Published in: Science, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
812
On SlideShare
0
From Embeds
0
Number of Embeds
203
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

NLP: Levenshtein Edit Distance & Skip Trie

  1. 1. Natural Language Processing Levenshtein Edit Distance (LED) & Skip Trie Matching (STM) Vladimir Kulyukin www.vkedco.blogspot.comwww.vkedco.blogspot.com
  2. 2. Outline ● Levenshtein Edit Distance (LED) – Definition – Recursive Computation – Dynamic Programming Computation ● Skip Trie – Background – Trie & Skip Trie – Skip Trie Matching
  3. 3. Levenshtein Edit Distance
  4. 4. Minimum Edit Distance ● Suppose we have two strings: source and target ● Suppose we have a finite set of operations (edit_ops) that can be used to transform source to target ● Each operation has a cost ● A Minimum Edit Distance is a metric that mea- sures the total cost of transforming source to tar- get
  5. 5. Strings as Prefix Sequences Any string can be viewed as a sequence of prefixes 1. s = '', then the prefix sequence is '' 2. s = 'a', then the prefix sequence is <'', 'a'> 3. s = 'ab', then the prefix sequence is <'', 'a', 'ab'> In general, if s = c1 c2 ...cn , then the prefix sequence is <'', 'c1 ', 'c1 c2 ', ..., s>
  6. 6. Definition ● Levenshtein edit distance (LED) is a metric, one of the best known, that measures similarity be- tween two character sequences ● The metric is named after Vladimir Levenshtein who discovered this metric in 1965 ● Given two strings, source and target, LED is de- fined as the minimum number of edit opera- tions (aka edits) to transform source to target
  7. 7. Edit Operations (AKA Edits) ● The standard edit operations, aka edits, are insertion, dele- tion, & substitution ● Assume pt and ps are legal positions in target and source, respectively ● Insertion – a character at position pt in target is inserted into source at position ps ● Deletion – a character is deleted from source at position ps ● Substitution - a character at position pt in target is substi- tuted for a character at position ps in source
  8. 8. Edit Costs ● The standard edit operations have associated costs ● The costs are application dependent, and are typically positive integers ● For example, the costs of insertion, deletion, and substitution can all be set to 1 ● In some contexts, substitution is set to 2 (substi- tution can be viewed as insertion and deletion)
  9. 9. String Transformation Cost CT(s1, s2) = numerical cost of transforming source string s1 to target string s2
  10. 10. Tabulating Transformation Costs TARGET '' c1 c2 c3 c4 c5 … cn '' c1 c2 … cm S O U R C E TARGET '' c1 c2 c3 c4 c5 … cn
  11. 11. CT('', '') '' c1 c2 c3 c4 c5 … cn '' c1 c2 … cm
  12. 12. CT('','c1') '' c1 c2 c3 c4 c5 … cn '' c1 c2 … cm
  13. 13. CT('','c1c2') '' c1 c2 c3 c4 c5 … cn '' c1 c2 … cm
  14. 14. CT('', 'c1c2c3') '' c1 c2 c3 c4 c5 … cn '' c1 c2 … cm
  15. 15. CT('c1', '') '' c1 c2 c3 c4 c5 … cn '' c1 c2 … cm
  16. 16. CT('c1c2', '') '' c1 c2 c3 c4 c5 … cn '' c1 c2 … cm
  17. 17. CT('c1c2c3', '') '' c1 c2 c3 c4 c5 … cn '' c1 c2 … cm
  18. 18. CT('c1...cm', 'c1...cn') '' c1 c2 c3 c4 c5 … cn '' c1 c2 … cm
  19. 19. Transforming Empty Source to Target '' c1 c2 c3 c4 c5 … cn '' c1 c2 … cm 0 i1 i2 i3 i4 i5 in The only way to transform empty source to some target is to insert 0 or more characters into it (ik is the cost of inserting k characters)
  20. 20. Transforming Source to Empty Target '' c1 c2 c3 c4 c5 … cn '' c1 c2 … cm 0 d1 d2 d3 dm The only way to transform some source to empty target is to delete 0 or more corresponding characters from it (dk is the cost of deleting k characters)
  21. 21. Examples
  22. 22. Example 01 Let insertion cost = deletion cost = substitution cost = 1. Let source = '' and target = 'ab'. How can we transform source to target?
  23. 23. Example 01 '' '' a b
  24. 24. Example 01 '' '' a b 0
  25. 25. Example 01 '' '' a b 0 1
  26. 26. Example 01 '' '' a b 0 1 2
  27. 27. Example 01 - insert 'a' at position 1 in source at cost 1; - insert 'b' at position 2 in source at cost 1; So, LED('', 'ab') = 2.
  28. 28. Example 02 Let insert cost = delete cost = substitute cost = 1. Let source = 'ab' and target = ''. How can we transform source to target?
  29. 29. Example 02 - Delete 'a' at position 1 in source at cost 1; - Delete 'b' at position 2 in source at cost 1; So, LED('ab', '') = 2.
  30. 30. Example 03 Let insert cost = delete cost = substitute cost = 1. Let source = 'abc' and target = 'ac'. - match 'a' at position 1 with 'a' at position 1 in target; - delete 'b' at position 2 in source at cost 1; - match 'c' at position 3 in source with 'c' at position 2 in target at cost 0. So, LED('abc', 'ab') = 1.
  31. 31. Recursive LED Algorithm
  32. 32. Specification LevEdDist(source, target, ins_cost, del_cost, sub_cost) - source – source string - target – target string - ins_cost – cost of insertion - del_cost – cost of deletion - sub_cost – cost of substitution LevEdDist(source, target, ins_cost, del_cost, sub_cost) returns a sequence of edits to convert source to target and the levenshtein distance, i.e., the total cost of edits
  33. 33. Pseudo Code LED(source_str, target_str, edit_ops, edit_cost, ins_cost=1, del_cost=1, sub_cost=1): #1. compute lengths of source and target strings target_len, source_len = len(target_str), len(source_str) #2. edit_ops is a list of edit operations that is destructively modified edit_ops_copy = copy(edit_ops) if source_len == 0: #3. if source is empty, insert all target characters into it for c in target_str: edit_ops_copy.append(new InsertOperator(c, ins_cost)) return edit_cost + target_len, edit_ops_copy if target_len == 0: #4. if target is empty, delete all characters from source for c in source_str: edit_ops_copy.append(new DeleteOper('del', c, del_cost)) return edit_cost + source_len, edit_ops_copy
  34. 34. Recursion ● If character at position source_len-1 in source is the same as character at position target_len-1 in target, set the current cost to 0 (this is the character match, which can be viewed as substitute the character in the source for the same character in the target) ● Match is a zero-cost substitution ● If these characters are not the same, compute the costs of deletion, insertion and substitution, and choose the minimum cost
  35. 35. Pseudo Code: Three Recursive Calls // choose deletion and recurse dc_cost, dc_edit_ops = LED(source_str[0:source_len-1], target_str, edit_ops, edit_cost, ins_cost=ins_cost, del_cost=del_cost, sub_cost=sub_cost) // choose insertion and recurse ic_cost, ic_edit_ops = LED(source_str, target_str[0:target_len-1], edit_ops, edit_cost, ins_cost=ins_cost, del_cost=del_cost, sub_cost=sub_cost) // choose substitution and recurse sc_cost, sc_edit_ops = LED(source_str[0:source_len-1], target_str[0:target_len-1], edit_ops, edit_cost, ins_cost=ins_cost, del_cost=del_cost, sub_cost=sub_cost)
  36. 36. Pseudo Code: Choosing Minimal Edit Sequence if min_cost == dc_cost: edit_ops_copy = copy(dc_edit_ops) // add a new delete operator edit_ops_copy.append(new DelOper(source_str[source_len-1], del_cost)) else if min_cost == ic_cost: edit_ops_copy = copy(ic_edit_ops) // add a new insertion operator edit_ops_copy.append(new InsOper(target_str[target_len-1], ins_cost)) else if min_cost == sc_cost:' edit_ops_copy = copy(sc_edit_ops) if target_str[target_len-1] == source_str[source_len-1]: // if the characters are the same, then there is a match edit_ops_copy.append(new MatchOper(target_str[target_len-1], source_str[source_len-1], 0)) else: edit_ops_copy.append(new SubOper(target_str[target_len-1], source_str[source_len-1], sub_cost)) else: edit_ops_copy = copy(edit_ops) min_cost = compute the cost of edit ops in edit_ops return min_cost, edit_ops_copy
  37. 37. LED Computation with Dynamic Programming
  38. 38. Computing CT(r, c) 1. Construct an m x n table CT 2. Fill row 0 3. Fill column 0 4. Then CT[r, c] = min{ CT[r-1,c-1] + sub_cost, CT[r-1, c] + del_cost, CT[r, c-1] + ins_cost } 5. CT[m, n] is the final (and minimal!) cost
  39. 39. Side Notes ● LED is a minimal distance ● LED is a correct minimal distance ● LED can be computed only with 2 rows ● An optimal sequence of edits can be recovered from the CT table
  40. 40. Skip Trie & Skip Trie Matching
  41. 41. Motivation ● According to U.S. Department of Agriculture, U.S. residents have increased their caloric intake by 523 calories per day since 1970 ● Mismanaged diets are estimated to account for 30- 35% of cancer and diabetes cases ● A major contributor to the increased caloric intake is the consumer's inability (and sometimes unwillingness) to read & understand nutrition labels ● Nutrition information is rarely available to blind and visually impaired individuals
  42. 42. Critical Barriers ● Manual nutrition intake recording is time- consuming and error-prone, especially on smartphones ● Automated, real-time nutrition information extraction & analysis is weak or nonexistent ● Nutrition decision support – is not context-sensitive; – does not couple consumers with dieticians; – is not integrated with PHRs or ODLs
  43. 43. Persuasive NUTrion Management System (PNUTS)
  44. 44. RoboCart ShopTalk ShopMobile I ShopMobile II PNUTS dd 2003-05 2006-08 2008-10 2010-12 2013-Now R&D Road to PNUTS
  45. 45. PNUTS Architecture Nutritionist Coach Cloud Consumer/Patient Inference Engine OCR Image Analysis
  46. 46. Vision-Based Nutrition Information Extraction in PNUTS Line Segmentor Nutrition Label Localizer TEXT Image Table Lines OCR
  47. 47. OCR Engine Accuracy Evaluation ● Two hundred images of nutrition label text chunks – ● Three categories used to categorize accuracy: – Complete: OCRed characters are identical to image text – Partial: at least one OCRed character is missing or misrecognized – Garbled: either empty string is returned or all OCRed characters are misrecognized
  48. 48. OCR Engine Accuracy   Complete Partial Garbled Tesseract on Device 146(73%) 36(18%) 18(9%) GOCR on Device 42(21%) 23(11.5%) 135(67.5%) Tesseract on Server 158(79%) 23(11.5%) 19(9.5%) GOCR on Server 58(28.99%) 56(28%) 90(45%)
  49. 49. OCR Engine Speed in Milliseconds   Run 1 Run 2 Run 3 Run 4 Run 5 AVG/Sample AVG/Image Tesseract on Device 128238 101438 101643 109678 103205 110439.6 552.1 GOCR on Device 50349 47746 48964 52450 48247 49019.6 245 Tesseract on Server 38958 38061 37850 9891 39032 38289.6 191 GOCR on Server 21253 20842 20195 21182 20520 20763.3 103.8
  50. 50. OCR Error Types ● Error Classification (Kukich 1992) – Non-words: 'polassium' vs. 'potassium' – Real-words: 'fats' vs. 'facts' ● State of the Art Error Correction: – N-Gram – Levenshtein Edit Distance (LED) – Both algorithms are implemented in Apache Lucene
  51. 51. Big O Analysis ● LED – O(m*n2 ), where n is the number of entries in the dictionary and n is the size of the input ● N-Gram – O(n), where n is the size of the input if the dictionary is implemented as a hash with constant lookup
  52. 52. Skip Trie Matching
  53. 53. Trie Data Structure ● Tries are popular on mobile platforms for word completion due to space efficiency ● Worst-case lookup is O(n) where n is the length of the input string ● Efficient storage compared to hash table
  54. 54. Skip Trie Matching ● Skip Trie Matching (STM) algorithm is based on the idea that the trie data structure can be used to find closest dictionary matches to misspelled words ● It is assumed that the dictionary of words is stored as a trie ● The only parameter in STM is the skip distance – a non-negative integer that defines the maximum number of misrecognized characters allowed in a misspelled word
  55. 55. STM Basic Steps ● Process the input string character by character ● At the trie's current node, find the child character that matches the input's current character ● If a match is found, recurse to that node and consume the input's character ● If no match is found, recurse on each child node after incrementing the skip distance and without consuming the input's current character ● Details and pseudocode are in this paper
  56. 56. STM Example Suppose that the OCR engines recognizes the string 'ACID' as 'ACIR' and the trie dictionary has the word 'ACID' as a character path.
  57. 57. STM Example
  58. 58. STM Example
  59. 59. STM Example
  60. 60. STM Example
  61. 61. Back to Big O Analysis ● LED – O(m*n2 ), where n is the number of entries in the dictionary and n is the size of the input ● N-Gram – O(n), where n is the size of the input if the dictionary is implemented as a hash with constant lookup ● STM – O(nlog|Σ|), where |Σ| is the size of the alphabet
  62. 62. LED, N-Gram, STM Accuracy & Speed   STM N-Gram LED Run Time (In milliseconds) 20 51 51 Recall 15% 9% 8% The results in the table below were obtained on a sample of 600 texts OCRed with Tesseract
  63. 63. STM Limitations ● Since STM is greedy, it cannot find all possible suggestions (not a limitation if a vocabulary is limited but a limitation in general) ● Current implementation finds matches only of the same length as the misspelled input ● STM cannot correct real-word errors
  64. 64. Conclusions ● On the tested samples OCRed with Tesseract, STM ran faster and was more accurate than Apache Lucene's implementations of N-GRAM & LED ● On the tested samples, Tesseract was more accurate than GOCR ● On the tested samples, GOCR ran faster than Tesseract
  65. 65. References 1. Levenshtein V. (1966). “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals.” Soviet Physics Doklady 10: 707–10. (pdf) 2. K. Kukich, "Techniques for Automatically Correcting Words in Text." ACM Computing Surveys, Vol. 24, No. 4, Dec. 1992. (pdf) 3. Kulyukin, V., Vanka, A., Wang, H. Skip Trie Matching: A Greedy Algorithm for Real-Time OCR Error Correction on Smartphones. International Journal of Digital Information and Wireless Communication (IJDIWC): 3(3): 56-65, 2013. ISSN: 2225-658X. (pdf)

×