The Switchabalizer - our journey from spell checker to homophone corrrecter

839 views

Published on

Presentation given at Open Data Bay Area by Oskar Singer on using Common Crawl and NLP techniques to improve grammar and spelling correction, specifically homophones.

Published in: Internet
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
839
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
11
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

The Switchabalizer - our journey from spell checker to homophone corrrecter

  1. 1. Introduction The Problem First Attempt Second Attempt Conclusion The Switchabalizer Our journey from spell checker to homophone correcter Oskar Singer July 23, 2014 Oskar Singer The Switchabalizer
  2. 2. Introduction The Problem First Attempt Second Attempt Conclusion How I got here I am a rising senior in the UMass Amherst CS program specializing in machine learning and natural language processing. Oskar Singer The Switchabalizer
  3. 3. Introduction The Problem First Attempt Second Attempt Conclusion How I got here I am a rising senior in the UMass Amherst CS program specializing in machine learning and natural language processing. Last summer, I interned at an Amherst/Boston-based text analytics company called Lexalytics Oskar Singer The Switchabalizer
  4. 4. Introduction The Problem First Attempt Second Attempt Conclusion How I got here I am a rising senior in the UMass Amherst CS program specializing in machine learning and natural language processing. Last summer, I interned at an Amherst/Boston-based text analytics company called Lexalytics I worked with Lexalytics’ head of software engineering on this project Oskar Singer The Switchabalizer
  5. 5. Introduction The Problem First Attempt Second Attempt Conclusion How I got here I am a rising senior in the UMass Amherst CS program specializing in machine learning and natural language processing. Last summer, I interned at an Amherst/Boston-based text analytics company called Lexalytics I worked with Lexalytics’ head of software engineering on this project Lexalytics often uses CommonCrawl, and it was a great option for a training data set Oskar Singer The Switchabalizer
  6. 6. Introduction The Problem First Attempt Second Attempt Conclusion Motivation Lexalytics provides sentiment analysis software Oskar Singer The Switchabalizer
  7. 7. Introduction The Problem First Attempt Second Attempt Conclusion Motivation Lexalytics provides sentiment analysis software Sentiment analysis relies heavily in sentence parsing and part-of-speech tagging Oskar Singer The Switchabalizer
  8. 8. Introduction The Problem First Attempt Second Attempt Conclusion Motivation Lexalytics provides sentiment analysis software Sentiment analysis relies heavily in sentence parsing and part-of-speech tagging Misspellings and misusage can do serious damage to accuracy for those two tasks Oskar Singer The Switchabalizer
  9. 9. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Oskar Singer The Switchabalizer
  10. 10. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Hunspell gives an unranked list of correction suggestions Oskar Singer The Switchabalizer
  11. 11. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Hunspell gives an unranked list of correction suggestions So we took the argmax of a home-baked scoring function that: Oskar Singer The Switchabalizer
  12. 12. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Hunspell gives an unranked list of correction suggestions So we took the argmax of a home-baked scoring function that: penalized string edit distance Oskar Singer The Switchabalizer
  13. 13. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Hunspell gives an unranked list of correction suggestions So we took the argmax of a home-baked scoring function that: penalized string edit distance penalized keyboard distance Oskar Singer The Switchabalizer
  14. 14. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Hunspell gives an unranked list of correction suggestions So we took the argmax of a home-baked scoring function that: penalized string edit distance penalized keyboard distance rewarded high word frequencies, which were harvested from CommonCrawl data Oskar Singer The Switchabalizer
  15. 15. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Failure Hunspell had an error rate of Oskar Singer The Switchabalizer
  16. 16. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Failure Hunspell had an error rate of 216% Oskar Singer The Switchabalizer
  17. 17. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? How is this possible? Two reasons: Oskar Singer The Switchabalizer
  18. 18. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? How is this possible? Two reasons: Hunspell missed all the mistakes Oskar Singer The Switchabalizer
  19. 19. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? How is this possible? Two reasons: Hunspell missed all the mistakes Hunspell made false corrections Oskar Singer The Switchabalizer
  20. 20. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Hunspell was a poor choice for a couple reasons: Oskar Singer The Switchabalizer
  21. 21. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Hunspell was a poor choice for a couple reasons: Hunspell’s vocabulary is not appropriate or flexible enough for Twitter domain Oskar Singer The Switchabalizer
  22. 22. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Hunspell was a poor choice for a couple reasons: Hunspell’s vocabulary is not appropriate or flexible enough for Twitter domain Hunspell can’t detect correctly spelled words that are out of context Oskar Singer The Switchabalizer
  23. 23. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Twitter’s vocabulary of abbreviations and acronyms is constantly growing Oskar Singer The Switchabalizer
  24. 24. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Twitter’s vocabulary of abbreviations and acronyms is constantly growing Hunspell’s internal dictionary is not prepared for this Oskar Singer The Switchabalizer
  25. 25. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Example: ur Oskar Singer The Switchabalizer
  26. 26. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Example: ur What was Hunspell’s correction? Oskar Singer The Switchabalizer
  27. 27. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Example: ur What was Hunspell’s correction? Ur (the ancient Sumerian city-state) Oskar Singer The Switchabalizer
  28. 28. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? When the issue is misuse rather than misspelling, Hunspell completely ignores the problem Oskar Singer The Switchabalizer
  29. 29. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? When the issue is misuse rather than misspelling, Hunspell completely ignores the problem Specifically, commonly misused homophones were a huge problem in our data Oskar Singer The Switchabalizer
  30. 30. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? When the issue is misuse rather than misspelling, Hunspell completely ignores the problem Specifically, commonly misused homophones were a huge problem in our data Examples: two/too/2/to; their/there/they’re; your/you’re Oskar Singer The Switchabalizer
  31. 31. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Addressing Misusage How do we capture the idea of misuse? Oskar Singer The Switchabalizer
  32. 32. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Addressing Misusage How do we capture the idea of misuse? Context Oskar Singer The Switchabalizer
  33. 33. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Addressing Misusage How do we capture the idea of misuse? Context How can we capture context? Oskar Singer The Switchabalizer
  34. 34. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Addressing Misusage How do we capture the idea of misuse? Context How can we capture context? Rule set? Oskar Singer The Switchabalizer
  35. 35. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Addressing Misusage How do we capture the idea of misuse? Context How can we capture context? Rule set? Probabilistic approach! Oskar Singer The Switchabalizer
  36. 36. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Bayes network Conditioned on the preceding and succeeding words Assumes these two words are independent Does not use bag-of-words approach (considers position) Oskar Singer The Switchabalizer
  37. 37. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Conditional Probability of Preceding or Succeeding Words P(pre(wi )|wj ) = #(wi wj ) #(wj ) , where pre(w) is the event that w is the preceding word and #(∗) is the number of occurences of a sequence of words Oskar Singer The Switchabalizer
  38. 38. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Conditional Probability of Preceding or Succeeding Words Conditional Probability of Preceding or Succeeding Words P(pre(wi )|wj ) = #(wi wj ) #(wj ) , where pre(w) is the event that w is the preceding word and #(∗) is the number of occurences of a sequence of words P(suc(wi )|wj ) = #(wj wi ) #(wj ) , where suc(w) is the event that w is the succeeding word Oskar Singer The Switchabalizer
  39. 39. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Conditional Probability of Both Words P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj ) log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj )) + log(P(suc(wk)|wj )) Oskar Singer The Switchabalizer
  40. 40. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Conditional Probability of Both Words Conditional Probability of Both Words P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj ) log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj )) + log(P(suc(wk)|wj )) The first equation holds because of our assumption of independence between the preceding and succeeding words Oskar Singer The Switchabalizer
  41. 41. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Conditional Probability of Both Words Conditional Probability of Both Words P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj ) log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj )) + log(P(suc(wk)|wj )) The first equation holds because of our assumption of independence between the preceding and succeeding words There is a missing term in the scoring function that I will address in the Future Work section Oskar Singer The Switchabalizer
  42. 42. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Switchable Sets Only certain groups should be compared, e.g. ”too” should not be scored against ”their” Oskar Singer The Switchabalizer
  43. 43. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Switchable Sets Only certain groups should be compared, e.g. ”too” should not be scored against ”their” Comparable switchables are groups in switchable sets Oskar Singer The Switchabalizer
  44. 44. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Switchable Sets Only certain groups should be compared, e.g. ”too” should not be scored against ”their” Comparable switchables are groups in switchable sets Each switchable is mapped to its switchable set Oskar Singer The Switchabalizer
  45. 45. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Picking the Word The Final Equation S(wi , wj , wk) = log(P(pre(wi ), suc(wk)|wj )) v∗ = argmaxv∈Vwj S(wi , v, wk) where S(wi , wj , wk) is the score for the sequence of words wi wj wk and Vwj is the switchable set corresponding to wj and v∗ is the ideal switchable Oskar Singer The Switchabalizer
  46. 46. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass What about common misspellings that intersect with switchables? Oskar Singer The Switchabalizer
  47. 47. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass What about common misspellings that intersect with switchables? Example: ”ur” Oskar Singer The Switchabalizer
  48. 48. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass What about common misspellings that intersect with switchables? Example: ”ur” Should we put them in the switchable sets? Oskar Singer The Switchabalizer
  49. 49. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass My opinion: no! Oskar Singer The Switchabalizer
  50. 50. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass My opinion: no! Realistically, its probably okay. I opted for a more elegant solution Oskar Singer The Switchabalizer
  51. 51. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass My opinion: no! Realistically, its probably okay. I opted for a more elegant solution Replace all common mispellings with something from the appropriate switchable set Oskar Singer The Switchabalizer
  52. 52. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass My opinion: no! Realistically, its probably okay. I opted for a more elegant solution Replace all common mispellings with something from the appropriate switchable set The model’s results are agnositc to the switchable that activates it Oskar Singer The Switchabalizer
  53. 53. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Testing Assume Wikipedia has correct usage of all switchables Oskar Singer The Switchabalizer
  54. 54. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Testing Assume Wikipedia has correct usage of all switchables Replace target words in Wikipedia articles with words from their switchable set Oskar Singer The Switchabalizer
  55. 55. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Testing Assume Wikipedia has correct usage of all switchables Replace target words in Wikipedia articles with words from their switchable set Run the Switchabilizer on corrupted articles Oskar Singer The Switchabalizer
  56. 56. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Results How did we do? Oskar Singer The Switchabalizer
  57. 57. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Results How did we do? 20% error Oskar Singer The Switchabalizer
  58. 58. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Ideal Scoring Function S(wi wj wk) = log(P(wj , pre(wi ), suc(wk)) = log(P(wj )P(wi |wj )P(wk|wj )) Oskar Singer The Switchabalizer
  59. 59. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Ideal Scoring Function Ideal Scoring Function S(wi wj wk) = log(P(wj , pre(wi ), suc(wk)) = log(P(wj )P(wi |wj )P(wk|wj )) Forgot the P(wj ) term in the factorization of the joint distribution, which resulted in a slightly unfitting conditional distribution. Remember this for reimplementation! Oskar Singer The Switchabalizer
  60. 60. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Testing conditions were not ideal because: Oskar Singer The Switchabalizer
  61. 61. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Testing conditions were not ideal because: Test data is not target data Oskar Singer The Switchabalizer
  62. 62. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Testing conditions were not ideal because: Test data is not target data Mistakes are contrived Oskar Singer The Switchabalizer
  63. 63. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Testing conditions were not ideal because: Test data is not target data Mistakes are contrived Somebody make a labeled test set, then tune the algorithm to it! Oskar Singer The Switchabalizer
  64. 64. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Here are some ideas I had for future experiments: Oskar Singer The Switchabalizer
  65. 65. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Here are some ideas I had for future experiments: Use a discriminative model like maximum entropy Oskar Singer The Switchabalizer
  66. 66. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Here are some ideas I had for future experiments: Use a discriminative model like maximum entropy Consider higher order neighbor words Oskar Singer The Switchabalizer
  67. 67. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Here are some ideas I had for future experiments: Use a discriminative model like maximum entropy Consider higher order neighbor words Implement for other languages Oskar Singer The Switchabalizer
  68. 68. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Start Coding! Anyone else can do this too! Oskar Singer The Switchabalizer
  69. 69. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Start Coding! Anyone else can do this too! Straight-forward probability model Oskar Singer The Switchabalizer
  70. 70. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Start Coding! Anyone else can do this too! Straight-forward probability model 25-50 lines of Python Oskar Singer The Switchabalizer
  71. 71. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Start Coding! Anyone else can do this too! Straight-forward probability model 25-50 lines of Python Freely accessible data from CommonCrawl! Oskar Singer The Switchabalizer
  72. 72. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Start Coding! Anyone else can do this too! Straight-forward probability model 25-50 lines of Python Freely accessible data from CommonCrawl! Go learn about ML and NLP! Get your hands dirty and add your own mods! Find new problems and try new solutions! Oskar Singer The Switchabalizer
  73. 73. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Thank You, CommonCrawl! Thanks so much to Lisa, Stephen, Grace and the rest of the team for providing such a fantastic resource and bringing me down to San Francisco to present! Oskar Singer The Switchabalizer

×