Presentation given at Open Data Bay Area by Oskar Singer on using Common Crawl and NLP techniques to improve grammar and spelling correction, specifically homophones.
The Switchabalizer - our journey from spell checker to homophone corrrecter
1. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Switchabalizer
Our journey from spell checker to homophone correcter
Oskar Singer
July 23, 2014
Oskar Singer The Switchabalizer
2. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
How I got here
I am a rising senior in the UMass Amherst CS program specializing
in machine learning and natural language processing.
Oskar Singer The Switchabalizer
3. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
How I got here
I am a rising senior in the UMass Amherst CS program specializing
in machine learning and natural language processing.
Last summer, I interned at an Amherst/Boston-based text
analytics company called Lexalytics
Oskar Singer The Switchabalizer
4. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
How I got here
I am a rising senior in the UMass Amherst CS program specializing
in machine learning and natural language processing.
Last summer, I interned at an Amherst/Boston-based text
analytics company called Lexalytics
I worked with Lexalytics’ head of software engineering on this
project
Oskar Singer The Switchabalizer
5. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
How I got here
I am a rising senior in the UMass Amherst CS program specializing
in machine learning and natural language processing.
Last summer, I interned at an Amherst/Boston-based text
analytics company called Lexalytics
I worked with Lexalytics’ head of software engineering on this
project
Lexalytics often uses CommonCrawl, and it was a great option for
a training data set
Oskar Singer The Switchabalizer
7. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Motivation
Lexalytics provides sentiment analysis software
Sentiment analysis relies heavily in sentence parsing and
part-of-speech tagging
Oskar Singer The Switchabalizer
8. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Motivation
Lexalytics provides sentiment analysis software
Sentiment analysis relies heavily in sentence parsing and
part-of-speech tagging
Misspellings and misusage can do serious damage to accuracy for
those two tasks
Oskar Singer The Switchabalizer
9. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Oskar Singer The Switchabalizer
10. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Hunspell gives an unranked list of correction suggestions
Oskar Singer The Switchabalizer
11. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Hunspell gives an unranked list of correction suggestions
So we took the argmax of a home-baked scoring function that:
Oskar Singer The Switchabalizer
12. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Hunspell gives an unranked list of correction suggestions
So we took the argmax of a home-baked scoring function that:
penalized string edit distance
Oskar Singer The Switchabalizer
13. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Hunspell gives an unranked list of correction suggestions
So we took the argmax of a home-baked scoring function that:
penalized string edit distance
penalized keyboard distance
Oskar Singer The Switchabalizer
14. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Hunspell gives an unranked list of correction suggestions
So we took the argmax of a home-baked scoring function that:
penalized string edit distance
penalized keyboard distance
rewarded high word frequencies, which were harvested from
CommonCrawl data
Oskar Singer The Switchabalizer
18. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
How is this possible? Two reasons:
Hunspell missed all the mistakes
Oskar Singer The Switchabalizer
19. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
How is this possible? Two reasons:
Hunspell missed all the mistakes
Hunspell made false corrections
Oskar Singer The Switchabalizer
20. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Hunspell was a poor choice for a couple reasons:
Oskar Singer The Switchabalizer
21. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Hunspell was a poor choice for a couple reasons:
Hunspell’s vocabulary is not appropriate or flexible enough for
Twitter domain
Oskar Singer The Switchabalizer
22. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Hunspell was a poor choice for a couple reasons:
Hunspell’s vocabulary is not appropriate or flexible enough for
Twitter domain
Hunspell can’t detect correctly spelled words that are out of
context
Oskar Singer The Switchabalizer
23. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Twitter’s vocabulary of abbreviations and acronyms is constantly
growing
Oskar Singer The Switchabalizer
24. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Twitter’s vocabulary of abbreviations and acronyms is constantly
growing
Hunspell’s internal dictionary is not prepared for this
Oskar Singer The Switchabalizer
26. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Example: ur
What was Hunspell’s correction?
Oskar Singer The Switchabalizer
27. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Example: ur
What was Hunspell’s correction?
Ur (the ancient Sumerian city-state)
Oskar Singer The Switchabalizer
28. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
When the issue is misuse rather than misspelling, Hunspell
completely ignores the problem
Oskar Singer The Switchabalizer
29. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
When the issue is misuse rather than misspelling, Hunspell
completely ignores the problem
Specifically, commonly misused homophones were a huge problem
in our data
Oskar Singer The Switchabalizer
30. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
When the issue is misuse rather than misspelling, Hunspell
completely ignores the problem
Specifically, commonly misused homophones were a huge problem
in our data
Examples: two/too/2/to; their/there/they’re; your/you’re
Oskar Singer The Switchabalizer
31. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Addressing Misusage
How do we capture the idea of misuse?
Oskar Singer The Switchabalizer
32. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Addressing Misusage
How do we capture the idea of misuse?
Context
Oskar Singer The Switchabalizer
33. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Addressing Misusage
How do we capture the idea of misuse?
Context
How can we capture context?
Oskar Singer The Switchabalizer
34. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Addressing Misusage
How do we capture the idea of misuse?
Context
How can we capture context?
Rule set?
Oskar Singer The Switchabalizer
35. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Addressing Misusage
How do we capture the idea of misuse?
Context
How can we capture context?
Rule set?
Probabilistic approach!
Oskar Singer The Switchabalizer
36. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Bayes network
Conditioned on the preceding and succeeding words
Assumes these two words are independent
Does not use bag-of-words approach (considers position)
Oskar Singer The Switchabalizer
37. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Conditional Probability of Preceding or Succeeding Words
P(pre(wi )|wj ) =
#(wi wj )
#(wj )
,
where pre(w) is the event that w is the preceding word and #(∗)
is the number of occurences of a sequence of words
Oskar Singer The Switchabalizer
38. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Conditional Probability of Preceding or Succeeding Words
Conditional Probability of Preceding or Succeeding Words
P(pre(wi )|wj ) =
#(wi wj )
#(wj )
,
where pre(w) is the event that w is the preceding word and #(∗)
is the number of occurences of a sequence of words
P(suc(wi )|wj ) =
#(wj wi )
#(wj )
,
where suc(w) is the event that w is the succeeding word
Oskar Singer The Switchabalizer
39. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Conditional Probability of Both Words
P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj )
log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj ))
+ log(P(suc(wk)|wj ))
Oskar Singer The Switchabalizer
40. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Conditional Probability of Both Words
Conditional Probability of Both Words
P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj )
log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj ))
+ log(P(suc(wk)|wj ))
The first equation holds because of our assumption of
independence between the preceding and succeeding words
Oskar Singer The Switchabalizer
41. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Conditional Probability of Both Words
Conditional Probability of Both Words
P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj )
log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj ))
+ log(P(suc(wk)|wj ))
The first equation holds because of our assumption of
independence between the preceding and succeeding words
There is a missing term in the scoring function that I will address
in the Future Work section
Oskar Singer The Switchabalizer
42. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Switchable Sets
Only certain groups should be compared, e.g. ”too” should not be
scored against ”their”
Oskar Singer The Switchabalizer
43. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Switchable Sets
Only certain groups should be compared, e.g. ”too” should not be
scored against ”their”
Comparable switchables are groups in switchable sets
Oskar Singer The Switchabalizer
44. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Switchable Sets
Only certain groups should be compared, e.g. ”too” should not be
scored against ”their”
Comparable switchables are groups in switchable sets
Each switchable is mapped to its switchable set
Oskar Singer The Switchabalizer
45. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Picking the Word
The Final Equation
S(wi , wj , wk) = log(P(pre(wi ), suc(wk)|wj ))
v∗
= argmaxv∈Vwj
S(wi , v, wk)
where S(wi , wj , wk) is the score for the sequence of words wi wj wk
and Vwj is the switchable set corresponding to wj and v∗ is the
ideal switchable
Oskar Singer The Switchabalizer
46. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
What about common misspellings that intersect with switchables?
Oskar Singer The Switchabalizer
47. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
What about common misspellings that intersect with switchables?
Example: ”ur”
Oskar Singer The Switchabalizer
48. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
What about common misspellings that intersect with switchables?
Example: ”ur”
Should we put them in the switchable sets?
Oskar Singer The Switchabalizer
50. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
My opinion: no!
Realistically, its probably okay. I opted for a more elegant solution
Oskar Singer The Switchabalizer
51. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
My opinion: no!
Realistically, its probably okay. I opted for a more elegant solution
Replace all common mispellings with something from the
appropriate switchable set
Oskar Singer The Switchabalizer
52. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
My opinion: no!
Realistically, its probably okay. I opted for a more elegant solution
Replace all common mispellings with something from the
appropriate switchable set
The model’s results are agnositc to the switchable that activates it
Oskar Singer The Switchabalizer
53. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Testing
Assume Wikipedia has correct usage of all switchables
Oskar Singer The Switchabalizer
54. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Testing
Assume Wikipedia has correct usage of all switchables
Replace target words in Wikipedia articles with words from their
switchable set
Oskar Singer The Switchabalizer
55. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Testing
Assume Wikipedia has correct usage of all switchables
Replace target words in Wikipedia articles with words from their
switchable set
Run the Switchabilizer on corrupted articles
Oskar Singer The Switchabalizer
58. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Ideal Scoring Function
S(wi wj wk) = log(P(wj , pre(wi ), suc(wk))
= log(P(wj )P(wi |wj )P(wk|wj ))
Oskar Singer The Switchabalizer
59. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Ideal Scoring Function
Ideal Scoring Function
S(wi wj wk) = log(P(wj , pre(wi ), suc(wk))
= log(P(wj )P(wi |wj )P(wk|wj ))
Forgot the P(wj ) term in the factorization of the joint distribution,
which resulted in a slightly unfitting conditional distribution.
Remember this for reimplementation!
Oskar Singer The Switchabalizer
61. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Testing conditions were not ideal because:
Test data is not target data
Oskar Singer The Switchabalizer
62. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Testing conditions were not ideal because:
Test data is not target data
Mistakes are contrived
Oskar Singer The Switchabalizer
63. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Testing conditions were not ideal because:
Test data is not target data
Mistakes are contrived
Somebody make a labeled test set, then tune the algorithm to it!
Oskar Singer The Switchabalizer
64. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Here are some ideas I had for future experiments:
Oskar Singer The Switchabalizer
65. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Here are some ideas I had for future experiments:
Use a discriminative model like maximum entropy
Oskar Singer The Switchabalizer
66. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Here are some ideas I had for future experiments:
Use a discriminative model like maximum entropy
Consider higher order neighbor words
Oskar Singer The Switchabalizer
67. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Here are some ideas I had for future experiments:
Use a discriminative model like maximum entropy
Consider higher order neighbor words
Implement for other languages
Oskar Singer The Switchabalizer
69. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Start Coding!
Anyone else can do this too!
Straight-forward probability model
Oskar Singer The Switchabalizer
70. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Start Coding!
Anyone else can do this too!
Straight-forward probability model
25-50 lines of Python
Oskar Singer The Switchabalizer
71. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Start Coding!
Anyone else can do this too!
Straight-forward probability model
25-50 lines of Python
Freely accessible data from CommonCrawl!
Oskar Singer The Switchabalizer
72. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Start Coding!
Anyone else can do this too!
Straight-forward probability model
25-50 lines of Python
Freely accessible data from CommonCrawl!
Go learn about ML and NLP! Get your hands dirty and add your
own mods! Find new problems and try new solutions!
Oskar Singer The Switchabalizer
73. Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Thank You, CommonCrawl!
Thanks so much to Lisa, Stephen, Grace and the rest of the team
for providing such a fantastic resource and bringing me down to
San Francisco to present!
Oskar Singer The Switchabalizer