SlideShare a Scribd company logo
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Switchabalizer
Our journey from spell checker to homophone correcter
Oskar Singer
July 23, 2014
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
How I got here
I am a rising senior in the UMass Amherst CS program specializing
in machine learning and natural language processing.
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
How I got here
I am a rising senior in the UMass Amherst CS program specializing
in machine learning and natural language processing.
Last summer, I interned at an Amherst/Boston-based text
analytics company called Lexalytics
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
How I got here
I am a rising senior in the UMass Amherst CS program specializing
in machine learning and natural language processing.
Last summer, I interned at an Amherst/Boston-based text
analytics company called Lexalytics
I worked with Lexalytics’ head of software engineering on this
project
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
How I got here
I am a rising senior in the UMass Amherst CS program specializing
in machine learning and natural language processing.
Last summer, I interned at an Amherst/Boston-based text
analytics company called Lexalytics
I worked with Lexalytics’ head of software engineering on this
project
Lexalytics often uses CommonCrawl, and it was a great option for
a training data set
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Motivation
Lexalytics provides sentiment analysis software
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Motivation
Lexalytics provides sentiment analysis software
Sentiment analysis relies heavily in sentence parsing and
part-of-speech tagging
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Motivation
Lexalytics provides sentiment analysis software
Sentiment analysis relies heavily in sentence parsing and
part-of-speech tagging
Misspellings and misusage can do serious damage to accuracy for
those two tasks
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Hunspell gives an unranked list of correction suggestions
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Hunspell gives an unranked list of correction suggestions
So we took the argmax of a home-baked scoring function that:
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Hunspell gives an unranked list of correction suggestions
So we took the argmax of a home-baked scoring function that:
penalized string edit distance
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Hunspell gives an unranked list of correction suggestions
So we took the argmax of a home-baked scoring function that:
penalized string edit distance
penalized keyboard distance
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Hunspell gives an unranked list of correction suggestions
So we took the argmax of a home-baked scoring function that:
penalized string edit distance
penalized keyboard distance
rewarded high word frequencies, which were harvested from
CommonCrawl data
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Failure
Hunspell had an error rate of
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Failure
Hunspell had an error rate of
216%
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
How is this possible? Two reasons:
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
How is this possible? Two reasons:
Hunspell missed all the mistakes
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
How is this possible? Two reasons:
Hunspell missed all the mistakes
Hunspell made false corrections
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Hunspell was a poor choice for a couple reasons:
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Hunspell was a poor choice for a couple reasons:
Hunspell’s vocabulary is not appropriate or flexible enough for
Twitter domain
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Hunspell was a poor choice for a couple reasons:
Hunspell’s vocabulary is not appropriate or flexible enough for
Twitter domain
Hunspell can’t detect correctly spelled words that are out of
context
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Twitter’s vocabulary of abbreviations and acronyms is constantly
growing
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Twitter’s vocabulary of abbreviations and acronyms is constantly
growing
Hunspell’s internal dictionary is not prepared for this
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Example: ur
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Example: ur
What was Hunspell’s correction?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Example: ur
What was Hunspell’s correction?
Ur (the ancient Sumerian city-state)
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
When the issue is misuse rather than misspelling, Hunspell
completely ignores the problem
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
When the issue is misuse rather than misspelling, Hunspell
completely ignores the problem
Specifically, commonly misused homophones were a huge problem
in our data
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
When the issue is misuse rather than misspelling, Hunspell
completely ignores the problem
Specifically, commonly misused homophones were a huge problem
in our data
Examples: two/too/2/to; their/there/they’re; your/you’re
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Addressing Misusage
How do we capture the idea of misuse?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Addressing Misusage
How do we capture the idea of misuse?
Context
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Addressing Misusage
How do we capture the idea of misuse?
Context
How can we capture context?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Addressing Misusage
How do we capture the idea of misuse?
Context
How can we capture context?
Rule set?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Addressing Misusage
How do we capture the idea of misuse?
Context
How can we capture context?
Rule set?
Probabilistic approach!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Bayes network
Conditioned on the preceding and succeeding words
Assumes these two words are independent
Does not use bag-of-words approach (considers position)
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Conditional Probability of Preceding or Succeeding Words
P(pre(wi )|wj ) =
#(wi wj )
#(wj )
,
where pre(w) is the event that w is the preceding word and #(∗)
is the number of occurences of a sequence of words
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Conditional Probability of Preceding or Succeeding Words
Conditional Probability of Preceding or Succeeding Words
P(pre(wi )|wj ) =
#(wi wj )
#(wj )
,
where pre(w) is the event that w is the preceding word and #(∗)
is the number of occurences of a sequence of words
P(suc(wi )|wj ) =
#(wj wi )
#(wj )
,
where suc(w) is the event that w is the succeeding word
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Conditional Probability of Both Words
P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj )
log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj ))
+ log(P(suc(wk)|wj ))
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Conditional Probability of Both Words
Conditional Probability of Both Words
P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj )
log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj ))
+ log(P(suc(wk)|wj ))
The first equation holds because of our assumption of
independence between the preceding and succeeding words
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Conditional Probability of Both Words
Conditional Probability of Both Words
P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj )
log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj ))
+ log(P(suc(wk)|wj ))
The first equation holds because of our assumption of
independence between the preceding and succeeding words
There is a missing term in the scoring function that I will address
in the Future Work section
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Switchable Sets
Only certain groups should be compared, e.g. ”too” should not be
scored against ”their”
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Switchable Sets
Only certain groups should be compared, e.g. ”too” should not be
scored against ”their”
Comparable switchables are groups in switchable sets
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Switchable Sets
Only certain groups should be compared, e.g. ”too” should not be
scored against ”their”
Comparable switchables are groups in switchable sets
Each switchable is mapped to its switchable set
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Picking the Word
The Final Equation
S(wi , wj , wk) = log(P(pre(wi ), suc(wk)|wj ))
v∗
= argmaxv∈Vwj
S(wi , v, wk)
where S(wi , wj , wk) is the score for the sequence of words wi wj wk
and Vwj is the switchable set corresponding to wj and v∗ is the
ideal switchable
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
What about common misspellings that intersect with switchables?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
What about common misspellings that intersect with switchables?
Example: ”ur”
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
What about common misspellings that intersect with switchables?
Example: ”ur”
Should we put them in the switchable sets?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
My opinion: no!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
My opinion: no!
Realistically, its probably okay. I opted for a more elegant solution
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
My opinion: no!
Realistically, its probably okay. I opted for a more elegant solution
Replace all common mispellings with something from the
appropriate switchable set
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
My opinion: no!
Realistically, its probably okay. I opted for a more elegant solution
Replace all common mispellings with something from the
appropriate switchable set
The model’s results are agnositc to the switchable that activates it
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Testing
Assume Wikipedia has correct usage of all switchables
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Testing
Assume Wikipedia has correct usage of all switchables
Replace target words in Wikipedia articles with words from their
switchable set
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Testing
Assume Wikipedia has correct usage of all switchables
Replace target words in Wikipedia articles with words from their
switchable set
Run the Switchabilizer on corrupted articles
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Results
How did we do?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Results
How did we do?
20% error
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Ideal Scoring Function
S(wi wj wk) = log(P(wj , pre(wi ), suc(wk))
= log(P(wj )P(wi |wj )P(wk|wj ))
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Ideal Scoring Function
Ideal Scoring Function
S(wi wj wk) = log(P(wj , pre(wi ), suc(wk))
= log(P(wj )P(wi |wj )P(wk|wj ))
Forgot the P(wj ) term in the factorization of the joint distribution,
which resulted in a slightly unfitting conditional distribution.
Remember this for reimplementation!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Testing conditions were not ideal because:
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Testing conditions were not ideal because:
Test data is not target data
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Testing conditions were not ideal because:
Test data is not target data
Mistakes are contrived
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Testing conditions were not ideal because:
Test data is not target data
Mistakes are contrived
Somebody make a labeled test set, then tune the algorithm to it!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Here are some ideas I had for future experiments:
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Here are some ideas I had for future experiments:
Use a discriminative model like maximum entropy
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Here are some ideas I had for future experiments:
Use a discriminative model like maximum entropy
Consider higher order neighbor words
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Here are some ideas I had for future experiments:
Use a discriminative model like maximum entropy
Consider higher order neighbor words
Implement for other languages
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Start Coding!
Anyone else can do this too!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Start Coding!
Anyone else can do this too!
Straight-forward probability model
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Start Coding!
Anyone else can do this too!
Straight-forward probability model
25-50 lines of Python
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Start Coding!
Anyone else can do this too!
Straight-forward probability model
25-50 lines of Python
Freely accessible data from CommonCrawl!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Start Coding!
Anyone else can do this too!
Straight-forward probability model
25-50 lines of Python
Freely accessible data from CommonCrawl!
Go learn about ML and NLP! Get your hands dirty and add your
own mods! Find new problems and try new solutions!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Thank You, CommonCrawl!
Thanks so much to Lisa, Stephen, Grace and the rest of the team
for providing such a fantastic resource and bringing me down to
San Francisco to present!
Oskar Singer The Switchabalizer

More Related Content

Recently uploaded

Ready to Unlock the Power of Blockchain!
Ready to Unlock the Power of Blockchain!Ready to Unlock the Power of Blockchain!
Ready to Unlock the Power of Blockchain!
Toptal Tech
 
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
3a0sd7z3
 
Design Thinking NETFLIX using all techniques.pptx
Design Thinking NETFLIX using all techniques.pptxDesign Thinking NETFLIX using all techniques.pptx
Design Thinking NETFLIX using all techniques.pptx
saathvikreddy2003
 
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
uehowe
 
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
ysasp1
 
Gen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needsGen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needs
Laura Szabó
 
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
fovkoyb
 
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
rtunex8r
 
[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024
hackersuli
 
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
xjq03c34
 
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
3a0sd7z3
 
一比一原版(USYD毕业证)悉尼大学毕业证如何办理
一比一原版(USYD毕业证)悉尼大学毕业证如何办理一比一原版(USYD毕业证)悉尼大学毕业证如何办理
一比一原版(USYD毕业证)悉尼大学毕业证如何办理
k4ncd0z
 
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
bseovas
 
Discover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to IndiaDiscover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to India
davidjhones387
 
HijackLoader Evolution: Interactive Process Hollowing
HijackLoader Evolution: Interactive Process HollowingHijackLoader Evolution: Interactive Process Hollowing
HijackLoader Evolution: Interactive Process Hollowing
Donato Onofri
 
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaalmanuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
wolfsoftcompanyco
 
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
uehowe
 
Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?
Paul Walk
 
办理毕业证(NYU毕业证)纽约大学毕业证成绩单官方原版办理
办理毕业证(NYU毕业证)纽约大学毕业证成绩单官方原版办理办理毕业证(NYU毕业证)纽约大学毕业证成绩单官方原版办理
办理毕业证(NYU毕业证)纽约大学毕业证成绩单官方原版办理
uehowe
 

Recently uploaded (19)

Ready to Unlock the Power of Blockchain!
Ready to Unlock the Power of Blockchain!Ready to Unlock the Power of Blockchain!
Ready to Unlock the Power of Blockchain!
 
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
快速办理(Vic毕业证书)惠灵顿维多利亚大学毕业证完成信一模一样
 
Design Thinking NETFLIX using all techniques.pptx
Design Thinking NETFLIX using all techniques.pptxDesign Thinking NETFLIX using all techniques.pptx
Design Thinking NETFLIX using all techniques.pptx
 
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
留学挂科(UofM毕业证)明尼苏达大学毕业证成绩单复刻办理
 
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
 
Gen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needsGen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needs
 
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
 
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
怎么办理(umiami毕业证书)美国迈阿密大学毕业证文凭证书实拍图原版一模一样
 
[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024
 
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
 
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
快速办理(新加坡SMU毕业证书)新加坡管理大学毕业证文凭证书一模一样
 
一比一原版(USYD毕业证)悉尼大学毕业证如何办理
一比一原版(USYD毕业证)悉尼大学毕业证如何办理一比一原版(USYD毕业证)悉尼大学毕业证如何办理
一比一原版(USYD毕业证)悉尼大学毕业证如何办理
 
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
 
Discover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to IndiaDiscover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to India
 
HijackLoader Evolution: Interactive Process Hollowing
HijackLoader Evolution: Interactive Process HollowingHijackLoader Evolution: Interactive Process Hollowing
HijackLoader Evolution: Interactive Process Hollowing
 
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaalmanuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
 
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
 
Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?
 
办理毕业证(NYU毕业证)纽约大学毕业证成绩单官方原版办理
办理毕业证(NYU毕业证)纽约大学毕业证成绩单官方原版办理办理毕业证(NYU毕业证)纽约大学毕业证成绩单官方原版办理
办理毕业证(NYU毕业证)纽约大学毕业证成绩单官方原版办理
 

Featured

Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools
 

Featured (20)

Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 

The Switchabalizer - our journey from spell checker to homophone corrrecter

  • 1. Introduction The Problem First Attempt Second Attempt Conclusion The Switchabalizer Our journey from spell checker to homophone correcter Oskar Singer July 23, 2014 Oskar Singer The Switchabalizer
  • 2. Introduction The Problem First Attempt Second Attempt Conclusion How I got here I am a rising senior in the UMass Amherst CS program specializing in machine learning and natural language processing. Oskar Singer The Switchabalizer
  • 3. Introduction The Problem First Attempt Second Attempt Conclusion How I got here I am a rising senior in the UMass Amherst CS program specializing in machine learning and natural language processing. Last summer, I interned at an Amherst/Boston-based text analytics company called Lexalytics Oskar Singer The Switchabalizer
  • 4. Introduction The Problem First Attempt Second Attempt Conclusion How I got here I am a rising senior in the UMass Amherst CS program specializing in machine learning and natural language processing. Last summer, I interned at an Amherst/Boston-based text analytics company called Lexalytics I worked with Lexalytics’ head of software engineering on this project Oskar Singer The Switchabalizer
  • 5. Introduction The Problem First Attempt Second Attempt Conclusion How I got here I am a rising senior in the UMass Amherst CS program specializing in machine learning and natural language processing. Last summer, I interned at an Amherst/Boston-based text analytics company called Lexalytics I worked with Lexalytics’ head of software engineering on this project Lexalytics often uses CommonCrawl, and it was a great option for a training data set Oskar Singer The Switchabalizer
  • 6. Introduction The Problem First Attempt Second Attempt Conclusion Motivation Lexalytics provides sentiment analysis software Oskar Singer The Switchabalizer
  • 7. Introduction The Problem First Attempt Second Attempt Conclusion Motivation Lexalytics provides sentiment analysis software Sentiment analysis relies heavily in sentence parsing and part-of-speech tagging Oskar Singer The Switchabalizer
  • 8. Introduction The Problem First Attempt Second Attempt Conclusion Motivation Lexalytics provides sentiment analysis software Sentiment analysis relies heavily in sentence parsing and part-of-speech tagging Misspellings and misusage can do serious damage to accuracy for those two tasks Oskar Singer The Switchabalizer
  • 9. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Oskar Singer The Switchabalizer
  • 10. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Hunspell gives an unranked list of correction suggestions Oskar Singer The Switchabalizer
  • 11. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Hunspell gives an unranked list of correction suggestions So we took the argmax of a home-baked scoring function that: Oskar Singer The Switchabalizer
  • 12. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Hunspell gives an unranked list of correction suggestions So we took the argmax of a home-baked scoring function that: penalized string edit distance Oskar Singer The Switchabalizer
  • 13. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Hunspell gives an unranked list of correction suggestions So we took the argmax of a home-baked scoring function that: penalized string edit distance penalized keyboard distance Oskar Singer The Switchabalizer
  • 14. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Hunspell gives an unranked list of correction suggestions So we took the argmax of a home-baked scoring function that: penalized string edit distance penalized keyboard distance rewarded high word frequencies, which were harvested from CommonCrawl data Oskar Singer The Switchabalizer
  • 15. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Failure Hunspell had an error rate of Oskar Singer The Switchabalizer
  • 16. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Failure Hunspell had an error rate of 216% Oskar Singer The Switchabalizer
  • 17. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? How is this possible? Two reasons: Oskar Singer The Switchabalizer
  • 18. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? How is this possible? Two reasons: Hunspell missed all the mistakes Oskar Singer The Switchabalizer
  • 19. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? How is this possible? Two reasons: Hunspell missed all the mistakes Hunspell made false corrections Oskar Singer The Switchabalizer
  • 20. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Hunspell was a poor choice for a couple reasons: Oskar Singer The Switchabalizer
  • 21. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Hunspell was a poor choice for a couple reasons: Hunspell’s vocabulary is not appropriate or flexible enough for Twitter domain Oskar Singer The Switchabalizer
  • 22. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Hunspell was a poor choice for a couple reasons: Hunspell’s vocabulary is not appropriate or flexible enough for Twitter domain Hunspell can’t detect correctly spelled words that are out of context Oskar Singer The Switchabalizer
  • 23. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Twitter’s vocabulary of abbreviations and acronyms is constantly growing Oskar Singer The Switchabalizer
  • 24. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Twitter’s vocabulary of abbreviations and acronyms is constantly growing Hunspell’s internal dictionary is not prepared for this Oskar Singer The Switchabalizer
  • 25. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Example: ur Oskar Singer The Switchabalizer
  • 26. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Example: ur What was Hunspell’s correction? Oskar Singer The Switchabalizer
  • 27. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Example: ur What was Hunspell’s correction? Ur (the ancient Sumerian city-state) Oskar Singer The Switchabalizer
  • 28. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? When the issue is misuse rather than misspelling, Hunspell completely ignores the problem Oskar Singer The Switchabalizer
  • 29. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? When the issue is misuse rather than misspelling, Hunspell completely ignores the problem Specifically, commonly misused homophones were a huge problem in our data Oskar Singer The Switchabalizer
  • 30. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? When the issue is misuse rather than misspelling, Hunspell completely ignores the problem Specifically, commonly misused homophones were a huge problem in our data Examples: two/too/2/to; their/there/they’re; your/you’re Oskar Singer The Switchabalizer
  • 31. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Addressing Misusage How do we capture the idea of misuse? Oskar Singer The Switchabalizer
  • 32. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Addressing Misusage How do we capture the idea of misuse? Context Oskar Singer The Switchabalizer
  • 33. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Addressing Misusage How do we capture the idea of misuse? Context How can we capture context? Oskar Singer The Switchabalizer
  • 34. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Addressing Misusage How do we capture the idea of misuse? Context How can we capture context? Rule set? Oskar Singer The Switchabalizer
  • 35. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Addressing Misusage How do we capture the idea of misuse? Context How can we capture context? Rule set? Probabilistic approach! Oskar Singer The Switchabalizer
  • 36. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Bayes network Conditioned on the preceding and succeeding words Assumes these two words are independent Does not use bag-of-words approach (considers position) Oskar Singer The Switchabalizer
  • 37. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Conditional Probability of Preceding or Succeeding Words P(pre(wi )|wj ) = #(wi wj ) #(wj ) , where pre(w) is the event that w is the preceding word and #(∗) is the number of occurences of a sequence of words Oskar Singer The Switchabalizer
  • 38. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Conditional Probability of Preceding or Succeeding Words Conditional Probability of Preceding or Succeeding Words P(pre(wi )|wj ) = #(wi wj ) #(wj ) , where pre(w) is the event that w is the preceding word and #(∗) is the number of occurences of a sequence of words P(suc(wi )|wj ) = #(wj wi ) #(wj ) , where suc(w) is the event that w is the succeeding word Oskar Singer The Switchabalizer
  • 39. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Conditional Probability of Both Words P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj ) log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj )) + log(P(suc(wk)|wj )) Oskar Singer The Switchabalizer
  • 40. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Conditional Probability of Both Words Conditional Probability of Both Words P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj ) log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj )) + log(P(suc(wk)|wj )) The first equation holds because of our assumption of independence between the preceding and succeeding words Oskar Singer The Switchabalizer
  • 41. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Conditional Probability of Both Words Conditional Probability of Both Words P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj ) log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj )) + log(P(suc(wk)|wj )) The first equation holds because of our assumption of independence between the preceding and succeeding words There is a missing term in the scoring function that I will address in the Future Work section Oskar Singer The Switchabalizer
  • 42. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Switchable Sets Only certain groups should be compared, e.g. ”too” should not be scored against ”their” Oskar Singer The Switchabalizer
  • 43. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Switchable Sets Only certain groups should be compared, e.g. ”too” should not be scored against ”their” Comparable switchables are groups in switchable sets Oskar Singer The Switchabalizer
  • 44. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Switchable Sets Only certain groups should be compared, e.g. ”too” should not be scored against ”their” Comparable switchables are groups in switchable sets Each switchable is mapped to its switchable set Oskar Singer The Switchabalizer
  • 45. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Picking the Word The Final Equation S(wi , wj , wk) = log(P(pre(wi ), suc(wk)|wj )) v∗ = argmaxv∈Vwj S(wi , v, wk) where S(wi , wj , wk) is the score for the sequence of words wi wj wk and Vwj is the switchable set corresponding to wj and v∗ is the ideal switchable Oskar Singer The Switchabalizer
  • 46. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass What about common misspellings that intersect with switchables? Oskar Singer The Switchabalizer
  • 47. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass What about common misspellings that intersect with switchables? Example: ”ur” Oskar Singer The Switchabalizer
  • 48. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass What about common misspellings that intersect with switchables? Example: ”ur” Should we put them in the switchable sets? Oskar Singer The Switchabalizer
  • 49. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass My opinion: no! Oskar Singer The Switchabalizer
  • 50. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass My opinion: no! Realistically, its probably okay. I opted for a more elegant solution Oskar Singer The Switchabalizer
  • 51. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass My opinion: no! Realistically, its probably okay. I opted for a more elegant solution Replace all common mispellings with something from the appropriate switchable set Oskar Singer The Switchabalizer
  • 52. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass My opinion: no! Realistically, its probably okay. I opted for a more elegant solution Replace all common mispellings with something from the appropriate switchable set The model’s results are agnositc to the switchable that activates it Oskar Singer The Switchabalizer
  • 53. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Testing Assume Wikipedia has correct usage of all switchables Oskar Singer The Switchabalizer
  • 54. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Testing Assume Wikipedia has correct usage of all switchables Replace target words in Wikipedia articles with words from their switchable set Oskar Singer The Switchabalizer
  • 55. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Testing Assume Wikipedia has correct usage of all switchables Replace target words in Wikipedia articles with words from their switchable set Run the Switchabilizer on corrupted articles Oskar Singer The Switchabalizer
  • 56. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Results How did we do? Oskar Singer The Switchabalizer
  • 57. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Results How did we do? 20% error Oskar Singer The Switchabalizer
  • 58. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Ideal Scoring Function S(wi wj wk) = log(P(wj , pre(wi ), suc(wk)) = log(P(wj )P(wi |wj )P(wk|wj )) Oskar Singer The Switchabalizer
  • 59. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Ideal Scoring Function Ideal Scoring Function S(wi wj wk) = log(P(wj , pre(wi ), suc(wk)) = log(P(wj )P(wi |wj )P(wk|wj )) Forgot the P(wj ) term in the factorization of the joint distribution, which resulted in a slightly unfitting conditional distribution. Remember this for reimplementation! Oskar Singer The Switchabalizer
  • 60. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Testing conditions were not ideal because: Oskar Singer The Switchabalizer
  • 61. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Testing conditions were not ideal because: Test data is not target data Oskar Singer The Switchabalizer
  • 62. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Testing conditions were not ideal because: Test data is not target data Mistakes are contrived Oskar Singer The Switchabalizer
  • 63. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Testing conditions were not ideal because: Test data is not target data Mistakes are contrived Somebody make a labeled test set, then tune the algorithm to it! Oskar Singer The Switchabalizer
  • 64. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Here are some ideas I had for future experiments: Oskar Singer The Switchabalizer
  • 65. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Here are some ideas I had for future experiments: Use a discriminative model like maximum entropy Oskar Singer The Switchabalizer
  • 66. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Here are some ideas I had for future experiments: Use a discriminative model like maximum entropy Consider higher order neighbor words Oskar Singer The Switchabalizer
  • 67. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Here are some ideas I had for future experiments: Use a discriminative model like maximum entropy Consider higher order neighbor words Implement for other languages Oskar Singer The Switchabalizer
  • 68. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Start Coding! Anyone else can do this too! Oskar Singer The Switchabalizer
  • 69. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Start Coding! Anyone else can do this too! Straight-forward probability model Oskar Singer The Switchabalizer
  • 70. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Start Coding! Anyone else can do this too! Straight-forward probability model 25-50 lines of Python Oskar Singer The Switchabalizer
  • 71. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Start Coding! Anyone else can do this too! Straight-forward probability model 25-50 lines of Python Freely accessible data from CommonCrawl! Oskar Singer The Switchabalizer
  • 72. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Start Coding! Anyone else can do this too! Straight-forward probability model 25-50 lines of Python Freely accessible data from CommonCrawl! Go learn about ML and NLP! Get your hands dirty and add your own mods! Find new problems and try new solutions! Oskar Singer The Switchabalizer
  • 73. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Thank You, CommonCrawl! Thanks so much to Lisa, Stephen, Grace and the rest of the team for providing such a fantastic resource and bringing me down to San Francisco to present! Oskar Singer The Switchabalizer