The Switchabalizer - our journey from spell checker to homophone corrrecter

Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Switchabalizer
Our journey from spell checker to homophone correcter
Oskar Singer
July 23, 2014
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
How I got here
I am a rising senior in the UMass Amherst CS program specializing
in machine learning and natural language processing.
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
How I got here
I am a rising senior in the UMass Amherst CS program specializing
in machine learning and natural language processing.
Last summer, I interned at an Amherst/Boston-based text
analytics company called Lexalytics
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
How I got here
I am a rising senior in the UMass Amherst CS program specializing
in machine learning and natural language processing.
Last summer, I interned at an Amherst/Boston-based text
analytics company called Lexalytics
I worked with Lexalytics’ head of software engineering on this
project
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
How I got here
I am a rising senior in the UMass Amherst CS program specializing
in machine learning and natural language processing.
Last summer, I interned at an Amherst/Boston-based text
analytics company called Lexalytics
I worked with Lexalytics’ head of software engineering on this
project
Lexalytics often uses CommonCrawl, and it was a great option for
a training data set
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Motivation
Lexalytics provides sentiment analysis software
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Motivation
Lexalytics provides sentiment analysis software
Sentiment analysis relies heavily in sentence parsing and
part-of-speech tagging
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Motivation
Lexalytics provides sentiment analysis software
Sentiment analysis relies heavily in sentence parsing and
part-of-speech tagging
Misspellings and misusage can do serious damage to accuracy for
those two tasks
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Hunspell gives an unranked list of correction suggestions
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Hunspell gives an unranked list of correction suggestions
So we took the argmax of a home-baked scoring function that:
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Hunspell gives an unranked list of correction suggestions
So we took the argmax of a home-baked scoring function that:
penalized string edit distance
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Hunspell gives an unranked list of correction suggestions
So we took the argmax of a home-baked scoring function that:
penalized string edit distance
penalized keyboard distance
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Approach
We employed an open-source spell-checker called Hunspell
Hunspell gives an unranked list of correction suggestions
So we took the argmax of a home-baked scoring function that:
penalized string edit distance
penalized keyboard distance
rewarded high word frequencies, which were harvested from
CommonCrawl data
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Failure
Hunspell had an error rate of
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
Failure
Hunspell had an error rate of
216%
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
How is this possible? Two reasons:
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
How is this possible? Two reasons:
Hunspell missed all the mistakes
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
How is this possible? Two reasons:
Hunspell missed all the mistakes
Hunspell made false corrections
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Hunspell was a poor choice for a couple reasons:
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Hunspell was a poor choice for a couple reasons:
Hunspell’s vocabulary is not appropriate or flexible enough for
Twitter domain
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Hunspell was a poor choice for a couple reasons:
Hunspell’s vocabulary is not appropriate or flexible enough for
Twitter domain
Hunspell can’t detect correctly spelled words that are out of
context
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Twitter’s vocabulary of abbreviations and acronyms is constantly
growing
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Twitter’s vocabulary of abbreviations and acronyms is constantly
growing
Hunspell’s internal dictionary is not prepared for this
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Example: ur
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Example: ur
What was Hunspell’s correction?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
Example: ur
What was Hunspell’s correction?
Ur (the ancient Sumerian city-state)
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
When the issue is misuse rather than misspelling, Hunspell
completely ignores the problem
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
When the issue is misuse rather than misspelling, Hunspell
completely ignores the problem
Specifically, commonly misused homophones were a huge problem
in our data
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
The Approach
The Weaknesses
What Happened?
When the issue is misuse rather than misspelling, Hunspell
completely ignores the problem
Specifically, commonly misused homophones were a huge problem
in our data
Examples: two/too/2/to; their/there/they’re; your/you’re
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Addressing Misusage
How do we capture the idea of misuse?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Addressing Misusage
How do we capture the idea of misuse?
Context
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Addressing Misusage
How do we capture the idea of misuse?
Context
How can we capture context?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Addressing Misusage
How do we capture the idea of misuse?
Context
How can we capture context?
Rule set?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Addressing Misusage
How do we capture the idea of misuse?
Context
How can we capture context?
Rule set?
Probabilistic approach!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Bayes network
Conditioned on the preceding and succeeding words
Assumes these two words are independent
Does not use bag-of-words approach (considers position)
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Conditional Probability of Preceding or Succeeding Words
P(pre(wi )|wj ) =
#(wi wj )
#(wj )
,
where pre(w) is the event that w is the preceding word and #(∗)
is the number of occurences of a sequence of words
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Conditional Probability of Preceding or Succeeding Words
Conditional Probability of Preceding or Succeeding Words
P(pre(wi )|wj ) =
#(wi wj )
#(wj )
,
where pre(w) is the event that w is the preceding word and #(∗)
is the number of occurences of a sequence of words
P(suc(wi )|wj ) =
#(wj wi )
#(wj )
,
where suc(w) is the event that w is the succeeding word
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Conditional Probability of Both Words
P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj )
log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj ))
+ log(P(suc(wk)|wj ))
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Conditional Probability of Both Words
Conditional Probability of Both Words
P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj )
log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj ))
+ log(P(suc(wk)|wj ))
The first equation holds because of our assumption of
independence between the preceding and succeeding words
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Probability Model
Conditional Probability of Both Words
Conditional Probability of Both Words
P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj )
log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj ))
+ log(P(suc(wk)|wj ))
The first equation holds because of our assumption of
independence between the preceding and succeeding words
There is a missing term in the scoring function that I will address
in the Future Work section
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Switchable Sets
Only certain groups should be compared, e.g. ”too” should not be
scored against ”their”
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Switchable Sets
Only certain groups should be compared, e.g. ”too” should not be
scored against ”their”
Comparable switchables are groups in switchable sets
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Switchable Sets
Only certain groups should be compared, e.g. ”too” should not be
scored against ”their”
Comparable switchables are groups in switchable sets
Each switchable is mapped to its switchable set
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Picking the Word
The Final Equation
S(wi , wj , wk) = log(P(pre(wi ), suc(wk)|wj ))
v∗
= argmaxv∈Vwj
S(wi , v, wk)
where S(wi , wj , wk) is the score for the sequence of words wi wj wk
and Vwj is the switchable set corresponding to wj and v∗ is the
ideal switchable
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
What about common misspellings that intersect with switchables?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
What about common misspellings that intersect with switchables?
Example: ”ur”
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
What about common misspellings that intersect with switchables?
Example: ”ur”
Should we put them in the switchable sets?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
My opinion: no!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
My opinion: no!
Realistically, its probably okay. I opted for a more elegant solution
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
My opinion: no!
Realistically, its probably okay. I opted for a more elegant solution
Replace all common mispellings with something from the
appropriate switchable set
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
First Pass
My opinion: no!
Realistically, its probably okay. I opted for a more elegant solution
Replace all common mispellings with something from the
appropriate switchable set
The model’s results are agnositc to the switchable that activates it
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Testing
Assume Wikipedia has correct usage of all switchables
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Testing
Assume Wikipedia has correct usage of all switchables
Replace target words in Wikipedia articles with words from their
switchable set
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Testing
Assume Wikipedia has correct usage of all switchables
Replace target words in Wikipedia articles with words from their
switchable set
Run the Switchabilizer on corrupted articles
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Results
How did we do?
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Brainstorm
The Approach
Testing and Results
Results
How did we do?
20% error
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Ideal Scoring Function
S(wi wj wk) = log(P(wj , pre(wi ), suc(wk))
= log(P(wj )P(wi |wj )P(wk|wj ))
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Ideal Scoring Function
Ideal Scoring Function
S(wi wj wk) = log(P(wj , pre(wi ), suc(wk))
= log(P(wj )P(wi |wj )P(wk|wj ))
Forgot the P(wj ) term in the factorization of the joint distribution,
which resulted in a slightly unfitting conditional distribution.
Remember this for reimplementation!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Testing conditions were not ideal because:
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Testing conditions were not ideal because:
Test data is not target data
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Testing conditions were not ideal because:
Test data is not target data
Mistakes are contrived
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Testing conditions were not ideal because:
Test data is not target data
Mistakes are contrived
Somebody make a labeled test set, then tune the algorithm to it!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Here are some ideas I had for future experiments:
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Here are some ideas I had for future experiments:
Use a discriminative model like maximum entropy
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Here are some ideas I had for future experiments:
Use a discriminative model like maximum entropy
Consider higher order neighbor words
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Future Work
Here are some ideas I had for future experiments:
Use a discriminative model like maximum entropy
Consider higher order neighbor words
Implement for other languages
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Start Coding!
Anyone else can do this too!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Start Coding!
Anyone else can do this too!
Straight-forward probability model
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Start Coding!
Anyone else can do this too!
Straight-forward probability model
25-50 lines of Python
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Start Coding!
Anyone else can do this too!
Straight-forward probability model
25-50 lines of Python
Freely accessible data from CommonCrawl!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Start Coding!
Anyone else can do this too!
Straight-forward probability model
25-50 lines of Python
Freely accessible data from CommonCrawl!
Go learn about ML and NLP! Get your hands dirty and add your
own mods! Find new problems and try new solutions!
Oskar Singer The Switchabalizer
Introduction
The Problem
First Attempt
Second Attempt
Conclusion
Future Work
Call to Action
Thank You, CommonCrawl!
Thanks so much to Lisa, Stephen, Grace and the rest of the team
for providing such a fantastic resource and bringing me down to
San Francisco to present!
Oskar Singer The Switchabalizer
1 of 73

Recommended

Passive 1 by
Passive 1Passive 1
Passive 1AFC_73
38 views3 slides
Measuring the impact of Google Analytics by
Measuring the impact of Google AnalyticsMeasuring the impact of Google Analytics
Measuring the impact of Google AnalyticsDomino Data Lab
4.3K views32 slides
Common Crawl: An Open Repository of Web Data by
Common Crawl: An Open Repository of Web DataCommon Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web Datahuguk
4.7K views19 slides
Mining a Large Web Corpus by
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web CorpusRobert Meusel
12.5K views21 slides
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012 by
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012Amazon Web Services
2.9K views29 slides
Building a Scalable Web Crawler with Hadoop by
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
34.9K views17 slides

More Related Content

Recently uploaded

information by
informationinformation
informationkhelgishekhar
9 views4 slides
DU Series - Day 4.pptx by
DU Series - Day 4.pptxDU Series - Day 4.pptx
DU Series - Day 4.pptxUiPathCommunity
106 views28 slides
UiPath Document Understanding_Day 3.pptx by
UiPath Document Understanding_Day 3.pptxUiPath Document Understanding_Day 3.pptx
UiPath Document Understanding_Day 3.pptxUiPathCommunity
105 views25 slides
Building trust in our information ecosystem: who do we trust in an emergency by
Building trust in our information ecosystem: who do we trust in an emergencyBuilding trust in our information ecosystem: who do we trust in an emergency
Building trust in our information ecosystem: who do we trust in an emergencyTina Purnat
100 views18 slides
IETF 118: Starlink Protocol Performance by
IETF 118: Starlink Protocol PerformanceIETF 118: Starlink Protocol Performance
IETF 118: Starlink Protocol PerformanceAPNIC
297 views22 slides
Is Entireweb better than Google by
Is Entireweb better than GoogleIs Entireweb better than Google
Is Entireweb better than Googlesebastianthomasbejan
12 views1 slide

Recently uploaded(10)

UiPath Document Understanding_Day 3.pptx by UiPathCommunity
UiPath Document Understanding_Day 3.pptxUiPath Document Understanding_Day 3.pptx
UiPath Document Understanding_Day 3.pptx
UiPathCommunity105 views
Building trust in our information ecosystem: who do we trust in an emergency by Tina Purnat
Building trust in our information ecosystem: who do we trust in an emergencyBuilding trust in our information ecosystem: who do we trust in an emergency
Building trust in our information ecosystem: who do we trust in an emergency
Tina Purnat100 views
IETF 118: Starlink Protocol Performance by APNIC
IETF 118: Starlink Protocol PerformanceIETF 118: Starlink Protocol Performance
IETF 118: Starlink Protocol Performance
APNIC297 views
Marketing and Community Building in Web3 by Federico Ast
Marketing and Community Building in Web3Marketing and Community Building in Web3
Marketing and Community Building in Web3
Federico Ast12 views
PORTFOLIO 1 (Bret Michael Pepito).pdf by brejess0410
PORTFOLIO 1 (Bret Michael Pepito).pdfPORTFOLIO 1 (Bret Michael Pepito).pdf
PORTFOLIO 1 (Bret Michael Pepito).pdf
brejess04108 views
How to think like a threat actor for Kubernetes.pptx by LibbySchulze1
How to think like a threat actor for Kubernetes.pptxHow to think like a threat actor for Kubernetes.pptx
How to think like a threat actor for Kubernetes.pptx
LibbySchulze15 views

Featured

Getting into the tech field. what next by
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
5.5K views22 slides
Google's Just Not That Into You: Understanding Core Updates & Search Intent by
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
6.2K views99 slides
How to have difficult conversations by
How to have difficult conversations How to have difficult conversations
How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC
4.7K views19 slides
Introduction to Data Science by
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceChristy Abraham Joy
82.2K views51 slides
Time Management & Productivity - Best Practices by
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
169.7K views42 slides
The six step guide to practical project management by
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
36.6K views27 slides

Featured(20)

Getting into the tech field. what next by Tessa Mero
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero5.5K views
Google's Just Not That Into You: Understanding Core Updates & Search Intent by Lily Ray
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray6.2K views
Time Management & Productivity - Best Practices by Vit Horky
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky169.7K views
The six step guide to practical project management by MindGenius
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius36.6K views
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright... by RachelPearson36
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson3612.6K views
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present... by Applitools
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools55.5K views
12 Ways to Increase Your Influence at Work by GetSmarter
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter401.6K views
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G... by DevGAMM Conference
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
DevGAMM Conference3.6K views
Barbie - Brand Strategy Presentation by Erica Santiago
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago25.1K views
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well by Saba Software
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software25.2K views
Introduction to C Programming Language by Simplilearn
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn8.4K views
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr... by Palo Alto Software
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
Palo Alto Software88.4K views
9 Tips for a Work-free Vacation by Weekdone.com
9 Tips for a Work-free Vacation9 Tips for a Work-free Vacation
9 Tips for a Work-free Vacation
Weekdone.com7.2K views
How to Map Your Future by SlideShop.com
How to Map Your FutureHow to Map Your Future
How to Map Your Future
SlideShop.com275.1K views
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -... by AccuraCast
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...
AccuraCast3.4K views

The Switchabalizer - our journey from spell checker to homophone corrrecter

  • 1. Introduction The Problem First Attempt Second Attempt Conclusion The Switchabalizer Our journey from spell checker to homophone correcter Oskar Singer July 23, 2014 Oskar Singer The Switchabalizer
  • 2. Introduction The Problem First Attempt Second Attempt Conclusion How I got here I am a rising senior in the UMass Amherst CS program specializing in machine learning and natural language processing. Oskar Singer The Switchabalizer
  • 3. Introduction The Problem First Attempt Second Attempt Conclusion How I got here I am a rising senior in the UMass Amherst CS program specializing in machine learning and natural language processing. Last summer, I interned at an Amherst/Boston-based text analytics company called Lexalytics Oskar Singer The Switchabalizer
  • 4. Introduction The Problem First Attempt Second Attempt Conclusion How I got here I am a rising senior in the UMass Amherst CS program specializing in machine learning and natural language processing. Last summer, I interned at an Amherst/Boston-based text analytics company called Lexalytics I worked with Lexalytics’ head of software engineering on this project Oskar Singer The Switchabalizer
  • 5. Introduction The Problem First Attempt Second Attempt Conclusion How I got here I am a rising senior in the UMass Amherst CS program specializing in machine learning and natural language processing. Last summer, I interned at an Amherst/Boston-based text analytics company called Lexalytics I worked with Lexalytics’ head of software engineering on this project Lexalytics often uses CommonCrawl, and it was a great option for a training data set Oskar Singer The Switchabalizer
  • 6. Introduction The Problem First Attempt Second Attempt Conclusion Motivation Lexalytics provides sentiment analysis software Oskar Singer The Switchabalizer
  • 7. Introduction The Problem First Attempt Second Attempt Conclusion Motivation Lexalytics provides sentiment analysis software Sentiment analysis relies heavily in sentence parsing and part-of-speech tagging Oskar Singer The Switchabalizer
  • 8. Introduction The Problem First Attempt Second Attempt Conclusion Motivation Lexalytics provides sentiment analysis software Sentiment analysis relies heavily in sentence parsing and part-of-speech tagging Misspellings and misusage can do serious damage to accuracy for those two tasks Oskar Singer The Switchabalizer
  • 9. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Oskar Singer The Switchabalizer
  • 10. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Hunspell gives an unranked list of correction suggestions Oskar Singer The Switchabalizer
  • 11. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Hunspell gives an unranked list of correction suggestions So we took the argmax of a home-baked scoring function that: Oskar Singer The Switchabalizer
  • 12. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Hunspell gives an unranked list of correction suggestions So we took the argmax of a home-baked scoring function that: penalized string edit distance Oskar Singer The Switchabalizer
  • 13. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Hunspell gives an unranked list of correction suggestions So we took the argmax of a home-baked scoring function that: penalized string edit distance penalized keyboard distance Oskar Singer The Switchabalizer
  • 14. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Approach We employed an open-source spell-checker called Hunspell Hunspell gives an unranked list of correction suggestions So we took the argmax of a home-baked scoring function that: penalized string edit distance penalized keyboard distance rewarded high word frequencies, which were harvested from CommonCrawl data Oskar Singer The Switchabalizer
  • 15. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Failure Hunspell had an error rate of Oskar Singer The Switchabalizer
  • 16. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses Failure Hunspell had an error rate of 216% Oskar Singer The Switchabalizer
  • 17. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? How is this possible? Two reasons: Oskar Singer The Switchabalizer
  • 18. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? How is this possible? Two reasons: Hunspell missed all the mistakes Oskar Singer The Switchabalizer
  • 19. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? How is this possible? Two reasons: Hunspell missed all the mistakes Hunspell made false corrections Oskar Singer The Switchabalizer
  • 20. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Hunspell was a poor choice for a couple reasons: Oskar Singer The Switchabalizer
  • 21. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Hunspell was a poor choice for a couple reasons: Hunspell’s vocabulary is not appropriate or flexible enough for Twitter domain Oskar Singer The Switchabalizer
  • 22. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Hunspell was a poor choice for a couple reasons: Hunspell’s vocabulary is not appropriate or flexible enough for Twitter domain Hunspell can’t detect correctly spelled words that are out of context Oskar Singer The Switchabalizer
  • 23. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Twitter’s vocabulary of abbreviations and acronyms is constantly growing Oskar Singer The Switchabalizer
  • 24. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Twitter’s vocabulary of abbreviations and acronyms is constantly growing Hunspell’s internal dictionary is not prepared for this Oskar Singer The Switchabalizer
  • 25. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Example: ur Oskar Singer The Switchabalizer
  • 26. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Example: ur What was Hunspell’s correction? Oskar Singer The Switchabalizer
  • 27. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? Example: ur What was Hunspell’s correction? Ur (the ancient Sumerian city-state) Oskar Singer The Switchabalizer
  • 28. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? When the issue is misuse rather than misspelling, Hunspell completely ignores the problem Oskar Singer The Switchabalizer
  • 29. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? When the issue is misuse rather than misspelling, Hunspell completely ignores the problem Specifically, commonly misused homophones were a huge problem in our data Oskar Singer The Switchabalizer
  • 30. Introduction The Problem First Attempt Second Attempt Conclusion The Approach The Weaknesses What Happened? When the issue is misuse rather than misspelling, Hunspell completely ignores the problem Specifically, commonly misused homophones were a huge problem in our data Examples: two/too/2/to; their/there/they’re; your/you’re Oskar Singer The Switchabalizer
  • 31. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Addressing Misusage How do we capture the idea of misuse? Oskar Singer The Switchabalizer
  • 32. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Addressing Misusage How do we capture the idea of misuse? Context Oskar Singer The Switchabalizer
  • 33. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Addressing Misusage How do we capture the idea of misuse? Context How can we capture context? Oskar Singer The Switchabalizer
  • 34. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Addressing Misusage How do we capture the idea of misuse? Context How can we capture context? Rule set? Oskar Singer The Switchabalizer
  • 35. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Addressing Misusage How do we capture the idea of misuse? Context How can we capture context? Rule set? Probabilistic approach! Oskar Singer The Switchabalizer
  • 36. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Bayes network Conditioned on the preceding and succeeding words Assumes these two words are independent Does not use bag-of-words approach (considers position) Oskar Singer The Switchabalizer
  • 37. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Conditional Probability of Preceding or Succeeding Words P(pre(wi )|wj ) = #(wi wj ) #(wj ) , where pre(w) is the event that w is the preceding word and #(∗) is the number of occurences of a sequence of words Oskar Singer The Switchabalizer
  • 38. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Conditional Probability of Preceding or Succeeding Words Conditional Probability of Preceding or Succeeding Words P(pre(wi )|wj ) = #(wi wj ) #(wj ) , where pre(w) is the event that w is the preceding word and #(∗) is the number of occurences of a sequence of words P(suc(wi )|wj ) = #(wj wi ) #(wj ) , where suc(w) is the event that w is the succeeding word Oskar Singer The Switchabalizer
  • 39. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Conditional Probability of Both Words P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj ) log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj )) + log(P(suc(wk)|wj )) Oskar Singer The Switchabalizer
  • 40. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Conditional Probability of Both Words Conditional Probability of Both Words P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj ) log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj )) + log(P(suc(wk)|wj )) The first equation holds because of our assumption of independence between the preceding and succeeding words Oskar Singer The Switchabalizer
  • 41. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Probability Model Conditional Probability of Both Words Conditional Probability of Both Words P(pre(wi ), suc(wk)|wj ) = P(pre(wi )|wj ) × P(suc(wk)|wj ) log(P(pre(wi ), suc(wk)|wj )) = log(P(pre(wi )|wj )) + log(P(suc(wk)|wj )) The first equation holds because of our assumption of independence between the preceding and succeeding words There is a missing term in the scoring function that I will address in the Future Work section Oskar Singer The Switchabalizer
  • 42. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Switchable Sets Only certain groups should be compared, e.g. ”too” should not be scored against ”their” Oskar Singer The Switchabalizer
  • 43. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Switchable Sets Only certain groups should be compared, e.g. ”too” should not be scored against ”their” Comparable switchables are groups in switchable sets Oskar Singer The Switchabalizer
  • 44. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Switchable Sets Only certain groups should be compared, e.g. ”too” should not be scored against ”their” Comparable switchables are groups in switchable sets Each switchable is mapped to its switchable set Oskar Singer The Switchabalizer
  • 45. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Picking the Word The Final Equation S(wi , wj , wk) = log(P(pre(wi ), suc(wk)|wj )) v∗ = argmaxv∈Vwj S(wi , v, wk) where S(wi , wj , wk) is the score for the sequence of words wi wj wk and Vwj is the switchable set corresponding to wj and v∗ is the ideal switchable Oskar Singer The Switchabalizer
  • 46. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass What about common misspellings that intersect with switchables? Oskar Singer The Switchabalizer
  • 47. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass What about common misspellings that intersect with switchables? Example: ”ur” Oskar Singer The Switchabalizer
  • 48. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass What about common misspellings that intersect with switchables? Example: ”ur” Should we put them in the switchable sets? Oskar Singer The Switchabalizer
  • 49. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass My opinion: no! Oskar Singer The Switchabalizer
  • 50. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass My opinion: no! Realistically, its probably okay. I opted for a more elegant solution Oskar Singer The Switchabalizer
  • 51. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass My opinion: no! Realistically, its probably okay. I opted for a more elegant solution Replace all common mispellings with something from the appropriate switchable set Oskar Singer The Switchabalizer
  • 52. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results First Pass My opinion: no! Realistically, its probably okay. I opted for a more elegant solution Replace all common mispellings with something from the appropriate switchable set The model’s results are agnositc to the switchable that activates it Oskar Singer The Switchabalizer
  • 53. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Testing Assume Wikipedia has correct usage of all switchables Oskar Singer The Switchabalizer
  • 54. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Testing Assume Wikipedia has correct usage of all switchables Replace target words in Wikipedia articles with words from their switchable set Oskar Singer The Switchabalizer
  • 55. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Testing Assume Wikipedia has correct usage of all switchables Replace target words in Wikipedia articles with words from their switchable set Run the Switchabilizer on corrupted articles Oskar Singer The Switchabalizer
  • 56. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Results How did we do? Oskar Singer The Switchabalizer
  • 57. Introduction The Problem First Attempt Second Attempt Conclusion Brainstorm The Approach Testing and Results Results How did we do? 20% error Oskar Singer The Switchabalizer
  • 58. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Ideal Scoring Function S(wi wj wk) = log(P(wj , pre(wi ), suc(wk)) = log(P(wj )P(wi |wj )P(wk|wj )) Oskar Singer The Switchabalizer
  • 59. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Ideal Scoring Function Ideal Scoring Function S(wi wj wk) = log(P(wj , pre(wi ), suc(wk)) = log(P(wj )P(wi |wj )P(wk|wj )) Forgot the P(wj ) term in the factorization of the joint distribution, which resulted in a slightly unfitting conditional distribution. Remember this for reimplementation! Oskar Singer The Switchabalizer
  • 60. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Testing conditions were not ideal because: Oskar Singer The Switchabalizer
  • 61. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Testing conditions were not ideal because: Test data is not target data Oskar Singer The Switchabalizer
  • 62. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Testing conditions were not ideal because: Test data is not target data Mistakes are contrived Oskar Singer The Switchabalizer
  • 63. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Testing conditions were not ideal because: Test data is not target data Mistakes are contrived Somebody make a labeled test set, then tune the algorithm to it! Oskar Singer The Switchabalizer
  • 64. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Here are some ideas I had for future experiments: Oskar Singer The Switchabalizer
  • 65. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Here are some ideas I had for future experiments: Use a discriminative model like maximum entropy Oskar Singer The Switchabalizer
  • 66. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Here are some ideas I had for future experiments: Use a discriminative model like maximum entropy Consider higher order neighbor words Oskar Singer The Switchabalizer
  • 67. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Future Work Here are some ideas I had for future experiments: Use a discriminative model like maximum entropy Consider higher order neighbor words Implement for other languages Oskar Singer The Switchabalizer
  • 68. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Start Coding! Anyone else can do this too! Oskar Singer The Switchabalizer
  • 69. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Start Coding! Anyone else can do this too! Straight-forward probability model Oskar Singer The Switchabalizer
  • 70. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Start Coding! Anyone else can do this too! Straight-forward probability model 25-50 lines of Python Oskar Singer The Switchabalizer
  • 71. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Start Coding! Anyone else can do this too! Straight-forward probability model 25-50 lines of Python Freely accessible data from CommonCrawl! Oskar Singer The Switchabalizer
  • 72. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Start Coding! Anyone else can do this too! Straight-forward probability model 25-50 lines of Python Freely accessible data from CommonCrawl! Go learn about ML and NLP! Get your hands dirty and add your own mods! Find new problems and try new solutions! Oskar Singer The Switchabalizer
  • 73. Introduction The Problem First Attempt Second Attempt Conclusion Future Work Call to Action Thank You, CommonCrawl! Thanks so much to Lisa, Stephen, Grace and the rest of the team for providing such a fantastic resource and bringing me down to San Francisco to present! Oskar Singer The Switchabalizer