SlideShare a Scribd company logo
1 of 15
Spell checking using n-
gram language models
Raphael Bouskila
Motivation
 Direct application
 Input correction
 Indirect application
 ASR post-processing improvement
 ASR performance metric
Motivation
 L. Zhuang, F. Zhou, D. Tygar. Keyboard Acoustic Emanations
Revisited. Proceedings of the 12th ACM Conference on
Computer and Communications Security, November 2005.
Motivation
 L. Zhuang, F. Zhou, D. Tygar. Keyboard Acoustic Emanations
Revisited. Proceedings of the 12th ACM Conference on
Computer and Communications Security, November 2005.
Theory
 Shannon’s noisy channel model
 C. Shannon. A mathematical theory of communication.
Bell System Technical Journal 27 (3), pp. 379-423, 1948.
Theory
 Classical Damereau errors (1964)
 Substitution
 [ALPHABET]  [ALPHSBET]
 Deletion
 [ALPHABET]  [ALPHBET]
 Insertion
 [ALPHABET]  [ALPHAABET]
 Transposition
 [ALPHABET]  [ALPHBAET]
 F.J. Damereau. A technique for computer detection and
correction of spelling errors. Communications of the ACM 7 (3),
pp. 171-176, 1964.
Theory
 Levenshtein distance (1966)
 Lecture 6 (DTW word alignment)
 Assign cost to each Damereau error
 Not all models consider transposition
 V. Levenshtein. Binary codes capable of correcting
deletions, insertions and reversals. Soviet Physice –
Doklady 10, pp. 707-710, 1966.
Implementation
 Test data creation: typofy.pl
 Single-error model
 Word spacing not affected
 Key locality not considered
Implementation
 Test data creation: typofy.pl
raph@nexus:~/asr$ ./typofy.pl --help
Plaintext typo-fier, by Raphael Bouskila <ralian@gmail.com>
Version: 0.1, April 1 2007
Usage: typofy.pl [-i|-iz INPUTFILE] [-e ERROR_RATE] [-d]
Takes a standard format text file and inserts random typos.
If input file is specified as '-iz inputfile',
the program unzips and reads a zipped input file.
If no input file is specified it uses the
file "typotext" in the current directory.
Error rate can be specified as a probability between 0 and 1.
Debug output is produced with -d.
Output is to standard output.
Implementation
 Typofication
raph@nexus:~/asr$ cat stuff2.txt
two narrow gauge railroads from china enter the city from the northeast
and northwest
some maps use bands of color to indicate different intervals of value
origins or causes of spontaneous mutation are not yet completely clear
unusually high levels of radiation were detected in many european
countries
raph@nexus:~/asr$ ./typofy.pl -e 0.30 -i stuff2.txt
to narrow gauge railroads from china enter he ciyt from the norteast and
nsrthwest
some map zse bands of oclor tj indicateh different intervals of valu
origins or causes of spontaneous mutatio are not yet copmletely slear
unusually igh leves ofb raiation were wetected in many euroiean countries
Implementation
 Source corpus: Wall Street Journal database
 Dictionary lookup
 4989-word dictionary
 N-gram language model (110 MB)
 Backoff trigram model
 1639687 bigrams
 2684151 trigrams
Implementation
 FSM word alignment
 Suggests n-best corrections
 Corrections sorted via n-gram perplexity (log-
probability) score
Issues
 Out-of-vocabulary errors
 [PLEISTOCENE]  ?
 WSJ corpus: 5,000 words
 Typical human vocabulary: 38,000 headwords; hundreds of
thousands of total words
 http://www.worldwidewords.org/articles/howmany.htm
 In-vocabulary errors
 [THUS]  [THIS]
 Assign greater weight to n-gram score
 Some other grammar/context checking model
 Noisiness of channel
 Not as much of an issue with single-error model
 Can still affect results by decreasing effectiveness of context
clues
Performance
 TBA
 Research possibilities:
 Correction success vs. channel noisiness
 Multiple-error model
 Non-letter error model (space, caps lock, etc.)
 Key locality clues
 Grammar clues (e.g. Chomsky CFG model)
Thanks!
 Prof. Rose
 Providing ultra-massive language model
 Many explanatory discussions
 Second Cup Coffee Co.
 Substitute for sleep

More Related Content

Viewers also liked

Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...Lucidworks
 
Text categorization
Text categorizationText categorization
Text categorizationKU Leuven
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
 
Speech recognition
Speech recognitionSpeech recognition
Speech recognitionCharu Joshi
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognitionRichie
 

Viewers also liked (6)

Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
 
Text categorization
Text categorizationText categorization
Text categorization
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Speech recognition
Speech recognitionSpeech recognition
Speech recognition
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 

Similar to Spell checking using n-gram language models summary

PERFORMANCE ANALYSIS OF CLIPPED STBC CODED MIMO OFDM SYSTEM
PERFORMANCE ANALYSIS OF CLIPPED STBC CODED MIMO OFDM SYSTEMPERFORMANCE ANALYSIS OF CLIPPED STBC CODED MIMO OFDM SYSTEM
PERFORMANCE ANALYSIS OF CLIPPED STBC CODED MIMO OFDM SYSTEMIAEME Publication
 
Brownbag Talk 061902
Brownbag Talk 061902Brownbag Talk 061902
Brownbag Talk 061902amcknightus
 
Encrypted Traffic Mining
Encrypted Traffic MiningEncrypted Traffic Mining
Encrypted Traffic MiningHenry Huang
 
The Physical Layer
The Physical LayerThe Physical Layer
The Physical Layeradil raja
 
An Effective Approach for Chinese Speech Recognition on Small Size of Vocabulary
An Effective Approach for Chinese Speech Recognition on Small Size of VocabularyAn Effective Approach for Chinese Speech Recognition on Small Size of Vocabulary
An Effective Approach for Chinese Speech Recognition on Small Size of Vocabularysipij
 
IRJET- Performance Analysis of a Synchronized Receiver over Noiseless and Fad...
IRJET- Performance Analysis of a Synchronized Receiver over Noiseless and Fad...IRJET- Performance Analysis of a Synchronized Receiver over Noiseless and Fad...
IRJET- Performance Analysis of a Synchronized Receiver over Noiseless and Fad...IRJET Journal
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~Yamagishi Laboratory, National Institute of Informatics, Japan
 
Path Loss Prediction by Robust Regression Methods
Path Loss Prediction by Robust Regression MethodsPath Loss Prediction by Robust Regression Methods
Path Loss Prediction by Robust Regression Methodsijceronline
 
2017-May_ECD-422_89.pdf
2017-May_ECD-422_89.pdf2017-May_ECD-422_89.pdf
2017-May_ECD-422_89.pdfAyushSunariya
 
17438 communication techniques
17438  communication techniques17438  communication techniques
17438 communication techniquessoni_nits
 
Pilot induced cyclostationarity based method for dvb system identification
Pilot induced cyclostationarity based method for dvb system identificationPilot induced cyclostationarity based method for dvb system identification
Pilot induced cyclostationarity based method for dvb system identificationiaemedu
 
A robust doa–based smart antenna processor for gsm base stations
A robust doa–based smart antenna processor for gsm base stationsA robust doa–based smart antenna processor for gsm base stations
A robust doa–based smart antenna processor for gsm base stationsmarwaeng
 
Error Rate Analysis of MIMO System Using V Blast Detection Technique in Fadin...
Error Rate Analysis of MIMO System Using V Blast Detection Technique in Fadin...Error Rate Analysis of MIMO System Using V Blast Detection Technique in Fadin...
Error Rate Analysis of MIMO System Using V Blast Detection Technique in Fadin...IJERA Editor
 
5.computer analysis of the cost 231 hata model for path loss estimation at 90...
5.computer analysis of the cost 231 hata model for path loss estimation at 90...5.computer analysis of the cost 231 hata model for path loss estimation at 90...
5.computer analysis of the cost 231 hata model for path loss estimation at 90...Alexander Decker
 
Spatial Modulation
Spatial ModulationSpatial Modulation
Spatial ModulationArvin Moeini
 
CSI Acquisition for FDD-based Massive MIMO Systems
CSI Acquisition for FDD-based Massive MIMO SystemsCSI Acquisition for FDD-based Massive MIMO Systems
CSI Acquisition for FDD-based Massive MIMO SystemsCPqD
 

Similar to Spell checking using n-gram language models summary (20)

Real Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform DomainReal Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform Domain
 
PERFORMANCE ANALYSIS OF CLIPPED STBC CODED MIMO OFDM SYSTEM
PERFORMANCE ANALYSIS OF CLIPPED STBC CODED MIMO OFDM SYSTEMPERFORMANCE ANALYSIS OF CLIPPED STBC CODED MIMO OFDM SYSTEM
PERFORMANCE ANALYSIS OF CLIPPED STBC CODED MIMO OFDM SYSTEM
 
Brownbag Talk 061902
Brownbag Talk 061902Brownbag Talk 061902
Brownbag Talk 061902
 
Hyp_con
Hyp_conHyp_con
Hyp_con
 
Encrypted Traffic Mining
Encrypted Traffic MiningEncrypted Traffic Mining
Encrypted Traffic Mining
 
The Physical Layer
The Physical LayerThe Physical Layer
The Physical Layer
 
An Effective Approach for Chinese Speech Recognition on Small Size of Vocabulary
An Effective Approach for Chinese Speech Recognition on Small Size of VocabularyAn Effective Approach for Chinese Speech Recognition on Small Size of Vocabulary
An Effective Approach for Chinese Speech Recognition on Small Size of Vocabulary
 
Dq24746750
Dq24746750Dq24746750
Dq24746750
 
IRJET- Performance Analysis of a Synchronized Receiver over Noiseless and Fad...
IRJET- Performance Analysis of a Synchronized Receiver over Noiseless and Fad...IRJET- Performance Analysis of a Synchronized Receiver over Noiseless and Fad...
IRJET- Performance Analysis of a Synchronized Receiver over Noiseless and Fad...
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
 
Path Loss Prediction by Robust Regression Methods
Path Loss Prediction by Robust Regression MethodsPath Loss Prediction by Robust Regression Methods
Path Loss Prediction by Robust Regression Methods
 
2017-May_ECD-422_89.pdf
2017-May_ECD-422_89.pdf2017-May_ECD-422_89.pdf
2017-May_ECD-422_89.pdf
 
17438 communication techniques
17438  communication techniques17438  communication techniques
17438 communication techniques
 
Pilot induced cyclostationarity based method for dvb system identification
Pilot induced cyclostationarity based method for dvb system identificationPilot induced cyclostationarity based method for dvb system identification
Pilot induced cyclostationarity based method for dvb system identification
 
A robust doa–based smart antenna processor for gsm base stations
A robust doa–based smart antenna processor for gsm base stationsA robust doa–based smart antenna processor for gsm base stations
A robust doa–based smart antenna processor for gsm base stations
 
U4301106110
U4301106110U4301106110
U4301106110
 
Error Rate Analysis of MIMO System Using V Blast Detection Technique in Fadin...
Error Rate Analysis of MIMO System Using V Blast Detection Technique in Fadin...Error Rate Analysis of MIMO System Using V Blast Detection Technique in Fadin...
Error Rate Analysis of MIMO System Using V Blast Detection Technique in Fadin...
 
5.computer analysis of the cost 231 hata model for path loss estimation at 90...
5.computer analysis of the cost 231 hata model for path loss estimation at 90...5.computer analysis of the cost 231 hata model for path loss estimation at 90...
5.computer analysis of the cost 231 hata model for path loss estimation at 90...
 
Spatial Modulation
Spatial ModulationSpatial Modulation
Spatial Modulation
 
CSI Acquisition for FDD-based Massive MIMO Systems
CSI Acquisition for FDD-based Massive MIMO SystemsCSI Acquisition for FDD-based Massive MIMO Systems
CSI Acquisition for FDD-based Massive MIMO Systems
 

More from Raphael Bouskila

Photonic chip-based RF spectrum analyzer
Photonic chip-based RF spectrum analyzerPhotonic chip-based RF spectrum analyzer
Photonic chip-based RF spectrum analyzerRaphael Bouskila
 
Nano magnetic interactions on MnAs thin films
Nano magnetic interactions on MnAs thin filmsNano magnetic interactions on MnAs thin films
Nano magnetic interactions on MnAs thin filmsRaphael Bouskila
 
Generation of optical harmonics
Generation of optical harmonicsGeneration of optical harmonics
Generation of optical harmonicsRaphael Bouskila
 
Three-wavelength self-pulsating fiber source
Three-wavelength self-pulsating fiber sourceThree-wavelength self-pulsating fiber source
Three-wavelength self-pulsating fiber sourceRaphael Bouskila
 

More from Raphael Bouskila (6)

Photonic chip-based RF spectrum analyzer
Photonic chip-based RF spectrum analyzerPhotonic chip-based RF spectrum analyzer
Photonic chip-based RF spectrum analyzer
 
Nano magnetic interactions on MnAs thin films
Nano magnetic interactions on MnAs thin filmsNano magnetic interactions on MnAs thin films
Nano magnetic interactions on MnAs thin films
 
Induction Motor Basics
Induction Motor BasicsInduction Motor Basics
Induction Motor Basics
 
Generation of optical harmonics
Generation of optical harmonicsGeneration of optical harmonics
Generation of optical harmonics
 
Three-wavelength self-pulsating fiber source
Three-wavelength self-pulsating fiber sourceThree-wavelength self-pulsating fiber source
Three-wavelength self-pulsating fiber source
 
Laser microphone
Laser microphoneLaser microphone
Laser microphone
 

Recently uploaded

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Spell checking using n-gram language models summary

  • 1. Spell checking using n- gram language models Raphael Bouskila
  • 2. Motivation  Direct application  Input correction  Indirect application  ASR post-processing improvement  ASR performance metric
  • 3. Motivation  L. Zhuang, F. Zhou, D. Tygar. Keyboard Acoustic Emanations Revisited. Proceedings of the 12th ACM Conference on Computer and Communications Security, November 2005.
  • 4. Motivation  L. Zhuang, F. Zhou, D. Tygar. Keyboard Acoustic Emanations Revisited. Proceedings of the 12th ACM Conference on Computer and Communications Security, November 2005.
  • 5. Theory  Shannon’s noisy channel model  C. Shannon. A mathematical theory of communication. Bell System Technical Journal 27 (3), pp. 379-423, 1948.
  • 6. Theory  Classical Damereau errors (1964)  Substitution  [ALPHABET]  [ALPHSBET]  Deletion  [ALPHABET]  [ALPHBET]  Insertion  [ALPHABET]  [ALPHAABET]  Transposition  [ALPHABET]  [ALPHBAET]  F.J. Damereau. A technique for computer detection and correction of spelling errors. Communications of the ACM 7 (3), pp. 171-176, 1964.
  • 7. Theory  Levenshtein distance (1966)  Lecture 6 (DTW word alignment)  Assign cost to each Damereau error  Not all models consider transposition  V. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physice – Doklady 10, pp. 707-710, 1966.
  • 8. Implementation  Test data creation: typofy.pl  Single-error model  Word spacing not affected  Key locality not considered
  • 9. Implementation  Test data creation: typofy.pl raph@nexus:~/asr$ ./typofy.pl --help Plaintext typo-fier, by Raphael Bouskila <ralian@gmail.com> Version: 0.1, April 1 2007 Usage: typofy.pl [-i|-iz INPUTFILE] [-e ERROR_RATE] [-d] Takes a standard format text file and inserts random typos. If input file is specified as '-iz inputfile', the program unzips and reads a zipped input file. If no input file is specified it uses the file "typotext" in the current directory. Error rate can be specified as a probability between 0 and 1. Debug output is produced with -d. Output is to standard output.
  • 10. Implementation  Typofication raph@nexus:~/asr$ cat stuff2.txt two narrow gauge railroads from china enter the city from the northeast and northwest some maps use bands of color to indicate different intervals of value origins or causes of spontaneous mutation are not yet completely clear unusually high levels of radiation were detected in many european countries raph@nexus:~/asr$ ./typofy.pl -e 0.30 -i stuff2.txt to narrow gauge railroads from china enter he ciyt from the norteast and nsrthwest some map zse bands of oclor tj indicateh different intervals of valu origins or causes of spontaneous mutatio are not yet copmletely slear unusually igh leves ofb raiation were wetected in many euroiean countries
  • 11. Implementation  Source corpus: Wall Street Journal database  Dictionary lookup  4989-word dictionary  N-gram language model (110 MB)  Backoff trigram model  1639687 bigrams  2684151 trigrams
  • 12. Implementation  FSM word alignment  Suggests n-best corrections  Corrections sorted via n-gram perplexity (log- probability) score
  • 13. Issues  Out-of-vocabulary errors  [PLEISTOCENE]  ?  WSJ corpus: 5,000 words  Typical human vocabulary: 38,000 headwords; hundreds of thousands of total words  http://www.worldwidewords.org/articles/howmany.htm  In-vocabulary errors  [THUS]  [THIS]  Assign greater weight to n-gram score  Some other grammar/context checking model  Noisiness of channel  Not as much of an issue with single-error model  Can still affect results by decreasing effectiveness of context clues
  • 14. Performance  TBA  Research possibilities:  Correction success vs. channel noisiness  Multiple-error model  Non-letter error model (space, caps lock, etc.)  Key locality clues  Grammar clues (e.g. Chomsky CFG model)
  • 15. Thanks!  Prof. Rose  Providing ultra-massive language model  Many explanatory discussions  Second Cup Coffee Co.  Substitute for sleep