This document discusses spell checking using n-gram language models. It motivates the use of this approach for direct input correction as well as improving speech recognition. It describes Shannon's noisy channel model and common spelling error types. The implementation uses a trigram language model trained on Wall Street Journal text to suggest corrections for artificially generated typos. Issues addressed include out-of-vocabulary errors and how noise in the channel can decrease context clues. Performance results are to be determined along with potential research areas like handling multiple errors or using grammatical clues.
3. Motivation
L. Zhuang, F. Zhou, D. Tygar. Keyboard Acoustic Emanations
Revisited. Proceedings of the 12th ACM Conference on
Computer and Communications Security, November 2005.
4. Motivation
L. Zhuang, F. Zhou, D. Tygar. Keyboard Acoustic Emanations
Revisited. Proceedings of the 12th ACM Conference on
Computer and Communications Security, November 2005.
5. Theory
Shannon’s noisy channel model
C. Shannon. A mathematical theory of communication.
Bell System Technical Journal 27 (3), pp. 379-423, 1948.
6. Theory
Classical Damereau errors (1964)
Substitution
[ALPHABET] [ALPHSBET]
Deletion
[ALPHABET] [ALPHBET]
Insertion
[ALPHABET] [ALPHAABET]
Transposition
[ALPHABET] [ALPHBAET]
F.J. Damereau. A technique for computer detection and
correction of spelling errors. Communications of the ACM 7 (3),
pp. 171-176, 1964.
7. Theory
Levenshtein distance (1966)
Lecture 6 (DTW word alignment)
Assign cost to each Damereau error
Not all models consider transposition
V. Levenshtein. Binary codes capable of correcting
deletions, insertions and reversals. Soviet Physice –
Doklady 10, pp. 707-710, 1966.
8. Implementation
Test data creation: typofy.pl
Single-error model
Word spacing not affected
Key locality not considered
9. Implementation
Test data creation: typofy.pl
raph@nexus:~/asr$ ./typofy.pl --help
Plaintext typo-fier, by Raphael Bouskila <ralian@gmail.com>
Version: 0.1, April 1 2007
Usage: typofy.pl [-i|-iz INPUTFILE] [-e ERROR_RATE] [-d]
Takes a standard format text file and inserts random typos.
If input file is specified as '-iz inputfile',
the program unzips and reads a zipped input file.
If no input file is specified it uses the
file "typotext" in the current directory.
Error rate can be specified as a probability between 0 and 1.
Debug output is produced with -d.
Output is to standard output.
10. Implementation
Typofication
raph@nexus:~/asr$ cat stuff2.txt
two narrow gauge railroads from china enter the city from the northeast
and northwest
some maps use bands of color to indicate different intervals of value
origins or causes of spontaneous mutation are not yet completely clear
unusually high levels of radiation were detected in many european
countries
raph@nexus:~/asr$ ./typofy.pl -e 0.30 -i stuff2.txt
to narrow gauge railroads from china enter he ciyt from the norteast and
nsrthwest
some map zse bands of oclor tj indicateh different intervals of valu
origins or causes of spontaneous mutatio are not yet copmletely slear
unusually igh leves ofb raiation were wetected in many euroiean countries
11. Implementation
Source corpus: Wall Street Journal database
Dictionary lookup
4989-word dictionary
N-gram language model (110 MB)
Backoff trigram model
1639687 bigrams
2684151 trigrams
12. Implementation
FSM word alignment
Suggests n-best corrections
Corrections sorted via n-gram perplexity (log-
probability) score
13. Issues
Out-of-vocabulary errors
[PLEISTOCENE] ?
WSJ corpus: 5,000 words
Typical human vocabulary: 38,000 headwords; hundreds of
thousands of total words
http://www.worldwidewords.org/articles/howmany.htm
In-vocabulary errors
[THUS] [THIS]
Assign greater weight to n-gram score
Some other grammar/context checking model
Noisiness of channel
Not as much of an issue with single-error model
Can still affect results by decreasing effectiveness of context
clues
14. Performance
TBA
Research possibilities:
Correction success vs. channel noisiness
Multiple-error model
Non-letter error model (space, caps lock, etc.)
Key locality clues
Grammar clues (e.g. Chomsky CFG model)
15. Thanks!
Prof. Rose
Providing ultra-massive language model
Many explanatory discussions
Second Cup Coffee Co.
Substitute for sleep