This document presents an overview of spell checking techniques in natural language processing. It discusses how spell checkers work by scanning text, comparing words to a dictionary, and using language-dependent algorithms. Two categories of spelling errors are described: real-word errors involving correctly spelled words and non-word errors containing no dictionary words. Techniques for error detection include dictionary lookup and n-gram comparisons using the Jaccard coefficient. The Levenshtein distance and Jaccard coefficient algorithms are then explained and shown to provide suggestions by calculating the edit distance between source and target words. The presentation concludes that these algorithms filter dictionary words and provide accurate suggestions to correct spelling mistakes in text.
3. Abstract
In this project, I am proposing a simple, flexible, and efficient spell checker
editor application based upon edit distance score, Supervised learning. I am
integrating Levenshtein distance (LD), Jaccard coefficient Algorithms to
achieve the require target, these all algorithms are measure the similarity
between two strings, which we will refer to as the source string (s) and the target
string (t). My approach is to design text editor based upon NLP having
auto spell checker which suggest the user mistake. I am using a novel scoring
scheme to integrate the retrieved words from each spelling approach and
calculate an overall score for each matched word. From the overall scores, we
can rank the possible matches. This algorithms required a training data set
which nothing but a data as dictionary, while writing a content in editor, the
backend proceed will happen like tokenization, distance calculation and
further filtering the results using mentioned algorithms and finally suggest
appropriate results.
4. Spell Check is a process of detecting and sometimes providing
suggestions for incorrectly spelled words in a text.
In computing, Spell Checker is an application program that flags words
in a document that may not be spelled correctly.
Spell Checker may be stand- alone capable of operating on a block a
text such as word-processor, electronic dictionary.
A basic spell checker carries out the following processes:
It scans the text and extracts the words contained in it.
It then compares each word with a known list of correctly spelled words (i.e.
a dictionary).
An additional step is a language-dependent algorithm for handling
morphology.
Introduction:
5. Spelling errors can be divided into two categories:
Real-word errors
Non-word errors.
Real-word errors : are those error words that are acceptable words in
the dictionary.
Non-word errors : are those error words that cannot be found in the
dictionary.
This words are complex to provide the suggestion, so this might not be
suggested.
6. 2. ERROR DETECTION TECHNIQUES
A. Dictionary Lookup Technique :
- In this, Dictionary lookup technique is used which checks every word of input
text for its presence in dictionary.
If that word present in the dictionary, then it is a correct
word.
Otherwise it is put into the list of error words.
7. String 1 – “statistics”
String 2 – “statistical”
If n is set to 2 (Bigrams are being extracted), then the similarity of the two
strings is calculated as follows.
Initially, the two strings are split into n-grams:
Statistics - st ta at ti is st ti ic cs 9 Bigrams
Statistical - st ta at ti is st ti ic ca al 10 Bigrams
Coefficient = A ⊓ B
A ⊔ B
1. N- grams Based Technique using Jaccard coefficient
B. ALGORITHMS FOR ERROR WORDS
8. Levenshtein distance (LD) is a measure of the similarity between two strings,
which we will refer to as the source string (s) and the target string (t). The
distance is the number of deletions, insertions, or substitutions required to
transform s into t. For example,
If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are
needed. The strings are already identical.
If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change
"s" to "n") is sufficient to transform s into t.
2. The Levenshtein Algorithm (LD)
9. The Levenshtein distance algorithm has been used in:
Spell checking
Speech recognition
DNA analysis
Plagiarism detection
Operation : Insertion, Deletion, Substitution
In this algorithm cost calculated by comparing the source and targeted
words, and according to low cost word is suggested to user.
e.g Cat Cut
here the one substitution by 1 latter.
10.
11.
12.
13. Conclusion
In this presentation we seen error detection , correction
techniques, the word are suggested to end user is based
on two algorithms one is Jaccard coefficient and second is
Levenshteien distance.
This algorithms filter out the dictionary
words and provide the exact suggestion to user, so that
user enter text in editor should be a error free and it
does not content any spelling mistakes.