Spell checker for Kannada OCR

IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 4,April 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 1 | P a g e Copyright@IDL-2017
Spell checker for Kannada OCR
Sharathkumar S
Assistant Professor
Department of Information Science and Engineering
Siddaganga Institute of Technology, Tumakuru
skumars@sit.ac.in
Suma S, Sneha N
UG scholars,
Department of Information Science and Engineering
Siddaganga Institute of Technology, Tumakuru
suma.sinchu.ds@gmail.com
snehan769@gmail.com
Abstract— A spell checker is an application program to
process the natural languages in machine readable format
effectively. Spelling checking and correction is a basic
necessity and a tedious work in any language, so we require
spell checker software to do this, which is the fundamental
necessity for any work. Spell checker is a set of program
which analyzes the wrongly used word and corrects it by the
most possible correct word. The challenging task here is the
work done for a Kannada language. In a software system
many Kannada words are typed in several formats since
Kannada has many fonts to write the grammar properly.
In this paper, we describe some techniques used in
Kannada language by a spell checker. We use NLP, which is
a field of computer science having relationship between
human (i.e., natural languages) and computers. Usually, we
have some modern NLP algorithms based on machine
learning to carry out the work.
Keywords—Spell checker, NLP, OCR, Dictionary Lookup;
I. INTRODUCTION
Kannada is a Dravidian language spoken predominantly
by people of Karnataka and other neighboring states. It has
roughly forty million native speakers and a total of 50.8
million speakers according to 2001 census. Spell checking is
the critical problem in NLP. The tool named spell checker is
the important tool for the number of tightly coupled
components for various software like OCR, word processor
and even translators.
1.1Error Analyzer
A linguistic error analyzer is a tool which studies the types
and causes of language errors.
Errors may be classified as: Conceptualization errors
(i.e., thinking), phoneme to grapheme mapping errors (i.e.,
writing), typing errors, OCR generated errors, errors generated
by speech recognizer.
Conceptualization errors
Errors occurred due to one‘s way of thinking.
Ex: may be wrongly written as .
Phoneme to grapheme mapping errors
Errors occurred while writing the dictated words.
Ex: may be wrongly written as .
Typing errors
Errors occurred while typing by pressing wrong key.
Ex: may be wrongly typed as .
OCR generated errors
Errors occurred by incorrect recognition of a character by
OCR.
Ex: may be wrongly recognized as .
Errors generated by speech recognizer
Errors occurred due to wrong pronunciation of words or
wrong recognition of words by speech recognizer.
Ex: may be wrongly recognized as .
1.2 Optical Character Recognition (OCR)
Optical character recognition is a technique for moving
text from paper form to electronic form. To convert an image,
written text or e-text into a machine readable format we
require an OCR, the input to this can be a plain document,
image etc. The source for OCR can be bank statements, ATM
transactions, e-statements, mailing documents etc.
To process different tasks like speech to text, image to text
and vice-versa, analyzing of the text is done in digitized
format, so that it can be easily edited, stored and even
accessed easily via open-access system. OCR is a field of
research in NLP, Machine learning, artificial intelligence and
computer vision.
In a modern era, there is a need of flexibility to produce an
accurate OCR systemso that it can recognize any type of fonts
with the support of various digital image inputs to get more
accurate outputs for the proper inputs supplied .

OCR Errors
Due to the noise, the following errors may occur
 Reject error
The machine reading process may not be able to
recognize a character.
 Substitutionerror
OCR may recognize a character incorrectly.
 Character fusions
Two or more character images merge to appear as a single
connected component.
 Character fragmentation
A character image is fragmented into more than one sub
image.
1.3 Spell Checker
A spell checker is an application program required by
machines to process natural languages effectively. Spell
checkers can be used as independent tools or they can be a
part of larger applications like search engine, translator etc. A
simple spell checker can performthe following tasks:
 Scanning and extracting the words contained in
the text.
 Matching of the correctly written words with
those typed including special
symbols,hyphens..etc is the important step.
 To handle morphology to process a language
dependent algorithm is required. English
language also requires a spell checker for the
similar words including plurals,verbal
forms…etc. So processing even these steps for
other languages will be a complicated issue.
Related Work
The literature survey reveals that most of the research
works on Kannada Spell Checker focus on normal text while
some efforts have been made in other languages like Punjabi,
Hindi etc., for OCR text. But, no work related to OCR spell
checkers are reported for Kannada directly. Some works on
OCR Spell Checker in other languages and Kannada spell
checker are reported here.
This review discusses common spell checking approaches
and the problems that may occur during spell checking
process. There are two common approaches for implementing
spell checker: Dictionary lookup method and N-gram
approach.
Initially, divide the bulk of text data into a series of
separate words—further we use an inbuilt analyzer i.e a
morphological analyzer which uses the separate dictionaries to
access the root word and the suffixed word followed by it. We
need to establish a relationship between different varieties of
root words and its suffixes—in order to do this process, a
mapping function is necessary. Validity of a word is checked
using morphological analyzer. We have to identify the type of
error i.e word is incorrect viz. correct root and incorrect suffix,
incorrect root and correct suffix and correct root and matching
suffix. These errors are taken care individually and the
incorrect words are made transparent by suitable solutions.
These words which are mis-interrupted are corrected by the
help of user by giving suitable suggestions. But, the drawback
of this system is that, it fails when it is implemented for OCR
output text. It cannot efficiently handle special cases in OCR
like character fusion, character fragmentation etc.
Dictionary lookup method
Dictionary lookup method is a method of comparing the
words in the input file with the correct words in the dictionary.
This method is used as an advantage over OCR to inspect the
letters which are ambiguous—but in a large scale, it leads to
size overhead and calculation of probability will become
complex and even the cost ofsearching.
N-gram approach
An N-gram approach is an arrangement of text in a
sequential order for different items like phonemes, graphemes,
letters, words and even pair of words.
Unigram, bigram, trigram are the varieties in it. In
general, we have N-gram which is a types of predictive model
designed by the help of Markov to guess the subsequent item
in the form of n-1 order and it follows is a probabilistic
language model approach.
Design

Romanization
Romanization is a process of converting a written text from
a specific system to Roman Script. Romanization includes
following methods:
 Transliteration – for representing written text
 Transcription – for representing the spoken word
 Combination of both transliteration and
transcription
Ex: 1. in Romanized format is written
as ‗avanu‘.
2. in Romanized format is written as ‗snEha‘.
In this tool we read line by line from an input file and each
line is Romanized to English.
Ex: is Romanized to
English as ‗rAmanu kAdig.e hodanu‘.
Tokenization
Tokenization is a process of forming a set of tokens which
has meaningful elements such as words, phrases, symbols in
the formof text.
Ex: Input: This is a spell checker for Kannada OCR.
Output: This, is, a, spell, checker, for, Kannada, OCR.
In our project we give Romanized text as a input for this
module and get a list of tokenized words for comparison.
Ex: Input: rAmanu kAdig.e hodanu
Output: [‗rAmanu‘, ‗kAdig.e‘, ‗hodanu‘]
Comparison
Compare each word with a standard dictionary and check
for the validity of the word by using minimum edit distance
algorithm.
Minimum Edit Distance
It is a levenstein distance where two strings or words are
compared and will result to either similar or dissimilarity, the
techniques used to perform this method are substitution,
insertion and deletion in order to convert one word to another
word and calculate the minimum distance to convert one string
to another by using NLP, where automatic processing of data
is done for spelling correction with the help of standard
dictionary and choose the suitable one by selecting the lowest
distance to the word formed.
Ex: When two strings INTENTION and EXECUTION are
considered, the minimum edit distance between them is 5 i.e.,
Minimum of 5 operations are required to change INTENTION
as EXECUTION.
The words in the dictionary which has edit distance less
than or equal to 3 are suggested for a given misspelled word.
Results

Conclusion
In this project we have implemented spell checker for
Kannada OCR. From this project we learnt various tools to
implement spell checker. After this project, we understood
various problems occur during text processing; also we got to
know how to tackle these problems. Although there were lots
of problems during Kannada text processing we understood a
major way to implement a Kannada Spell Checker for
Kannada OCR. As this project is a first attempt for
implementing spell checker for Kannada OCR, we hope our
project serves as platform for beginners to understand various
aspects ofspell checker for Kannada OCR.
Future Work
There can be several future work proposed, some of them
involving are improving the performance while others can be
built on top of the work done here. Here are some of the works
we believe can be performed:
 The methods can be improved to achieve better
efficiency.
 A larger dictionary with set of huge words can be
used.
 The methods can be used to separate root word and
affix word to improve the performance.
 Work can be elaborated for semantic errors as well.
 Further, the work can be extended by applying the
multi-threaded approach in the spell checker tool.
References
[1]. Rajeshakara Murthy S, Ramakanth Kumar P, ―A non-word Kannada spell
checker using morphological analyzer and dictionary lookup method‖.
International Journal of Engineering Sciences & Emerging
Technologies, June 2012, Volume 2, Issue 2.
[2].―OCR Spell: An Interactive Spelling Correction System for OCR Errors in
Text‖,Kazem Taghva* andEric Stofsky.
[3].‖SPELL CHECKER FOR OCR‖, Yogomaya Mohapatra, Ashis Kumar
Mishra, Anil Kumar Mishra, International Journal of Computer Science
and Information Technologies, Vol. 4(1), 2013, 91-97.
[4]. ―OCR Post-processing Error Correction Algorithm Using Google‘s
Online Spelling Suggestion‖ Yourself Bassil, Mohammed Alwani,
Journal of Emerging Trends in Computing and Information Sciences
VOL.3, NO. 1, January 2012.
[5]. ―A comprehensive survey on OCR techniques for Kannada script‖,
Chandrakala.H.T, Thippeswamy.G

Spell checker for Kannada OCR

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spell checker for Kannada OCR

Similar to Spell checker for Kannada OCR (20)

Recently uploaded

Recently uploaded (20)

Spell checker for Kannada OCR