SlideShare a Scribd company logo
1 of 4
Download to read offline
IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 4,April 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 1 | P a g e Copyright@IDL-2017
Spell checker for Kannada OCR
Sharathkumar S
Assistant Professor
Department of Information Science and Engineering
Siddaganga Institute of Technology, Tumakuru
skumars@sit.ac.in
Suma S, Sneha N
UG scholars,
Department of Information Science and Engineering
Siddaganga Institute of Technology, Tumakuru
suma.sinchu.ds@gmail.com
snehan769@gmail.com
Abstract— A spell checker is an application program to
process the natural languages in machine readable format
effectively. Spelling checking and correction is a basic
necessity and a tedious work in any language, so we require
spell checker software to do this, which is the fundamental
necessity for any work. Spell checker is a set of program
which analyzes the wrongly used word and corrects it by the
most possible correct word. The challenging task here is the
work done for a Kannada language. In a software system
many Kannada words are typed in several formats since
Kannada has many fonts to write the grammar properly.
In this paper, we describe some techniques used in
Kannada language by a spell checker. We use NLP, which is
a field of computer science having relationship between
human (i.e., natural languages) and computers. Usually, we
have some modern NLP algorithms based on machine
learning to carry out the work.
Keywords—Spell checker, NLP, OCR, Dictionary Lookup;
I. INTRODUCTION
Kannada is a Dravidian language spoken predominantly
by people of Karnataka and other neighboring states. It has
roughly forty million native speakers and a total of 50.8
million speakers according to 2001 census. Spell checking is
the critical problem in NLP. The tool named spell checker is
the important tool for the number of tightly coupled
components for various software like OCR, word processor
and even translators.
1.1Error Analyzer
A linguistic error analyzer is a tool which studies the types
and causes of language errors.
Errors may be classified as: Conceptualization errors
(i.e., thinking), phoneme to grapheme mapping errors (i.e.,
writing), typing errors, OCR generated errors, errors generated
by speech recognizer.
Conceptualization errors
Errors occurred due to one‘s way of thinking.
Ex: may be wrongly written as .
Phoneme to grapheme mapping errors
Errors occurred while writing the dictated words.
Ex: may be wrongly written as .
Typing errors
Errors occurred while typing by pressing wrong key.
Ex: may be wrongly typed as .
OCR generated errors
Errors occurred by incorrect recognition of a character by
OCR.
Ex: may be wrongly recognized as .
Errors generated by speech recognizer
Errors occurred due to wrong pronunciation of words or
wrong recognition of words by speech recognizer.
Ex: may be wrongly recognized as .
1.2 Optical Character Recognition (OCR)
Optical character recognition is a technique for moving
text from paper form to electronic form. To convert an image,
written text or e-text into a machine readable format we
require an OCR, the input to this can be a plain document,
image etc. The source for OCR can be bank statements, ATM
transactions, e-statements, mailing documents etc.
To process different tasks like speech to text, image to text
and vice-versa, analyzing of the text is done in digitized
format, so that it can be easily edited, stored and even
accessed easily via open-access system. OCR is a field of
research in NLP, Machine learning, artificial intelligence and
computer vision.
In a modern era, there is a need of flexibility to produce an
accurate OCR systemso that it can recognize any type of fonts
with the support of various digital image inputs to get more
accurate outputs for the proper inputs supplied .
IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 4,April 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 2 | P a g e Copyright@IDL-2017
OCR Errors
Due to the noise, the following errors may occur
 Reject error
The machine reading process may not be able to
recognize a character.
 Substitutionerror
OCR may recognize a character incorrectly.
 Character fusions
Two or more character images merge to appear as a single
connected component.
 Character fragmentation
A character image is fragmented into more than one sub
image.
1.3 Spell Checker
A spell checker is an application program required by
machines to process natural languages effectively. Spell
checkers can be used as independent tools or they can be a
part of larger applications like search engine, translator etc. A
simple spell checker can performthe following tasks:
 Scanning and extracting the words contained in
the text.
 Matching of the correctly written words with
those typed including special
symbols,hyphens..etc is the important step.
 To handle morphology to process a language
dependent algorithm is required. English
language also requires a spell checker for the
similar words including plurals,verbal
forms…etc. So processing even these steps for
other languages will be a complicated issue.
Related Work
The literature survey reveals that most of the research
works on Kannada Spell Checker focus on normal text while
some efforts have been made in other languages like Punjabi,
Hindi etc., for OCR text. But, no work related to OCR spell
checkers are reported for Kannada directly. Some works on
OCR Spell Checker in other languages and Kannada spell
checker are reported here.
This review discusses common spell checking approaches
and the problems that may occur during spell checking
process. There are two common approaches for implementing
spell checker: Dictionary lookup method and N-gram
approach.
Initially, divide the bulk of text data into a series of
separate words—further we use an inbuilt analyzer i.e a
morphological analyzer which uses the separate dictionaries to
access the root word and the suffixed word followed by it. We
need to establish a relationship between different varieties of
root words and its suffixes—in order to do this process, a
mapping function is necessary. Validity of a word is checked
using morphological analyzer. We have to identify the type of
error i.e word is incorrect viz. correct root and incorrect suffix,
incorrect root and correct suffix and correct root and matching
suffix. These errors are taken care individually and the
incorrect words are made transparent by suitable solutions.
These words which are mis-interrupted are corrected by the
help of user by giving suitable suggestions. But, the drawback
of this system is that, it fails when it is implemented for OCR
output text. It cannot efficiently handle special cases in OCR
like character fusion, character fragmentation etc.
Dictionary lookup method
Dictionary lookup method is a method of comparing the
words in the input file with the correct words in the dictionary.
This method is used as an advantage over OCR to inspect the
letters which are ambiguous—but in a large scale, it leads to
size overhead and calculation of probability will become
complex and even the cost ofsearching.
N-gram approach
An N-gram approach is an arrangement of text in a
sequential order for different items like phonemes, graphemes,
letters, words and even pair of words.
Unigram, bigram, trigram are the varieties in it. In
general, we have N-gram which is a types of predictive model
designed by the help of Markov to guess the subsequent item
in the form of n-1 order and it follows is a probabilistic
language model approach.
Design
IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 4,April 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 3 | P a g e Copyright@IDL-2017
Romanization
Romanization is a process of converting a written text from
a specific system to Roman Script. Romanization includes
following methods:
 Transliteration – for representing written text
 Transcription – for representing the spoken word
 Combination of both transliteration and
transcription
Ex: 1. in Romanized format is written
as ‗avanu‘.
2. in Romanized format is written as ‗snEha‘.
In this tool we read line by line from an input file and each
line is Romanized to English.
Ex: is Romanized to
English as ‗rAmanu kAdig.e hodanu‘.
Tokenization
Tokenization is a process of forming a set of tokens which
has meaningful elements such as words, phrases, symbols in
the formof text.
Ex: Input: This is a spell checker for Kannada OCR.
Output: This, is, a, spell, checker, for, Kannada, OCR.
In our project we give Romanized text as a input for this
module and get a list of tokenized words for comparison.
Ex: Input: rAmanu kAdig.e hodanu
Output: [‗rAmanu‘, ‗kAdig.e‘, ‗hodanu‘]
Comparison
Compare each word with a standard dictionary and check
for the validity of the word by using minimum edit distance
algorithm.
Minimum Edit Distance
It is a levenstein distance where two strings or words are
compared and will result to either similar or dissimilarity, the
techniques used to perform this method are substitution,
insertion and deletion in order to convert one word to another
word and calculate the minimum distance to convert one string
to another by using NLP, where automatic processing of data
is done for spelling correction with the help of standard
dictionary and choose the suitable one by selecting the lowest
distance to the word formed.
Ex: When two strings INTENTION and EXECUTION are
considered, the minimum edit distance between them is 5 i.e.,
Minimum of 5 operations are required to change INTENTION
as EXECUTION.
The words in the dictionary which has edit distance less
than or equal to 3 are suggested for a given misspelled word.
Results
IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 4,April 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 4 | P a g e Copyright@IDL-2017
Conclusion
In this project we have implemented spell checker for
Kannada OCR. From this project we learnt various tools to
implement spell checker. After this project, we understood
various problems occur during text processing; also we got to
know how to tackle these problems. Although there were lots
of problems during Kannada text processing we understood a
major way to implement a Kannada Spell Checker for
Kannada OCR. As this project is a first attempt for
implementing spell checker for Kannada OCR, we hope our
project serves as platform for beginners to understand various
aspects ofspell checker for Kannada OCR.
Future Work
There can be several future work proposed, some of them
involving are improving the performance while others can be
built on top of the work done here. Here are some of the works
we believe can be performed:
 The methods can be improved to achieve better
efficiency.
 A larger dictionary with set of huge words can be
used.
 The methods can be used to separate root word and
affix word to improve the performance.
 Work can be elaborated for semantic errors as well.
 Further, the work can be extended by applying the
multi-threaded approach in the spell checker tool.
References
[1]. Rajeshakara Murthy S, Ramakanth Kumar P, ―A non-word Kannada spell
checker using morphological analyzer and dictionary lookup method‖.
International Journal of Engineering Sciences & Emerging
Technologies, June 2012, Volume 2, Issue 2.
[2].―OCR Spell: An Interactive Spelling Correction System for OCR Errors in
Text‖,Kazem Taghva* andEric Stofsky.
[3].‖SPELL CHECKER FOR OCR‖, Yogomaya Mohapatra, Ashis Kumar
Mishra, Anil Kumar Mishra, International Journal of Computer Science
and Information Technologies, Vol. 4(1), 2013, 91-97.
[4]. ―OCR Post-processing Error Correction Algorithm Using Google‘s
Online Spelling Suggestion‖ Yourself Bassil, Mohammed Alwani,
Journal of Emerging Trends in Computing and Information Sciences
VOL.3, NO. 1, January 2012.
[5]. ―A comprehensive survey on OCR techniques for Kannada script‖,
Chandrakala.H.T, Thippeswamy.G

More Related Content

What's hot

5. message authentication and hash function
5. message authentication and hash function5. message authentication and hash function
5. message authentication and hash functionChirag Patel
 
DNA based Cryptography_Final_Review
DNA based Cryptography_Final_ReviewDNA based Cryptography_Final_Review
DNA based Cryptography_Final_ReviewRasheed Karuvally
 
Matching techniques
Matching techniquesMatching techniques
Matching techniquesNagpalkirti
 
Lexical analysis - Compiler Design
Lexical analysis - Compiler DesignLexical analysis - Compiler Design
Lexical analysis - Compiler DesignMuhammed Afsal Villan
 
Software quality assurance
Software quality assuranceSoftware quality assurance
Software quality assuranceAman Adhikari
 
Block Cipher and its Design Principles
Block Cipher and its Design PrinciplesBlock Cipher and its Design Principles
Block Cipher and its Design PrinciplesSHUBHA CHATURVEDI
 
Syntax directed translation
Syntax directed translationSyntax directed translation
Syntax directed translationAkshaya Arunan
 
Confidentiality using Symmetric Encryption
Confidentiality using Symmetric EncryptionConfidentiality using Symmetric Encryption
Confidentiality using Symmetric EncryptionJay Nagar
 
MD-5 : Algorithm
MD-5 : AlgorithmMD-5 : Algorithm
MD-5 : AlgorithmSahil Kureel
 
Security services and mechanisms
Security services and mechanismsSecurity services and mechanisms
Security services and mechanismsRajapriya82
 
2. public key cryptography and RSA
2. public key cryptography and RSA2. public key cryptography and RSA
2. public key cryptography and RSADr.Florence Dayana
 
Double DES & Triple DES
Double DES & Triple DESDouble DES & Triple DES
Double DES & Triple DESHemant Sharma
 

What's hot (20)

5. message authentication and hash function
5. message authentication and hash function5. message authentication and hash function
5. message authentication and hash function
 
DNA based Cryptography_Final_Review
DNA based Cryptography_Final_ReviewDNA based Cryptography_Final_Review
DNA based Cryptography_Final_Review
 
Cryptography
CryptographyCryptography
Cryptography
 
Ch08
Ch08Ch08
Ch08
 
Matching techniques
Matching techniquesMatching techniques
Matching techniques
 
Lexical analysis - Compiler Design
Lexical analysis - Compiler DesignLexical analysis - Compiler Design
Lexical analysis - Compiler Design
 
Software quality assurance
Software quality assuranceSoftware quality assurance
Software quality assurance
 
Pgp
PgpPgp
Pgp
 
Block Cipher and its Design Principles
Block Cipher and its Design PrinciplesBlock Cipher and its Design Principles
Block Cipher and its Design Principles
 
Syntax directed translation
Syntax directed translationSyntax directed translation
Syntax directed translation
 
Kmp
KmpKmp
Kmp
 
Cryptography
CryptographyCryptography
Cryptography
 
Confidentiality using Symmetric Encryption
Confidentiality using Symmetric EncryptionConfidentiality using Symmetric Encryption
Confidentiality using Symmetric Encryption
 
String matching algorithm
String matching algorithmString matching algorithm
String matching algorithm
 
MD-5 : Algorithm
MD-5 : AlgorithmMD-5 : Algorithm
MD-5 : Algorithm
 
Hash Function
Hash FunctionHash Function
Hash Function
 
Security services and mechanisms
Security services and mechanismsSecurity services and mechanisms
Security services and mechanisms
 
Finite Automata
Finite AutomataFinite Automata
Finite Automata
 
2. public key cryptography and RSA
2. public key cryptography and RSA2. public key cryptography and RSA
2. public key cryptography and RSA
 
Double DES & Triple DES
Double DES & Triple DESDouble DES & Triple DES
Double DES & Triple DES
 

Similar to Spell checker for Kannada OCR

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
EasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfEasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfNohaGhoweil
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-PShubham Kumar
 
Language Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and PersianLanguage Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and PersianWaqas Tariq
 
Natural Language Processing using Text Mining
Natural Language Processing using Text MiningNatural Language Processing using Text Mining
Natural Language Processing using Text MiningSushanti Acharya
 
Langauage model
Langauage modelLangauage model
Langauage modelc sharada
 
Comparison Analysis of Post- Processing Method for Punjabi Font
Comparison Analysis of Post- Processing Method for Punjabi FontComparison Analysis of Post- Processing Method for Punjabi Font
Comparison Analysis of Post- Processing Method for Punjabi FontIRJET Journal
 
IRJET- Vernacular Language Spell Checker & Autocorrection
IRJET- Vernacular Language Spell Checker & AutocorrectionIRJET- Vernacular Language Spell Checker & Autocorrection
IRJET- Vernacular Language Spell Checker & AutocorrectionIRJET Journal
 
Shallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShashank Shisodia
 
Bantu Spell Checker and Corrector using Modified Edit Distance Algorithm (MEDA)
Bantu Spell Checker and Corrector using Modified Edit Distance Algorithm (MEDA)Bantu Spell Checker and Corrector using Modified Edit Distance Algorithm (MEDA)
Bantu Spell Checker and Corrector using Modified Edit Distance Algorithm (MEDA)jennifer steffan
 
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...IJCI JOURNAL
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Daniel Adenew
 
Contextual Analysis for Middle Eastern Languages with Hidden Markov Models
Contextual Analysis for Middle Eastern Languages with Hidden Markov ModelsContextual Analysis for Middle Eastern Languages with Hidden Markov Models
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
 
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGESSCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGEScscpconf
 
Natural Language Processing: A comprehensive overview
Natural Language Processing: A comprehensive overviewNatural Language Processing: A comprehensive overview
Natural Language Processing: A comprehensive overviewBenjaminlapid1
 
Allin Qillqay A Free On-Line Web Spell Checking Service For Quechua
Allin Qillqay  A Free On-Line Web Spell Checking Service For QuechuaAllin Qillqay  A Free On-Line Web Spell Checking Service For Quechua
Allin Qillqay A Free On-Line Web Spell Checking Service For QuechuaAndrea Porter
 
IRJET - Storytelling App for Children with Hearing Impairment using Natur...
IRJET -  	  Storytelling App for Children with Hearing Impairment using Natur...IRJET -  	  Storytelling App for Children with Hearing Impairment using Natur...
IRJET - Storytelling App for Children with Hearing Impairment using Natur...IRJET Journal
 
Natural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and DifficultiesNatural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and Difficultiesijtsrd
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxSHIBDASDUTTA
 

Similar to Spell checker for Kannada OCR (20)

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
EasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfEasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdf
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
 
Language Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and PersianLanguage Identifier for Languages of Pakistan Including Arabic and Persian
Language Identifier for Languages of Pakistan Including Arabic and Persian
 
Natural Language Processing using Text Mining
Natural Language Processing using Text MiningNatural Language Processing using Text Mining
Natural Language Processing using Text Mining
 
Langauage model
Langauage modelLangauage model
Langauage model
 
Comparison Analysis of Post- Processing Method for Punjabi Font
Comparison Analysis of Post- Processing Method for Punjabi FontComparison Analysis of Post- Processing Method for Punjabi Font
Comparison Analysis of Post- Processing Method for Punjabi Font
 
IRJET- Vernacular Language Spell Checker & Autocorrection
IRJET- Vernacular Language Spell Checker & AutocorrectionIRJET- Vernacular Language Spell Checker & Autocorrection
IRJET- Vernacular Language Spell Checker & Autocorrection
 
Shallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliterator
 
Bantu Spell Checker and Corrector using Modified Edit Distance Algorithm (MEDA)
Bantu Spell Checker and Corrector using Modified Edit Distance Algorithm (MEDA)Bantu Spell Checker and Corrector using Modified Edit Distance Algorithm (MEDA)
Bantu Spell Checker and Corrector using Modified Edit Distance Algorithm (MEDA)
 
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...
 
Contextual Analysis for Middle Eastern Languages with Hidden Markov Models
Contextual Analysis for Middle Eastern Languages with Hidden Markov ModelsContextual Analysis for Middle Eastern Languages with Hidden Markov Models
Contextual Analysis for Middle Eastern Languages with Hidden Markov Models
 
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGESSCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGES
 
Natural Language Processing: A comprehensive overview
Natural Language Processing: A comprehensive overviewNatural Language Processing: A comprehensive overview
Natural Language Processing: A comprehensive overview
 
Allin Qillqay A Free On-Line Web Spell Checking Service For Quechua
Allin Qillqay  A Free On-Line Web Spell Checking Service For QuechuaAllin Qillqay  A Free On-Line Web Spell Checking Service For Quechua
Allin Qillqay A Free On-Line Web Spell Checking Service For Quechua
 
IRJET - Storytelling App for Children with Hearing Impairment using Natur...
IRJET -  	  Storytelling App for Children with Hearing Impairment using Natur...IRJET -  	  Storytelling App for Children with Hearing Impairment using Natur...
IRJET - Storytelling App for Children with Hearing Impairment using Natur...
 
Natural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and DifficultiesNatural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and Difficulties
 
Unit 5f.pptx
Unit 5f.pptxUnit 5f.pptx
Unit 5f.pptx
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptx
 

Recently uploaded

Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoĂŁo Esperancinha
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 

Recently uploaded (20)

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 

Spell checker for Kannada OCR

  • 1. IDL - International Digital Library Of Technology & Research Volume 1, Issue 4,April 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 1 | P a g e Copyright@IDL-2017 Spell checker for Kannada OCR Sharathkumar S Assistant Professor Department of Information Science and Engineering Siddaganga Institute of Technology, Tumakuru skumars@sit.ac.in Suma S, Sneha N UG scholars, Department of Information Science and Engineering Siddaganga Institute of Technology, Tumakuru suma.sinchu.ds@gmail.com snehan769@gmail.com Abstract— A spell checker is an application program to process the natural languages in machine readable format effectively. Spelling checking and correction is a basic necessity and a tedious work in any language, so we require spell checker software to do this, which is the fundamental necessity for any work. Spell checker is a set of program which analyzes the wrongly used word and corrects it by the most possible correct word. The challenging task here is the work done for a Kannada language. In a software system many Kannada words are typed in several formats since Kannada has many fonts to write the grammar properly. In this paper, we describe some techniques used in Kannada language by a spell checker. We use NLP, which is a field of computer science having relationship between human (i.e., natural languages) and computers. Usually, we have some modern NLP algorithms based on machine learning to carry out the work. Keywords—Spell checker, NLP, OCR, Dictionary Lookup; I. INTRODUCTION Kannada is a Dravidian language spoken predominantly by people of Karnataka and other neighboring states. It has roughly forty million native speakers and a total of 50.8 million speakers according to 2001 census. Spell checking is the critical problem in NLP. The tool named spell checker is the important tool for the number of tightly coupled components for various software like OCR, word processor and even translators. 1.1Error Analyzer A linguistic error analyzer is a tool which studies the types and causes of language errors. Errors may be classified as: Conceptualization errors (i.e., thinking), phoneme to grapheme mapping errors (i.e., writing), typing errors, OCR generated errors, errors generated by speech recognizer. Conceptualization errors Errors occurred due to one‘s way of thinking. Ex: may be wrongly written as . Phoneme to grapheme mapping errors Errors occurred while writing the dictated words. Ex: may be wrongly written as . Typing errors Errors occurred while typing by pressing wrong key. Ex: may be wrongly typed as . OCR generated errors Errors occurred by incorrect recognition of a character by OCR. Ex: may be wrongly recognized as . Errors generated by speech recognizer Errors occurred due to wrong pronunciation of words or wrong recognition of words by speech recognizer. Ex: may be wrongly recognized as . 1.2 Optical Character Recognition (OCR) Optical character recognition is a technique for moving text from paper form to electronic form. To convert an image, written text or e-text into a machine readable format we require an OCR, the input to this can be a plain document, image etc. The source for OCR can be bank statements, ATM transactions, e-statements, mailing documents etc. To process different tasks like speech to text, image to text and vice-versa, analyzing of the text is done in digitized format, so that it can be easily edited, stored and even accessed easily via open-access system. OCR is a field of research in NLP, Machine learning, artificial intelligence and computer vision. In a modern era, there is a need of flexibility to produce an accurate OCR systemso that it can recognize any type of fonts with the support of various digital image inputs to get more accurate outputs for the proper inputs supplied .
  • 2. IDL - International Digital Library Of Technology & Research Volume 1, Issue 4,April 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 2 | P a g e Copyright@IDL-2017 OCR Errors Due to the noise, the following errors may occur  Reject error The machine reading process may not be able to recognize a character.  Substitutionerror OCR may recognize a character incorrectly.  Character fusions Two or more character images merge to appear as a single connected component.  Character fragmentation A character image is fragmented into more than one sub image. 1.3 Spell Checker A spell checker is an application program required by machines to process natural languages effectively. Spell checkers can be used as independent tools or they can be a part of larger applications like search engine, translator etc. A simple spell checker can performthe following tasks:  Scanning and extracting the words contained in the text.  Matching of the correctly written words with those typed including special symbols,hyphens..etc is the important step.  To handle morphology to process a language dependent algorithm is required. English language also requires a spell checker for the similar words including plurals,verbal forms…etc. So processing even these steps for other languages will be a complicated issue. Related Work The literature survey reveals that most of the research works on Kannada Spell Checker focus on normal text while some efforts have been made in other languages like Punjabi, Hindi etc., for OCR text. But, no work related to OCR spell checkers are reported for Kannada directly. Some works on OCR Spell Checker in other languages and Kannada spell checker are reported here. This review discusses common spell checking approaches and the problems that may occur during spell checking process. There are two common approaches for implementing spell checker: Dictionary lookup method and N-gram approach. Initially, divide the bulk of text data into a series of separate words—further we use an inbuilt analyzer i.e a morphological analyzer which uses the separate dictionaries to access the root word and the suffixed word followed by it. We need to establish a relationship between different varieties of root words and its suffixes—in order to do this process, a mapping function is necessary. Validity of a word is checked using morphological analyzer. We have to identify the type of error i.e word is incorrect viz. correct root and incorrect suffix, incorrect root and correct suffix and correct root and matching suffix. These errors are taken care individually and the incorrect words are made transparent by suitable solutions. These words which are mis-interrupted are corrected by the help of user by giving suitable suggestions. But, the drawback of this system is that, it fails when it is implemented for OCR output text. It cannot efficiently handle special cases in OCR like character fusion, character fragmentation etc. Dictionary lookup method Dictionary lookup method is a method of comparing the words in the input file with the correct words in the dictionary. This method is used as an advantage over OCR to inspect the letters which are ambiguous—but in a large scale, it leads to size overhead and calculation of probability will become complex and even the cost ofsearching. N-gram approach An N-gram approach is an arrangement of text in a sequential order for different items like phonemes, graphemes, letters, words and even pair of words. Unigram, bigram, trigram are the varieties in it. In general, we have N-gram which is a types of predictive model designed by the help of Markov to guess the subsequent item in the form of n-1 order and it follows is a probabilistic language model approach. Design
  • 3. IDL - International Digital Library Of Technology & Research Volume 1, Issue 4,April 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 3 | P a g e Copyright@IDL-2017 Romanization Romanization is a process of converting a written text from a specific system to Roman Script. Romanization includes following methods:  Transliteration – for representing written text  Transcription – for representing the spoken word  Combination of both transliteration and transcription Ex: 1. in Romanized format is written as ‗avanu‘. 2. in Romanized format is written as ‗snEha‘. In this tool we read line by line from an input file and each line is Romanized to English. Ex: is Romanized to English as ‗rAmanu kAdig.e hodanu‘. Tokenization Tokenization is a process of forming a set of tokens which has meaningful elements such as words, phrases, symbols in the formof text. Ex: Input: This is a spell checker for Kannada OCR. Output: This, is, a, spell, checker, for, Kannada, OCR. In our project we give Romanized text as a input for this module and get a list of tokenized words for comparison. Ex: Input: rAmanu kAdig.e hodanu Output: [‗rAmanu‘, ‗kAdig.e‘, ‗hodanu‘] Comparison Compare each word with a standard dictionary and check for the validity of the word by using minimum edit distance algorithm. Minimum Edit Distance It is a levenstein distance where two strings or words are compared and will result to either similar or dissimilarity, the techniques used to perform this method are substitution, insertion and deletion in order to convert one word to another word and calculate the minimum distance to convert one string to another by using NLP, where automatic processing of data is done for spelling correction with the help of standard dictionary and choose the suitable one by selecting the lowest distance to the word formed. Ex: When two strings INTENTION and EXECUTION are considered, the minimum edit distance between them is 5 i.e., Minimum of 5 operations are required to change INTENTION as EXECUTION. The words in the dictionary which has edit distance less than or equal to 3 are suggested for a given misspelled word. Results
  • 4. IDL - International Digital Library Of Technology & Research Volume 1, Issue 4,April 2017 Available at: www.dbpublications.org International e-Journal For Technology And Research-2017 IDL - International Digital Library 4 | P a g e Copyright@IDL-2017 Conclusion In this project we have implemented spell checker for Kannada OCR. From this project we learnt various tools to implement spell checker. After this project, we understood various problems occur during text processing; also we got to know how to tackle these problems. Although there were lots of problems during Kannada text processing we understood a major way to implement a Kannada Spell Checker for Kannada OCR. As this project is a first attempt for implementing spell checker for Kannada OCR, we hope our project serves as platform for beginners to understand various aspects ofspell checker for Kannada OCR. Future Work There can be several future work proposed, some of them involving are improving the performance while others can be built on top of the work done here. Here are some of the works we believe can be performed:  The methods can be improved to achieve better efficiency.  A larger dictionary with set of huge words can be used.  The methods can be used to separate root word and affix word to improve the performance.  Work can be elaborated for semantic errors as well.  Further, the work can be extended by applying the multi-threaded approach in the spell checker tool. References [1]. Rajeshakara Murthy S, Ramakanth Kumar P, ―A non-word Kannada spell checker using morphological analyzer and dictionary lookup method‖. International Journal of Engineering Sciences & Emerging Technologies, June 2012, Volume 2, Issue 2. [2].―OCR Spell: An Interactive Spelling Correction System for OCR Errors in Text‖,Kazem Taghva* andEric Stofsky. [3].‖SPELL CHECKER FOR OCR‖, Yogomaya Mohapatra, Ashis Kumar Mishra, Anil Kumar Mishra, International Journal of Computer Science and Information Technologies, Vol. 4(1), 2013, 91-97. [4]. ―OCR Post-processing Error Correction Algorithm Using Google‘s Online Spelling Suggestion‖ Yourself Bassil, Mohammed Alwani, Journal of Emerging Trends in Computing and Information Sciences VOL.3, NO. 1, January 2012. [5]. ―A comprehensive survey on OCR techniques for Kannada script‖, Chandrakala.H.T, Thippeswamy.G