In most of our official papers, school text books, it is observed that English words interspersed within the Indian languages. So there is need for an Optical Character Recognition (OCR) system which can recognize these bilingual documents and store it for future use. In this paper we present an OCR system developed for the recognition of Indian language i.e. Oriya and Roman scripts for printed documents. For such purpose, it is necessary to separate different scripts before feeding them to their individual OCR system. Firstly, we need to correct the skew followed by segmentation. Here we propose the script differentiation line-wise. We emphasize on Upper and lower matras associated with Oriya and absent in English. We have used horizontal histogram for line distinction belonging to different script. After separation different scripts are sent to their individual recognition engines.
An exhaustive font and size invariant classification scheme for ocr of devana...ijnlc
The document presents a classification scheme for recognizing Devanagari characters that is invariant to font and size. It identifies the basic symbols that commonly appear in the middle zone of Devanagari text across different fonts and sizes. Through an analysis of over 465,000 words from various sources, it finds that 345 symbols account for 99.97% of text and aims to classify these into groups based on structural properties like the presence or absence of vertical bars. The proposed classification scheme is validated on 25 fonts and 3 sizes to demonstrate its font and size invariance.
A PREPROCESSING MODEL FOR HAND-WRITTEN ARABIC TEXTS BASED ON VORONOI DIAGRAMSijcsit
In this paper, a preprocessing model for hand-written Arabic text on the basis of the Voronoi Diagrams (VDs) is presented and discussed. The proposed VD-based pre-processing model consists of five stages: a preparatory stage, page segmentation, thinning, baseline estimation, and slanting correction. In the preparatory stage, the text image is converted via VDs into a group of geometrical forms that consist of edges and vertices that are used to create the other stages of the proposed model. This stage consists of
four main processes: binarization, edge extraction and contour tracking, sampling, and point-VD construction. The second stage is the page segmentation stage based on the VD area. In the third stage, an efficient method for text structuring (that is, thinning) is presented. In the fourth stage, a novel baseline
based VD method is presented. In the fifth stage, an efficient technique for slanting detection and correction is proposed and discussed.
An Optical Character Recognition for Handwritten Devanagari ScriptIJERA Editor
Optical Character Recognition is process of recognition of character from scanned document and lots of OCR now available in the market. But most of these systems work for Roman, Chinese, Japanese and Arabic characters . There are no sufficient number of work on Indian language script like Devanagari so this paper present a review on optical character recognition on handwritten Devanagari script
The using of the internet with its technologies and applications have been increased rapidly. So, protecting
the text from illegal use is too needed . Text watermarking is used for this purpose. Arabic text has many
characteristics such existing of diacritics , kashida (extension character) and points above or under its
letters .Each of Arabic letters can take different shapes with different Unicode. These characteristics are
utilized in the watermarking process. In this paper, several methods are discussed in the area of Arabic
text watermarking with its advantages and disadvantages .Comparison of these methods is done in term of
capacity, robustness and Imperceptibility.
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...CSCJournals
In a multi script environment, majority of the documents may contain text information printed in more than one script/language forms. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this context, this paper proposes to develop a model to identify and separate text words of Kannada, Hindi and English scripts from a printed tri-lingual document. The proposed method is trained to learn thoroughly the distinct features of each script. The binary tree classifier is used to classify the input text image. Experimentation conducted involved 1500 text words for learning and 1200 text words for testing. Extensive experimentation has been carried out on both manually created data set and scanned data set. The results are very encouraging and prove the efficacy of the proposed model. The average success rate is found to be 99% for manually created data set and 98.5% for data set constructed from scanned document images.
This paper presents a new multi-tier holistic approach for recognizing Urdu text written in Nastaliq script. It first identifies special ligatures like dots, tay, hamza and mad from base ligatures. It then associates the special ligatures with neighboring base ligatures. Features are extracted from the ligatures and special ligature-base ligature associations. These features are input to a neural network that recognizes the ligatures in three steps: 1) identifying special ligatures, 2) associating them with base ligatures, and 3) recognizing the base ligatures. The system was tested on 200 ligatures with 100% accuracy for ligatures in its training set and closest match classification for new ligatures.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Font type identification of hindi printed documenteSAT Journals
This document summarizes an approach for identifying the font type in Hindi printed documents. It discusses preprocessing steps like noise removal, binarization, and skew correction. Features are extracted from text including typographical features like word height and width, and structural features like thickness of vertical lines. Five Hindi font types are classified using these extracted features, achieving 96-100% accuracy on test documents. Future work could involve word-level font identification and improving image quality before feature extraction and classification.
An exhaustive font and size invariant classification scheme for ocr of devana...ijnlc
The document presents a classification scheme for recognizing Devanagari characters that is invariant to font and size. It identifies the basic symbols that commonly appear in the middle zone of Devanagari text across different fonts and sizes. Through an analysis of over 465,000 words from various sources, it finds that 345 symbols account for 99.97% of text and aims to classify these into groups based on structural properties like the presence or absence of vertical bars. The proposed classification scheme is validated on 25 fonts and 3 sizes to demonstrate its font and size invariance.
A PREPROCESSING MODEL FOR HAND-WRITTEN ARABIC TEXTS BASED ON VORONOI DIAGRAMSijcsit
In this paper, a preprocessing model for hand-written Arabic text on the basis of the Voronoi Diagrams (VDs) is presented and discussed. The proposed VD-based pre-processing model consists of five stages: a preparatory stage, page segmentation, thinning, baseline estimation, and slanting correction. In the preparatory stage, the text image is converted via VDs into a group of geometrical forms that consist of edges and vertices that are used to create the other stages of the proposed model. This stage consists of
four main processes: binarization, edge extraction and contour tracking, sampling, and point-VD construction. The second stage is the page segmentation stage based on the VD area. In the third stage, an efficient method for text structuring (that is, thinning) is presented. In the fourth stage, a novel baseline
based VD method is presented. In the fifth stage, an efficient technique for slanting detection and correction is proposed and discussed.
An Optical Character Recognition for Handwritten Devanagari ScriptIJERA Editor
Optical Character Recognition is process of recognition of character from scanned document and lots of OCR now available in the market. But most of these systems work for Roman, Chinese, Japanese and Arabic characters . There are no sufficient number of work on Indian language script like Devanagari so this paper present a review on optical character recognition on handwritten Devanagari script
The using of the internet with its technologies and applications have been increased rapidly. So, protecting
the text from illegal use is too needed . Text watermarking is used for this purpose. Arabic text has many
characteristics such existing of diacritics , kashida (extension character) and points above or under its
letters .Each of Arabic letters can take different shapes with different Unicode. These characteristics are
utilized in the watermarking process. In this paper, several methods are discussed in the area of Arabic
text watermarking with its advantages and disadvantages .Comparison of these methods is done in term of
capacity, robustness and Imperceptibility.
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...CSCJournals
In a multi script environment, majority of the documents may contain text information printed in more than one script/language forms. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this context, this paper proposes to develop a model to identify and separate text words of Kannada, Hindi and English scripts from a printed tri-lingual document. The proposed method is trained to learn thoroughly the distinct features of each script. The binary tree classifier is used to classify the input text image. Experimentation conducted involved 1500 text words for learning and 1200 text words for testing. Extensive experimentation has been carried out on both manually created data set and scanned data set. The results are very encouraging and prove the efficacy of the proposed model. The average success rate is found to be 99% for manually created data set and 98.5% for data set constructed from scanned document images.
This paper presents a new multi-tier holistic approach for recognizing Urdu text written in Nastaliq script. It first identifies special ligatures like dots, tay, hamza and mad from base ligatures. It then associates the special ligatures with neighboring base ligatures. Features are extracted from the ligatures and special ligature-base ligature associations. These features are input to a neural network that recognizes the ligatures in three steps: 1) identifying special ligatures, 2) associating them with base ligatures, and 3) recognizing the base ligatures. The system was tested on 200 ligatures with 100% accuracy for ligatures in its training set and closest match classification for new ligatures.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Font type identification of hindi printed documenteSAT Journals
This document summarizes an approach for identifying the font type in Hindi printed documents. It discusses preprocessing steps like noise removal, binarization, and skew correction. Features are extracted from text including typographical features like word height and width, and structural features like thickness of vertical lines. Five Hindi font types are classified using these extracted features, achieving 96-100% accuracy on test documents. Future work could involve word-level font identification and improving image quality before feature extraction and classification.
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...IJCI JOURNAL
A large amount of data in Indian languages stored digitally is in ASCII-based font formats. ASCII has 128
character-set, therefore it is unable to represent all the characters necessary to deal with the variety of
scripts available worldwide. Moreover, these ASCII-based fonts are not based on a single standard
mapping between the character-codes and the individual characters, for a particular Indian script, unlike
the English language fonts based on the standard ASCII mapping. Therefore, it is required that the fonts for
a particular script must be available on the system to accurately represent the data in that script. Also, the
conversion of data in one font into another is a difficult task. The non-standard ASCII-based fonts also pose
problems in performing search on texts in Indian languages available over web. There are 25 official
languages in India, and the amount of digital text available in ASCII-based fonts is much larger than the
text available in the standard ISCII (Indian Script Code for Information Interchange) or Unicode formats.
This paper discusses the work done in the field of font-detection (to identify the font of the given text) and
font-converters (to convert the ASCII-format text into the corresponding Unicode text).
Optical Character Recognition System for Urdu (Naskh Font)Using Pattern Match...CSCJournals
The offline optical character recognition (OCR) for different languages has been developed over the recent years. Since 1965, the US postal service has been using this system for automating their services. The range of the applications under this area is increasing day by day, due to its utility in almost major areas of government as well as private sector. This technique has been very useful in making paper free environment in many major organizations as far as the backup of their previous file record is concerned. Our this system has been proposed for the Offline Character Recognition for Isolated Characters of Urdu language, as Urdu language forms words by combining Isolated Characters. Urdu is a cursive language, having connected characters making words. The major area of utility for Urdu OCR will be digitizing of a lot of literature related material already stocked in libraries. Urdu language is famous and spoken in more than 3 big countries including Pakistan, India and Bangladesh. A lot of work has been done in Urdu poetry and literature up to the recent century. Creation of OCR for Urdu language will make an important role in converting all those work from physical libraries to electronic libraries. Most of the stuff already placed on internet is in the form of images having text, which took a lot of space to transfer and even read online. So the need of an Urdu OCR is a must. The system is of training system type. It consists of the image preprocessing, line and character segmentation, creation of xml file for training purpose. While Recognition system includes taking xml file, the image to be recognized, segment it and creation of chain codes for character images and matching with already stored in xml file. The system has been implemented and it has 89% recognition accuracy with a 15 char/sec recognition rate.
The document discusses key concepts in the Interface Definition Language (IDL) used in Common Object Request Broker Architecture (CORBA). It covers IDL ground rules like case sensitivity, syntax, comments, and modules. It also describes various IDL data types including primitive types like boolean, char, floating point, integer types and constructed types like enumerations, structures, unions, and interfaces. Methods, parameters, direction qualifiers like in, out, inout are explained. The document also discusses one-way methods and how they allow non-blocking remote procedure calls.
SEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXTcscpconf
This document discusses the segmentation of printed Bangla characters without modifiers for optical character recognition systems. It begins with an introduction to OCR systems and Bangla script. The main steps of an OCR system are then outlined, with a focus on the segmentation step. Line, word and character segmentation algorithms are described in detail along with figures to illustrate the steps. The goal is to properly segment individual characters for recognition.
Off-Line Arabic Handwritten Words Segmentation using Morphological Operatorssipij
The document presents a model for segmenting offline Arabic handwritten words using morphological operators. It has three main stages: pre-processing, segmentation, and evaluation. In pre-processing, morphological operators like dilation and erosion are used to connect gaps in words caused by pen lifting or scanning. Segmentation uses connected component analysis to bound words into sub-words. The model is tested on over 1,100 images from the IESK-ArDB database, achieving 88% accuracy in segmenting words into sub-words by connecting small gaps. This compares favorably to previous related works on Arabic handwriting segmentation.
A survey on Script and Language identification for Handwritten document imagesiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
The document describes an OCR system for recognition of Urdu text written in Nastaliq font. It discusses the characteristics of Urdu script, existing approaches for cursive script recognition, and the methodology used in the system. The system employs two holistic approaches - a multi-tier approach using neural networks and a multi-stage classification approach combining multiple classifiers. Results show the 20 most frequent ligatures identified from analyzing BBC Urdu news text, and feature vectors extracted for segmentation.
Current state of the art pos tagging for indian languages – a studyiaemedu
The document summarizes research on part-of-speech tagging for several Indian languages. It discusses rule-based, stochastic, and neural network approaches to POS tagging and surveys work done on Hindi, Bengali, Tamil, Telugu, Gujarati, Malayalam, Manipuri, and Assamese. The accuracy of POS taggers for Indian languages ranges from 71-94% depending on the language and approach used, compared to 93-98% for Western languages, due to limited annotated data and morphological complexity in Indian languages.
Current state of the art pos tagging for indian languages – a studyIAEME Publication
This document summarizes current research on part-of-speech tagging for several Indian languages, including Hindi, Bengali, Tamil, Telugu, Gujarati, Malayalam, Manipuri, and Assamese. It describes different techniques used for POS tagging such as rule-based, stochastic, and neural approaches. It then reviews specific work done for each of the listed languages, describing different studies that have applied techniques like HMM, CRF, SVM, and achieved accuracies ranging from 69% to 98% depending on the language and approach used.
A New Method for Identification of Partially Similar Indian ScriptsCSCJournals
In this paper, the texture symmetry/non symmetry factor has been exploited to get the script texture by using the Bi Wavelants which give the factor of symmetry/non symmetry in terms of the third cumulant and the Bi-spectra gives the quadratically coupled frequencies. The envelope of Bi-spectra (Bi-Wavelant) provides an accurate behavior of the symmetry/non symmetry factor of the script texture. Classification has been better performed by SVM with training set of roots of the envelope found using the Newton-Raphson technique. The method could successfully identify 8 Indian scripts like Devanagari, Urdu, Gujrati, Telugu, Assamese, Gurmukhi, Kannada, and Bangla. The method can segment any kind of document with very good results. The identification results are excellent.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Dimension Reduction for Script Classification - Printed Indian Documentsijait
Automatic identification of a script in a given document image facilitates many important applications such as automatic archiving of multilingual documents, searching online archives of document images and for the selection of script specific OCR in a multilingual environment. This paper provides a comparison study of three dimension reduction techniques, namely partial least squares (PLS), sliced inverse regression (SIR)
and principal component analysis (PCA), and evaluates the relative performance of classification procedures incorporating those methods. For given script we extracted different features like Gray Level Co-occurrence Method (GLCM) and Scale invariant feature transform (SIFT) features. The features are
extracted globally from a given text block which does not require any complex and reliable segmentation of the document image into lines and characters. Extracted features are reduced using various dimension reduction techniques. The reduced features are fed into Nearest Neighbor classifier. Thus the proposed
scheme is efficient and can be used for many practical
pplications which require processing large volumes
of data. The scheme has been tested on 10 Indian scripts and found to be robust in the process of scanning and relatively insensitive to change in font size. This proposed system achieves good classification accuracy on a large testing data set.
The document describes a machine learning approach for language identification, named entity recognition, and transliteration on query words. It discusses:
1) Using supervised machine learning classifiers like random forest, decision trees, and SVMs along with contextual, character n-gram, and gazetteer features for language identification of Hindi-English and Bangla-English words.
2) Applying an IOB tagging scheme and features like character n-grams, context words, and typographic properties for named entity recognition and classification.
3) A statistical machine transliteration model that segments, aligns, and maps source and target language transliteration units based on context and probabilities learned from parallel training data.
Quality estimation of machine translation outputs through stemmingijcsa
Machine Translation is the challenging problem for Indian languages. Every day we can see some machine
translators being developed , but getting a high quality automatic translation is still a very distant dream .
The correct translated sentence for Hindi language is rarely found. In this paper, we are emphasizing on
English-Hindi language pair, so in order to preserve the correct MT output we present a ranking system,
which employs some machine learning techniques and morphological features. In ranking no human
intervention is required. We have also validated our results by comparing it with human ranking.
Abstract
Part of speech tagging plays an important role in developing natural language processing software. Part of speech tagging means assigning part of speech tag to each word of the sentence. The part of speech tagger takes a sentence as input and it assigns respective/appropriate part of speech tag to each word of that sentence. In this article I surveys the different work have done about odia POS tagging.
________________________________________________
The document provides an outline for a student script, including the structure for scene headings with a big idea and subpoints for what happens at the beginning and next in each scene. It also gives an example outline from the video "Responsibility at Riverside School" with three scenes about a student making an excuse for missing homework, getting in trouble with the principal who calls their mom, and the teacher rewarding another student and explaining why the first student didn't get a reward.
This document provides instructions for a group project where students will research various aspects of William Shakespeare and the Elizabethan era to create an advertising flyer for Shakespeare. Students are divided into groups of 4 and each assigned a role - historian, detective, playwright, or peasant. They will then conduct research on their topic and collaborate to design a flyer using Microsoft Publisher that highlights interesting facts about Shakespeare's life, times and works.
How to process a miscellaneous payment through your PayClix dashboardAndrew Harrison
This presentation provides a step-by-step guide on how to process a miscellaneous payment for a customer within your PayClix dashboard. This powerpoint will help increase your PayClix knowledge and help you get more out of PayClix
This document discusses various physicochemical principles relevant to pharmacy, including expressions of concentration, calculations, equivalent weight, tonicity, and buffers. It covers ratios, percentages, parts, molar and equivalent solutions, milliequivalents, adjusting tonicity, and buffer systems and applications. The document is intended as a guide for pharmacists and students to understand various concentration expressions and physicochemical concepts.
Spray drying is a method to dry solutions or slurries using hot air in a chamber. Liquid enters the top and is sprayed by pressure jets into a mist of droplets that are dried by the hot air and fall to the bottom as particles. Different types of spray driers exist with modifications like liquid and air entry points. Spray drying is useful for heat-sensitive products and can dry items like milk, enzymes, and particles can be encapsulated. Levigation and pulverization by intervention are processes to reduce substances to fine powders using solvents that are later removed. Elutriation separates particles by size using upward fluid flow, with gravitational elutriation using sedimentation and centrifugal using rotation
The document provides instructions for an English class to write a script based on a fable they read. It describes the classroom as uncomfortable and cold with 28 students, 18 girls and 10 boys. The teacher asks what a script is and a student answers in Portuguese. The class is then told the fable will be turned into a movie and they must write the script, describing the setting, characters, and adding new thrilling events.
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...IJCI JOURNAL
A large amount of data in Indian languages stored digitally is in ASCII-based font formats. ASCII has 128
character-set, therefore it is unable to represent all the characters necessary to deal with the variety of
scripts available worldwide. Moreover, these ASCII-based fonts are not based on a single standard
mapping between the character-codes and the individual characters, for a particular Indian script, unlike
the English language fonts based on the standard ASCII mapping. Therefore, it is required that the fonts for
a particular script must be available on the system to accurately represent the data in that script. Also, the
conversion of data in one font into another is a difficult task. The non-standard ASCII-based fonts also pose
problems in performing search on texts in Indian languages available over web. There are 25 official
languages in India, and the amount of digital text available in ASCII-based fonts is much larger than the
text available in the standard ISCII (Indian Script Code for Information Interchange) or Unicode formats.
This paper discusses the work done in the field of font-detection (to identify the font of the given text) and
font-converters (to convert the ASCII-format text into the corresponding Unicode text).
Optical Character Recognition System for Urdu (Naskh Font)Using Pattern Match...CSCJournals
The offline optical character recognition (OCR) for different languages has been developed over the recent years. Since 1965, the US postal service has been using this system for automating their services. The range of the applications under this area is increasing day by day, due to its utility in almost major areas of government as well as private sector. This technique has been very useful in making paper free environment in many major organizations as far as the backup of their previous file record is concerned. Our this system has been proposed for the Offline Character Recognition for Isolated Characters of Urdu language, as Urdu language forms words by combining Isolated Characters. Urdu is a cursive language, having connected characters making words. The major area of utility for Urdu OCR will be digitizing of a lot of literature related material already stocked in libraries. Urdu language is famous and spoken in more than 3 big countries including Pakistan, India and Bangladesh. A lot of work has been done in Urdu poetry and literature up to the recent century. Creation of OCR for Urdu language will make an important role in converting all those work from physical libraries to electronic libraries. Most of the stuff already placed on internet is in the form of images having text, which took a lot of space to transfer and even read online. So the need of an Urdu OCR is a must. The system is of training system type. It consists of the image preprocessing, line and character segmentation, creation of xml file for training purpose. While Recognition system includes taking xml file, the image to be recognized, segment it and creation of chain codes for character images and matching with already stored in xml file. The system has been implemented and it has 89% recognition accuracy with a 15 char/sec recognition rate.
The document discusses key concepts in the Interface Definition Language (IDL) used in Common Object Request Broker Architecture (CORBA). It covers IDL ground rules like case sensitivity, syntax, comments, and modules. It also describes various IDL data types including primitive types like boolean, char, floating point, integer types and constructed types like enumerations, structures, unions, and interfaces. Methods, parameters, direction qualifiers like in, out, inout are explained. The document also discusses one-way methods and how they allow non-blocking remote procedure calls.
SEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXTcscpconf
This document discusses the segmentation of printed Bangla characters without modifiers for optical character recognition systems. It begins with an introduction to OCR systems and Bangla script. The main steps of an OCR system are then outlined, with a focus on the segmentation step. Line, word and character segmentation algorithms are described in detail along with figures to illustrate the steps. The goal is to properly segment individual characters for recognition.
Off-Line Arabic Handwritten Words Segmentation using Morphological Operatorssipij
The document presents a model for segmenting offline Arabic handwritten words using morphological operators. It has three main stages: pre-processing, segmentation, and evaluation. In pre-processing, morphological operators like dilation and erosion are used to connect gaps in words caused by pen lifting or scanning. Segmentation uses connected component analysis to bound words into sub-words. The model is tested on over 1,100 images from the IESK-ArDB database, achieving 88% accuracy in segmenting words into sub-words by connecting small gaps. This compares favorably to previous related works on Arabic handwriting segmentation.
A survey on Script and Language identification for Handwritten document imagesiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
The document describes an OCR system for recognition of Urdu text written in Nastaliq font. It discusses the characteristics of Urdu script, existing approaches for cursive script recognition, and the methodology used in the system. The system employs two holistic approaches - a multi-tier approach using neural networks and a multi-stage classification approach combining multiple classifiers. Results show the 20 most frequent ligatures identified from analyzing BBC Urdu news text, and feature vectors extracted for segmentation.
Current state of the art pos tagging for indian languages – a studyiaemedu
The document summarizes research on part-of-speech tagging for several Indian languages. It discusses rule-based, stochastic, and neural network approaches to POS tagging and surveys work done on Hindi, Bengali, Tamil, Telugu, Gujarati, Malayalam, Manipuri, and Assamese. The accuracy of POS taggers for Indian languages ranges from 71-94% depending on the language and approach used, compared to 93-98% for Western languages, due to limited annotated data and morphological complexity in Indian languages.
Current state of the art pos tagging for indian languages – a studyIAEME Publication
This document summarizes current research on part-of-speech tagging for several Indian languages, including Hindi, Bengali, Tamil, Telugu, Gujarati, Malayalam, Manipuri, and Assamese. It describes different techniques used for POS tagging such as rule-based, stochastic, and neural approaches. It then reviews specific work done for each of the listed languages, describing different studies that have applied techniques like HMM, CRF, SVM, and achieved accuracies ranging from 69% to 98% depending on the language and approach used.
A New Method for Identification of Partially Similar Indian ScriptsCSCJournals
In this paper, the texture symmetry/non symmetry factor has been exploited to get the script texture by using the Bi Wavelants which give the factor of symmetry/non symmetry in terms of the third cumulant and the Bi-spectra gives the quadratically coupled frequencies. The envelope of Bi-spectra (Bi-Wavelant) provides an accurate behavior of the symmetry/non symmetry factor of the script texture. Classification has been better performed by SVM with training set of roots of the envelope found using the Newton-Raphson technique. The method could successfully identify 8 Indian scripts like Devanagari, Urdu, Gujrati, Telugu, Assamese, Gurmukhi, Kannada, and Bangla. The method can segment any kind of document with very good results. The identification results are excellent.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Dimension Reduction for Script Classification - Printed Indian Documentsijait
Automatic identification of a script in a given document image facilitates many important applications such as automatic archiving of multilingual documents, searching online archives of document images and for the selection of script specific OCR in a multilingual environment. This paper provides a comparison study of three dimension reduction techniques, namely partial least squares (PLS), sliced inverse regression (SIR)
and principal component analysis (PCA), and evaluates the relative performance of classification procedures incorporating those methods. For given script we extracted different features like Gray Level Co-occurrence Method (GLCM) and Scale invariant feature transform (SIFT) features. The features are
extracted globally from a given text block which does not require any complex and reliable segmentation of the document image into lines and characters. Extracted features are reduced using various dimension reduction techniques. The reduced features are fed into Nearest Neighbor classifier. Thus the proposed
scheme is efficient and can be used for many practical
pplications which require processing large volumes
of data. The scheme has been tested on 10 Indian scripts and found to be robust in the process of scanning and relatively insensitive to change in font size. This proposed system achieves good classification accuracy on a large testing data set.
The document describes a machine learning approach for language identification, named entity recognition, and transliteration on query words. It discusses:
1) Using supervised machine learning classifiers like random forest, decision trees, and SVMs along with contextual, character n-gram, and gazetteer features for language identification of Hindi-English and Bangla-English words.
2) Applying an IOB tagging scheme and features like character n-grams, context words, and typographic properties for named entity recognition and classification.
3) A statistical machine transliteration model that segments, aligns, and maps source and target language transliteration units based on context and probabilities learned from parallel training data.
Quality estimation of machine translation outputs through stemmingijcsa
Machine Translation is the challenging problem for Indian languages. Every day we can see some machine
translators being developed , but getting a high quality automatic translation is still a very distant dream .
The correct translated sentence for Hindi language is rarely found. In this paper, we are emphasizing on
English-Hindi language pair, so in order to preserve the correct MT output we present a ranking system,
which employs some machine learning techniques and morphological features. In ranking no human
intervention is required. We have also validated our results by comparing it with human ranking.
Abstract
Part of speech tagging plays an important role in developing natural language processing software. Part of speech tagging means assigning part of speech tag to each word of the sentence. The part of speech tagger takes a sentence as input and it assigns respective/appropriate part of speech tag to each word of that sentence. In this article I surveys the different work have done about odia POS tagging.
________________________________________________
The document provides an outline for a student script, including the structure for scene headings with a big idea and subpoints for what happens at the beginning and next in each scene. It also gives an example outline from the video "Responsibility at Riverside School" with three scenes about a student making an excuse for missing homework, getting in trouble with the principal who calls their mom, and the teacher rewarding another student and explaining why the first student didn't get a reward.
This document provides instructions for a group project where students will research various aspects of William Shakespeare and the Elizabethan era to create an advertising flyer for Shakespeare. Students are divided into groups of 4 and each assigned a role - historian, detective, playwright, or peasant. They will then conduct research on their topic and collaborate to design a flyer using Microsoft Publisher that highlights interesting facts about Shakespeare's life, times and works.
How to process a miscellaneous payment through your PayClix dashboardAndrew Harrison
This presentation provides a step-by-step guide on how to process a miscellaneous payment for a customer within your PayClix dashboard. This powerpoint will help increase your PayClix knowledge and help you get more out of PayClix
This document discusses various physicochemical principles relevant to pharmacy, including expressions of concentration, calculations, equivalent weight, tonicity, and buffers. It covers ratios, percentages, parts, molar and equivalent solutions, milliequivalents, adjusting tonicity, and buffer systems and applications. The document is intended as a guide for pharmacists and students to understand various concentration expressions and physicochemical concepts.
Spray drying is a method to dry solutions or slurries using hot air in a chamber. Liquid enters the top and is sprayed by pressure jets into a mist of droplets that are dried by the hot air and fall to the bottom as particles. Different types of spray driers exist with modifications like liquid and air entry points. Spray drying is useful for heat-sensitive products and can dry items like milk, enzymes, and particles can be encapsulated. Levigation and pulverization by intervention are processes to reduce substances to fine powders using solvents that are later removed. Elutriation separates particles by size using upward fluid flow, with gravitational elutriation using sedimentation and centrifugal using rotation
The document provides instructions for an English class to write a script based on a fable they read. It describes the classroom as uncomfortable and cold with 28 students, 18 girls and 10 boys. The teacher asks what a script is and a student answers in Portuguese. The class is then told the fable will be turned into a movie and they must write the script, describing the setting, characters, and adding new thrilling events.
This document contains dialogues for various everyday situations like ordering drinks, checking into a hotel, buying coffee, clothes and gifts. It includes sample exchanges for asking directions, ordering meals and checking out of a hotel. The dialogues provide example phrases and questions for common activities and transactions to help learn English for travel and daily interactions.
Charlotte Mew wrote the poem "The Trees Are Down" in reaction to the felling of plane trees in Euston Square Gardens in the 1920s. The poem opens with a quotation from Revelation lamenting the cutting down of the trees. Mew depicts the felling of the trees and her sense of loss and desolation at their removal. She personifies the trees and expresses sympathy for their fate. The poem reflects Mew's affinity for nature and her belief that trees held symbolic value that did not deserve destruction, regardless of the reason for their felling.
Respiration is regulated by both nervous and chemical mechanisms. The nervous mechanism involves respiratory centers in the medulla oblongata and pons that collect sensory information and control respiratory muscles. There are four respiratory centers - the inspiratory and expiratory centers in the medulla, and the pneumotaxic and apneustic centers in the pons. The chemical mechanism involves central and peripheral chemoreceptors that detect changes in blood oxygen, carbon dioxide, and hydrogen ion levels and stimulate the respiratory centers.
This document provides examples of different radio script formats including the BBC scene style, BBC cue style, and U.S. radio drama format. It was written by Matt Carless and contains information on the stylistic conventions for writing radio scripts in both the BBC and American formats.
This document summarizes key concepts from a course on process instrumentation and control. It discusses servo and regulatory control, distinguishing between responses to setpoint changes versus disturbances. It also contrasts continuous versus batch processes. Continuous processes have constant inputs and outputs, while batch processes involve discrete batches undergoing separate processing stages. The document provides examples and compares characteristics of batch and continuous processes. Finally, it defines self-regulating versus non-self-regulating processes, using a water tank example to illustrate inherent feedback in self-regulating systems.
Gluconeogenesis is the production of glucose from non-carbohydrate sources through a complex series of metabolic pathways. It occurs primarily in the liver and kidney cytosol and produces approximately 1 kg of glucose per day, which is essential for brain function and as an energy source for muscles. The major precursors for gluconeogenesis are lactate, pyruvate, amino acids, glycerol, and propionate derived from the breakdown of proteins, fats, and certain metabolites. The pathways involved closely mirror glycolysis except for a few irreversible steps that are bypassed by alternative enzyme-catalyzed reactions in order to synthesize glucose from these precursors.
The circulatory system transports oxygen, nutrients, hormones, metabolic waste and more throughout the body. It consists of the cardiovascular system of the heart and blood vessels, and the lymphatic system. Blood is composed of plasma, red blood cells, white blood cells, and platelets. The heart pumps blood through two circuits - pulmonary circulation to the lungs and systemic circulation to the body. Blood pressure and pH are regulated to maintain homeostasis.
The document discusses the tricarboxylic acid (TCA) cycle, also known as the Krebs cycle or citric acid cycle. It provides three key points:
1. The TCA cycle is the final common pathway for the oxidation of carbohydrates, fats, and proteins to produce energy in the form of ATP. It involves the oxidation of acetyl-CoA to carbon dioxide and generates reduced cofactors NADH and FADH2 that feed into the electron transport chain.
2. The cycle occurs in the mitochondrial matrix and is tightly regulated. It generates 3 NADH and 1 FADH2 per acetyl-CoA molecule oxidized, which can ultimately produce around 12 ATP
The story is narrated by a man on the day before his execution for killing his wife. He describes how he grew to hate his black cat Pluto after initially loving it. In a drunken state, he hangs Pluto, which fills him with remorse. Later, a similar black cat appears and the man kills his wife in a fit of madness and hides her body behind a wall. The cat's cries lead to the discovery of the body and the man's arrest.
Penicillins by Dr. Panchumarthy Ravisankar M.Pharm., Ph.D.Dr. Ravi Sankar
The document discusses penicillins, including their:
1) Historical background of discovery by Alexander Fleming in 1928 from Penicillium notatum.
2) Classification based on structure, spectrum, source and pharmacological activity, with penicillins classified under the beta-lactam class.
3) Structures of different penicillins such as penicillin G, penicillin V, methicillin, and isoxazolyl penicillins.
The poem laments the felling of great plane trees at the end of the garden. The narrator expresses a deep connection to the trees, feeling their lives were intertwined through changing seasons and weather. As the trees are cut down amid the noisy work and laughter of men, the narrator is filled with sadness, comparing the loss of the trees to half the spring being gone. The poem references Revelation and an angel's cry of "Hurt not the trees," emphasizing the sacredness of nature.
This document outlines requirements for a College Management System that allows staff and students to access and share information. The system would include modules for campus information, administration, departments, staff, and students. It would allow users to view and change profiles, attendance records, marks and more. The system is intended to be accessed via login and password on the college intranet. It would use HTML, Oracle database, Tomcat web server, and Java technologies. Future enhancements could include online exam modules and a facility for teachers to upload lecture videos. The goal of the system is to provide appropriate information to users and help reduce wasted time by centralizing college information and services.
This document provides an overview of the physiology of the gastrointestinal system. It begins with an introduction and then covers the functions and contents of various parts of the GI tract, including the mouth, stomach, small intestine, large intestine, liver, gallbladder and pancreas. Specific topics discussed include the role of saliva, gastric juice, pancreatic juice, and bile in digestion and absorption of nutrients. The functions of digestive enzymes and how carbohydrates, proteins and fats are broken down are also explained.
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGEScscpconf
This document presents a technique for identifying scripts (Tamil, English, Hindi) and numerals from multilingual document images using a rule-based classifier. Words are segmented and the first character of each word is represented as a 9-bit vector based on features like density, shape, and transitions. A rule-based classifier containing rules derived from training data is used to classify the script of each character. The technique aims to automatically categorize multilingual documents before applying optical character recognition and requires minimal preprocessing with high accuracy.
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACHijcseit
This document summarizes a research paper on Devnagari document segmentation using a histogram approach. It discusses challenges in segmenting the Devnagari script used for several Indian languages. A simple algorithm is proposed using horizontal and vertical histograms to segment documents into lines, words and characters. The algorithm achieves near 100% accuracy for line segmentation but lower accuracy for word and character segmentation due to complexities in the Devnagari script. Future work is needed to improve character segmentation handling connected and modified characters.
DISCRIMINATION OF ENGLISH TO OTHER INDIAN LANGUAGES (KANNADA AND HINDI) FOR O...IJCSEA Journal
India is a multilingual multi-script country. In every state of India there are two languages one is state local language and the other is English. For example in Andhra Pradesh, a state in India, the document may contain text words in English and Telugu script. For Optical Character Recognition (OCR) of such a bilingual document, it is necessary to identify the script before feeding the text words to the OCRs of individual scripts. In this paper, we are introducing a simple and efficient technique of script identification for Kannada, English and Hindi text words of a printed document. The proposed approach is based on the horizontal and vertical projection profile for the discrimination of the three scripts. The feature extraction is done based on the horizontal projection profile of each text words. We analysed 700 different words of Kannada, English and Hindi in order to extract the discrimination features and for the development of knowledge base. We use the horizontal projection profile of each text word and based on the horizontal projection profile we extract the appropriate features. The proposed system is tested on 100 differentdocument images containing more than 1000 text words of each script and a classification rate of 98.25%, 99.25% and 98.87% is achieved for Kannada, English and Hindi respectively.
International Journal of Image Processing (IJIP) Volume (3) Issue (3)CSCJournals
This document describes an optical character recognition system for Urdu text written in Naskh font. It discusses the challenges of recognizing characters in the Urdu language due to its cursive nature. The proposed system uses a training-based approach where character images are preprocessed, segmented, and used to create an XML training file containing chain codes. For recognition, an input image is segmented and converted to chain codes which are matched against the training file. The system achieves 89% accuracy recognizing 15 characters per second. Developing an Urdu OCR could help digitize literature and make texts more accessible online while reducing storage requirements.
This document summarizes research on script and language identification techniques for handwritten document images. It discusses the challenges of identifying scripts in handwritten versus printed documents. It then reviews various approaches that have been used, which are generally categorized as structure-based (using features like character geometry) or visual appearance-based (using texture features). Specific techniques discussed include those using linear discriminant analysis, K-nearest neighbors, neural networks, support vector machines, Gabor filters, and steerable pyramids. The document analyzes and compares the performance of different methods. It notes that better pre-processing and larger, standardized datasets are still needed to fully evaluate the techniques.
An Empirical Study on Identification of Strokes and their Significance in Scr...IJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
This document discusses font and size identification in Telugu printed documents. It provides background on the Telugu script, which contains a large number of compound characters formed from combinations of vowels and consonants. The document then discusses the need for font and size identification as a preprocessing step for optical character recognition (OCR) systems to improve accuracy. It presents an approach using zonal analysis and connected component analysis to extract features from text images like aspect ratio and pixel ratio to identify the font and size by comparing to a database. Results showed this approach could accurately identify different fonts and sizes in Telugu text images.
Devnagari document segmentation using histogram approachVikas Dongre
Document segmentation is one of the critical phases in machine recognition of any language. Correct
segmentation of individual symbols decides the accuracy of character recognition technique. It is used to
decompose image of a sequence of characters into sub images of individual symbols by segmenting lines and
words. Devnagari is the most popular script in India. It is used for writing Hindi, Marathi, Sanskrit and
Nepali languages. Moreover, Hindi is the third most popular language in the world. Devnagari documents
consist of vowels, consonants and various modifiers. Hence proper segmentation of Devnagari word is
challenging. A simple histogram based approach to segment Devnagari documents is proposed in this paper.
Various challenges in segmentation of Devnagari script are also discussed.
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...ijnlc
The comprehension of whole manually written records is a testing issue which incorporates various
difficult undertakings. Given a written by hand archive, its format needs to be dissected to detach different
content sorts in a first step. These different content sorts can then be coordinated to specific frameworks,
including writer style, image, or table recognizers. Research in programmed author recognizable proof has
principally centred around the measurable methodology. This has prompted the particular and extraction
of factual elements, for example, run-length appropriations, incline dissemination, entropy, and edge-pivot
conveyance. The edge-pivot conveyance highlight out flanks all other measurable elements. Edge-pivot
circulation is an element that portrays the adjustments in bearing of a written work stroke in written by
hand content. The edge- pivot circulation is extricated by method for a window that is slid over an edgerecognized
on offline scanned images. At whatever point the focal pixel of the window is on, the two edge
pieces (i.e. associated successions of pixels) rising up out of this focal pixel are considered. Their bearings
are measured and put away as sets. A joint likelihood dissemination is gotten from an extensive recognition
Review of research on devnagari character recognitionVikas Dongre
This document summarizes research on Devnagari character recognition. It begins with an abstract discussing the progress of English character recognition and the need for further research on Indian languages like Devnagari. The document then reviews the stages of Devnagari optical character recognition systems, including pre-processing, segmentation, feature extraction, recognition, and post-processing. It discusses challenges in Devnagari recognition due to features of the script like connected characters. The document also reviews common techniques used at each stage of recognition systems and provides directions for future research.
Language Identification from a Tri-lingual Printed Document: A Simple ApproachIJERA Editor
In a multi-script, multilingual country like India, a document may contain text lines in more than one script/language forms. For such a multi-script environment, multilingual Optical Character Recognition (OCR) system is needed to read the multi-script documents. To make a multilingual OCR system successful, it is necessary to identify different script regions of the document before feeding the document to the OCRs of individual language. With this context, this paper proposes to work on the prioritized requirements of a particular region- Andra Pradesh, a state in India, where any document including official ones, would contain the text in three languages-Telugu, Hindi and English. So, the objective of this paper is to develop a system that should aim to accurately identify and separate Telugu, Hindi and English text lines from a printed multilingual document and also to group the portion of the document in other than these three languages into a separate category OTHERS. The proposed method is developed by thoroughly understanding the nature of top and bottom profiles of the printed text lines. Experimentation conducted involved 900 text lines for learning and 900 text lines for testing. The performance has turned out to be 95.67%.
Preprocessing Phase for Offline Arabic Handwritten Character RecognitionEditor IJCATR
This document discusses preprocessing techniques for offline Arabic handwritten character recognition. It begins with an introduction to optical character recognition approaches and the challenges of recognizing Arabic text due to its unique characteristics. The general phases of OCR systems are then described, including preprocessing, segmentation, feature extraction, and classification. The proposed preprocessing phase is then discussed in detail, including algorithms for binarization, dot removal, and thinning applied to isolated Arabic character images from a dataset. The goal of preprocessing is to simplify images prior to feature extraction and recognition.
Fragmentation of Handwritten Touching Characters in Devanagari ScriptZac Darcy
Character Segmentation of handwritten words is a difficult task because of different writing styles and
complex structural features. Segmentation of handwritten text in Devanagari script is an uphill task. The
occurrence of header line, overlapped characters in middle zone & half characters make the segmentation
process more difficultt. Sometimes, interline space and noise makes line fragmentation a difficult task.
Sometimes, interline space and noise makes line fragmentation a difficult task. Without separating the
touching characters, it will be difficult to identify the characters, hence fragmentation is necessary of the
touching characters in a word. So, we devised a technique, according to that first step will be
preprocessing of a word, than identify the joint points, form the bounding boxes around all vertical &
horizontal lines and finally fragment the touching characters on the basis of their height and width.
Fragmentation of handwritten touching characters in devanagari scriptZac Darcy
Character Segmentation of handwritten words is a difficult task because of different writing styles and
complex structural features. Segmentation of handwritten text in Devanagari script is an uphill task. The
occurrence of header line, overlapped characters in middle zone & half characters make the segmentation
process more difficultt. Sometimes, interline space and noise makes line fragmentation a difficult task.
Sometimes, interline space and noise makes line fragmentation a difficult task. Without separating the
touching characters, it will be difficult to identify the characters, hence fragmentation is necessary of the
touching characters in a word. So, we devised a technique, according to that first step will be
preprocessing of a word, than identify the joint points, form the bounding boxes around all vertical &
horizontal lines and finally fragment the touching characters on the basis of their height and width.
Handwriting character recognition (HCR) is the ability of a computer to receive and interpret handwritten input. Handwritten Character Recognition is one of the active and challenging research areas in the field of Pattern Recognition. Pattern recognition is a process that taking in raw data and making an action based on the category of the pattern. HCR is one of the well-known applications of pattern recognition. Handwriting recognition especially for Indian languages is still in infant stage because not much work has been done it. This paper discuss about an idea to recognize Kannada vowels using chain code features. Kannada is a South Indian language. For any recognition system, an important part is feature extraction. A proper feature extraction method can increase the recognition ratio. In this paper, a chain code based feature extraction method is investigated for developing HCR system. Chain code is working based on 4-neighborhood or 8–neighborhood methods. Chain code is a sequence of code directions of a character and connection to a starting point which is often used in image processing. In this paper, 8–neighborhood method has been implemented which allows generation of eight different codes for each character. These codes have been used as features of the character image, which have been later on used for training and testing for K-Nearest Neighbor (KNN) classifiers. The level of accuracy reached to 100%.
Fuzzy rule based classification and recognition of handwritten hindiIAEME Publication
This document summarizes a research paper that proposes a fuzzy rule-based system for classifying and recognizing handwritten Hindi words. The system works in six stages: preprocessing, segmentation, normalization, classification, feature extraction, and recognition. In the classification stage, characters are classified into seven classes based on the presence, position, length, connectivity, and number of junction points of their vertical bars. Experimental results on 450 words written by 30 people showed the system achieved a classification and recognition rate of 92.02%.
Fuzzy rule based classification and recognition of handwritten hindiIAEME Publication
This document describes a fuzzy rule-based system for classifying and recognizing handwritten Hindi words. The system works in six stages: preprocessing, segmentation, normalization, classification, feature extraction, and recognition. Preprocessing includes binarization, thinning, slant correction, dilation, erosion, and filtering to prepare images for further processing. Classification uses fuzzy if-then rules based on the presence and position of vertical bars to classify characters into seven classes. Feature extraction identifies curves, lines, junction points and endpoints. The system was tested on 450 words written by 30 people, achieving a recognition rate of 92.02%.
Script Identification In Trilingual Indian DocumentsCSCJournals
This paper presents a research work in identification of script from trilingual Indian documents. This paper proposes a classification algorithm based on structural and contour features. The proposed system identifies the script of languages like English, Tamil and Hindi. 300 word images of the above mentioned three scripts were tested and 98.6% accuracy was obtained. Performance comparison with various existing methods is discussed.
IRJET - A Survey on Recognition of Strike-Out Texts in Handwritten DocumentsIRJET Journal
This document summarizes research on recognizing strike-out or crossed-out text in handwritten documents. It discusses 4 papers that studied this problem in different languages like English, Bengali, and Devanagari. The first paper used an LSTM model to recognize struck-out English text with 11% character error rate. The second used HMMs to recognize French text crossed out with lines or waves, achieving 46-92% accuracy. The third used graph algorithms and SVM to detect and remove Bengali strike-outs, obtaining 95% accuracy. The fourth used decision trees to classify English forensic documents, identifying 48% of crossed texts correctly. Overall the document reviews approaches to strike-out text recognition and removal.
Segmentation of Handwritten Text in Gurmukhi ScriptCSCJournals
Character segmentation is an important preprocessing step for text recognition. The size and shape of characters generally play an important role in the process of segmentation. But for any optical character recognition (OCR) system, the presence of touching characters in textual as well handwritten documents further decreases correct segmentation as well as recognition rate drastically. Because one can not control the size and shape of characters in handwritten documents so the segmentation process for the handwritten document is too difficult. We tried to segment handwritten text by proposing some algorithms, which were implemented and have shown encouraging results. Algorithms have been proposed to segment the touching characters. These algorithms have shown a reasonable improvement in segmenting the touching handwritten characters in Gurmukhi script.
Similar to A Novel Approach for Bilingual (English - Oriya) Script Identification and Recognition in a Printed Document (20)
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
Leveraging Generative AI to Drive Nonprofit InnovationTechSoup
In this webinar, participants learned how to utilize Generative AI to streamline operations and elevate member engagement. Amazon Web Service experts provided a customer specific use cases and dived into low/no-code tools that are quick and easy to deploy through Amazon Web Service (AWS.)
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
हिंदी वर्णमाला पीपीटी, hindi alphabet PPT presentation, hindi varnamala PPT, Hindi Varnamala pdf, हिंदी स्वर, हिंदी व्यंजन, sikhiye hindi varnmala, dr. mulla adam ali, hindi language and literature, hindi alphabet with drawing, hindi alphabet pdf, hindi varnamala for childrens, hindi language, hindi varnamala practice for kids, https://www.drmullaadamali.com
Communicating effectively and consistently with students can help them feel at ease during their learning experience and provide the instructor with a communication trail to track the course's progress. This workshop will take you through constructing an engaging course container to facilitate effective communication.
Constructing Your Course Container for Effective Communication
A Novel Approach for Bilingual (English - Oriya) Script Identification and Recognition in a Printed Document
1. Sanghamitra Mohanty & Himadri Nandini Das Bebartta
International Journal of Image Processing (IJIP), Volume (4): Issue (2) 175
A Novel Approach for Bilingual (English - Oriya) Script
Identification and Recognition in a Printed Document
Sanghamitra Mohanty sangham1@rediffmail.com
Faculty/Department of Computer Science
Utkal University Bhubaneswar,
751004, India
Himadri Nandini Das Bebartta himadri_nandini@yahoo.co.in
Scholar/ Department of Computer Science
Utkal University Bhubaneswar,
751004, India
Abstract
In most of our official papers, school text books, it is observed that English words
interspersed within the Indian languages. So there is need for an Optical
Character Recognition (OCR) system which can recognize these bilingual
documents and store it for future use. In this paper we present an OCR system
developed for the recognition of Indian language i.e. Oriya and Roman scripts for
printed documents. For such purpose, it is necessary to separate different scripts
before feeding them to their individual OCR system. Firstly, we need to correct
the skew followed by segmentation. Here we propose the script differentiation
line-wise. We emphasize on Upper and lower matras associated with Oriya and
absent in English. We have used horizontal histogram for line distinction
belonging to different script. After separation different scripts are sent to their
individual recognition engines.
Keywords: Script separation, Indian script, Bilingual (English-Oriya) OCR, Horizontal profiles
1. INTRODUCTION
Researchers have been emphasizing a lot of effort for pattern recognition since decades.
Amongst the pattern recognition field Optical Character Recognition is the oldest sub field and
has almost achieved a lot of success in the case of recognition of Monolingual Scripts. . In India,
there are 24 official (Indian constitution accepted) languages. Two or more of these languages
may be written in one script. Twelve different scripts are used for writing these languages. Under
the three-language formula, some of the Indian documents are written in three languages namely,
English, Hindi and the state official language. One of the important tasks in machine learning is
the electronic reading of documents. All official documents, magazines and reports can be
converted to electronic form using a high performance Optical Character Recognizer (OCR). In
the Indian scenario, documents are often bilingual or multi-lingual in nature. English, being the
link language in India, is used in most of the important official documents, reports, magazines and
technical papers in addition to an Indian language. Monolingual OCRs fail in such contexts and
there is a need to extend the operation of current monolingual systems to bilingual ones. This
paper describes one such system, which handles both Oriya and Roman script. Recognition of
bilingual documents can be approached by the following method i.e. Recognition via script
2. Sanghamitra Mohanty & Himadri Nandini Das Bebartta
International Journal of Image Processing (IJIP), Volume (4): Issue (2) 176
identification. Optical Character Recognition (OCR) system of such a document page can be
made through the Development of a script separation scheme to identify different scripts present
in the document pages and then run individual OCR developed for each script alphabets.
Development of a generalized OCR system for Indian languages is more difficult than a single
script OCR development. This is because of the large number of characters in each Indian script
alphabet. On the other hand, second option is simpler for a country like India because of many
scripts. There are many pieces of work on script identification from a single document. Spitz [1]
developed a method to separate Han-based or Latin- based script separation. He used optical
density distribution of characters and frequently occurring word shape characteristics for the
purpose. Recently, using fractal-based texture features, Tan [5] described an automatic method
for identification of Chinese, English, Greek, Russian, Malayalam and Persian text. Ding et al. [3]
proposed a method for separating two classes of scripts: European (comprising Roman and
Cyrillic scripts) and Oriental (comprising Chinese, Japanese and Korean scripts). Dhanya and
Ramakrishnan [9] proposed a Gabor filter based technique for word-wise segmentation from bi-
lingual documents containing English and Tamil scripts. Using cluster based templates; an
automatic script identification technique has been described by Hochberg et al. [4]. Wood et al.
[2] described an approach using filtered pixel projection profiles for script separation. Pal and
Chaudhuri [6] proposed a line-wise script identification scheme from tri-language (triplet)
documents. Later, Pal et al. [7] proposed a generalized scheme for line-wise script identification
from a single document containing all the twelve Indian scripts. Pal et al. [8] also proposed some
work on word-wise identification from Indian script documents.
All the above pieces of work are done for script separation from printed documents. In the
proposed scheme, at first, the documents noise is cleared which we perform at the binarization
stage and then the skew is detected and corrected. Using horizontal projection profile the
document is segmented into lines. The line height for individual script is different. Along with this
property one more uniqueness property in between the Roman and Oriya script is that each line
consists of more number of Roman characters as compared to that of Oriya. Basing on these
features we have taken a threshold value by dividing the line height of each line with the number
of characters in a line. And after obtaining a unique value we sent these lines to their respective
classifiers. The classifier which we have used is the Support Vector Machine. The Figure 1. below
shows the entire process carried out for the recognition of our bilingual document.
FIGURE 1: The Schematic Representation of the Bilingual OCR system.
In section 2 we have described the properties of the Oriya Script. Section 3 covers a brief
description on binarization and skew correction. Section 4 gives a description on segmentation. In
Section 5 we have described the major portion of our work which focuses on Script identification.
Section 6 gives an analysis on the further cases that we have studied for bilingual script
differentiation. Section 7 describes on Feature extraction part which has been achieved through
Support Vector Machines. Section 8 discusses on the result that we have obtained.
3. Sanghamitra Mohanty & Himadri Nandini Das Bebartta
International Journal of Image Processing (IJIP), Volume (4): Issue (2) 177
2. PROPERTIES OF ORIYA SCRIPT
The complex nature of Oriya alphabets consists of 268 symbols (13 vowels, 36 consonants, 10
digits and 210 conjuncts) among which around 90 characters are difficult to recognize because
they occupy special size. The symbols have been shown in the character set of Oriya language.
The components of the characters can be classified into:
(a) Main component: Either a vowel or a symbol may be consonant.
(b) Vowel Modifier: A character can also have a vowel modifier, which modifies the consonants.
When the vowel modifier does not touch the main component, it forms separate component,
which lies to left or right or top or bottom of the main component and hence does not lie within a
line.
(c) Consonant modifier: A symbol can be composed of two or more consonants, the main
component and consonant modifier/s or half consonant. Spatially, the consonant modifier could
be to bottom or top of the main component, and hence lie above or below the line. More than two
up to four consonant vowel combinations are found. These are called conjuncts or yuktas. The
basic characters of Oriya script are shown in Fig. 2. 1, Fig. 2. 2 and Fig. 2. 3.
FIGURE 2.1: 52 Oriya Vowels and Consonants.
FIGURE 2.2: Some of the Oriya Yuktas (Conjuncts).
FIGURE 2.3: 10 Oriya Digits.
From the above Figure it can be noted that out of 52 basic characters 37 characters have a
convex shape at the upper part. The writing style in the script is from left to right. The concept of
upper/lower case is absent in Oriya script. A consonant or vowel following a consonant
sometimes takes a compound orthographic shape, which we call as compound character or
conjuncts. Compound characters can be combinations of consonant and consonant, as well as
consonant and vowel.
4. Sanghamitra Mohanty & Himadri Nandini Das Bebartta
International Journal of Image Processing (IJIP), Volume (4): Issue (2) 178
3. BINARIZATION AND SKEW CORRECTION
Binarization
The input of an OCR is given in from the scanner or a camera. After this we need to binarize the
image. The image enhancement is followed using the spatial domain method that refers to the
aggregate of pixels composing an image. Spatial domain processes is denoted by the expression
O(x, y) =T [I(x, y)] where I(x, y) is the input image, O(x, y) is the processed image and T is an
operator on I. The operator T is applied at each location (x, y) to yield the output. The effect of
this transformation would be to produce an image of higher contrast than the original by
darkening the levels below 'm' and brightening the levels above m in the original image. Here 'm'
is the threshold value taken by us for brightening and darkening the original image. T(r) produces
a two-level (binary) image[10].
Skew Correction
Detecting the skew of a document image and correcting it are important issues in realizing a
practical document reader. For skew correction we have implemented Baird Algorithm. It’s a
horizontal profiling based algorithm. For skew detection, the horizontal profiles are computed
close to the expected orientations. For each angle a measure is made of variation in the bin
heights along the profile and the one with the maximum variation gives the Skew angle.
4. SEGMENTATION
Several approaches has also been taken for segmentation of the script line wise word wise and
character wise. A new algorithm for Segmentation of Handwritten Text in Gurmukhi Script has
been done by Sharma and Singh [11]. A new intelligent segmentation technique for functional
Magnetic Resonance Imaging (fMRI) been implemented using an Echostate Neural Network
(ESN) by D. Suganthi and Dr. S. Purushothaman [12]. A Simple Segmentation Approach for
Unconstrained Cursive Handwritten Words in Conjunction with the Neural Network has been
performed by Khan and Muhammad [13]. The major challenge in our work is the separation of
lines for script identification. The result of line segmentation which has been shown later, takes
into consideration the upper and lower matras of the line. And this gives the differences in the line
height for the distinction of the script. One more factor which we have considered for line
identification of different script is the horizontal projection profiles which look into the intensity of
pixels in different zones. Horizontal projection profile is the sum of black pixels along every row of
the image. For both of the above methods we have discussed the output in script identification
part and here we have discussed the concepts only.
The purpose of analyzing the text line detection of an image is to identify the physical region in
the image and their characteristics. A maximal region in an image is the maximal homogenous
area of the image. The property of homogeneity in the case of text image refers to the type of
region, such as text block, graphic, text line, word, etc. so we define the segmentation as follows
A segmentation of a text line image is a set of mutually exclusive and collectively exhaustive sub
regions of the text line image. Given an text line image, I, a segmentation is defined as
S= ,such that,
5. Sanghamitra Mohanty & Himadri Nandini Das Bebartta
International Journal of Image Processing (IJIP), Volume (4): Issue (2) 179
,and
∩ =ϕ i j.
Typical top-down approaches proceed by dividing a text image into smaller regions using the
horizontal and vertical projection profiles. The X-Y Cut algorithm, starts dividing a text image into
sections based on valleys in their projection profiles. The algorithm repeatedly partitions the
image by alternately projecting the regions of the current segmentation on the horizontal and
vertical axes. An image is recursively split horizontally and vertically until a final criterion where a
split is impossible is met. Projection profile based techniques are extremely sensitive to the skew
of the image. Hence extreme care has to be taken while scanning of images or a reliable skew
correction algorithm has to be applied before the segmentation process.
5. SCRIPT IDENTIFICATION
In a script, a text line may be partitioned into three zones. The upper-zone denotes the portion
above the mean-line, the middle zone (busy-zone) covers the portion of basic (and compound)
characters below mean-line and the lower-zone is the portion below base-line. Thus we can
define that an imaginary line, where most of the uppermost (lowermost) points of characters of a
text line lie, is referred as mean-line (base-line). Example of zoning is shown in Figure. 3. And
Figure 4a and b show a word each of Oriya and English, and their corresponding projection
profiles respectively. Here mean-line along with base-line partitions the text line into three zones.
FIGURE 3: Line Showing The Upper, Middle and Lower Zone.
For example from the Figure 4 shown below we can observe that the percentage of pixels in the
lower zone in case of Oriya characters is more in comparison to English characters.
In this approach, script identification is first performed at the line level and this knowledge is used
to identify the OCR to be employed. Individual OCRs have been developed for Oriya [14] as well
as English and these could be used for further processing. Such an approach allows the Roman
and Oriya characters to be handled independently from each other. In most Indian languages, a
text line may be partitioned into three zones. We call the uppermost and lowermost boundary
lines of a text line as upper-line and lower-line.
6. Sanghamitra Mohanty & Himadri Nandini Das Bebartta
International Journal of Image Processing (IJIP), Volume (4): Issue (2) 180
FIGURE 4: The Three Zones of (a) Oriya Word and (b) English Word.
For script recognition, features are identified based on the following observations. From the
above projection profile we can observe that
1. The number of Oriya characters present in a line are comparatively less than that of the
Roman characters
2. All the upper case letters in Roman script extend into the upper zone and middle zone while
the lower case letters occupy the middle, lower and upper zones.
3. The Roman scripts has very few downward extensions(only for g, p, q, j, and y) and have low
range of the pixel density, whereas most of the Oriya line contains lower matras and have a
high range of pixel density.
4. Few Roman scripts(taking into consideration the lower case letters) has very less upward
extensions(only for b, d, f, h, l, and t) and have low range of the pixel density, whereas most
of the Oriya line contains upper vowel markers (matras) and have a high range of pixel
density
5. The upper portion of most of the Oriya script is convex in nature and touches the mean-line
and the Roman script is dominated by vertical and slant strokes.
In consideration to the above features for distinction we have tried to separate the scripts on the
basis of the line height. Figure 5 shows the different lines extracted for the individual scripts. Here
we have considered the upper and lower matras for the Oriya characters. We have observed that,
considering a certain threshold value for the line height, document containing English lines have a
line height less than the threshold value and the Oriya lines have a value that is greater than the
threshold value
7. Sanghamitra Mohanty & Himadri Nandini Das Bebartta
International Journal of Image Processing (IJIP), Volume (4): Issue (2) 181
.
FIGURE 5: Shown Above is the Line with Their Upper and Lower Matras.
Line
Num
ber
Line Height
1 109
2 101
3 98
4 105
5 105
6 72
7 77
8 76
9 71
10 74
11 83
12 77
13 64
TABLE 1: Line Height for the Different Line Numbers.
For each of the line shown above, the number of characters present in each line has been
calculated. Then a threshold value 'R' for both the scripts has been calculated by dividing the line
height of each line by the number of characters present in the line. Thus, R can be written as
R = Line height / Number of characters
The values that we obtained has been shown in Table 2.From this values we can see that for
Oriya script the value lies above 3.0 and for Roman it is below 3.0. So basing on these values the
script has been separated.
8. Sanghamitra Mohanty & Himadri Nandini Das Bebartta
International Journal of Image Processing (IJIP), Volume (4): Issue (2) 182
Line
Number
Number of Characters
1 3.3
2 3.06
3 3.5
4 3.08
5 3.08
6 8
7 1.08
8 1.08
9 1.24
10 1.08
11 1.25
12 1.08
12 2
TABLE 2: The Ratio Obtained after Dividing Line Height with Number of Characters.
We have taken nearly fifteen hundred printed documents for having a comparison in between the
output and deriving a conclusion. The above table and figure are represented for one of the
document while carrying out our experiment.
6. FEATURE EXTRACTION
The two essential sub-stages of recognition phase are feature extraction and classification. The
feature extraction stage analyzes a text segment and selects a set of features that can be used to
uniquely identify the text segment. The derived features are then used as input to the character
classifier. The classification stage is the main decision making stage of an OCR system and uses
the extracted feature as input to identify the text segment according to the preset rules.
Performance of the system largely depends upon the type of the classifier used. Classification is
usually accomplished by comparing the feature vectors corresponding to the input text/character
with the representatives of each character class, using a distance metric. The classifier which has
been used by our system is Support Vector Machine (SVM).
Support Vector Machines
Support vector machines are originally formulated for two-class classification problems [15]. But
since decision functions of two-class support vector machines are directly determined to
maximize the generalization ability, an extension to multiclass problems is not unique. There are
roughly three ways to solve this problem: one against all, pair wise, and all-at-once
classifications. Original formulation by Vapnik [15] is one-against-all classification, in which one
class is separated from the remaining classes. By this formulation, however, unclassifiable
regions exist. Instead of discrete decision functions, Vapnik [16, p. 438] proposed to use
continuous decision functions. We classify a datum into the class with the maximum value of the
decision functions. In pair wise classification, the n-class problem is converted into n(n − 1)/2
two-class problems. Kreßel [18] showed that by this formulation, unclassifiable regions reduce,
but still they remain. To resolve unclassifiable regions for pair wise classification, Platt, Cristianini,
and Shawe-Taylor [19] proposed decision-tree-based pair wise classification called Decision
9. Sanghamitra Mohanty & Himadri Nandini Das Bebartta
International Journal of Image Processing (IJIP), Volume (4): Issue (2) 183
Directed Acyclic Graph (DDAG). Kijsirikul and Ussivakul [20] proposed the same method and
called it Adaptive Directed Acyclic Graph (ADAG). The problem with DDAGs and ADAGs is that
the generalization regions depend on the tree structure [17]. Abe and Inoue [21] extended one-
against-all fuzzy support vector machines to pair wise classification. In all-at-once formulation we
need to determine all the decision functions at once [22, 23], [16, pp. 437–440]. But this results in
simultaneously solving a problem with a larger number of variables than the above mentioned
methods. By decision trees, the unclassifiable regions are assigned to the classes associated
with leaf nodes. Thus if class pairs with low generalization ability are assigned to the leaf nodes,
associated decision functions are used for classification. Thus the classes that are difficult to be
separated are classified using the decision functions determined for these classes. As a measure
to estimate the generalization ability, we can use any measure that is developed for estimating
the generalization ability for two-class problems, because in DDAGs and ADAGs, decision
functions for all the class pairs are needed to be determined in advance. We explain two-class
support vector machines, pair wise SVMs. and DDAGs.
Two-class Support Vector Machines
Let m-dimensional inputs xi (i = 1. . . M) belong to Class 1 or 2 and the associated labels be yi
= 1 for Class 1 and −1 for Class 2. Let the decision function be
where w is an m-dimensional vector, b is a scalar, and
D( ) ≥ 1 - ξi for i = 1 ,….., M.
Here ξi are nonnegative slack variables. The distance between the separating hyper plane D(x) =
0 and the training datum, with ξi = 0, nearest to the hyper plane is called margin. The hyper plane
D(x) = 0 with the maximum margin is called optimal separating hyper plane. To determine the
optimal separating hyper plane, we minimize
subject to the constraints:
( ) ≥ 1 - ξi for i = 1,... ,M.
where C is the margin parameter that determines the tradeoff between the maximization of the
margin and minimization of the classification error. The data that satisfy the equality in (4) are
called support vectors.
To enhance separability, the input space is mapped into the high-dimensional dot-product space
called feature space. Let the mapping function be g(x). If the dot
10. Sanghamitra Mohanty & Himadri Nandini Das Bebartta
International Journal of Image Processing (IJIP), Volume (4): Issue (2) 184
product in the feature space is expressed by H(x, x_) =g(x)tg(x), H(x, x_) is called kernel function,
and we do not need to explicitly treat the feature space. The kernel functions used in this study
are as follows:
1. Dot product kernels
H =
2. Polynomial kernels
H =
where d is an integer.
3. Radial Basis Function kernels
H =
where γ value is found out by 1/number of dimension of input vectors.
To simplify notations, in the following we discuss support vector machines with the dot product
kernel.
Pair wise Support Vector Machines
In pair wise support vector machines; we determine the decision functions for all the
combinations of class pairs. In determining a decision function for a class pair, we use the training
data for the corresponding two classes. Thus, in each training the number of training data is
reduced considerably compared to one-against-all support vector machines, which use all the
training data. But the number of decision functions is n(n − 1)/2, compared to n for one-against-all
support vector machines, where n is the number of classes [25].
FIGURE 6. Unclassifiable Regions by the Pair Wise Formulation.
Let the decision function for class i against class j, with the maximum margin, be
11. Sanghamitra Mohanty & Himadri Nandini Das Bebartta
International Journal of Image Processing (IJIP), Volume (4): Issue (2) 185
Dij = +
where wij is an m-dimensional vector, bij is a scalar, and Dij (x) = −Dji (x). The
regions do not overlap and if x is in Ri , we classify x into class i. If x is not in Ri (i = 1, . . . , n), we
classify x by voting. Namely, for the input vector x we calculate
where
sign(x) =
and classify x into the class
If x ε Ri , Di (x) = n − 1 and Dk (x) < n − 1. Thus x is classified into class i. But if any of Di (x) is not
n, (12) may be satisfied for plural i’s. In this case, x is unclassifiable. If the decision functions for a
three- class problem are as shown in Figure 6, the shaded region is unclassifiable since Di (x) = 1
(i = 1, 2, and 3).
Decision-tree Based Support Vector Machines:
Decision Directed Acyclic Graphs
Figure 7 shows the decision tree for the three classes shown in Figure 6. In the
FIGURE 7: Decision-Tree-Based Pair Wise Classification.
12. Sanghamitra Mohanty & Himadri Nandini Das Bebartta
International Journal of Image Processing (IJIP), Volume (4): Issue (2) 186
FIGURE 8: Generalization Region by Decision Tree Based Pair Wise Classification.
Figure 7, shows that x does not belong to class i. As the top-level classification, we can choose
any pair of classes. And except for the leaf node if Dij (x) > 0, we consider that x does not belong
to class j, and if Dij (x) < 0 not class i. Then if D12 (x) > 0, x does not belong to Class 2. Thus it
belongs to either Class 1 or 3 and the next classification pair is Classes 1 and 3. The
generalization regions become as shown in Figure 8. Unclassifiable regions are resolved but
clearly the generalization regions depend on the tree structure.
Classification by a DDAG is executed by list processing. In list processing, first we generate a list
with class numbers as elements. Then, we calculate the decision function, for the input x,
corresponding to the first and the last elements. Let these classes be i and j and Dij (x) > 0. We
delete the element j from the list. We repeat the above procedure until one element is left. Then
we classify x into the class that corresponds to the element number. For Fig. 7, we generate the
list {1, 3, and 2}. If D12 (x) > 0, we delete element 2 from the list; we obtain {1, 3}. Then if D13 (x) >
0, we delete element 3 from the list. Since only 1 is left in the list, we classify x into Class 1.
Training of a DDAG is the same with conventional pair wise support vector machines. Namely,
we need to determine n (n − 1)/2 decision functions for an n-class problem. The advantage of
DDAGs is that classification is faster than conventional pair wise support vector machines or pair
wise fuzzy support vector machines. In a DDAG, classification can be done by calculating (n − 1)
decision functions [24].
We have made use of DDAG support vector machine for the recognition of our OCR engine. And
below we show types of samples used for training and testing and the accuracy rate which we
have obtained for training characters.
7. RESULTS
A corpus for Oriya OCR consisting of data base for machine printed Oriya characters has been
developed. Collection of different samples for both the scripts has been done. Mainly samples
have been gathered from laser print documents, books and news papers containing variable font
style and sizes. A scanning resolution of 300 dpi is employed for digitization of all the documents.
Figure 9 and Figure 10 shows some sample characters of various fonts of both Oriya and Roman
script used in the experiment.
We have performed experiments with different types of images such as normal, bold, thin, small,
big, etc. having varied sizes of Oriya and Roman characters. The training and testing set
13. Sanghamitra Mohanty & Himadri Nandini Das Bebartta
International Journal of Image Processing (IJIP), Volume (4): Issue (2) 187
comprises of more than 10, 000 samples. We have considered gray scale images for collection of
the samples. This database can be utilized for the purpose of document analysis, recognition,
and examination. The training set consists of binary images of 297 Oriya letters and 52 English
alphabets including both the lower and upper case letters. We have kept the same data file for
testing and training for all types of different classifiers to analyze the result. In most of the
documents the occurrence of Roman characters is very few as compared to that of Oriya
characters. For this reason, for training purpose we have collected more samples for Oriya
characters than that of English.
FIGURE 9: Samples of machine printed Oriya Characters Used for Training.
FIGURE 10: Samples of Machine Printed Roman Characters Used For Training.
Table below shows the effect on accuracy by considering different character sizes with different
types of the images used for Oriya characters.
Image type Size of the samples Accuracy percentage
Bold and small 92.78%
Bold and big 99.8%
Normal and small 96.98%
Normal and Bold 97.12%
TABLE 2: Effect on Accuracy by Considering Different Character Sizes with Different Types of the Images
used for Oriya Characters.
Table 3 below shows the recognition accuracy for Roman characters with normal, large and bold
and small fonts and it is observed that large sizes give better accuracy as compared to the other
fonts.
14. Sanghamitra Mohanty & Himadri Nandini Das Bebartta
International Journal of Image Processing (IJIP), Volume (4): Issue (2) 188
Size of the
samples
Accuracy percentage
Large 92.13%
normal 87.78%
Normal and
small
88.26%
Normal and Bold 90.89%
TABLE 3: Recognition Accuracy for Roman Characters with Different Font Styles
Regarding the effect on accuracy by considering the different character sizes with different types
of the images used for characters, for Oriya-Bold and big characters the accuracy rate is high and
it is nearly 99.8 percentage of accuracy. The accuracy rate decreases for the thin and small size
characters.
Figure 11 shows an example of a typical bilingual document used in our work. It can be seen that
as per our discussion in the script identification section, all most all of the Oriya lines are
associated with lower and upper matras.
Thus finally after the different scripts being sent to the respective classifiers the final result that
we got is shown below in Figure 12. The Figure 11 shows one of the images we have taken for
testing. The upper portion of the image contains the English scripts and the lower half of the
image consists of the Oriya script.
FIGURE 11: Sample of an Image Tested in Bilingual OCR.
For the image shown above the corresponding output is shown below in Figure 12.
15. Sanghamitra Mohanty & Himadri Nandini Das Bebartta
International Journal of Image Processing (IJIP), Volume (4): Issue (2) 189
FIGURE 12: Output of the Bilingual Document after Recognition.
8. CONCLUSION
A novel method to script separation has been taken care of in this paper. In this work we have
tried to distinguish between the English and Oriya documents through horizontal projection
profiles for intensity of pixels in different zone along with the line height and the number of
characters present in that line. Separation of the scripts is preferred because training both the
scripts in a single recognition system decreases the accuracy rate. There is a probability of some
Oriya characters to get confused with some Roman characters and similar problem can be faced
during the period of post processing. We have tried to recognize the Oriya scripts and Roman
script with two separate training set using Support Vector Machines. And finally those recognized
characters are inserted into a single editor. Improved accuracy is always desired, and we are
trying to achieve it by improving each and every processing task: preprocessing, feature
extraction, sample generation, classifier design, multiple classifier combination, etc. Selection of
features and designing of classifiers jointly also lead to better classification performance. Multiple
classifiers is being tried to be applied to increase overall accuracy of the OCR system as it is
difficult to optimize the performance using single classifier at a time with a larger feature vector
set. The present OCR system deals with clean machine printed text with minimum noise. And the
input texts are printed in a non italic and non decorative regular font in standard font size. This
work in future can be extended for the development of bilingual OCR dealing with degraded,
noisy machine printed and italic text. This research work can also be extended to the handwritten
text. A postprocessor for both the scripts can also be developed to increase the overall accuracy.
So in consideration to the above problems, steps are being taken for more refinement of our
bilingual OCR.
ACKNOWLEDGEMENT
We are thankful to DIT, MCIT for its support and my colleague Mr. Tarun Kumar Behera for his
cooperation.
16. Sanghamitra Mohanty & Himadri Nandini Das Bebartta
International Journal of Image Processing (IJIP), Volume (4): Issue (2) 190
9. REFERENCES
1. A. L. Spitz. “Determination of the Script and Language Content of Document Images”. IEEE
Trans. on PAMI, 235-245, 1997
2. J. Ding, L. Lam,and C. Y. Suen. “Classification of Oriental and European Scripts by using
Characteristic Features”. In Proceedings of 4th ICDAR, pp. 1023-1027, 1997
3. D. Hhanya, A. G. Ramakrishna, and P. B. Pati. “ Script Identification in Printed Bilingual
Documents”. Sadhana, 27(1): 73-82, 2002
4. J. Hochberg, P. Kelly, T. Thomas, and L. Kerns. “Automatic script Identification from
Document Images using Cluster-Based Templates” IEEE Trans. on PAMI, 176-181, 1997
5. T. N. Tan. “Rotation Invariant Texture Features and their use in Automatic Script
Identification”. IEEE Trans. On PAMI, 751-756, 1998
6. S. Wood, X. Yao, and K. Krishnamurthi, , L. Dang. “Language Identification for Printed Text
Independent of Segmentation”. In Proc. Int’l Conf. on Image Processing. 428-431, 1995
7. U. Pal, and B. B Chaudhuri,. “Script Line Separation from Indian Multi-Script Documents”.
IETE Journal of Research, 49, 3-11, 2003
8. U. Pal, S. Sinha, and B. B. Chaudhuri. “Multi-Script Line identification from Indian
Documents”. In Proceedings 7th ICDAR, 880--884, 2003
9. S. Chanda, U. Pal, “English, Devnagari and Urdu Text Identification”. Proc. International
Conference on Cognition and Recognition, 538-545, 2005
10. S. Mohanty, H. N. Das Bebartta, and T.K . Behera. “An Efficient Blingual Optical Character
Recognition (English-Oriya) System for Printed Documents”. Seventh International
Conference on Advances in Pattern Recognition, ICAPR. 398-401, 2009
11. R. K. Sharma, Dr. A. Singh, “Segmentation of Handwritten Text in Gurmukhi Script”.
Computers & Security, 2(3):12-17, 2009
12. D. Suganthi, Dr. S. Purushothaman, “fMRI Segmentation Using Echo State Neural Network”.
Computers & Security, 2(1):1-9, 2009
13. A. R. Khan, D. Muhammad, “A Simple Segmentation Approach for Unconstrained Cursive
Handwritten Words in Conjunction with the Neural Network”. Computers & Security, 2(3):29-
35, 2009
14. S. Mohanty, and H. K. Behera.” A complete OCR Development System for Oriya Script”.
Proceedings of SIMPLE’ 04, IIT Kharagpur, 2004
15. B. V. Dasarathy. “Nearest Neighbor Pattern Classification Techniques”. IEEE Computer
Society Press,New York, 1991
16. V. N. Vapnik. “The Nature of Statistical LearningTheory”. Springer-Verlag, London, UK, 1995.
17. V. N. Vapnik. “Statistical Learning Theory”. John Wiley & Sons, New York, 1998.
17. Sanghamitra Mohanty & Himadri Nandini Das Bebartta
International Journal of Image Processing (IJIP), Volume (4): Issue (2) 191
18. S. Abe. “Analysis of multiclass support vector machines”. In Proceedings of International
Conference on Computational Intelligence for Modelling Control and Automation
(CIMCA’2003), Vienna, Austria, 2003
19. U. H.-G. Kreßel. “Pair wise classification and support vector machines”. In B. Sch¨olkopf, C.
J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods: Support Vector
Learning, pages 255– 268. The MIT Press, Cambridge, MA, 1999
20. J. C. Platt, N. Cristianini, and J. Shawe-Taylor. “Large margin DAGs for multiclass
classification”. In S. A. Solla, T. K. Leen, and K.-R. M¨uller, editors, Advances in Neural
Information Processing Systems12, pages 547–553. The MIT Press, Cambridge, MA, 2000
21. B. Kijsirikul and N. Ussivakul. “Multiclass support vector machines using adaptive directed
acyclic Graph”. In Proceedings of International Joint Conference on Neural Networks (IJCNN
2002), 980–985, 2002
22. S. Abe and T. Inoue. “Fuzzy support vector machines for multiclass problems”. In
Proceedings of the Tenth European Symposium on Artificial Neural Networks (ESANN”2002),
116–118, Bruges, Belgium, 2002
23. K. P. Bennett. Combining support vector and mathematical programming methods for
classification. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel
Methods: Support Vector Learning, pages 307–326. The MIT Press, Cambridge, MA, 1999
24. J. Weston and C. Watkins. Support vector machinesfor multi-class pattern recognition. In
Proceedings of the Seventh European Symposium on Artificial Neural Networks (ESANN’99),
pages 219–224, 1999
25. F. Takahashi and S. Abe. “Optimizing Directed Acyclic Graph Support vector Machines”.
ANNPR , Florence (Italy), September 2003