This paper presents a new multi-tier holistic approach for recognizing Urdu text written in Nastaliq script. It first identifies special ligatures like dots, tay, hamza and mad from base ligatures. It then associates the special ligatures with neighboring base ligatures. Features are extracted from the ligatures and special ligature-base ligature associations. These features are input to a neural network that recognizes the ligatures in three steps: 1) identifying special ligatures, 2) associating them with base ligatures, and 3) recognizing the base ligatures. The system was tested on 200 ligatures with 100% accuracy for ligatures in its training set and closest match classification for new ligatures.
An exhaustive font and size invariant classification scheme for ocr of devana...ijnlc
Main challenge in any Optical Character Recognition (OCR) system is to deal with multiple fonts and sizes. In OCR of Indian languages, one also has to deal with a huge number of conjunct characters whose shape changes drastically with fonts. Separating the conjunct characters into its constituent symbols leads to segmentation errors. The proposed approach handles both the above listed problems in the context of Devanagari script. An attempt is made to identify all possible connected symbols of Devanagari (could be a consonant, vowel, half consonant or conjunct consonant henceforward shall be referred as a basic symbol) in the middle zone without segmenting the conjunct characters. On observing 469580 words from a variety of sources in our study, it is found that only 345 symbols are used more frequently in the middle zone and cover 99.97% of the text. They are then classified into 16 different classes on the basis of structural properties which are invariant across fonts and sizes. To validate the proposed classification scheme, results are presented on 25 fonts and three sizes.
The Heuristic Extraction Algorithms for Freeman Chain Code of Handwritten Cha...Waqas Tariq
Handwriting character recognition (HCR) is the ability of a computer to receive and interpret handwritten input. In HCR, there are many representation schemes and one of them is Freeman chain code (FCC). Chain code is a sequence of code direction of a characters and connection to a starting point which is often used in image processing. The main problem in representing character using FCC that it is depends on the starting points. Unfortunately, the study about FCC extraction using one continuous route and to minimizing the length of chain code to FCC from a thinned binary image (TBI) have not been widely explored. To solve this problem, heuristic algorithms are proposed to extract the FCC that is correctly representing the characters. This paper proposes two heuristics algorithm that are based on randomized and enumeration-based algorithms to solve the problems. As problem solving techniques, the randomized algorithm makes the random choices while enumeration-based algorithm enumerates all possible candidates for solution. The performance measures of the algorithms are the route length and computation time. The experiment on the algorithms are performed based on the chain code representation derived from established previous works of Center of Excellence for Document Analysis and Recognition (CEDAR) dataset which consists of 126 upper-case letter characters. The experimental result shows that route length of both algorithms are similar but the computation time of enumeration-based algorithm is higher than randomized algorithm. This is because enumeration-based algorithm considers all branches in route walk.
SEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXTcscpconf
Optical Character Recognition (OCR) is one of the fundamental research areas of image processing and pattern recognition field. The performance accuracy of an OCR system depends on the proper segmentation of the characters. This paper is concerned with the segmentation of printed bangla characters without modifiers for optical character recognition (OCR) system. The basic steps needed for developing an OCR system also have been discussed.
OCR-THE 3 LAYERED APPROACH FOR DECISION MAKING STATE AND IDENTIFICATION OF TE...ijaia
Optical Character recognition is the method of digitalization of hand and type written or printed text into
machine-encoded form and is superfluity of the various applications of envision of human’s life. In present
human life OCR has been successfully using in finance, legal, banking, health care and home need
appliances. India is a multi cultural, literature and traditional scripted country. Telugu is the southern
Indian language, it is a syllabic language, symbol script represents a complete syllable and formed with the
conjunct mixed consonants in their representation. Recognition of mixed conjunct consonants is critical
than the normal consonants, because of their variation in written strokes, conjunct maxing with pre and
post level of consonants. This paper proposes the layered approach methodology to recognize the
characters, conjunct consonants, mixed- conjunct consonants and expressed the efficient classification of
the hand written and printed conjunct consonants. This paper implements the Advanced Fuzzy Logic system
controller to take the text in the form of written or printed, collected the text images from the scanned file,
digital camera, Processing the Image with Examine the high intensity of images based on the quality
ration, Extract the image characters depends on the quality then check the character orientation and
alignment then to check the character thickness, base and print ration. The input image characters can
classify into the two ways, first way represents the normal consonants and the second way represents
conjunct consonants. Digitalized image text divided into three layers, the middle layer represents normal
consonants and the top and bottom layer represents mixed conjunct consonants. Here recognition process
starts from middle layer, and then it continues to check the top and bottom layers. The recognition process
treat as conjunct consonants when it can detect any symbolic characters in top and bottom layers of
present base character otherwise treats as normal consonants. The post processing technique applied to all
three layered characters. Post processing of the image: concentrated on the image text readability and
compatibility, if the readability is not process then repeat the process again. In this recognition process
includes slant correction, thinning, normalization, segmentation, feature extraction and classification. In
the process of development of the algorithm the pre-processing, segmentation, character recognition and
post-processing modules were discussed. The main objectives to the development of this paper are: To
develop the classification, identification of deference prototyping for written and printed consonants,
conjunct consonants and symbols based on 3 layered approaches with different measurable area by using
fuzzy logic and to determine suitable features for handwritten character recognition.
FREEMAN CODE BASED ONLINE HANDWRITTEN CHARACTER RECOGNITION FOR MALAYALAM USI...acijjournal
Handwritten character recognition is conversion of handwritten text to machine readable and editable form. Online character recognition deals with live conversion of characters. Malayalam is a language spoken by millions of people in the state of Kerala and the union territories of Lakshadweep and Pondicherry in India. It is written mostly in clockwise direction and consists of loops and curves. The method aims at training a simple neural network with three layers using backpropagation algorithm.
Freeman codes are used to represent each character as feature vector. These feature vectors act as inputs to the network during the training and testing phases of the neural network. The output is the character expressed in the Unicode format.
An Optical Character Recognition for Handwritten Devanagari ScriptIJERA Editor
Optical Character Recognition is process of recognition of character from scanned document and lots of OCR now available in the market. But most of these systems work for Roman, Chinese, Japanese and Arabic characters . There are no sufficient number of work on Indian language script like Devanagari so this paper present a review on optical character recognition on handwritten Devanagari script
An exhaustive font and size invariant classification scheme for ocr of devana...ijnlc
Main challenge in any Optical Character Recognition (OCR) system is to deal with multiple fonts and sizes. In OCR of Indian languages, one also has to deal with a huge number of conjunct characters whose shape changes drastically with fonts. Separating the conjunct characters into its constituent symbols leads to segmentation errors. The proposed approach handles both the above listed problems in the context of Devanagari script. An attempt is made to identify all possible connected symbols of Devanagari (could be a consonant, vowel, half consonant or conjunct consonant henceforward shall be referred as a basic symbol) in the middle zone without segmenting the conjunct characters. On observing 469580 words from a variety of sources in our study, it is found that only 345 symbols are used more frequently in the middle zone and cover 99.97% of the text. They are then classified into 16 different classes on the basis of structural properties which are invariant across fonts and sizes. To validate the proposed classification scheme, results are presented on 25 fonts and three sizes.
The Heuristic Extraction Algorithms for Freeman Chain Code of Handwritten Cha...Waqas Tariq
Handwriting character recognition (HCR) is the ability of a computer to receive and interpret handwritten input. In HCR, there are many representation schemes and one of them is Freeman chain code (FCC). Chain code is a sequence of code direction of a characters and connection to a starting point which is often used in image processing. The main problem in representing character using FCC that it is depends on the starting points. Unfortunately, the study about FCC extraction using one continuous route and to minimizing the length of chain code to FCC from a thinned binary image (TBI) have not been widely explored. To solve this problem, heuristic algorithms are proposed to extract the FCC that is correctly representing the characters. This paper proposes two heuristics algorithm that are based on randomized and enumeration-based algorithms to solve the problems. As problem solving techniques, the randomized algorithm makes the random choices while enumeration-based algorithm enumerates all possible candidates for solution. The performance measures of the algorithms are the route length and computation time. The experiment on the algorithms are performed based on the chain code representation derived from established previous works of Center of Excellence for Document Analysis and Recognition (CEDAR) dataset which consists of 126 upper-case letter characters. The experimental result shows that route length of both algorithms are similar but the computation time of enumeration-based algorithm is higher than randomized algorithm. This is because enumeration-based algorithm considers all branches in route walk.
SEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXTcscpconf
Optical Character Recognition (OCR) is one of the fundamental research areas of image processing and pattern recognition field. The performance accuracy of an OCR system depends on the proper segmentation of the characters. This paper is concerned with the segmentation of printed bangla characters without modifiers for optical character recognition (OCR) system. The basic steps needed for developing an OCR system also have been discussed.
OCR-THE 3 LAYERED APPROACH FOR DECISION MAKING STATE AND IDENTIFICATION OF TE...ijaia
Optical Character recognition is the method of digitalization of hand and type written or printed text into
machine-encoded form and is superfluity of the various applications of envision of human’s life. In present
human life OCR has been successfully using in finance, legal, banking, health care and home need
appliances. India is a multi cultural, literature and traditional scripted country. Telugu is the southern
Indian language, it is a syllabic language, symbol script represents a complete syllable and formed with the
conjunct mixed consonants in their representation. Recognition of mixed conjunct consonants is critical
than the normal consonants, because of their variation in written strokes, conjunct maxing with pre and
post level of consonants. This paper proposes the layered approach methodology to recognize the
characters, conjunct consonants, mixed- conjunct consonants and expressed the efficient classification of
the hand written and printed conjunct consonants. This paper implements the Advanced Fuzzy Logic system
controller to take the text in the form of written or printed, collected the text images from the scanned file,
digital camera, Processing the Image with Examine the high intensity of images based on the quality
ration, Extract the image characters depends on the quality then check the character orientation and
alignment then to check the character thickness, base and print ration. The input image characters can
classify into the two ways, first way represents the normal consonants and the second way represents
conjunct consonants. Digitalized image text divided into three layers, the middle layer represents normal
consonants and the top and bottom layer represents mixed conjunct consonants. Here recognition process
starts from middle layer, and then it continues to check the top and bottom layers. The recognition process
treat as conjunct consonants when it can detect any symbolic characters in top and bottom layers of
present base character otherwise treats as normal consonants. The post processing technique applied to all
three layered characters. Post processing of the image: concentrated on the image text readability and
compatibility, if the readability is not process then repeat the process again. In this recognition process
includes slant correction, thinning, normalization, segmentation, feature extraction and classification. In
the process of development of the algorithm the pre-processing, segmentation, character recognition and
post-processing modules were discussed. The main objectives to the development of this paper are: To
develop the classification, identification of deference prototyping for written and printed consonants,
conjunct consonants and symbols based on 3 layered approaches with different measurable area by using
fuzzy logic and to determine suitable features for handwritten character recognition.
FREEMAN CODE BASED ONLINE HANDWRITTEN CHARACTER RECOGNITION FOR MALAYALAM USI...acijjournal
Handwritten character recognition is conversion of handwritten text to machine readable and editable form. Online character recognition deals with live conversion of characters. Malayalam is a language spoken by millions of people in the state of Kerala and the union territories of Lakshadweep and Pondicherry in India. It is written mostly in clockwise direction and consists of loops and curves. The method aims at training a simple neural network with three layers using backpropagation algorithm.
Freeman codes are used to represent each character as feature vector. These feature vectors act as inputs to the network during the training and testing phases of the neural network. The output is the character expressed in the Unicode format.
An Optical Character Recognition for Handwritten Devanagari ScriptIJERA Editor
Optical Character Recognition is process of recognition of character from scanned document and lots of OCR now available in the market. But most of these systems work for Roman, Chinese, Japanese and Arabic characters . There are no sufficient number of work on Indian language script like Devanagari so this paper present a review on optical character recognition on handwritten Devanagari script
OCR-THE 3 LAYERED APPROACH FOR CLASSIFICATION AND IDENTIFICATION OF TELUGU HA...csandit
Optical Character recognition is the method of digitalization of hand and type written or
printed text into machine-encoded form and is superfluity of the various applications of envision
of human’s life. In present human life OCR has been successfully using in finance, legal,
banking, health care and home need appliances. India is a multi cultural, literature and
traditional scripted country. Telugu is the southern Indian language, it is a syllabic language,
symbol script represents a complete syllable and formed with the conjunct mixed consonants in
their representation. Recognition of mixed conjunct consonants is critical than the normal
consonants, because of their variation in written strokes, conjunct maxing with pre and post
level of consonants. This paper proposes the layered approach methodology to recognize the
characters, conjunct consonants, mixed- conjunct consonants and expressed the efficient
classification of the hand written and printed conjunct consonants. This paper implements the
Advanced Fuzzy Logic system controller to take the text in the form of written or printed,
collected the text images from the scanned file, digital camera, Processing the Image with
Examine the high intensity of images based on the quality ration, Extract the image characters
depends on the quality then check the character orientation and alignment then to check the
character thickness, base and print ration. The input image characters can classify into the two
ways, first way represents the normal consonants and the second way represents conjunct
consonants. Digitalized image text divided into three layers, the middle layer represents normal
consonants and the top and bottom layer represents mixed conjunct consonants. Here
recognition process starts from middle layer, and then it continues to check the top and bottom
layers. The recognition process treat as conjunct consonants when it can detect any symbolic
characters in top and bottom layers of present base character otherwise treats as normal
consonants. The post processing technique applied to all three layered characters. Post
processing of the image: concentrated on the image text readability and compatibility, if the
readability is not process then repeat the process again. In this recognition process includes
slant correction, thinning, normalization, segmentation, feature extraction and classification. In
the process of development of the algorithm the pre-processing, segmentation, character
recognition and post-processing modules were discussed. The main objectives to the
development of this paper are: To develop the classification, identification of deference
prototyping for written and printed consonants, conjunct consonants and symbols based on 3
layered approaches with different measurable area by using fuzzy logic and to determine
suitable features for handwritten character recognition.
Handwritten Character Recognition: A Comprehensive Review on Geometrical Anal...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Segmentation of Handwritten Chinese Character Strings Based on improved Algor...ijeei-iaes
Algorithm Liu attracts high attention because of its high accuracy in segmentation of Japanese postal address. But the disadvantages, such as complexity and difficult implementation of algorithm, etc. have an adverse effect on its popularization and application. In this paper, the author applies the principles of algorithm Liu to handwritten Chinese character segmentation according to the characteristics of the handwritten Chinese characters, based on deeply study on algorithm Liu.In the same time, the author put forward the judgment criterion of Segmentation block classification and adhering mode of the handwritten Chinese characters.In the process of segmentation, text images are seen as the sequence made up of Connected Components (CCs), while the connected components are made up of several horizontal itinerary set of black pixels in image. The author determines whether these parts will be merged into segmentation through analyzing connected components. And then the author does image segmentation through adhering mode based on the analysis of outline edges. Finally cut the text images into character segmentation. Experimental results show that the improved Algorithm Liu obtains high segmentation accuracy and produces a satisfactory segmentation result.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
The presentation will describe an algorithm through which one can recognize Devanagari Characters. Devanagari is the script in which Hindi is represented. This algorithm
could automatically segment character from the image of Devenagari text and then recognize them.
For extracting the individual characters from the image of Devanagari text, algorithm segmented the image several
times using the vertical and horizontal projection.
The algorithm starts with first segmenting the lines separately from the document by taking horizontal projection and then the line
into words by taking vertical projection of the line. Another step which is particular to the separation of
Devanagari characters was required and was done by first removing the header line by finding horizontal projection
of each word. The characters can then be extracted by vertical projection of the word without the header line.
Algorithm uses a Kohonen Neural Netowrk for the recognition task. After the separation of the characters from the
image, the image matrix was then downsampled to bring it down to a fixed size so as to make the recognition
size independent. The matrix can then be fed as input neurons to the Kohonen Neural Network and the winning neuron is
found which identifies the recognized the character. This information in Kohonen Neural Network was stored
earlier during the training phase of the neural network. For this, we first assigned random weights from input neurons
to output neurons and then for each training set, the winning neuron was calculated by finding the maximum
output produced by the neurons. The wights for this winning neuron were then adjusted so that it responds to this
pattern more strongly the next time.
An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...iosrjce
Segmentation technique plays a major role in scripting the documents for extraction of various
features. Many researchers are doing various research works in this field to make the segmenting process
simple as well as efficient. In this paper a simple segmentation technique for both the line and word
segmentation of a script document has been proposed. The main objective of this technique is to recognize the
spaces that separate two text lines.For the Word segmentation technique also similar procedure is followed. In
this work ,three different scanned document have been taken as input images for both line and word
segmentation techniques. The results found were outstanding with average accuracy for both line and word. It
provides 100% accuracy for line segmentation and 100% for line segmentation as well. Evaluation results show
that our method outperforms several competing methods.
Recognition of Words in Tamil Script Using Neural NetworkIJERA Editor
In this paper, word recognition using neural network is proposed. Recognition process is started with the partitioning of document image into lines, words, and characters and then capturing the local features of segmented characters. After classifying the characters, the word image is transferred into unique code based on character code. This code ideally describes any form of word including word with mixed styles and different sizes. Sequence of character codes of the word form input pattern and word code is a target value of the pattern. Neural network is used to train the patterns of the words. Trained network is tested with word patterns and is recognized or unrecognized based on the network error value. Experiments have been conducted with a local database to evaluate the performance of the word recognizing system and obtained good accuracy. This method can be applied for any language word recognition system as the training is based on only unique code of the characters and words belonging to the language.
Handwritten character recognition is one of the most challenging and ongoing areas of research in the
field of pattern recognition. HCR research is matured for foreign languages like Chinese and Japanese but
the problem is much more complex for Indian languages. The problem becomes even more complicated for
South Indian languages due to its large character set and the presence of vowels modifiers and compound
characters. This paper provides an overview of important contributions and advances in offline as well as
online handwritten character recognition of Malayalam scripts.
Rule based algorithm for handwritten characters recognitionRanda Elanwar
Presentation of Master Dissertation
Content:
Rule-based Algorithm for Off-line Isolated Handwritten character recognition
Rule-based Algorithm for On-line Arabic Cursive Handwriting Segmentation and Recognition
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...ijistjournal
Named entity recognition (NER) is one of the applications of Natural Language Processing and is regarded as the subtask of information retrieval. NER is the process to detect Named Entities (NEs) in a document and to categorize them into certain Named entity classes such as the name of organization, person, location, sport, river, city, country, quantity etc. In English, we have accomplished lot of work related to NER. But, at present, still we have not been able to achieve much of the success pertaining to NER in the Indian languages. The following paper discusses about NER, the various approaches of NER, Performance Metrics, the challenges in NER in the Indian languages and finally some of the results that have been achieved by performing NER in Hindi by aggregating approaches such as Rule based heuristics and Hidden Markov Model (HMM).
A Comprehensive Study On Handwritten Character Recognition Systemiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
OCR-THE 3 LAYERED APPROACH FOR CLASSIFICATION AND IDENTIFICATION OF TELUGU HA...csandit
Optical Character recognition is the method of digitalization of hand and type written or
printed text into machine-encoded form and is superfluity of the various applications of envision
of human’s life. In present human life OCR has been successfully using in finance, legal,
banking, health care and home need appliances. India is a multi cultural, literature and
traditional scripted country. Telugu is the southern Indian language, it is a syllabic language,
symbol script represents a complete syllable and formed with the conjunct mixed consonants in
their representation. Recognition of mixed conjunct consonants is critical than the normal
consonants, because of their variation in written strokes, conjunct maxing with pre and post
level of consonants. This paper proposes the layered approach methodology to recognize the
characters, conjunct consonants, mixed- conjunct consonants and expressed the efficient
classification of the hand written and printed conjunct consonants. This paper implements the
Advanced Fuzzy Logic system controller to take the text in the form of written or printed,
collected the text images from the scanned file, digital camera, Processing the Image with
Examine the high intensity of images based on the quality ration, Extract the image characters
depends on the quality then check the character orientation and alignment then to check the
character thickness, base and print ration. The input image characters can classify into the two
ways, first way represents the normal consonants and the second way represents conjunct
consonants. Digitalized image text divided into three layers, the middle layer represents normal
consonants and the top and bottom layer represents mixed conjunct consonants. Here
recognition process starts from middle layer, and then it continues to check the top and bottom
layers. The recognition process treat as conjunct consonants when it can detect any symbolic
characters in top and bottom layers of present base character otherwise treats as normal
consonants. The post processing technique applied to all three layered characters. Post
processing of the image: concentrated on the image text readability and compatibility, if the
readability is not process then repeat the process again. In this recognition process includes
slant correction, thinning, normalization, segmentation, feature extraction and classification. In
the process of development of the algorithm the pre-processing, segmentation, character
recognition and post-processing modules were discussed. The main objectives to the
development of this paper are: To develop the classification, identification of deference
prototyping for written and printed consonants, conjunct consonants and symbols based on 3
layered approaches with different measurable area by using fuzzy logic and to determine
suitable features for handwritten character recognition.
Handwritten Character Recognition: A Comprehensive Review on Geometrical Anal...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Segmentation of Handwritten Chinese Character Strings Based on improved Algor...ijeei-iaes
Algorithm Liu attracts high attention because of its high accuracy in segmentation of Japanese postal address. But the disadvantages, such as complexity and difficult implementation of algorithm, etc. have an adverse effect on its popularization and application. In this paper, the author applies the principles of algorithm Liu to handwritten Chinese character segmentation according to the characteristics of the handwritten Chinese characters, based on deeply study on algorithm Liu.In the same time, the author put forward the judgment criterion of Segmentation block classification and adhering mode of the handwritten Chinese characters.In the process of segmentation, text images are seen as the sequence made up of Connected Components (CCs), while the connected components are made up of several horizontal itinerary set of black pixels in image. The author determines whether these parts will be merged into segmentation through analyzing connected components. And then the author does image segmentation through adhering mode based on the analysis of outline edges. Finally cut the text images into character segmentation. Experimental results show that the improved Algorithm Liu obtains high segmentation accuracy and produces a satisfactory segmentation result.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
The presentation will describe an algorithm through which one can recognize Devanagari Characters. Devanagari is the script in which Hindi is represented. This algorithm
could automatically segment character from the image of Devenagari text and then recognize them.
For extracting the individual characters from the image of Devanagari text, algorithm segmented the image several
times using the vertical and horizontal projection.
The algorithm starts with first segmenting the lines separately from the document by taking horizontal projection and then the line
into words by taking vertical projection of the line. Another step which is particular to the separation of
Devanagari characters was required and was done by first removing the header line by finding horizontal projection
of each word. The characters can then be extracted by vertical projection of the word without the header line.
Algorithm uses a Kohonen Neural Netowrk for the recognition task. After the separation of the characters from the
image, the image matrix was then downsampled to bring it down to a fixed size so as to make the recognition
size independent. The matrix can then be fed as input neurons to the Kohonen Neural Network and the winning neuron is
found which identifies the recognized the character. This information in Kohonen Neural Network was stored
earlier during the training phase of the neural network. For this, we first assigned random weights from input neurons
to output neurons and then for each training set, the winning neuron was calculated by finding the maximum
output produced by the neurons. The wights for this winning neuron were then adjusted so that it responds to this
pattern more strongly the next time.
An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...iosrjce
Segmentation technique plays a major role in scripting the documents for extraction of various
features. Many researchers are doing various research works in this field to make the segmenting process
simple as well as efficient. In this paper a simple segmentation technique for both the line and word
segmentation of a script document has been proposed. The main objective of this technique is to recognize the
spaces that separate two text lines.For the Word segmentation technique also similar procedure is followed. In
this work ,three different scanned document have been taken as input images for both line and word
segmentation techniques. The results found were outstanding with average accuracy for both line and word. It
provides 100% accuracy for line segmentation and 100% for line segmentation as well. Evaluation results show
that our method outperforms several competing methods.
Recognition of Words in Tamil Script Using Neural NetworkIJERA Editor
In this paper, word recognition using neural network is proposed. Recognition process is started with the partitioning of document image into lines, words, and characters and then capturing the local features of segmented characters. After classifying the characters, the word image is transferred into unique code based on character code. This code ideally describes any form of word including word with mixed styles and different sizes. Sequence of character codes of the word form input pattern and word code is a target value of the pattern. Neural network is used to train the patterns of the words. Trained network is tested with word patterns and is recognized or unrecognized based on the network error value. Experiments have been conducted with a local database to evaluate the performance of the word recognizing system and obtained good accuracy. This method can be applied for any language word recognition system as the training is based on only unique code of the characters and words belonging to the language.
Handwritten character recognition is one of the most challenging and ongoing areas of research in the
field of pattern recognition. HCR research is matured for foreign languages like Chinese and Japanese but
the problem is much more complex for Indian languages. The problem becomes even more complicated for
South Indian languages due to its large character set and the presence of vowels modifiers and compound
characters. This paper provides an overview of important contributions and advances in offline as well as
online handwritten character recognition of Malayalam scripts.
Rule based algorithm for handwritten characters recognitionRanda Elanwar
Presentation of Master Dissertation
Content:
Rule-based Algorithm for Off-line Isolated Handwritten character recognition
Rule-based Algorithm for On-line Arabic Cursive Handwriting Segmentation and Recognition
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...ijistjournal
Named entity recognition (NER) is one of the applications of Natural Language Processing and is regarded as the subtask of information retrieval. NER is the process to detect Named Entities (NEs) in a document and to categorize them into certain Named entity classes such as the name of organization, person, location, sport, river, city, country, quantity etc. In English, we have accomplished lot of work related to NER. But, at present, still we have not been able to achieve much of the success pertaining to NER in the Indian languages. The following paper discusses about NER, the various approaches of NER, Performance Metrics, the challenges in NER in the Indian languages and finally some of the results that have been achieved by performing NER in Hindi by aggregating approaches such as Rule based heuristics and Hidden Markov Model (HMM).
A Comprehensive Study On Handwritten Character Recognition Systemiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A survey of named entity recognition in assamese and other indian languagesijnlc
Named Entity Recognition is always important when dealing with major Natural Language Processing
tasks such as information extraction, question-answering, machine translation, document summarization
etc so in this paper we put forward a survey of Named Entities in Indian Languages with particular
reference to Assamese. There are various rule-based and machine learning approaches available for
Named Entity Recognition. At the very first of the paper we give an idea of the available approaches for
Named Entity Recognition and then we discuss about the related research in this field. Assamese like other
Indian languages is agglutinative and suffers from lack of appropriate resources as Named Entity
Recognition requires large data sets, gazetteer list, dictionary etc and some useful feature like
capitalization as found in English cannot be found in Assamese. Apart from this we also describe some of
the issues faced in Assamese while doing Named Entity Recognition.
Optical Character Recognition System for Urdu (Naskh Font)Using Pattern Match...CSCJournals
The offline optical character recognition (OCR) for different languages has been developed over the recent years. Since 1965, the US postal service has been using this system for automating their services. The range of the applications under this area is increasing day by day, due to its utility in almost major areas of government as well as private sector. This technique has been very useful in making paper free environment in many major organizations as far as the backup of their previous file record is concerned. Our this system has been proposed for the Offline Character Recognition for Isolated Characters of Urdu language, as Urdu language forms words by combining Isolated Characters. Urdu is a cursive language, having connected characters making words. The major area of utility for Urdu OCR will be digitizing of a lot of literature related material already stocked in libraries. Urdu language is famous and spoken in more than 3 big countries including Pakistan, India and Bangladesh. A lot of work has been done in Urdu poetry and literature up to the recent century. Creation of OCR for Urdu language will make an important role in converting all those work from physical libraries to electronic libraries. Most of the stuff already placed on internet is in the form of images having text, which took a lot of space to transfer and even read online. So the need of an Urdu OCR is a must. The system is of training system type. It consists of the image preprocessing, line and character segmentation, creation of xml file for training purpose. While Recognition system includes taking xml file, the image to be recognized, segment it and creation of chain codes for character images and matching with already stored in xml file. The system has been implemented and it has 89% recognition accuracy with a 15 char/sec recognition rate.
Preprocessing Phase for Offline Arabic Handwritten Character RecognitionEditor IJCATR
—In this paper we reviewed the importance issues of the optical character recognition, gives more emphases for OCR and its phases. We discuss the main characteristics of Arabic language, furthermore it focused on the pre-processing phase of the character recognition system. We described and implemented the algorithms of binarization, dots removing and thinning which will be used for feature extraction phase. The algorithms are tested using 47,988 isolated character sample taken from SUST/ ALT dataset and achieved better results. The pre-processing phase developed by using MATLAB software
Handwriting character recognition (HCR) is the ability of a computer to receive and interpret handwritten input. Handwritten Character Recognition is one of the active and challenging research areas in the field of Pattern Recognition. Pattern recognition is a process that taking in raw data and making an action based on the category of the pattern. HCR is one of the well-known applications of pattern recognition. Handwriting recognition especially for Indian languages is still in infant stage because not much work has been done it. This paper discuss about an idea to recognize Kannada vowels using chain code features. Kannada is a South Indian language. For any recognition system, an important part is feature extraction. A proper feature extraction method can increase the recognition ratio. In this paper, a chain code based feature extraction method is investigated for developing HCR system. Chain code is working based on 4-neighborhood or 8–neighborhood methods. Chain code is a sequence of code directions of a character and connection to a starting point which is often used in image processing. In this paper, 8–neighborhood method has been implemented which allows generation of eight different codes for each character. These codes have been used as features of the character image, which have been later on used for training and testing for K-Nearest Neighbor (KNN) classifiers. The level of accuracy reached to 100%.
International Journal of Research in Engineering and Science is an open access peer-reviewed international forum for scientists involved in research to publish quality and refereed papers. Papers reporting original research or experimentally proved review work are welcome. Papers for publication are selected through peer review to ensure originality, relevance, and readability.
Wavelet Packet Based Features for Automatic Script IdentificationCSCJournals
In a multi script environment, an archive of documents having the text regions printed in different scripts is in practice. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this paper, a novel texture-based approach is presented to identify the script type of the collection of documents printed in seven scripts, to categorize them for further processing. The South Indian documents printed in the seven scripts - Kannada, Tamil, Telugu, Malayalam, Urdu, Hindi and English are considered here. The document images are decomposed through the Wavelet Packet Decomposition using the Haar basis function up to level two. The texture features are extracted from the sub bands of the wavelet packet decomposition. The Shannon entropy value is computed for the set of sub bands and these entropy values are combined to use as the texture features. Experimentation conducted involved 2100 text images for learning and 1400 text images for testing. Script classification performance is analyzed using the K-nearest neighbor classifier. The average success rate is found to be 99.68%.
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONIJCSEA Journal
The massive grow of the modern information retrieval system (IRS), especially in natural languages
becomes more difficult. The search in Arabic languages, as natural language, is not good enough yet. This
paper will try to build similar thesaurus based on Arabic language in two mechanisms, the first one is full
word mechanisms and the other is stemmed mechanisms, and then to compare between them.
The comparison made by this study proves that the similar thesaurus using stemmed mechanisms get more
better results than using traditional in the same mechanisms and similar thesaurus improved more the
recall and precision than traditional information retrieval system at recall and precision levels.
An effective approach to offline arabic handwriting recognitionijaia
Segmentation is the most challenging part of the Arabic handwriting recognition, due to the unique
characteristics of Arabic writing that allows the same shape to denote different characters. In this paper,
an off-line Arabic handwriting recognition system is proposed. The processing details are presented in
three main stages. Firstly, the image is skeletonized to one pixel thin. Secondly, transfer each diagonally
connected foreground pixel to the closest horizontal or vertical line. Finally, these orthogonal lines are
coded as vectors of unique integer numbers; each vector represents one letter of the word. In order to
evaluate the proposed techniques, the system has been tested on the IFN/ENIT database, and the
experimental results show that our method is superior to those methods currently available.
Off line system for the recognition of handwritten arabic charactercsandit
Recognition of handwritten Arabic text awaits accurate recognition solutions. There are many
difficulties facing a good handwritten Arabic recognition system such as unlimited variation in
human handwriting, similarities of distinct character shapes, and their position in the word. The
typical Optical Character Recognition (OCR) systems are based mainly on three stages,
preprocessing, features extraction and recognition.
In this paper, we present an efficient approach for the recognition of off-line Arabic handwritten
characters which is based on structural, Statistical and Morphological features from the main
body of the character and also from the secondary components. Evaluation of the accuracy of
the selected features is made. The system was trained and tested with CENPRMI dataset. The
proposed algorithm obtained promising results in terms of accuracy (success rate of 100% for
some letters at average 88%). In Comparable with other related works we find that our result is
the highest among others.
Technology selection for a given problem is often a tough ask. This is immensely useful comparative analysis betweeen Greenplum, Vectorwise and Amazon Redshift.
Steve Jobs has not only revolutionized high tech industry by introducing ground breaking products, but he has also showed us the way for managing organization and personal life.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Online aptitude test management system project report.pdfKamal Acharya
The purpose of on-line aptitude test system is to take online test in an efficient manner and no time wasting for checking the paper. The main objective of on-line aptitude test system is to efficiently evaluate the candidate thoroughly through a fully automated system that not only saves lot of time but also gives fast results. For students they give papers according to their convenience and time and there is no need of using extra thing like paper, pen etc. This can be used in educational institutions as well as in corporate world. Can be used anywhere any time as it is a web based application (user Location doesn’t matter). No restriction that examiner has to be present when the candidate takes the test.
Every time when lecturers/professors need to conduct examinations they have to sit down think about the questions and then create a whole new set of questions for each and every exam. In some cases the professor may want to give an open book online exam that is the student can take the exam any time anywhere, but the student might have to answer the questions in a limited time period. The professor may want to change the sequence of questions for every student. The problem that a student has is whenever a date for the exam is declared the student has to take it and there is no way he can take it at some other time. This project will create an interface for the examiner to create and store questions in a repository. It will also create an interface for the student to take examinations at his convenience and the questions and/or exams may be timed. Thereby creating an application which can be used by examiners and examinee’s simultaneously.
Examination System is very useful for Teachers/Professors. As in the teaching profession, you are responsible for writing question papers. In the conventional method, you write the question paper on paper, keep question papers separate from answers and all this information you have to keep in a locker to avoid unauthorized access. Using the Examination System you can create a question paper and everything will be written to a single exam file in encrypted format. You can set the General and Administrator password to avoid unauthorized access to your question paper. Every time you start the examination, the program shuffles all the questions and selects them randomly from the database, which reduces the chances of memorizing the questions.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Multitier holistic Approach for urdu Nastaliq Recognition
1. A Multi-tier Holistic approach for Urdu Nastaliq Recognition
Syed. Afaq Husain* and Syed. Hassan Amin**
Faculty of Computer Science and Engineering
Ghulam Ishaq Khan (GIK) Institute of Engineering Sciences and Technology
Topi, 23460, Dist. Swabi, NWFP, PAKISTAN
Email:* syed_a_h@giki.edu.pk_ , **shassan@giki.edu.pk
Abstract
Character recognition is an active area of research
with numerous applications including web publishing,
document analysis and text to speech conversion. In this
paper, we present a new approach for the off-line
recognition of cursive Urdu Text. This methodology has
been developed for the Noori Nastaliq Script [Ahmed 1].
Word (Ligature) based identification has been adopted
instead of character based identification. A multi-tier
holistic approach has been utilized to recognize ligatures
from a pre-defined ligature set. Initially, the special
ligatures (Dots, Tay, Hamza & Mad) are identified from
the base ligatures. These special ligatures are associated to
the most probable neighboring base ligature in the second
step. Finally, the above information along with some other
RTS invariant features of base ligature is presented to the
Feed Forward Back Propagation neural network to
perform the final recognition task.
Keywords: OCR, Urdu Character Recognition, Noori
Nastaliq, Ligature based identification, Back-propagation
Neural Network.
1. Objective
Urdu is the national language of Pakistan. It is a
language that is understood by over 300 million people
belonging to Pakistan, India and Bangladesh. Due to its
historical database of literature, there is a need to devise
automatic systems for conversion of this literature into
electronic form that may be accessible on the world-wide-
web. The suggested Urdu Text recognition system
endeavors to convert scanned Urdu documents
automatically into computerized text files in UZT format.
The Diacritics (Aerab) and punctuation have been
ignored in the current version of the system, however may
be classified as another category of symbols. Multi-Font
and multi-lingual support has also been ignored for
simplification.
2. Introduction
Urdu character set is based on the Arabic
character set. It is a cursive language even in its printed
form. In the past, a lot of research has been done on
automatic recognition of text written in languages based
on Roman [Guyon],[Ha], Chinese text [Guo],[Ding],
Arabic [Amin1] and Persian [Khorsheed3] but no serious
research has ever been published on Urdu text recognition.
Arabic and Persian, which are based on similar basic
characters and writing styles as Urdu, have seen quite
worthwhile research in the past decade. However, those
solutions are not valid to Urdu due to a number of inherent
differences in the script and styles of Urdu text. Nasakh
and Nastaliq are the two most popular writing styles
(scripts) in Urdu and both have their own unique features
that make them different and more complicated than their
close counterparts. The following chart (Table 1)
represents a view of the comparative complexities of Urdu
Script as compared to some other languages.
Like Arabic, recognizing Urdu script presents
challenges of cursive orthography and context sensitive
letter shape [Khorsheed2]. However, in contrast to Arabic
text, in which connected characters follows a base line, the
joined characters in Nastaliq and Nasakh are positioned
according to their preceding, pro-ceding as well as a
vertical justification of the ligature.
Table 1: Comparative features of some languages
The word recognition strategies are generally
classified into three categories, namely Holistic Approach,
Analytic Approach and Feature Sequence Matching.
[Shridher]. However, some researchers regard the
Sequence matching techniques to be a form of Holistic
approach. The analytic approach tries to segment the word
into characters before the recognition task while the
holistic approaches tries to recognize the word or its sub-
part (ligature) as a whole. [Khorsheed1]. The first
approach segment Urdu words into characters, and second
approach segment words into symbols. These symbols
may be character, ligature or possibly a fraction of
character.
In this paper, we present an approach to
recognize commonly used ligatures from Noori Nastaliq
Script developed by Ahmad Mirza Jamil [Ahmed1].
Nastaliq is one of the most beautiful and one of the most
complex scripts. The script was originally created by the
Characteristics Urdu Arabic Latin Hebrew Hindi
H Justification R L R L L R R L L R
V-Justification Centre Base No No Top
Cursive Yes Yes No No Yes
Diacritics Yes Yes No No Yes
# Vowels 2 2 5 11 -
# Letters 37 28 26 22 40
Letter Shapes 1-28 1-4 2 1 1
Complementary
Characters
5 3- - - -
2. calligrapher Mir Ali Tabrezi. The attempts to mechanize
Urdu script didn’t bear any success for a long time, and as
a result a typewriter that could type in the Nastaliq style, is
not available even today. There are two approaches to
computerizing Nastaliq i.e. Ligature based approach (more
glyphs) and character based approach (more rules). For
example, the word has three ligatures or separate
shapes , and . Noori Nastaliq describes about
20000 ligatures that are required to write almost all words
contained in the Urdu dictionary. Since, the ligature based
recognition is dependent on the ligatures used for training
it has the context information due to which it has a higher
performance. However, it has the disadvantage that adding
new ligatures into the system would require re-training of
the system. E.g. the. Urdu word Computer is one ligature
that is not in the formal dictionary of ligatures though it is
widely written in Urdu text.
3. Character Recognition Schemes
The problem of Urdu text recognition is closely
related to Arabic text recognition. Arabic Text
Recognition Systems generally have following stages:
image acquisition, preprocessing, segmentation, feature
extraction, classification and recognition [Khorsheed3].
The Arabic Text Recognition Systems are further
divided into Segmentation based and Segmentation-free
systems. Here we briefly describe approaches into Arabic
Text Recognition, with the view that these give valuable
insight into problem of Urdu Text Recognition [Bunke].
3.1 Segmentation Free Systems
In these systems, the word is recognized as a
whole without trying to segment and recognize characters
or primitives [7]. One approach for such systems is to
calculate a single feature vector for each word; this feature
vector is then used to recognize the word.
3.2 Segmentation Based Systems
In Segmentation based systems, each word is
further divided into a number of subparts. The
segmentation-based systems are further subdivided into
four categories: Isolated/Pre-segmented characters,
segmenting a word into characters, segmenting a word into
primitives, Integration of recognition and segmentation.
These systems are either impractical because they try to
recognize digits and isolated characters or they have low
recognition rate because of segmentation errors
[Khorsheed2].
4. Ligature Identification System
In our proposed system, after preprocessing, the
text is segmented into a number of ligatures ordered from
right to left and top to bottom. The ligatures at this stage
are defined as every connected set of characters. These
ligatures also contain the special symbols used in Urdu
namely, (Tau, Mad, Dots, Hamza and Ha). A number of
features are calculated and then fed into Feed Forward
Back propagation neural net to recognize special ligatures
from the base ligatures. These special ligatures are then
associated with the base ligature, forming part of the
feature vector used to recognize base ligature, thus aiding
in the recognition of the base ligature. This feature vector
is then used to recognize ligatures using a Feed Forward
Back Propagation neural net.
Figure 1: Stages of Urdu Character Recognition
4.1 Preprocessing
The preprocessing stage involves Smoothing,
Skew detection and correction, Document decomposition,
Slant normalization etc.
4.2 Segmentation
In document image analysis, four commonly used
segmentation algorithms are connected component
labeling, X-Y tree decomposition, run-length smearing,
and Hough Transform.
We have applied Connected Component Labeling
to the image of Urdu text. This technique assigns to each
connected component of binary image a distinct label. The
labels are usually natural numbers from 1 to the number of
connected components in the input image. The algorithm
scans the image from left-to-right and top-to-bottom. On
the first line containing black pixels, a unique label is
assigned to each contiguous run of black pixels. For each
black pixel, the pixels in its eight neighborhood are
examined, if any of these pixels has been labeled the same
label is assigned to the current pixel, otherwise a new label
is assigned to it. The procedure continues to the bottom of
the image [Khorsheed3].
4.3 Feature Extraction I
In this stage, we extract only those features that
will help us in the recognition of special ligatures, see
figure. These features are Solidity, Number of Holes, Axis
Ratio, Eccentricity, Moments, Normalized segment length,
curvature, ratio of bounding box width and height.
Preprocessing
Segmentation
Feature Extraction I
Special Ligature Identification
Feature Extraction II
Ligature Identification
3. 4.3.1 Solidity
Solidity is a scalar quantity. It is defined as the
proportion of the pixels in the convex hull that are also in
the region. It is computed as
Solidity = Ligature Area/ Convex Hull Area
Where,
Ligature Area = ∑∑f (x, y)
For all x, y in the binary image of the ligature
Convex Hull Area = ∑∑f(x,y)
For all x, y in the convex hull of the ligature
4.3.2 Axes Ratio
It is the ratio of the major axis to the minor axis
of the best-fit ellipse of the ligature.
Axis Ratio = a/b
Where a and b are the lengths of semi-major axis and
semi-minor axis of the best-fit ellipse.
4.3.3 Eccentricity
It is the ratio of the distance between the foci of the
best-fit ellipse to its major axis.
Eccentricity = distance btw foci / 2b
4.3.4 Moment based features
These refer to certain functions of moments,
which are invariant to geometric transformations such as,
translation, scaling, and rotation [6]. Such features are
useful in identification of objects with unique shapes,
regardless of their location, size and orientation
4.3.5 Normalized Length Feature
First the normalized length of a segment i is
calculated relative to other segment lengths in the same
word. Then normalized length of the ligature is calculated
as
Normalized Length = ∑ L(i)
4.3.6 Curvature Feature:
In a similar fashion, first the curvature of a segment is
measured by simply dividing the Euclidean distance
between the two feature points of that segment by its
actual length. This feature equals zero when the segment is
a loop and 1 when the segment is a straight line.
C(i) = (Euclidean distance between
endpoints) / segment length
Then curvature feature of the ligature is calculated as a
sum of curvature features of all of its segments.
Curvature Feature = ∑ C(i)
4.3.7 Number of Holes:
This feature gives total number of holes in a ligature.
If feature points of ligature are considered as a set of
vertices V, and segments as a set of edges E, of a graph G
(V, E), then total number of holes in the ligature can be
found using graph theory as following:
Number of Holes = E - Est
Here,
E = Number of edges in G
Est= Number of edges in the spanning tree of G.
A graph with N vertices has N-1 edges in its spanning
tree.
4.4 Special Ligature Identification
For identifying special ligatures, a Feed Forward
Back propagation neural network with 15 inputs, 25
hidden and 25 output neurons was used. The feature
vectors obtained from Feature extraction 1 stage of the
system are fed to this neural network. It then identifies the
ligatures as either special ligatures or base ligatures.
Figure 2: Some special ligatures
4.5 Feature Extraction II
In this stage, we associate special ligatures with
the base ligatures. We associate special ligature with the
base ligature whose Centroid-to-Centroid distance is
minimum. A number of lines are grown from the centre of
each special ligature, when one of these lines touches a
base ligature, then the special ligature is associated with
that base ligature.
In this stage, due to association of special
ligatures with the base ligatures twenty new features are
added to the feature vector of the base ligature.
4.6 Ligature Identification
In this stage, the final feature vector consisting of
34 features is fed into Feed Forward Back propagation
neural network. The network architecture consists of 34
inputs, 65 hidden neurons and 45 output neurons.
5. Results
The system was trained using a training set of
two hundred carefully selected ligatures. The testing was
done on bitmap images containing Urdu written in
Nastaliq font using a text editor.
This simplified the problem by neglecting the
pre-processing stage required for noise removal during
image acquisition. The training set contained the more
simplified and commonly used ligatures.
The performance of the system on images
containing the trained ligatures only was 100 %.. However
incases, where it contained additional ligatures, they were
classified to the closest match in the training set. No
rejection class was utilized.
6. Conclusion
In this paper, we have presented a method for
recognition of Cursive Urdu text written in Nastaliq Script.
The system is currently trained for a small number of
ligatures but has the potential to be expanded to be more
practical use. Our approach minimizes the errors due to
segmentation by using segmentation free approach. By
using multiple classes of features , we have improved the
number of ligatures that can be identified.
4. 7. Future Directions
A number of possible directions are under
consideration for enhancement of the system for practical
use namely,
1. Enhancement of the number of ligatures used for
training.
2. Addition of Special characters, Numerals and Aerab
for recognition as special ligatures
3. Recognition of intonation marks in the document.
4. Addition of multi lingual support in the system.
References
1. [Ahmed] Ahmad Mirza Jamil, “Noori Nastaliq,
Computerized Urdu Calligraphy”, Elite Publishers,
1982.
2. [Amin] A.Amin and S.Al-Fedaghi, “Machine
recognition of printed Arabic text utilizing a natural
language morphology”, Int. J. of Man-machine
Studies 35,6 (1991), 768-788.
3. [Badr] Badr Al-Badr, Robert M. Haralick,
“Segmentation–Free word recognition with
application to Arabic”, IJDAR1(3):147-166(1998)
4. [Bunke] H. Bunke, P. Wang, “Handbook of character
recognition and document image analysis”, World
Scientific, 2000.
5. [Ding] X.Q.Ding, Y.S.Wu, Recognition of multi-font
printed chineses characters, CCIPP/CLCS, 1988,
Toroto, Canada.
6. [Guo] H.Guo, X.Q.Ding, The development of high
performance Chineses/English bi-lingual OCR
system, proc. CMIN ’95, Beijing, China, March 95,
248-253.
7. [Guyon] I.Guyon, J.Bromley, N.Matic, etc, “A neural
network system for recognizing on-line handwriting”,
Models of Neural network, Springer Verlag, 1996.
8. [Ha] J.Y.Ha, S,C. Oh, J.H. Kim, and Y.B. Kwon,
“Unconstrained handwriiten word recognition with
interconnected hidden Markov Models, 3rd Int.
Workshop on Frontiers in Handwriting Recognition”,
Buffalo, May 93, 455-460
9. [Khorsheed1] Mohammad S. Khorsheed, William F.
Clocksin, “Structural features of cursive Arabic
script”, proc of 10th
British Vision Conference,
University of Nottingham, UK, September-1999.
10. [Khorsheed2] M S Khorsheed, ”Off-Line Arabic
Character Recognition A Review”.
11. [Khorsheed3] Mohammad S. Khorsheed, ”Automatic
recognition of words in Arabic manuscripts”, PhD
Dissertation, Churchill College, University of
Cambridge, June 2000
12. [Shridher] N.Shridher, F.Kimura, “Segmentation
based cursive handwriting recognition”, Handbook of
character recognition and document image analysis,
126-127, World scientific, 1997.
13. [Trier] Ovinid Due Trier, Anil K. Jain, and Torfinn
Taxt, “Feature Extraction Methods for Character
Recognition – A Survey”, Pattern Recognition, Vol.
29 , No. 4 , pp. 641-662 , 1996.