International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...CSCJournals
In a multi script environment, majority of the documents may contain text information printed in more than one script/language forms. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this context, this paper proposes to develop a model to identify and separate text words of Kannada, Hindi and English scripts from a printed tri-lingual document. The proposed method is trained to learn thoroughly the distinct features of each script. The binary tree classifier is used to classify the input text image. Experimentation conducted involved 1500 text words for learning and 1200 text words for testing. Extensive experimentation has been carried out on both manually created data set and scanned data set. The results are very encouraging and prove the efficacy of the proposed model. The average success rate is found to be 99% for manually created data set and 98.5% for data set constructed from scanned document images.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
An Optical Character Recognition for Handwritten Devanagari ScriptIJERA Editor
Optical Character Recognition is process of recognition of character from scanned document and lots of OCR now available in the market. But most of these systems work for Roman, Chinese, Japanese and Arabic characters . There are no sufficient number of work on Indian language script like Devanagari so this paper present a review on optical character recognition on handwritten Devanagari script
Angular Symmetric Axis Constellation Model for Off-line Odia Handwritten Char...IJAAS Team
Optical character recognition is one of the emerging research topics in the field of image processing, and it has extensive area of application in pattern recognition. Odia handwritten script is the most research concern area because it has eldest and most likable language in the state of odisha, India. Odia character is a usually handwritten, which was generally occupied by scanner into machine readable form. In this regard several recognition technique have been evolved for variance kind of languages but writing pattern of odia character is just like as curve appearance; Hence it is more difficult for recognition. In this article we have presented the novel approach for Odia character recognition based on the different angle based symmetric axis feature extraction technique which gives high accuracy of recognition pattern. This empirical model generates a unique angle based boundary points on every skeletonised character images. These points are interconnected with each other in order to extract row and column symmetry axis. We extracted feature matrix having mean distance of row, mean angle of row, mean distance of column and mean angle of column from centre of the image to midpoint of the symmetric axis respectively. The system uses a 10 fold validation to the random forest (RF) classifier and SVM for feature matrix. We have considered the standard database on 200 images having each of 47 Odia character and 10 Odia numeric for simulation. As we have noted outcome of simulation of SVM and RF yields 96.3% and 98.2% accuracy rate on NIT Rourkela Odia character database and 88.9% and 93.6% from ISI Kolkata Odia numerical database.
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
Optical character recognition is one of the emerging research topics in the field of image processing, and it has extensive area of application in pattern recognition. Odia handwritten script is the most research concern area because it has eldest and most likable language in the state of odisha, India. Odia character is a usually handwritten, which was generally occupied by scanner into machine readable form. In this regard several recognition technique have been evolved for variance kind of languages but writing pattern of odia character is just like as curve appearance; Hence it is more difficult for recognition. In this article we have presented the novel approach for Odia character recognition based on the different angle based symmetric axis feature extraction technique which gives high accuracy of recognition pattern. This empirical model generates a unique angle based boundary points on every skeletonised character images. These points are interconnected with each other in order to extract row and column symmetry axis. We extracted feature matrix having mean distance of row, mean angle of row, mean distance of column and mean angle of column from centre of the image to midpoint of the symmetric axis respectively. The system uses a 10 fold validation to the random forest (RF) classifier and SVM for feature matrix. We have considered the standard database on 200 images having each of 47 Odia character and 10 Odia numeric for simulation. As we have noted outcome of simulation of SVM and RF yields 96.3% and 98.2% accuracy rate on NIT Rourkela Odia character database and 88.9% and 93.6% from ISI Kolkata Odia numerical database.
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...CSCJournals
In a multi script environment, majority of the documents may contain text information printed in more than one script/language forms. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this context, this paper proposes to develop a model to identify and separate text words of Kannada, Hindi and English scripts from a printed tri-lingual document. The proposed method is trained to learn thoroughly the distinct features of each script. The binary tree classifier is used to classify the input text image. Experimentation conducted involved 1500 text words for learning and 1200 text words for testing. Extensive experimentation has been carried out on both manually created data set and scanned data set. The results are very encouraging and prove the efficacy of the proposed model. The average success rate is found to be 99% for manually created data set and 98.5% for data set constructed from scanned document images.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
An Optical Character Recognition for Handwritten Devanagari ScriptIJERA Editor
Optical Character Recognition is process of recognition of character from scanned document and lots of OCR now available in the market. But most of these systems work for Roman, Chinese, Japanese and Arabic characters . There are no sufficient number of work on Indian language script like Devanagari so this paper present a review on optical character recognition on handwritten Devanagari script
Angular Symmetric Axis Constellation Model for Off-line Odia Handwritten Char...IJAAS Team
Optical character recognition is one of the emerging research topics in the field of image processing, and it has extensive area of application in pattern recognition. Odia handwritten script is the most research concern area because it has eldest and most likable language in the state of odisha, India. Odia character is a usually handwritten, which was generally occupied by scanner into machine readable form. In this regard several recognition technique have been evolved for variance kind of languages but writing pattern of odia character is just like as curve appearance; Hence it is more difficult for recognition. In this article we have presented the novel approach for Odia character recognition based on the different angle based symmetric axis feature extraction technique which gives high accuracy of recognition pattern. This empirical model generates a unique angle based boundary points on every skeletonised character images. These points are interconnected with each other in order to extract row and column symmetry axis. We extracted feature matrix having mean distance of row, mean angle of row, mean distance of column and mean angle of column from centre of the image to midpoint of the symmetric axis respectively. The system uses a 10 fold validation to the random forest (RF) classifier and SVM for feature matrix. We have considered the standard database on 200 images having each of 47 Odia character and 10 Odia numeric for simulation. As we have noted outcome of simulation of SVM and RF yields 96.3% and 98.2% accuracy rate on NIT Rourkela Odia character database and 88.9% and 93.6% from ISI Kolkata Odia numerical database.
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
Optical character recognition is one of the emerging research topics in the field of image processing, and it has extensive area of application in pattern recognition. Odia handwritten script is the most research concern area because it has eldest and most likable language in the state of odisha, India. Odia character is a usually handwritten, which was generally occupied by scanner into machine readable form. In this regard several recognition technique have been evolved for variance kind of languages but writing pattern of odia character is just like as curve appearance; Hence it is more difficult for recognition. In this article we have presented the novel approach for Odia character recognition based on the different angle based symmetric axis feature extraction technique which gives high accuracy of recognition pattern. This empirical model generates a unique angle based boundary points on every skeletonised character images. These points are interconnected with each other in order to extract row and column symmetry axis. We extracted feature matrix having mean distance of row, mean angle of row, mean distance of column and mean angle of column from centre of the image to midpoint of the symmetric axis respectively. The system uses a 10 fold validation to the random forest (RF) classifier and SVM for feature matrix. We have considered the standard database on 200 images having each of 47 Odia character and 10 Odia numeric for simulation. As we have noted outcome of simulation of SVM and RF yields 96.3% and 98.2% accuracy rate on NIT Rourkela Odia character database and 88.9% and 93.6% from ISI Kolkata Odia numerical database.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A survey on Script and Language identification for Handwritten document imagesiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATAijnlc
Due to the dramatic growth of internet use, the amount of unstructured Bengali text data has increased
enormous. It is therefore essential to extract event intelligently from it. The progress in technologies in
natural language processing (NLP) for information extraction that is used to locate and classify content in
news data according to predefined categories such as person name, place name, organization name, date,
time etc. The current named entity recognition (NER), which is a subtask of NLP, plays a vital rule to
achieve human level performance on specific documents such as newspapers to effectively identify entities.
The purpose of this research is to introduce NER system in Bengali news data to identify events of specified
things in running text based on regular expression and Bengali grammar. In so doing, I have designed and
evaluated part-of-speech (POS) tags to recognize proper nouns. In this thesis, I have explained Hidden
Markov Model (HMM) based approach for developing NER system from Bengali news data.
Natural language processing which is an engineering field which is concerned with the design and implementation of computer systems whose main function is to analyze, understand, interpret and produce a natural language. Literature review was conducted using Turkish natural language studies documentation method. Documentation on the current method of examination to be reliable scientific research and thesis studies on natural language processing in Turkey were examined by scanning pages of the thesis of Higher Education. Evaluated in terms of the subjects of the study samples obtained as a result of the literature review Morphological analysis studies, syntactic analysis studies, semantic analysis studies and problem analysis criteria were determined and presented. Yilmaz Ince, E "Turkish Natural Language Processing Studies" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-1 , December 2019, URL: https://www.ijtsrd.com/papers/ijtsrd29831.pdf Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/29831/turkish-natural-language-processing-studies/yilmaz-ince-e
Keyword Extraction Based Summarization of Categorized Kannada Text Documents ijsc
The internet has caused a humongous growth in the number of documents available online. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indices for a given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We use two feature selection techniques for obtaining features from documents, then we combine scores obtained by GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF (Term Frequency) for extracting key words and later use these for summarization based on rank of the sentence. In the current implementation, a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.
A statistical model for gist generation a case study on hindi news articleIJDKP
Every day, huge number of news articles are reported and disseminated on the Internet. By generating gist
of an article, reader can go through the main topics instead of reading the whole article as it takes much
time for reader to read the entire content of the article. An ideal system would understand the document
and generate the appropriate theme(s) directly from the results of the understanding. In the absence of
natural language understanding system, it is required to design an appropriate system. Gist generation is a
difficult task because it requires both maximizing text content in short summary and maintains
grammaticality of the text. In this paper we present a statistical approach to generate a gist of a Hindi
news article. The experimental results are evaluated using the standard measures such as precision, recall
and F1 measure for different statistical models and their combination on the article before pre-processing
and after pre-processing.
Punjabi Text Classification using Naïve Bayes, Centroid and Hybrid Approachcscpconf
Punjabi Text Classification is the process of assigning predefined classes to the unlabelled text
documents. Because of dramatic increase in the amount of content available in digital form, text
classification becomes an urgent need to manage the digital data efficiently and accurately. Till
now no Punjabi Text Classifier is available for Punjabi Text Documents. Therefore, in this
paper, existing classification algorithm such as Naïve Bayes, Centroid Based techniques are
used for Punjabi Text Classification. And one new approach is proposed for the Punjabi Text
Documents which is the combination Naïve Bayes (to extract the relevant features so as to
reduce the dimensionality) and Ontology Based Classification (that act as text classifier that
used extracted features). These algorithms are performed over 184 Punjabi News Articles on
Sports that classify the documents into 7 classes such as ਿਕਕਟ (krikaṭ), ਹਾਕੀ (hākī), ਕਬੱ ਡੀ
(kabḍḍī), ਫੁਟਬਾਲ (phuṭbāl), ਟੈਿਨਸ (ṭainis), ਬੈਡਿਮੰ ਟਨ (baiḍmiṇṭan), ਓਲੰ ਿਪਕ (ōlmpik).
A study on the approaches of developing a named entity recognition tooleSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
Recognition of Words in Tamil Script Using Neural NetworkIJERA Editor
In this paper, word recognition using neural network is proposed. Recognition process is started with the partitioning of document image into lines, words, and characters and then capturing the local features of segmented characters. After classifying the characters, the word image is transferred into unique code based on character code. This code ideally describes any form of word including word with mixed styles and different sizes. Sequence of character codes of the word form input pattern and word code is a target value of the pattern. Neural network is used to train the patterns of the words. Trained network is tested with word patterns and is recognized or unrecognized based on the network error value. Experiments have been conducted with a local database to evaluate the performance of the word recognizing system and obtained good accuracy. This method can be applied for any language word recognition system as the training is based on only unique code of the characters and words belonging to the language.
The paper addresses the automation of the task of an epigraphist in reading and deciphering inscriptions.
The automation steps include Pre-processing, Segmentation, Feature Extraction and Recognition. Preprocessing
involves, enhancement of degraded ancient document images which is achieved through Spatial
filtering methods, followed by binarization of the enhanced image. Segmentation is carried out using Drop
Fall and Water Reservoir approaches, to obtain sampled characters. Next Gabor and Zonal features are
extracted for the sampled characters, and stored as feature vectors for training. Artificial Neural Network
(ANN) is trained with these feature vectors and later used for classification of new test characters. Finally
the classified characters are mapped to characters of modern form. The system showed good results when
tested on the nearly 150 samples of ancient Kannada epigraphs from Ashoka and Hoysala periods. An
average Recognition accuracy of 80.2% for Ashoka period and 75.6% for Hoysala period is achieved.
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONijnlc
Named Entity Recognition and Classification (NERC) is a process of identification of proper nouns in the text and classification of those nouns into certain predefined categories like person name, location,organization, date, and time etc. NERC in Kannada is an essential and challenging task. The aim of this work is to develop a novel model for NERC, based on Multinomial Naïve Bayes (MNB) Classifier. The Methodology adopted in this paper is based on feature extraction of training corpus, by using term frequency, inverse document frequency and fitting them to a tf-idf-vectorizer. The paper discusses the
various issues in developing the proposed model. The details of implementation and performance evaluation are discussed. The experiments are conducted on a training corpus of size 95,170 tokens and test corpus of 5,000 tokens. It is observed that the model works with Precision, Recall and F1-measure of
83%, 79% and 81% respectively.
A language independent approach to develop urduir systemcsandit
This is the era of Information Technology. Today the most important thing is how one gets the
right information at right time. More and more data repositories are now being made available
online. Information retrieval systems or search engines are used to access electronic
information available on the internet. These information retrieval systems depend on the
available tools and techniques for efficient retrieval of information content in response to the
user query needs. During last few years, a wide range of information in Indian regional
languages like Hindi, Urdu, Bengali, Oriya, Tamil and Telugu has been made available on web
in the form of e-data. But the access to these data repositories is very low because the efficient
search engines/retrieval systems supporting these languages are very limited. We have
developed a language independent system to facilitate efficient retrieval of information
available in Urdu language which can be used for other languages as well. The system gives
precision of 0.63 and the recall of the system is 0.8.
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEMcscpconf
This is the era of Information Technology. Today the most important thing is how one gets theright information at right time. More and more data repositories are now being made available online. Information retrieval systems or search engines are used to access electronic information available on the internet. These information retrieval systems depend on the available tools and techniques for efficient retrieval of information content in response to the user query needs. During last few years, a wide range of information in Indian regional languages like Hindi, Urdu, Bengali, Oriya, Tamil and Telugu has been made available on web in the form of e-data. But the access to these data repositories is very low because the efficient search engines/retrieval systems supporting these languages are very limited. We have developed a language independent system to facilitate efficient retrieval of information available in Urdu language which can be used for other languages as well. The system gives precision of 0.63 and the recall of the system is 0.8.
Information Retrieval System is an effective process that helps a user to trace relevant information by Natural Language Processing (NLP). In this research paper, we have presented present an algorithmic Information Retrieval System(BIRS) based on information and the system is significant mathematically and statistically. This paper is demonstrated by two algorithms for finding out the lemmatization of Bengali words such as Trie and Dictionary Based Search by Removing Affix (DBSRA) as well as compared with Edit Distance for the exact lemmatization. We have presented the Bengali Anaphora resolution system using the Hobbs’ algorithm to get the correct expression of information. As the actions of questions answering algorithms, the TF-IDF and Cosine Similarity are developed to find out the accurate answer from the documents. In this study, we have introduced a Bengali Language Toolkit (BLTK) and Bengali Language Expression (BRE) that make the easiest implication of our task. We have also developed Bengali root word’s corpus, synonym word’s corpus, stop word’s corpus and gathered 672 articles from the popular Bengali newspapers ‘The Daily Prothom Alo’ which is our inserted information. For testing this system, we have created 19335 questions from the introduced information and got 97.22% accurate answer.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
Suitability of naïve bayesian methods for paragraph level text classification...ijaia
The amount of data present online is growing very rapidly, hence a need for organizing and categorizing
data has become an obvious need. The Information Retrieval (IR) techniques act as an aid in assisting
users in obtaining relevant information. IR in the Indian context is very relevant as there are several blogs,
news publications in Indian languages present online. This work looks at the suitability of Naïve Bayesian
methods for paragraph level text classification in the Kannada language. The Naïve Bayesian methods are
the most primitive algorithms for Text Categorization tasks. We apply dimensionality reduction technique
using Minimum term frequency, stop word identification and elimination methods for achieving the task. It
is evident that Naïve Bayesian Multinomial model outperforms simple Naïve Bayesian approach in
paragraph classification tasks.
This report on an online shopping mall, previously this company was working with other name than BELISCITY.COM and now a days working with the name gulfdealz.com.
I worked online on this report Most of the part of this report is on observational basis by viewing their website online continuously each and every portion day to day, their delivery of goods, web structures, add post on web courier service, it was thought provoking regarding strategies while developing strategies for online company.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A survey on Script and Language identification for Handwritten document imagesiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATAijnlc
Due to the dramatic growth of internet use, the amount of unstructured Bengali text data has increased
enormous. It is therefore essential to extract event intelligently from it. The progress in technologies in
natural language processing (NLP) for information extraction that is used to locate and classify content in
news data according to predefined categories such as person name, place name, organization name, date,
time etc. The current named entity recognition (NER), which is a subtask of NLP, plays a vital rule to
achieve human level performance on specific documents such as newspapers to effectively identify entities.
The purpose of this research is to introduce NER system in Bengali news data to identify events of specified
things in running text based on regular expression and Bengali grammar. In so doing, I have designed and
evaluated part-of-speech (POS) tags to recognize proper nouns. In this thesis, I have explained Hidden
Markov Model (HMM) based approach for developing NER system from Bengali news data.
Natural language processing which is an engineering field which is concerned with the design and implementation of computer systems whose main function is to analyze, understand, interpret and produce a natural language. Literature review was conducted using Turkish natural language studies documentation method. Documentation on the current method of examination to be reliable scientific research and thesis studies on natural language processing in Turkey were examined by scanning pages of the thesis of Higher Education. Evaluated in terms of the subjects of the study samples obtained as a result of the literature review Morphological analysis studies, syntactic analysis studies, semantic analysis studies and problem analysis criteria were determined and presented. Yilmaz Ince, E "Turkish Natural Language Processing Studies" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-1 , December 2019, URL: https://www.ijtsrd.com/papers/ijtsrd29831.pdf Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/29831/turkish-natural-language-processing-studies/yilmaz-ince-e
Keyword Extraction Based Summarization of Categorized Kannada Text Documents ijsc
The internet has caused a humongous growth in the number of documents available online. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indices for a given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We use two feature selection techniques for obtaining features from documents, then we combine scores obtained by GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF (Term Frequency) for extracting key words and later use these for summarization based on rank of the sentence. In the current implementation, a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.
A statistical model for gist generation a case study on hindi news articleIJDKP
Every day, huge number of news articles are reported and disseminated on the Internet. By generating gist
of an article, reader can go through the main topics instead of reading the whole article as it takes much
time for reader to read the entire content of the article. An ideal system would understand the document
and generate the appropriate theme(s) directly from the results of the understanding. In the absence of
natural language understanding system, it is required to design an appropriate system. Gist generation is a
difficult task because it requires both maximizing text content in short summary and maintains
grammaticality of the text. In this paper we present a statistical approach to generate a gist of a Hindi
news article. The experimental results are evaluated using the standard measures such as precision, recall
and F1 measure for different statistical models and their combination on the article before pre-processing
and after pre-processing.
Punjabi Text Classification using Naïve Bayes, Centroid and Hybrid Approachcscpconf
Punjabi Text Classification is the process of assigning predefined classes to the unlabelled text
documents. Because of dramatic increase in the amount of content available in digital form, text
classification becomes an urgent need to manage the digital data efficiently and accurately. Till
now no Punjabi Text Classifier is available for Punjabi Text Documents. Therefore, in this
paper, existing classification algorithm such as Naïve Bayes, Centroid Based techniques are
used for Punjabi Text Classification. And one new approach is proposed for the Punjabi Text
Documents which is the combination Naïve Bayes (to extract the relevant features so as to
reduce the dimensionality) and Ontology Based Classification (that act as text classifier that
used extracted features). These algorithms are performed over 184 Punjabi News Articles on
Sports that classify the documents into 7 classes such as ਿਕਕਟ (krikaṭ), ਹਾਕੀ (hākī), ਕਬੱ ਡੀ
(kabḍḍī), ਫੁਟਬਾਲ (phuṭbāl), ਟੈਿਨਸ (ṭainis), ਬੈਡਿਮੰ ਟਨ (baiḍmiṇṭan), ਓਲੰ ਿਪਕ (ōlmpik).
A study on the approaches of developing a named entity recognition tooleSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
Recognition of Words in Tamil Script Using Neural NetworkIJERA Editor
In this paper, word recognition using neural network is proposed. Recognition process is started with the partitioning of document image into lines, words, and characters and then capturing the local features of segmented characters. After classifying the characters, the word image is transferred into unique code based on character code. This code ideally describes any form of word including word with mixed styles and different sizes. Sequence of character codes of the word form input pattern and word code is a target value of the pattern. Neural network is used to train the patterns of the words. Trained network is tested with word patterns and is recognized or unrecognized based on the network error value. Experiments have been conducted with a local database to evaluate the performance of the word recognizing system and obtained good accuracy. This method can be applied for any language word recognition system as the training is based on only unique code of the characters and words belonging to the language.
The paper addresses the automation of the task of an epigraphist in reading and deciphering inscriptions.
The automation steps include Pre-processing, Segmentation, Feature Extraction and Recognition. Preprocessing
involves, enhancement of degraded ancient document images which is achieved through Spatial
filtering methods, followed by binarization of the enhanced image. Segmentation is carried out using Drop
Fall and Water Reservoir approaches, to obtain sampled characters. Next Gabor and Zonal features are
extracted for the sampled characters, and stored as feature vectors for training. Artificial Neural Network
(ANN) is trained with these feature vectors and later used for classification of new test characters. Finally
the classified characters are mapped to characters of modern form. The system showed good results when
tested on the nearly 150 samples of ancient Kannada epigraphs from Ashoka and Hoysala periods. An
average Recognition accuracy of 80.2% for Ashoka period and 75.6% for Hoysala period is achieved.
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONijnlc
Named Entity Recognition and Classification (NERC) is a process of identification of proper nouns in the text and classification of those nouns into certain predefined categories like person name, location,organization, date, and time etc. NERC in Kannada is an essential and challenging task. The aim of this work is to develop a novel model for NERC, based on Multinomial Naïve Bayes (MNB) Classifier. The Methodology adopted in this paper is based on feature extraction of training corpus, by using term frequency, inverse document frequency and fitting them to a tf-idf-vectorizer. The paper discusses the
various issues in developing the proposed model. The details of implementation and performance evaluation are discussed. The experiments are conducted on a training corpus of size 95,170 tokens and test corpus of 5,000 tokens. It is observed that the model works with Precision, Recall and F1-measure of
83%, 79% and 81% respectively.
A language independent approach to develop urduir systemcsandit
This is the era of Information Technology. Today the most important thing is how one gets the
right information at right time. More and more data repositories are now being made available
online. Information retrieval systems or search engines are used to access electronic
information available on the internet. These information retrieval systems depend on the
available tools and techniques for efficient retrieval of information content in response to the
user query needs. During last few years, a wide range of information in Indian regional
languages like Hindi, Urdu, Bengali, Oriya, Tamil and Telugu has been made available on web
in the form of e-data. But the access to these data repositories is very low because the efficient
search engines/retrieval systems supporting these languages are very limited. We have
developed a language independent system to facilitate efficient retrieval of information
available in Urdu language which can be used for other languages as well. The system gives
precision of 0.63 and the recall of the system is 0.8.
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEMcscpconf
This is the era of Information Technology. Today the most important thing is how one gets theright information at right time. More and more data repositories are now being made available online. Information retrieval systems or search engines are used to access electronic information available on the internet. These information retrieval systems depend on the available tools and techniques for efficient retrieval of information content in response to the user query needs. During last few years, a wide range of information in Indian regional languages like Hindi, Urdu, Bengali, Oriya, Tamil and Telugu has been made available on web in the form of e-data. But the access to these data repositories is very low because the efficient search engines/retrieval systems supporting these languages are very limited. We have developed a language independent system to facilitate efficient retrieval of information available in Urdu language which can be used for other languages as well. The system gives precision of 0.63 and the recall of the system is 0.8.
Information Retrieval System is an effective process that helps a user to trace relevant information by Natural Language Processing (NLP). In this research paper, we have presented present an algorithmic Information Retrieval System(BIRS) based on information and the system is significant mathematically and statistically. This paper is demonstrated by two algorithms for finding out the lemmatization of Bengali words such as Trie and Dictionary Based Search by Removing Affix (DBSRA) as well as compared with Edit Distance for the exact lemmatization. We have presented the Bengali Anaphora resolution system using the Hobbs’ algorithm to get the correct expression of information. As the actions of questions answering algorithms, the TF-IDF and Cosine Similarity are developed to find out the accurate answer from the documents. In this study, we have introduced a Bengali Language Toolkit (BLTK) and Bengali Language Expression (BRE) that make the easiest implication of our task. We have also developed Bengali root word’s corpus, synonym word’s corpus, stop word’s corpus and gathered 672 articles from the popular Bengali newspapers ‘The Daily Prothom Alo’ which is our inserted information. For testing this system, we have created 19335 questions from the introduced information and got 97.22% accurate answer.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
Suitability of naïve bayesian methods for paragraph level text classification...ijaia
The amount of data present online is growing very rapidly, hence a need for organizing and categorizing
data has become an obvious need. The Information Retrieval (IR) techniques act as an aid in assisting
users in obtaining relevant information. IR in the Indian context is very relevant as there are several blogs,
news publications in Indian languages present online. This work looks at the suitability of Naïve Bayesian
methods for paragraph level text classification in the Kannada language. The Naïve Bayesian methods are
the most primitive algorithms for Text Categorization tasks. We apply dimensionality reduction technique
using Minimum term frequency, stop word identification and elimination methods for achieving the task. It
is evident that Naïve Bayesian Multinomial model outperforms simple Naïve Bayesian approach in
paragraph classification tasks.
This report on an online shopping mall, previously this company was working with other name than BELISCITY.COM and now a days working with the name gulfdealz.com.
I worked online on this report Most of the part of this report is on observational basis by viewing their website online continuously each and every portion day to day, their delivery of goods, web structures, add post on web courier service, it was thought provoking regarding strategies while developing strategies for online company.
Eeva Hellström 7.11.2011: Onko maaseutu mennyttä – mikä on sen osa tulevaisuu...Sitra Maamerkit
Maaseudun tulevaisuuden rooli määrittyy kysynnän, ei tarjonnan kautta. Eeva Hellströmin esitys 7.11.2011 Juvan kunnanvaltuuston 100-vuotisjuhlaseminaarissa.
ATLAS OF MINERAL RESOURCES OF THE ESCAPE REGION VOLUME 12 GEOLOGY AND MINERAL...MYO AUNG Myanmar
Atlas of Mineral Resources of the ESCAP Region:
Geology and Mineral Resources of Myanmar
This series shows the distribution of mineral deposits and occurrences in the countries of the Asia and Pacific region irrespective of their economic significance and provides information on their contained commodities, reserves, geographic locations, their relation to the geological environment and other characteristics.
Book Ordering Information:
Sales Number: E.95.II.F.17 ISBN: 92111968250 Vol.12 Pages: 208pp.
DIMENSION REDUCTION FOR SCRIPT CLASSIFICATION- PRINTED INDIAN DOCUMENTSijait
Automatic identification of a script in a given document image facilitates many important applications such
as automatic archiving of multilingual documents, searching online archives of document images and for
the selection of script specific OCR in a multilingual environment. This paper provides a comparison study
of three dimension reduction techniques, namely partial least squares (PLS), sliced inverse regression (SIR)
and principal component analysis (PCA), and evaluates the relative performance of classification
procedures incorporating those methods
Dimension Reduction for Script Classification - Printed Indian Documentsijait
Automatic identification of a script in a given document image facilitates many important applications such as automatic archiving of multilingual documents, searching online archives of document images and for the selection of script specific OCR in a multilingual environment. This paper provides a comparison study of three dimension reduction techniques, namely partial least squares (PLS), sliced inverse regression (SIR)
and principal component analysis (PCA), and evaluates the relative performance of classification procedures incorporating those methods. For given script we extracted different features like Gray Level Co-occurrence Method (GLCM) and Scale invariant feature transform (SIFT) features. The features are
extracted globally from a given text block which does not require any complex and reliable segmentation of the document image into lines and characters. Extracted features are reduced using various dimension reduction techniques. The reduced features are fed into Nearest Neighbor classifier. Thus the proposed
scheme is efficient and can be used for many practical
pplications which require processing large volumes
of data. The scheme has been tested on 10 Indian scripts and found to be robust in the process of scanning and relatively insensitive to change in font size. This proposed system achieves good classification accuracy on a large testing data set.
Language Identification from a Tri-lingual Printed Document: A Simple ApproachIJERA Editor
In a multi-script, multilingual country like India, a document may contain text lines in more than one script/language forms. For such a multi-script environment, multilingual Optical Character Recognition (OCR) system is needed to read the multi-script documents. To make a multilingual OCR system successful, it is necessary to identify different script regions of the document before feeding the document to the OCRs of individual language. With this context, this paper proposes to work on the prioritized requirements of a particular region- Andra Pradesh, a state in India, where any document including official ones, would contain the text in three languages-Telugu, Hindi and English. So, the objective of this paper is to develop a system that should aim to accurately identify and separate Telugu, Hindi and English text lines from a printed multilingual document and also to group the portion of the document in other than these three languages into a separate category OTHERS. The proposed method is developed by thoroughly understanding the nature of top and bottom profiles of the printed text lines. Experimentation conducted involved 900 text lines for learning and 900 text lines for testing. The performance has turned out to be 95.67%.
Wavelet Packet Based Features for Automatic Script IdentificationCSCJournals
In a multi script environment, an archive of documents having the text regions printed in different scripts is in practice. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this paper, a novel texture-based approach is presented to identify the script type of the collection of documents printed in seven scripts, to categorize them for further processing. The South Indian documents printed in the seven scripts - Kannada, Tamil, Telugu, Malayalam, Urdu, Hindi and English are considered here. The document images are decomposed through the Wavelet Packet Decomposition using the Haar basis function up to level two. The texture features are extracted from the sub bands of the wavelet packet decomposition. The Shannon entropy value is computed for the set of sub bands and these entropy values are combined to use as the texture features. Experimentation conducted involved 2100 text images for learning and 1400 text images for testing. Script classification performance is analyzed using the K-nearest neighbor classifier. The average success rate is found to be 99.68%.
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
An Empirical Study on Identification of Strokes and their Significance in Scr...IJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSDIJERA Editor
With rapid growth of multilingual information on the Internet, Cross Language Information Retrieval (CLIR) is becoming need of the day. It helps user to query in their native language and retrieve information in any language. But the performance of CLIR is poor as compared to monolingual retrieval due to lexical ambiguity, mismatching of query terms and out-of-vocabulary words. In this paper, we have proposed an algorithm for improving the performance of Marathi-English CLIR system. The system first finds possible translations of input query in target language, disambiguates them and then gives English queries to search engine for relevant document retrieval. The disambiguation is based on unsupervised corpus-based method which uses English dictionary as additional resource. The experiment is performed on FIRE 2011 (Forum of Information Retrieval Evaluation) dataset using “Title” and “Description” fields as inputs. The experimental results show that proposed approach gives better performance of Marathi-English CLIR system with good precision level.
AN APPORACH FOR SCRIPT IDENTIFICATION IN PRINTED TRILINGUAL DOCUMENTS USING T...ijaia
In this work, we review the outcome of texture features for script classification. Rectangular White Space
analysis algorithm is used to analyze and identify heterogeneous layouts of document images. The texture
features, namely the color texture moments, Local binary pattern (LBP) and responses of Gabor, LM-filter,
S-filter, R-filter are extracted, and combinations of these are considered in the classification. In this work,
a probabilistic neural network and Nearest Neighbor are used for classification. To corrabate the
adequacy of the proposed strategy, an experiment was operated on our own data set. To study the effect of
classification accuracy, we vary the database sizes and the results show that the combination of multiple
features vastly improves the performance.
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGEScscpconf
Identification of scripts from multi-script document is one of the important steps in the design
of an OCR system for successful analysis and recognition. Most optical character recognition
(OCR) systems can recognize at most a few scripts. But for large archives of document images
containing different scripts, there must be some way to automatically categorize these
documents before applying the proper OCR on them. Much work has already been reported in
this area. In the Indian context, though some results have been reported, the task is still at its
infancy. This paper presents a research in the identification of Tamil, English and Hindi
scripts at word level irrespective of their font faces and sizes. It also identifies English
numerals from multilingual document images. The proposed technique performs document
vectorization method which generates vectors from the nine zones segmented over the
characters based on their shape, density and transition features. Script is then determined by
using Rule based classifiers and its sub classifiers containing set of classification rules which
are raised from the vectors. The proposed system identifies scripts from document images
even if it suffers from noise and other kinds of distortions. Results from experiments,
simulations, and human vision encounter that the proposed technique identifies scripts and numerals with minimal pre-processing and high accuracy. In future, this can also be extended for other scripts.
A decision tree based word sense disambiguation system in manipuri languageacijjournal
This paper manifests a primary attempt on building a word sense disambiguation system in Manipuri
language. The paper discusses related attempts made in the Manipuri language followed by the proposed
plan. A database, consisting of 650 sentences, is collected in Manipuri language in the course of the study.
Conventional positional and context based features are suggested to capture the sense of the words, which
have ambiguous and multiple senses. The proposed work is expected to predict the senses of the
polysemous words with high accuracy with the help of the suitable knowledge acquisition techniques. The
system produces an accuracy of 71.75 %.
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
Stemming is the process of term conflation. It conflates all the word variants to a common form called as stem. It plays significant role in numerous Natural Language Processing (NLP) applications like morphological analysis, parsing, document summarization, text classification, part-of-speech tagging, question-answering system, machine translation, word sense disambiguation, information retrieval (IR), etc. Each of these tasks requires some pre-processing to be done. Stemming is one of the important building blocks for all these applications. This paper, presents an overview of various stemming techniques, evaluation criteria for stemmers and various existing stemmers for Indic languages.
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...ijnlc
The comprehension of whole manually written records is a testing issue which incorporates various
difficult undertakings. Given a written by hand archive, its format needs to be dissected to detach different
content sorts in a first step. These different content sorts can then be coordinated to specific frameworks,
including writer style, image, or table recognizers. Research in programmed author recognizable proof has
principally centred around the measurable methodology. This has prompted the particular and extraction
of factual elements, for example, run-length appropriations, incline dissemination, entropy, and edge-pivot
conveyance. The edge-pivot conveyance highlight out flanks all other measurable elements. Edge-pivot
circulation is an element that portrays the adjustments in bearing of a written work stroke in written by
hand content. The edge- pivot circulation is extricated by method for a window that is slid over an edgerecognized
on offline scanned images. At whatever point the focal pixel of the window is on, the two edge
pieces (i.e. associated successions of pixels) rising up out of this focal pixel are considered. Their bearings
are measured and put away as sets. A joint likelihood dissemination is gotten from an extensive recognition
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGESijistjournal
Named Entity Recognition is a prior task in Natural Language Processing. Named Entity Recognition is a sub task of information extraction and it identifies and classifies proper nouns in to its predefined categories such as person, location, organization, time, date etc. In this document the major focus is given on NER approaches and the work done till now for various languages to identify Named Entities is been discussed. Author have done comparative study to recognize named entity and identified that CRF approach proven best for Indian languages to identify named entity.
Information retrieval (IR) system aims to retrieve
relevant documents to a user query where the query is a set of
keywords. Cross-language information retrieval (CLIR) is a
retrieval process in which the user fires queries in one language to
retrieve information from another language. The growing
requirement on the Internet for users to access information
expressed in language other than their own has led to Cross
Language Information Retrieval (CLIR) becoming established as
a major topic in IR.
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
A Novel approach for Document Clustering using Concept ExtractionAM Publications
In this paper we present a novel approach to extract the concept from a document and cluster such set of documents depending on the concept extracted from each of them. We transform the corpus into vector space by using term frequency–inverse document frequency then calculate the cosine distance between each document, followed by clustering them using K means algorithm. We also use multidimensional scaling to reduce the dimensionality within the corpus. It results in the grouping of documents which are most similar to each other with respect to their content and the genre.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
1. Thanuja C, Shreedevi G R / International Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 4, Jul-Aug 2013, pp.1329-1335
1329 | P a g e
Content Based Image Retrieval System for Kannada Query
Image from Multilingual Document Image Collection
Thanuja C1
, Shreedevi G R2
1, 2
(Department of Information Science, Sambhram institute of technology, Bangalore-97
ABSTRACT
In a multilingual country like India, a
document may contain text words in more than
one language. It is reasonably natural that the
documents produced at the border regions of
Karnataka may also be printed in the regional
languages of the neighbouring states like Telugu,
Tamil, Malayalam and Urdu. An electronic
Searching system for Kannada text, based on the
content is needed to access such multilingual
documents. So, it is necessary to identify different
language portions of the document before the
retrieval. The objective of this paper is to propose
visual clues based procedure to identify Kannada
text from a multilingual document which contains
Hindi, English and Malayalam text portions along
with Kannada.
Keywords - Content based image retrieval,
Correlation Coefficients, Feature Extraction, Multi-
lingual Document image, Script Identification.
I. INTRODUCTION
Large electronic collections of historical
prints, writings, manuscripts and books exist in
Indian languages that needs search options in images.
The objective of language identification is to translate
human identifiable documents to machine identifiable
codes. For Example the heritage inscriptions are
being digitized which may contain more than one
language. Such collections can be made available to
large communities through electronic media.
Identification of the language in a document
image is of primary importance for retrieval.
Language identification may seem to be an
elementary and simple issue for humans in the real
world, but it is difficult for a machine, primarily
because different scripts (a script could be a common
medium for different languages) are made up of
different shaped patterns to produce different
character sets. A document containing text
information in more than one language is called a
multilingual document. For such type of multilingual
documents, it is very essential to identify the text
language portion of the document, before the
retrieval. Language identification is one of the vision
application problems. Generally human system
identifies the language in a document using some
visible characteristic features such as texture,
horizontal lines, vertical lines, which are visually
perceivable and appeal to visual sensation. This
human visual perception capability has been the
motivator for the development of the proposed
system. With this context, in this paper, an attempt
has been made to simulate the human visual system,
to identify the type of the language based on visual
clues, without reading the contents of the document.
There is a need for easy and efficient access to such
documents.
The search procedures available for text
domain can be applied, if these document images are
converted into textual representations using
recognizers. However, it is an infeasible solution due
to the unavailability of efficient and robust OCRs for
Indian languages. Addressing this problem, the paper
proposes an efficient recognition and retrieval of
Kannada language from multilingual document
images, based on the visual features of the text.
II. RELATED WORK
2.1 SCRIPT IDENTIFICATION
From the literature survey, it is evident that
some amount of work has been carried out in script
identification. Peake and Tan [1997] have proposed a
method for automatic script and language
identification from document images using multiple
channel (Gabour) filters and gray level co-occurrence
matrices for seven languages: Chinese, English,
Greek, Korean, Malayalam, Persian and Russian. Tan
[1998] has developed rotation invariant texture
feature extraction method for automatic script
identification for six languages: Chinese, Greek,
English, Russian, Persian and Malayalam. In the
context of Indian languages, some amount of
research work on language identification has been
reported [1997, 1997, 2005, and 2003]. Pal and
Choudhuri [2001] have proposed an automatic
technique of separating the text lines from 12 Indian
scripts (English, Devanagari, Bangla, Gujarati,
Kannada, Kashmiri, Malayalam, Oriya, Punjabi,
Tamil, Telugu and Urdu) using ten triplets formed by
grouping English and Devanagari with any one of the
other scripts.
Santanu Choudhuri, et al. [2000] have
proposed a method for identification of Indian
languages by combining Gabour filter based
technique and direction distance histogram classifier
considering Hindi, English, Malayalam, Bengali,
Telugu and Urdu. Basavaraj Patil and Subbareddy
have developed a character script class identification
system for machine printed bilingual documents in
2. Thanuja C, Shreedevi G R / International Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 4, Jul-Aug 2013, pp.1329-1335
1330 | P a g e
English and Kannada scripts using probabilistic
neural network. Pal and Choudhuri [1997] have
proposed an automatic separation of Bangla,
Devanagari and Roman words in multilingual
multiscript Indian documents. Nagabhushan et.al.
20013] havep oposed a fuzzy statistical approach to
Kannada vowel recognition based on invariant
moments. Pal et. al. [2005] has suggested a word-
wise script identification model from a document
containing English, Devanagari and Telugu text.
Chanda and Pal [2005] have proposed an automatic
technique for word-wise identification of Devanagari,
English and Urdu scripts from a single document.
Spitz [1997] has proposed a technique for
distinguishing Han and Latin based scripts on the
basis of spatial relationships of features related to the
character structures.
Pal et al. [2003] have developed a script
identification technique for Indian languages by
employing new features based on water reservoir
principle, contour tracing, jump discontinuity, left
and right profile. Ramachandra et al. [2002] have
proposed a method based on rotation- invariant
texture features using multichannel Gabor filter for
identifying six (Bengali, Kannada, Malayalam, Oriya,
Telugu and Marathi) Indian languages. Hochberg et
al. [2006] have presented a system that automatically
identifies the script form using cluster-based
templates. Gopal et al. have presented a scheme to
identify different Indian scripts through hierarchical
classification which uses features extracted from the
responses of a multichannel log-Gabor filter.
Having all these in mind, we have proposed
a system that would more accurately identify and
separate language portions of Kannada by separating
Hindi, English and Malayalam text from document
images as our intension is to identify only Kannada
from multilingual document images. The system
identifies the Kannada with the help of a knowledge
base as the main aim is to focus only on Kannada
text.
2.2 CONTENT BASED IMAGE RETRIEVAL
In this section, we look at the literature of
indexing and retrieval techniques used for search in
large image databases. The topic of interest overlaps
with databases, pattern recognition, and content
based image retrieval, digital libraries, document
image processing and information retrieval.
A number of approaches have been
proposed in recent years for efficient search and
retrieval of document images. There are essentially
two classes of techniques to search document image
collections. The first approach is to convert the
images into text and then apply a search engine.
In recognition based search and retrieval
techniques, the document images are passed through
an optical character recognizer (OCR) to obtain text
documents. The text documents are then processed by
a text search engine to build the index. The text index
makes the document retrieval efficient.
Taghva et al. built a search engine for
documents obtained after recognition of images.
Searching is done based on the results of similarity
calculation between the query words and the database
words. Similar words are identified from the correct
terms by applying mutual information measure.
There have been attempts to retrieve complete
documents (rather than searching words) by
considering the information from word
neighbourhood (like n-grams) to improve the search
in presence of OCR errors. Word spotting is a
method of searching and locating words in document
images by treating a collection of documents as a
collection of word images. The words are clustered
and the clusters are annotated for enabling indexing
and searching over the documents. It involves
segmentation of each document into its
corresponding lines and then into words. The word
spotting approach has been extended to searching
queried words from printed document images of
newspapers and books. Dynamic time warping
(DTW) based word-spotting algorithm for indexing
and retrieval of online documents is also reported.
The remainder of the paper describes our
current development effort in more detail. Section 3
demonstrates visible feature of the four languages we
have considered for experiment. Section 4 gives the
overall system architecture. Section 5 briefs on word
level segmentation. Section 6 describes the
supportive knowledge base for script identification.
Section 7 details the implementation and
experimental results of the developed system. And
Section 8 concludes the paper.
III. VISUAL DISCRIMINATING
FEATURES OF KANNADA, HINDI,
ENGLISH AND MALAYALAM
TEXT
Feature extraction is an integral part of any
recognition system. The aim of feature extraction is
to describe the pattern by means of minimum number
of features or attributes that are effective in
discriminating pattern classes. The new algorithms
presented in this paper are inspired by a simple
observation that every language defines a finite set of
text patterns, each having a distinct visual
appearance. The character shape descriptors take into
account any feature that appears to be distinct for the
language and hence every language could be
identified based on its visual discriminating features.
Presence and absence of the discriminating features
of Kannada, Hindi and English text words are given
in Table-1.
3.1. FEATURES OF KANNADA TEXT
It could be seen that most of the Kannada
characters have horizontal line like structures.
3. Thanuja C, Shreedevi G R / International Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 4, Jul-Aug 2013, pp.1329-1335
1331 | P a g e
Kannada character set has 50 basic characters, out of
which the first 14 are vowels and the remaining
characters are consonants. A consonant combined
with a vowel forms a modified compound character
resulting in more than one component and is much
larger in size than the corresponding basic character.
It could be seen that a document in Kannada
language is made up of collection of basic and
compound characters resulting in equal and unequal
sized characters with some characters having more
than one component. Typical Kannada word with
vertical and horizontal lines after erosion is given
below.
Word image
Vertical Erosion
Horizontal Erosion
3.2 FEATURES OF HINDI TEXT
In Hindi, many characters have a horizontal
line at the upper part. This line is called sirorekha in
Devanagari. However, we shall call it as head-line. It
could be seen that, when two or more characters sit
side by side to form a word, the character head-line
segments mostly join one another in a word resulting
in only one component within each text word and
generates one continuous head-line for each text
word. Since the characters are connected through
their head-line portions, a Hindi word appears as a
single component and hence it cannot be segmented
further into blocks, which could be used as a visual
discriminating feature to recognize Hindi language.
We can also observe that most of the Hindi characters
have vertical line like structures. It could be seen that
since two or more characters are connected together
through their head-line portions, the width of the
block is much larger than the height of the text line.
Typical Hindi word with vertical and horizontal lines
after erosion is given below.
Word image
Vertical Erosion
Horizontal Erosion
3.3. FEATURES OF ENGLISH TEXT
It has been found that a distinct
characteristic of most of the English characters is the
existence of vertical line-like structures and uniform
sized characters with each characters having only one
component (except “i” and “j” in lower-case).
Typical English word with vertical and horizontal
lines after erosion is given below.
Word image
Vertical Erosion
Horizontal Erosion
3.4. FEATURES OF MALAYALAM TEXT
In Malayalam language, many characters
have a horizontal line. This could be used as a visual
discriminating feature to recognize Malayalam
language. We can also observe that most of the
Malayalam characters have vertical line like
structures. Typical Malayalam word with vertical and
horizontal lines after erosion is given below.
Word image
Vertical Erosion
Horizontal Erosion
Table-1: Presence and absence of discriminating
features of Kannada, Hindi, English Malayalam text
words.
I. LANGUAGE
FEATURE Horizontal
lines
Vertical
lines
Kannada Yes No
Hindi Yes Yes
English Yes Yes
Malayalam Yes Yes
(Yes means presence and No means absence of that
feature)
IV. SYSTEM ARCHITECTURE
Figure 1 is the architecture of the system
which accepts a textual query from users. The
textual query is first converted to an image by
4. Thanuja C, Shreedevi G R / International Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 4, Jul-Aug 2013, pp.1329-1335
1332 | P a g e
rendering, features are extracted from images and
then recognition of Kannada language and a search is
carried out for retrieval of relevant multilingual
documents.
Fig 1. System Architecture
V. SEGMENTATION
The result of pre-processing and dilation is
the connected components which makes the
characters of the words to connect as a single group
of pixels. These single groups of pixels are treated as
a single word and segmented. Document’s word
images marked for segmentation is shown in Fig. 2.
Fig 2. Segmentation
VI. SUPPORTIVE KNOWLEDGE BASE
FOR SCRIPT IDENTIFICATION
Knowledge base plays an important role in
Recognition of any pattern and knowledge base is a
repository of Derived information .A supportive
knowledge base is constructed for each specific class
of patterns, which further helps during decision
making to arrive at a conclusion. In the present
method, the vertical line and horizontal line density
of segmented word images of the four languages-
Kannada, Hindi, English and Malayalam- are
practically computed using sufficient data set.
Erosion is used to obtain the features of the
languages. Based on the experimental results, a
supportive knowledge base is constructed considering
the density of the vertical and horizontal lines of each
language text words. The density of two visual
features for each word image for the four languages
are practically computed through extensive
experimentation and stored in the knowledge base for
later use during decision-making. The given below is
a summary plot of the values from the
knowledgebase.
Fig 3. Plot generated from Knowledgebase
VII. IMPLEMENTATION
System accepts a textual query from users.
Results of the search are pages from document image
collections with the query word being highlighted.
An efficient mechanism for retrieval of a Kannada
word from a large multilingual document image
collection is presented in this paper. This involves i)
Pre-Processing ii) Query image formulation and
iii) Matching and retrieval.
7.1 PRE-PROCESSING
Pre-Processing involves preparing the
source image for recognition of Kannada language.
The source image is converted to the binary image.
This process of conversion helps in performing
morphological operation. Morphological operation is
considered as repeated dilations of an image. Dilation
helps in differentiating two words delimited by a
space. Then identify the Kannada language by
extracting vertical and horizontal line features from
the multilingual image documents from all the four
languages. Then, record the coordinates of each word
in image document.
7.2 QUERY IMAGE
Query image has to be formulated from the
query word given as input to the system. English text
entered, is translated to Kannada and converted as
image by rendering. Convert this query image to
binary image. This helps in comparison of input
image with the query image.
5. Thanuja C, Shreedevi G R / International Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 4, Jul-Aug 2013, pp.1329-1335
1333 | P a g e
7.3 MATCHING AND RETRIEVAL
This is the stage where the documents
matching with the search criteria are retrieved.
Correlation coefficient of images is used for
matching query word with source image. If the
matching score of query and source image is more
than the threshold then words are matching, and the
word will be highlighted in the document.
7.4. CORRELATION MATCHING
Correlation coefficient between two
variables (image matrix)) is defined as
the covariance of the two variables divided by the
product of their standard deviations. The form of the
definition involves a "product moment", that is, the
mean (the first moment about the origin) of the
product of the mean-adjusted random variables.
Pearson's correlation coefficient when
applied to a population is commonly represented by
the Greek letter ρ (rho) and may be referred to as
the population correlation coefficient or
the population Pearson correlation coefficient. The
formula for ρ is:
Pearson's correlation coefficient when
applied to a sample is commonly represented by the
letter r and may be referred to as the sample
correlation coefficient or the sample Pearson
correlation coefficient. We can obtain a formula
for r by substituting estimates of the covariances and
variances based on a sample into the formula above.
That formula for r is:
An equivalent expression gives the
correlation coefficient as the mean of the products of
the standard scores. Based on a sample of paired data
(Xi, Yi), the sample correlation coefficient is
where
are the standard score, sample mean, and
sample standard deviation, respectively.
The absolute value of both the sample and
population Pearson correlation coefficients are less
than or equal to 1. Correlations equal to 1 or -1
correspond to data points lying exactly on a line, or to
a bivariate distribution entirely supported on a line. A
key mathematical property of the correlation
coefficient is that it is invariant (up to a sign) to
separate changes in location and scale in the two
variables. That is, we may transform X toa + bX and
transform Y to c + dY, where a, b, c, and d are
constants, without changing the correlation
coefficient (this fact holds for both the population
and sample Pearson correlation coefficients). Note
that more general linear transformations do change
the correlation.
The Pearson correlation can be expressed in
terms of uncentered moments. Since μX = E(X), σX
2
=
E[(X − E(X))2
] = E(X2
) − E2
(X) and likewise for Y,
and since
The correlation can also be written as
Alternative formulae for the sample Pearson
correlation coefficient are also available:
The correlation coefficient ranges from −1
to 1. A value of 1 implies that a linear equation
describes the relationship between X and Y perfectly,
with all data points lying on a line for which Y
increases as X increases. A value of −1 implies that
all data points lie on a line for which Y decreases
as X increases. A value of 0 implies that there is no
linear correlation between the variables.
More generally, note that (Xi − X) (Yi − Y) is
positive if and only if Xi and Yi lie on the same side of
their respective means. Thus the correlation
coefficient is positive if Xi and Yi tend to be
simultaneously greater than, or simultaneously less
than, their respective means.
For uncentered data, the correlation
coefficient corresponds with the cosine of the
angle between both possible regression lines
y=gx(x) and x=gy(y). For centered data (i.e., data
which have been shifted by the sample mean so as to
have an average of zero), the correlation coefficient
can also be viewed as the cosine of
the angle between the two vectors of samples
drawn from the two random variables (see Fig.4).
Fig 4. Regression lines for y=gx(x) [red] and x=gy(y)
[blue]
6. Thanuja C, Shreedevi G R / International Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 4, Jul-Aug 2013, pp.1329-1335
1334 | P a g e
Both the uncentered (non-Pearson-compliant) and
centered correlation coefficients can be determined
for a dataset.
The experimental results of the proposed system are
given below. Figure 5 shows script identification
based on the knowledge base. Only the Kannada text
has been marked for searching. After script
identification, the query word will be searched only in
this highlighted area, which is Kannada text.
Fig 5. Script identification
Figure 6 shows the final result of the search.
The Query word “ARJI” is being highlighted in the
document, and this document will be given to the
user as a result.
Fig 6. Result of search
VIII. CONCLUSION
In this paper, we have presented word wise
identification models to identify Kannada text words
from Indian multilingual machine printed documents
which also contains Hindi, English and Malayalam
text words . The proposed method is developed based
on the visual discriminating features, which serve as
useful visual clues for language identification. The
methods help to accurately identify and separate
language portions of Kannada from Hindi, English
and Malayalam. The experimental results show that
the method effectively identifies and separates the
Kannada language portions of the document, which
further helps in document image retrieval.
The system can be enhanced in number of
directions. One can work on combination of different
fonts in a single documents collection. Searching and
retrieval in documents with more number of
languages which has same kind of features is
challenging. The system can also be enhanced to
character level identification.
REFERENCE
[1] P.Naghabhushan, Radhika M Pai, “Modified
Region Decomposition Method and Optimal
Depth Decision Tree in the Recognition of
non-uniform sized characters – An
Experimentation with Kannada Characters”,
Journal of Pattern Recognition Letters, 20,
1467-1475, (1999).
[2] A. Balasubramanian, Million Meshesha, and
C.V. Jawahar Retrieval from Document
Image Collections Centre for Visual
Information Technology, International
Institute of Information Technology,
Hyderabad - 500 032, India
[3] A.L.Spitz, “Determination of the Script and
language Content of Document Images”,
IEEE Transaction on Pattern Analysis and
Machine Intelligence, vol. 19, no. 3, 235-
245, 1997.
[4] Ashwin T V 2000 A font and size
independent OCR for printed Kannada using
SVM. M E Project Report, Dept.
Electrical Engg., Indian Institute of
Science, Bangalore
[5] G.S. Peake, T.N.Tan, “Script and Language
Identification from Document Images”,
Proc. Eighth British Mach. Vision
Conference., 2, 230-233, (1997).
[6] J.Hochberg, P.Kelly, T.Thomas, L.Kerns,
“Automatic Script Identification from
Document Images using Cluster –based
Templates”, IEEE Transaction on Pattern
Analysis and Machine Intelligence, 176-181,
1997. Gopal Datt Joshi, Saurabh Garg,
Jayanthi Sivaswamy, “Script Identification
from Indian Documents”, DAS 2006, LNCS
3872, 255-267, 2006.
[7] Koichi Ito, Ayumi Morita, Takafumi Aoki
Tatsuo Higuchi, Hiroshi Nakajima, and Koji
Kobayashi, A Fingerprint Recognition
Algorithm Using Phase-Based Image
Matching for Low-Quality Fingerprints
[8] M.C.Padma, P.Nagabhushan, “Horizontal
and Vertical linear edge features as useful
clues in the discrimination of multiligual
(Kannada, Hindi and English) machine
7. Thanuja C, Shreedevi G R / International Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 4, Jul-Aug 2013, pp.1329-1335
1335 | P a g e
printed documents”, Proc. National
Workshop on Computer Vision, Graphics
and Image Processing (WVGIP), Madhurai,
204-209, (2002).
[9] M.C.Padma, P.Nagabhushan, “Identification
and separation of text words of Kannada,
Hindi and English languages through
discriminating features”, Proc. 2nd National
Conference on Document Analysis and
Recognition, Mandya, Karnataka, 252-260,
(2003).
[10] M.C.Padma, P.Nagabhushan, “Study of the
Applicability of Horizontal and Vertical
Projections and Segmentation in Language
Identification of Kannada, Hindi and
English Documents”, Proc. National
Conference NCCIT, Kilakarai, Tamilnadu,
93-102, (2001).
[11] P.Nagabhushan, S.A.Angadi, B.S.Anami,
“A Fuzzy Statistical Approachto Kannada
Vowel Recognition based on Invariant
Moments”, proc. 2nd National Conference,
NCDAR, Mandya, 275-285, (2003).
[12] R.C.Gonzalez, R.E.Woods, Digital Image
Processing Pearson Education Publications,
India, 2002.
[13] Ramachandra Manthalkar and P.K. Biswas,
“An Automatic Script Identification Scheme
for Indian Languages”, NCC, 2002.
[14] S.Basvaraj Patil, N.V.Subba Reddy,
“Character script class identification system
using probabilistic neural network for multi-
script multi lingual document processing”,
Proc. National Conference on Document
Analysis and Recognition, Mandya,
Karnataka, 1-8,
[15] S.Chanda, U.Pal, “English, Devanagari and
Urdu Text Identification”, Proc.
International Conference on Document
Analysis and Recognition, 538-545, (2005).
[16] Santanu Choudhury, Gaurav Harit, Shekar
Madnani, R.B. Shet, “Identification of
Scripts of Indian Languages by Combining
Trainable Classifiers”, ICVGIP 2000, Dec.,
20-22, Bangalore, India.
[17] T.N.Tan, “Rotation Invariant Texture
Features and their use in Automatic Script
Identification”, IEEE Trans. Pattern
Analysis and Machine Intelligence, 20(7),
751- 756, (1998).
[18] U.Pal B.B. Choudhuri, “Automatic
Separation of Words in Multi Lingual multi
Script Indian Documents”, Proc. 4th
International Conference on Document
Analysis and Recognition, 576-579, (1997).
[19] U.Pal, B.B.Choudhuri, “Automatic
Identification of English, Chinese, Arabic,
Devanagari and Bangla Script Line”, Proc.
6th International Conference on Document
Analysis and Recognition, 790-794, (2001).
[20] U.Pal, B.B.Choudhuri, “OCR in Bangla:an
Indo-Bangladeshi language”, IEEE, no.2,
1051-4651, (1994).
[21] U.Pal, B.B.Choudhuri, “Script Line
Separation From Indian Multi-Script
Documents”, Proc. 5th International
Conference on Document Analysis and
Recognition(IEEE Comput. Soc. Press),
406-409, (1999).
[22] U.Pal, S.Sinha, B.B.Choudhuri, “Multi-
Script Line Identification from Indian
Documents”, Proc. 7th International
Conference on Document Analysis and
Recognition (ICDAR 2003) vol. 2, 880-884,
2003.
[23] U.Pal, S.Sinha, B.B.Choudhuri, “Word-wise
script identification from a document
containing English, Devanagari and Telugu
text”, Proc. 2nd National Conference on
Document Analysis and Recognition,
Karnataka, India, 213-220, (2003).
[24] U.Pal, S.Sinha, B.B.Choudhuri, “Word-wise
script identification from a document
containing English, Devanagari and Telugu
text”, Proc. 2nd National Conference on
Document Analysis and Recognition,
Karnataka, India, 213-220, (2003).