Automatic identification of a script in a given document image facilitates many important applications such
as automatic archiving of multilingual documents, searching online archives of document images and for
the selection of script specific OCR in a multilingual environment. This paper provides a comparison study
of three dimension reduction techniques, namely partial least squares (PLS), sliced inverse regression (SIR)
and principal component analysis (PCA), and evaluates the relative performance of classification
procedures incorporating those methods
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...CSCJournals
In a multi script environment, majority of the documents may contain text information printed in more than one script/language forms. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this context, this paper proposes to develop a model to identify and separate text words of Kannada, Hindi and English scripts from a printed tri-lingual document. The proposed method is trained to learn thoroughly the distinct features of each script. The binary tree classifier is used to classify the input text image. Experimentation conducted involved 1500 text words for learning and 1200 text words for testing. Extensive experimentation has been carried out on both manually created data set and scanned data set. The results are very encouraging and prove the efficacy of the proposed model. The average success rate is found to be 99% for manually created data set and 98.5% for data set constructed from scanned document images.
Wavelet Packet Based Features for Automatic Script IdentificationCSCJournals
In a multi script environment, an archive of documents having the text regions printed in different scripts is in practice. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this paper, a novel texture-based approach is presented to identify the script type of the collection of documents printed in seven scripts, to categorize them for further processing. The South Indian documents printed in the seven scripts - Kannada, Tamil, Telugu, Malayalam, Urdu, Hindi and English are considered here. The document images are decomposed through the Wavelet Packet Decomposition using the Haar basis function up to level two. The texture features are extracted from the sub bands of the wavelet packet decomposition. The Shannon entropy value is computed for the set of sub bands and these entropy values are combined to use as the texture features. Experimentation conducted involved 2100 text images for learning and 1400 text images for testing. Script classification performance is analyzed using the K-nearest neighbor classifier. The average success rate is found to be 99.68%.
A survey on Script and Language identification for Handwritten document imagesiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
Punjabi Text Classification using Naïve Bayes, Centroid and Hybrid Approachcscpconf
Punjabi Text Classification is the process of assigning predefined classes to the unlabelled text
documents. Because of dramatic increase in the amount of content available in digital form, text
classification becomes an urgent need to manage the digital data efficiently and accurately. Till
now no Punjabi Text Classifier is available for Punjabi Text Documents. Therefore, in this
paper, existing classification algorithm such as Naïve Bayes, Centroid Based techniques are
used for Punjabi Text Classification. And one new approach is proposed for the Punjabi Text
Documents which is the combination Naïve Bayes (to extract the relevant features so as to
reduce the dimensionality) and Ontology Based Classification (that act as text classifier that
used extracted features). These algorithms are performed over 184 Punjabi News Articles on
Sports that classify the documents into 7 classes such as ਿਕਕਟ (krikaṭ), ਹਾਕੀ (hākī), ਕਬੱ ਡੀ
(kabḍḍī), ਫੁਟਬਾਲ (phuṭbāl), ਟੈਿਨਸ (ṭainis), ਬੈਡਿਮੰ ਟਨ (baiḍmiṇṭan), ਓਲੰ ਿਪਕ (ōlmpik).
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...CSCJournals
In a multi script environment, majority of the documents may contain text information printed in more than one script/language forms. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this context, this paper proposes to develop a model to identify and separate text words of Kannada, Hindi and English scripts from a printed tri-lingual document. The proposed method is trained to learn thoroughly the distinct features of each script. The binary tree classifier is used to classify the input text image. Experimentation conducted involved 1500 text words for learning and 1200 text words for testing. Extensive experimentation has been carried out on both manually created data set and scanned data set. The results are very encouraging and prove the efficacy of the proposed model. The average success rate is found to be 99% for manually created data set and 98.5% for data set constructed from scanned document images.
Wavelet Packet Based Features for Automatic Script IdentificationCSCJournals
In a multi script environment, an archive of documents having the text regions printed in different scripts is in practice. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this paper, a novel texture-based approach is presented to identify the script type of the collection of documents printed in seven scripts, to categorize them for further processing. The South Indian documents printed in the seven scripts - Kannada, Tamil, Telugu, Malayalam, Urdu, Hindi and English are considered here. The document images are decomposed through the Wavelet Packet Decomposition using the Haar basis function up to level two. The texture features are extracted from the sub bands of the wavelet packet decomposition. The Shannon entropy value is computed for the set of sub bands and these entropy values are combined to use as the texture features. Experimentation conducted involved 2100 text images for learning and 1400 text images for testing. Script classification performance is analyzed using the K-nearest neighbor classifier. The average success rate is found to be 99.68%.
A survey on Script and Language identification for Handwritten document imagesiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
Punjabi Text Classification using Naïve Bayes, Centroid and Hybrid Approachcscpconf
Punjabi Text Classification is the process of assigning predefined classes to the unlabelled text
documents. Because of dramatic increase in the amount of content available in digital form, text
classification becomes an urgent need to manage the digital data efficiently and accurately. Till
now no Punjabi Text Classifier is available for Punjabi Text Documents. Therefore, in this
paper, existing classification algorithm such as Naïve Bayes, Centroid Based techniques are
used for Punjabi Text Classification. And one new approach is proposed for the Punjabi Text
Documents which is the combination Naïve Bayes (to extract the relevant features so as to
reduce the dimensionality) and Ontology Based Classification (that act as text classifier that
used extracted features). These algorithms are performed over 184 Punjabi News Articles on
Sports that classify the documents into 7 classes such as ਿਕਕਟ (krikaṭ), ਹਾਕੀ (hākī), ਕਬੱ ਡੀ
(kabḍḍī), ਫੁਟਬਾਲ (phuṭbāl), ਟੈਿਨਸ (ṭainis), ਬੈਡਿਮੰ ਟਨ (baiḍmiṇṭan), ਓਲੰ ਿਪਕ (ōlmpik).
DISCRIMINATION OF ENGLISH TO OTHER INDIAN LANGUAGES (KANNADA AND HINDI) FOR O...IJCSEA Journal
India is a multilingual multi-script country. In every state of India there are two languages one is state local language and the other is English. For example in Andhra Pradesh, a state in India, the document may contain text words in English and Telugu script. For Optical Character Recognition (OCR) of such a bilingual document, it is necessary to identify the script before feeding the text words to the OCRs of individual scripts. In this paper, we are introducing a simple and efficient technique of script identification for Kannada, English and Hindi text words of a printed document. The proposed approach is based on the horizontal and vertical projection profile for the discrimination of the three scripts. The feature extraction is done based on the horizontal projection profile of each text words. We analysed 700 different words of Kannada, English and Hindi in order to extract the discrimination features and for the development of knowledge base. We use the horizontal projection profile of each text word and based on the horizontal projection profile we extract the appropriate features. The proposed system is tested on 100 differentdocument images containing more than 1000 text words of each script and a classification rate of 98.25%, 99.25% and 98.87% is achieved for Kannada, English and Hindi respectively.
AN EFFECTIVE ARABIC TEXT CLASSIFICATION APPROACH BASED ON KERNEL NAIVE BAYES ...ijaia
With growing texts of electronic documents used in many applications, a fast and accurate text classification method is very important. Arabic text classification is one of the most challenging topics. This is probably caused by the fact that Arabic words have unlimited variation in the meaning, in addition to the problems that are specific to Arabic language only. Many studies have been proved that Naive Bayes (NB)
classifier is being relatively robust, easy to implement, fast, and accurate for many different fields such as text classification. However, non-linear classification and strong violations of the independence assumptions problems can lead to very poor performance of NB classifier. In this paper, first, we preprocess
the Arabic documents to tokenize only the Arabic words. Second, we convert those words into vectors using term frequency and inverse document frequency (TF-IDF) technique. Third, we propose an efficient approach based on Kernel Naive Bayes (KNB) classifier to solve the non-linearity problem of
Arabic text classification. Finally, experimental results and performance evaluation on our collected dataset of Arabic topic mining corpus are presented, showing the effectiveness of the proposed KNB classifier against other baseline classifiers.
Language Identifier for Languages of Pakistan Including Arabic and PersianWaqas Tariq
Language recognizer/identifier/guesser is the basic application used by humans to identify the language of a text document. It takes simply a file as input and after processing its text, decides the language of text document with precision using LIJ-I, LIJ-II and LIJ-III. LIJ-I results in poor accuracy and strengthen with the use of LIJ-II which is further boosted towards a higher level of accuracy with the use of LIJ-III. It also helps in calculating the probability of digrams and the average percentages of accuracy. LIJ-I considers the complete character sets of each language while the LIJ-II considers only the difference. A JAVA based language recognizer is developed and presented in this paper in detail.
A Novel Approach for Bilingual (English - Oriya) Script Identification and Re...CSCJournals
In most of our official papers, school text books, it is observed that English words interspersed within the Indian languages. So there is need for an Optical Character Recognition (OCR) system which can recognize these bilingual documents and store it for future use. In this paper we present an OCR system developed for the recognition of Indian language i.e. Oriya and Roman scripts for printed documents. For such purpose, it is necessary to separate different scripts before feeding them to their individual OCR system. Firstly, we need to correct the skew followed by segmentation. Here we propose the script differentiation line-wise. We emphasize on Upper and lower matras associated with Oriya and absent in English. We have used horizontal histogram for line distinction belonging to different script. After separation different scripts are sent to their individual recognition engines.
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc
This article introduces a methodology for analyzing sentiment in Arabic text using a global foreign lexical
source. Our method leverages the available resource in another language such as the SentiWordNet in
English to the limited language resource that is Arabic. The knowledge that is taken from the external
resource will be injected into the feature model whilethe machine-learning-based classifier is trained. The
first step of our method is to build the bag-of-words (BOW) model of the Arabic text. The second step
calculates the score of polarity using translation machine technique and English SentiWordNet. The scores
for each text will be added to the model in three pairs for objective, positive, and negative. The last step of
our method involves training the ML classifier on that model to predict the sentiment of the Arabic text.
Our method increases the performance compared with the baseline model that is BOW in most cases. In
addition, it seems a viable approach to sentiment analysis in Arabic text where there is limitation of the
available resource.
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONijnlc
Named Entity Recognition and Classification (NERC) is a process of identification of proper nouns in the text and classification of those nouns into certain predefined categories like person name, location,organization, date, and time etc. NERC in Kannada is an essential and challenging task. The aim of this work is to develop a novel model for NERC, based on Multinomial Naïve Bayes (MNB) Classifier. The Methodology adopted in this paper is based on feature extraction of training corpus, by using term frequency, inverse document frequency and fitting them to a tf-idf-vectorizer. The paper discusses the
various issues in developing the proposed model. The details of implementation and performance evaluation are discussed. The experiments are conducted on a training corpus of size 95,170 tokens and test corpus of 5,000 tokens. It is observed that the model works with Precision, Recall and F1-measure of
83%, 79% and 81% respectively.
ATAR: Attention-based LSTM for Arabizi transliterationIJECEIAES
A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present ATAR, an ATtention-based LSTM model for ARabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49).
Character Recognition System for Modi Scriptijceronline
Optical character recognition (OCR) is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-generated text. Character recognition is one of the oldest fields of research since the advent of computers. Till now OCR has been developed for several languages like English, Devnagari (Hindi), Bangla etc. But not much work has been done for Modi script. Modi script has been used in Maharashtra in the 17th century, Modi script was widely used in Maharashtra for writing up to 1950 and contained lot of literature of various philosophers and has its own historical importance. And to convert that literature into computer readable format using optical character recognition systems is our objective. Hence the proposed system will be beneficial for extracting the literature from Modi script. This project work elaborates algorithm required for Modi script OCR. After initial stage of Segmentation, we have used Affine Moment Invariants for calculating moments of each character. These moments were used for training of the system and building database. Thereafter classification has been performed using Fuzzy Logic.
final Year Projects, Final Year Projects in Chennai, Software Projects, Embedded Projects, Microcontrollers Projects, DSP Projects, VLSI Projects, Matlab Projects, Java Projects, .NET Projects, IEEE Projects, IEEE 2009 Projects, IEEE 2009 Projects, Software, IEEE 2009 Projects, Embedded, Software IEEE 2009 Projects, Embedded IEEE 2009 Projects, Final Year Project Titles, Final Year Project Reports, Final Year Project Review, Robotics Projects, Mechanical Projects, Electrical Projects, Power Electronics Projects, Power System Projects, Model Projects, Java Projects, J2EE Projects, Engineering Projects, Student Projects, Engineering College Projects, MCA Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, Wireless Networks Projects, Network Security Projects, Networking Projects, final year projects, ieee projects, student projects, college projects, ieee projects in chennai, java projects, software ieee projects, embedded ieee projects, "ieee2009projects", "final year projects", "ieee projects", "Engineering Projects", "Final Year Projects in Chennai", "Final year Projects at Chennai", Java Projects, ASP.NET Projects, VB.NET Projects, C# Projects, Visual C++ Projects, Matlab Projects, NS2 Projects, C Projects, Microcontroller Projects, ATMEL Projects, PIC Projects, ARM Projects, DSP Projects, VLSI Projects, FPGA Projects, CPLD Projects, Power Electronics Projects, Electrical Projects, Robotics Projects, Solor Projects, MEMS Projects, J2EE Projects, J2ME Projects, AJAX Projects, Structs Projects, EJB Projects, Real Time Projects, Live Projects, Student Projects, Engineering Projects, MCA Projects, MBA Projects, College Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, M.Sc Projects, Final Year Java Projects, Final Year ASP.NET Projects, Final Year VB.NET Projects, Final Year C# Projects, Final Year Visual C++ Projects, Final Year Matlab Projects, Final Year NS2 Projects, Final Year C Projects, Final Year Microcontroller Projects, Final Year ATMEL Projects, Final Year PIC Projects, Final Year ARM Projects, Final Year DSP Projects, Final Year VLSI Projects, Final Year FPGA Projects, Final Year CPLD Projects, Final Year Power Electronics Projects, Final Year Electrical Projects, Final Year Robotics Projects, Final Year Solor Projects, Final Year MEMS Projects, Final Year J2EE Projects, Final Year J2ME Projects, Final Year AJAX Projects, Final Year Structs Projects, Final Year EJB Projects, Final Year Real Time Projects, Final Year Live Projects, Final Year Student Projects, Final Year Engineering Projects, Final Year MCA Projects, Final Year MBA Projects, Final Year College Projects, Final Year BE Projects, Final Year BTech Projects, Final Year ME Projects, Final Year MTech Projects, Final Year M.Sc Projects, IEEE Java Projects, ASP.NET Projects, VB.NET Projects, C# Projects, Visual C++ Projects, Matlab Projects, NS2 Projects, C Projects, Microcontroller Projects, ATMEL Projects, PIC Projects, ARM Projects, DSP Projects, VLSI Projects, FPGA Projects, CPLD Projects, Power Electronics Projects, Electrical Projects, Robotics Projects, Solor Projects, MEMS Projects, J2EE Projects, J2ME Projects, AJAX Projects, Structs Projects, EJB Projects, Real Time Projects, Live Projects, Student Projects, Engineering Projects, MCA Projects, MBA Projects, College Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, M.Sc Projects, IEEE 2009 Java Projects, IEEE 2009 ASP.NET Projects, IEEE 2009 VB.NET Projects, IEEE 2009 C# Projects, IEEE 2009 Visual C++ Projects, IEEE 2009 Matlab Projects, IEEE 2009 NS2 Projects, IEEE 2009 C Projects, IEEE 2009 Microcontroller Projects, IEEE 2009 ATMEL Projects, IEEE 2009 PIC Projects, IEEE 2009 ARM Projects, IEEE 2009 DSP Projects, IEEE 2009 VLSI Projects, IEEE 2009 FPGA Projects, IEEE 2009 CPLD Projects, IEEE 2009 Power Electronics Projects, IEEE 2009 Electrical Projects, IEEE 2009 Robotics Projects, IEEE 2009 Solor Projects, IEEE 2009 MEMS Projects, IEEE 2009 J2EE P
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Language Identification from a Tri-lingual Printed Document: A Simple ApproachIJERA Editor
In a multi-script, multilingual country like India, a document may contain text lines in more than one script/language forms. For such a multi-script environment, multilingual Optical Character Recognition (OCR) system is needed to read the multi-script documents. To make a multilingual OCR system successful, it is necessary to identify different script regions of the document before feeding the document to the OCRs of individual language. With this context, this paper proposes to work on the prioritized requirements of a particular region- Andra Pradesh, a state in India, where any document including official ones, would contain the text in three languages-Telugu, Hindi and English. So, the objective of this paper is to develop a system that should aim to accurately identify and separate Telugu, Hindi and English text lines from a printed multilingual document and also to group the portion of the document in other than these three languages into a separate category OTHERS. The proposed method is developed by thoroughly understanding the nature of top and bottom profiles of the printed text lines. Experimentation conducted involved 900 text lines for learning and 900 text lines for testing. The performance has turned out to be 95.67%.
DISCRIMINATION OF ENGLISH TO OTHER INDIAN LANGUAGES (KANNADA AND HINDI) FOR O...IJCSEA Journal
India is a multilingual multi-script country. In every state of India there are two languages one is state local language and the other is English. For example in Andhra Pradesh, a state in India, the document may contain text words in English and Telugu script. For Optical Character Recognition (OCR) of such a bilingual document, it is necessary to identify the script before feeding the text words to the OCRs of individual scripts. In this paper, we are introducing a simple and efficient technique of script identification for Kannada, English and Hindi text words of a printed document. The proposed approach is based on the horizontal and vertical projection profile for the discrimination of the three scripts. The feature extraction is done based on the horizontal projection profile of each text words. We analysed 700 different words of Kannada, English and Hindi in order to extract the discrimination features and for the development of knowledge base. We use the horizontal projection profile of each text word and based on the horizontal projection profile we extract the appropriate features. The proposed system is tested on 100 differentdocument images containing more than 1000 text words of each script and a classification rate of 98.25%, 99.25% and 98.87% is achieved for Kannada, English and Hindi respectively.
AN EFFECTIVE ARABIC TEXT CLASSIFICATION APPROACH BASED ON KERNEL NAIVE BAYES ...ijaia
With growing texts of electronic documents used in many applications, a fast and accurate text classification method is very important. Arabic text classification is one of the most challenging topics. This is probably caused by the fact that Arabic words have unlimited variation in the meaning, in addition to the problems that are specific to Arabic language only. Many studies have been proved that Naive Bayes (NB)
classifier is being relatively robust, easy to implement, fast, and accurate for many different fields such as text classification. However, non-linear classification and strong violations of the independence assumptions problems can lead to very poor performance of NB classifier. In this paper, first, we preprocess
the Arabic documents to tokenize only the Arabic words. Second, we convert those words into vectors using term frequency and inverse document frequency (TF-IDF) technique. Third, we propose an efficient approach based on Kernel Naive Bayes (KNB) classifier to solve the non-linearity problem of
Arabic text classification. Finally, experimental results and performance evaluation on our collected dataset of Arabic topic mining corpus are presented, showing the effectiveness of the proposed KNB classifier against other baseline classifiers.
Language Identifier for Languages of Pakistan Including Arabic and PersianWaqas Tariq
Language recognizer/identifier/guesser is the basic application used by humans to identify the language of a text document. It takes simply a file as input and after processing its text, decides the language of text document with precision using LIJ-I, LIJ-II and LIJ-III. LIJ-I results in poor accuracy and strengthen with the use of LIJ-II which is further boosted towards a higher level of accuracy with the use of LIJ-III. It also helps in calculating the probability of digrams and the average percentages of accuracy. LIJ-I considers the complete character sets of each language while the LIJ-II considers only the difference. A JAVA based language recognizer is developed and presented in this paper in detail.
A Novel Approach for Bilingual (English - Oriya) Script Identification and Re...CSCJournals
In most of our official papers, school text books, it is observed that English words interspersed within the Indian languages. So there is need for an Optical Character Recognition (OCR) system which can recognize these bilingual documents and store it for future use. In this paper we present an OCR system developed for the recognition of Indian language i.e. Oriya and Roman scripts for printed documents. For such purpose, it is necessary to separate different scripts before feeding them to their individual OCR system. Firstly, we need to correct the skew followed by segmentation. Here we propose the script differentiation line-wise. We emphasize on Upper and lower matras associated with Oriya and absent in English. We have used horizontal histogram for line distinction belonging to different script. After separation different scripts are sent to their individual recognition engines.
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc
This article introduces a methodology for analyzing sentiment in Arabic text using a global foreign lexical
source. Our method leverages the available resource in another language such as the SentiWordNet in
English to the limited language resource that is Arabic. The knowledge that is taken from the external
resource will be injected into the feature model whilethe machine-learning-based classifier is trained. The
first step of our method is to build the bag-of-words (BOW) model of the Arabic text. The second step
calculates the score of polarity using translation machine technique and English SentiWordNet. The scores
for each text will be added to the model in three pairs for objective, positive, and negative. The last step of
our method involves training the ML classifier on that model to predict the sentiment of the Arabic text.
Our method increases the performance compared with the baseline model that is BOW in most cases. In
addition, it seems a viable approach to sentiment analysis in Arabic text where there is limitation of the
available resource.
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONijnlc
Named Entity Recognition and Classification (NERC) is a process of identification of proper nouns in the text and classification of those nouns into certain predefined categories like person name, location,organization, date, and time etc. NERC in Kannada is an essential and challenging task. The aim of this work is to develop a novel model for NERC, based on Multinomial Naïve Bayes (MNB) Classifier. The Methodology adopted in this paper is based on feature extraction of training corpus, by using term frequency, inverse document frequency and fitting them to a tf-idf-vectorizer. The paper discusses the
various issues in developing the proposed model. The details of implementation and performance evaluation are discussed. The experiments are conducted on a training corpus of size 95,170 tokens and test corpus of 5,000 tokens. It is observed that the model works with Precision, Recall and F1-measure of
83%, 79% and 81% respectively.
ATAR: Attention-based LSTM for Arabizi transliterationIJECEIAES
A non-standard romanization of Arabic script, known as Arbizi, is widely used in Arabic online and SMS/chat communities. However, since state-of-the-art tools and applications for Arabic NLP expects Arabic to be written in Arabic script, handling contents written in Arabizi requires a special attention either by building customized tools or by transliterating them into Arabic script. The latter approach is the more common one and this work presents two significant contributions in this direction. The first one is to collect and publicly release the first large-scale “Arabizi to Arabic script” parallel corpus focusing on the Jordanian dialect and consisting of more than 25 k pairs carefully created and inspected by native speakers to ensure highest quality. Second, we present ATAR, an ATtention-based LSTM model for ARabizi transliteration. Training and testing this model on our dataset yields impressive accuracy (79%) and BLEU score (88.49).
Character Recognition System for Modi Scriptijceronline
Optical character recognition (OCR) is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-generated text. Character recognition is one of the oldest fields of research since the advent of computers. Till now OCR has been developed for several languages like English, Devnagari (Hindi), Bangla etc. But not much work has been done for Modi script. Modi script has been used in Maharashtra in the 17th century, Modi script was widely used in Maharashtra for writing up to 1950 and contained lot of literature of various philosophers and has its own historical importance. And to convert that literature into computer readable format using optical character recognition systems is our objective. Hence the proposed system will be beneficial for extracting the literature from Modi script. This project work elaborates algorithm required for Modi script OCR. After initial stage of Segmentation, we have used Affine Moment Invariants for calculating moments of each character. These moments were used for training of the system and building database. Thereafter classification has been performed using Fuzzy Logic.
final Year Projects, Final Year Projects in Chennai, Software Projects, Embedded Projects, Microcontrollers Projects, DSP Projects, VLSI Projects, Matlab Projects, Java Projects, .NET Projects, IEEE Projects, IEEE 2009 Projects, IEEE 2009 Projects, Software, IEEE 2009 Projects, Embedded, Software IEEE 2009 Projects, Embedded IEEE 2009 Projects, Final Year Project Titles, Final Year Project Reports, Final Year Project Review, Robotics Projects, Mechanical Projects, Electrical Projects, Power Electronics Projects, Power System Projects, Model Projects, Java Projects, J2EE Projects, Engineering Projects, Student Projects, Engineering College Projects, MCA Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, Wireless Networks Projects, Network Security Projects, Networking Projects, final year projects, ieee projects, student projects, college projects, ieee projects in chennai, java projects, software ieee projects, embedded ieee projects, "ieee2009projects", "final year projects", "ieee projects", "Engineering Projects", "Final Year Projects in Chennai", "Final year Projects at Chennai", Java Projects, ASP.NET Projects, VB.NET Projects, C# Projects, Visual C++ Projects, Matlab Projects, NS2 Projects, C Projects, Microcontroller Projects, ATMEL Projects, PIC Projects, ARM Projects, DSP Projects, VLSI Projects, FPGA Projects, CPLD Projects, Power Electronics Projects, Electrical Projects, Robotics Projects, Solor Projects, MEMS Projects, J2EE Projects, J2ME Projects, AJAX Projects, Structs Projects, EJB Projects, Real Time Projects, Live Projects, Student Projects, Engineering Projects, MCA Projects, MBA Projects, College Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, M.Sc Projects, Final Year Java Projects, Final Year ASP.NET Projects, Final Year VB.NET Projects, Final Year C# Projects, Final Year Visual C++ Projects, Final Year Matlab Projects, Final Year NS2 Projects, Final Year C Projects, Final Year Microcontroller Projects, Final Year ATMEL Projects, Final Year PIC Projects, Final Year ARM Projects, Final Year DSP Projects, Final Year VLSI Projects, Final Year FPGA Projects, Final Year CPLD Projects, Final Year Power Electronics Projects, Final Year Electrical Projects, Final Year Robotics Projects, Final Year Solor Projects, Final Year MEMS Projects, Final Year J2EE Projects, Final Year J2ME Projects, Final Year AJAX Projects, Final Year Structs Projects, Final Year EJB Projects, Final Year Real Time Projects, Final Year Live Projects, Final Year Student Projects, Final Year Engineering Projects, Final Year MCA Projects, Final Year MBA Projects, Final Year College Projects, Final Year BE Projects, Final Year BTech Projects, Final Year ME Projects, Final Year MTech Projects, Final Year M.Sc Projects, IEEE Java Projects, ASP.NET Projects, VB.NET Projects, C# Projects, Visual C++ Projects, Matlab Projects, NS2 Projects, C Projects, Microcontroller Projects, ATMEL Projects, PIC Projects, ARM Projects, DSP Projects, VLSI Projects, FPGA Projects, CPLD Projects, Power Electronics Projects, Electrical Projects, Robotics Projects, Solor Projects, MEMS Projects, J2EE Projects, J2ME Projects, AJAX Projects, Structs Projects, EJB Projects, Real Time Projects, Live Projects, Student Projects, Engineering Projects, MCA Projects, MBA Projects, College Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, M.Sc Projects, IEEE 2009 Java Projects, IEEE 2009 ASP.NET Projects, IEEE 2009 VB.NET Projects, IEEE 2009 C# Projects, IEEE 2009 Visual C++ Projects, IEEE 2009 Matlab Projects, IEEE 2009 NS2 Projects, IEEE 2009 C Projects, IEEE 2009 Microcontroller Projects, IEEE 2009 ATMEL Projects, IEEE 2009 PIC Projects, IEEE 2009 ARM Projects, IEEE 2009 DSP Projects, IEEE 2009 VLSI Projects, IEEE 2009 FPGA Projects, IEEE 2009 CPLD Projects, IEEE 2009 Power Electronics Projects, IEEE 2009 Electrical Projects, IEEE 2009 Robotics Projects, IEEE 2009 Solor Projects, IEEE 2009 MEMS Projects, IEEE 2009 J2EE P
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Language Identification from a Tri-lingual Printed Document: A Simple ApproachIJERA Editor
In a multi-script, multilingual country like India, a document may contain text lines in more than one script/language forms. For such a multi-script environment, multilingual Optical Character Recognition (OCR) system is needed to read the multi-script documents. To make a multilingual OCR system successful, it is necessary to identify different script regions of the document before feeding the document to the OCRs of individual language. With this context, this paper proposes to work on the prioritized requirements of a particular region- Andra Pradesh, a state in India, where any document including official ones, would contain the text in three languages-Telugu, Hindi and English. So, the objective of this paper is to develop a system that should aim to accurately identify and separate Telugu, Hindi and English text lines from a printed multilingual document and also to group the portion of the document in other than these three languages into a separate category OTHERS. The proposed method is developed by thoroughly understanding the nature of top and bottom profiles of the printed text lines. Experimentation conducted involved 900 text lines for learning and 900 text lines for testing. The performance has turned out to be 95.67%.
A New Method for Identification of Partially Similar Indian ScriptsCSCJournals
In this paper, the texture symmetry/non symmetry factor has been exploited to get the script texture by using the Bi Wavelants which give the factor of symmetry/non symmetry in terms of the third cumulant and the Bi-spectra gives the quadratically coupled frequencies. The envelope of Bi-spectra (Bi-Wavelant) provides an accurate behavior of the symmetry/non symmetry factor of the script texture. Classification has been better performed by SVM with training set of roots of the envelope found using the Newton-Raphson technique. The method could successfully identify 8 Indian scripts like Devanagari, Urdu, Gujrati, Telugu, Assamese, Gurmukhi, Kannada, and Bangla. The method can segment any kind of document with very good results. The identification results are excellent.
SCRIPTS AND NUMERALS IDENTIFICATION FROM PRINTED MULTILINGUAL DOCUMENT IMAGEScscpconf
Identification of scripts from multi-script document is one of the important steps in the design
of an OCR system for successful analysis and recognition. Most optical character recognition
(OCR) systems can recognize at most a few scripts. But for large archives of document images
containing different scripts, there must be some way to automatically categorize these
documents before applying the proper OCR on them. Much work has already been reported in
this area. In the Indian context, though some results have been reported, the task is still at its
infancy. This paper presents a research in the identification of Tamil, English and Hindi
scripts at word level irrespective of their font faces and sizes. It also identifies English
numerals from multilingual document images. The proposed technique performs document
vectorization method which generates vectors from the nine zones segmented over the
characters based on their shape, density and transition features. Script is then determined by
using Rule based classifiers and its sub classifiers containing set of classification rules which
are raised from the vectors. The proposed system identifies scripts from document images
even if it suffers from noise and other kinds of distortions. Results from experiments,
simulations, and human vision encounter that the proposed technique identifies scripts and numerals with minimal pre-processing and high accuracy. In future, this can also be extended for other scripts.
AN APPORACH FOR SCRIPT IDENTIFICATION IN PRINTED TRILINGUAL DOCUMENTS USING T...ijaia
In this work, we review the outcome of texture features for script classification. Rectangular White Space
analysis algorithm is used to analyze and identify heterogeneous layouts of document images. The texture
features, namely the color texture moments, Local binary pattern (LBP) and responses of Gabor, LM-filter,
S-filter, R-filter are extracted, and combinations of these are considered in the classification. In this work,
a probabilistic neural network and Nearest Neighbor are used for classification. To corrabate the
adequacy of the proposed strategy, an experiment was operated on our own data set. To study the effect of
classification accuracy, we vary the database sizes and the results show that the combination of multiple
features vastly improves the performance.
An Empirical Study on Identification of Strokes and their Significance in Scr...IJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
An Optical Character Recognition for Handwritten Devanagari ScriptIJERA Editor
Optical Character Recognition is process of recognition of character from scanned document and lots of OCR now available in the market. But most of these systems work for Roman, Chinese, Japanese and Arabic characters . There are no sufficient number of work on Indian language script like Devanagari so this paper present a review on optical character recognition on handwritten Devanagari script
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...ijnlc
The comprehension of whole manually written records is a testing issue which incorporates various
difficult undertakings. Given a written by hand archive, its format needs to be dissected to detach different
content sorts in a first step. These different content sorts can then be coordinated to specific frameworks,
including writer style, image, or table recognizers. Research in programmed author recognizable proof has
principally centred around the measurable methodology. This has prompted the particular and extraction
of factual elements, for example, run-length appropriations, incline dissemination, entropy, and edge-pivot
conveyance. The edge-pivot conveyance highlight out flanks all other measurable elements. Edge-pivot
circulation is an element that portrays the adjustments in bearing of a written work stroke in written by
hand content. The edge- pivot circulation is extricated by method for a window that is slid over an edgerecognized
on offline scanned images. At whatever point the focal pixel of the window is on, the two edge
pieces (i.e. associated successions of pixels) rising up out of this focal pixel are considered. Their bearings
are measured and put away as sets. A joint likelihood dissemination is gotten from an extensive recognition
Recognition of Words in Tamil Script Using Neural NetworkIJERA Editor
In this paper, word recognition using neural network is proposed. Recognition process is started with the partitioning of document image into lines, words, and characters and then capturing the local features of segmented characters. After classifying the characters, the word image is transferred into unique code based on character code. This code ideally describes any form of word including word with mixed styles and different sizes. Sequence of character codes of the word form input pattern and word code is a target value of the pattern. Neural network is used to train the patterns of the words. Trained network is tested with word patterns and is recognized or unrecognized based on the network error value. Experiments have been conducted with a local database to evaluate the performance of the word recognizing system and obtained good accuracy. This method can be applied for any language word recognition system as the training is based on only unique code of the characters and words belonging to the language.
The paper addresses the automation of the task of an epigraphist in reading and deciphering inscriptions.
The automation steps include Pre-processing, Segmentation, Feature Extraction and Recognition. Preprocessing
involves, enhancement of degraded ancient document images which is achieved through Spatial
filtering methods, followed by binarization of the enhanced image. Segmentation is carried out using Drop
Fall and Water Reservoir approaches, to obtain sampled characters. Next Gabor and Zonal features are
extracted for the sampled characters, and stored as feature vectors for training. Artificial Neural Network
(ANN) is trained with these feature vectors and later used for classification of new test characters. Finally
the classified characters are mapped to characters of modern form. The system showed good results when
tested on the nearly 150 samples of ancient Kannada epigraphs from Ashoka and Hoysala periods. An
average Recognition accuracy of 80.2% for Ashoka period and 75.6% for Hoysala period is achieved.
Angular Symmetric Axis Constellation Model for Off-line Odia Handwritten Char...IJAAS Team
Optical character recognition is one of the emerging research topics in the field of image processing, and it has extensive area of application in pattern recognition. Odia handwritten script is the most research concern area because it has eldest and most likable language in the state of odisha, India. Odia character is a usually handwritten, which was generally occupied by scanner into machine readable form. In this regard several recognition technique have been evolved for variance kind of languages but writing pattern of odia character is just like as curve appearance; Hence it is more difficult for recognition. In this article we have presented the novel approach for Odia character recognition based on the different angle based symmetric axis feature extraction technique which gives high accuracy of recognition pattern. This empirical model generates a unique angle based boundary points on every skeletonised character images. These points are interconnected with each other in order to extract row and column symmetry axis. We extracted feature matrix having mean distance of row, mean angle of row, mean distance of column and mean angle of column from centre of the image to midpoint of the symmetric axis respectively. The system uses a 10 fold validation to the random forest (RF) classifier and SVM for feature matrix. We have considered the standard database on 200 images having each of 47 Odia character and 10 Odia numeric for simulation. As we have noted outcome of simulation of SVM and RF yields 96.3% and 98.2% accuracy rate on NIT Rourkela Odia character database and 88.9% and 93.6% from ISI Kolkata Odia numerical database.
An Efficient Segmentation Technique for Machine Printed Devanagiri Script: Bo...iosrjce
Segmentation technique plays a major role in scripting the documents for extraction of various
features. Many researchers are doing various research works in this field to make the segmenting process
simple as well as efficient. In this paper a simple segmentation technique for both the line and word
segmentation of a script document has been proposed. The main objective of this technique is to recognize the
spaces that separate two text lines.For the Word segmentation technique also similar procedure is followed. In
this work ,three different scanned document have been taken as input images for both line and word
segmentation techniques. The results found were outstanding with average accuracy for both line and word. It
provides 100% accuracy for line segmentation and 100% for line segmentation as well. Evaluation results show
that our method outperforms several competing methods.
STRUCTURAL FEATURES FOR RECOGNITION OF HAND WRITTEN KANNADA CHARACTER BASED O...ijcseit
Research in image processing involves many active areas, of these Recognition of Handwritten character holds lots of promises and is challenging one .The idea is to enable the computer to be able to recognize intelligibly hand written inputs In this paper, a new method that uses structural features and support vector Machine (SVM) classifier for recognition of Handwritten Kannada characters is presented. On an average recognition accuracy of 89.84 % and 85.14% for handwritten Kannada vowels and Consonants obtained with this proposed method, inspite of inherent variations.
Similar to DIMENSION REDUCTION FOR SCRIPT CLASSIFICATION- PRINTED INDIAN DOCUMENTS (20)
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
Safalta Digital marketing institute in Noida, provide complete applications that encompass a huge range of virtual advertising and marketing additives, which includes search engine optimization, virtual communication advertising, pay-per-click on marketing, content material advertising, internet analytics, and greater. These university courses are designed for students who possess a comprehensive understanding of virtual marketing strategies and attributes.Safalta Digital Marketing Institute in Noida is a first choice for young individuals or students who are looking to start their careers in the field of digital advertising. The institute gives specialized courses designed and certification.
for beginners, providing thorough training in areas such as SEO, digital communication marketing, and PPC training in Noida. After finishing the program, students receive the certifications recognised by top different universitie, setting a strong foundation for a successful career in digital marketing.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
The simplified electron and muon model, Oscillating Spacetime: The Foundation...RitikBhardwaj56
Discover the Simplified Electron and Muon Model: A New Wave-Based Approach to Understanding Particles delves into a groundbreaking theory that presents electrons and muons as rotating soliton waves within oscillating spacetime. Geared towards students, researchers, and science buffs, this book breaks down complex ideas into simple explanations. It covers topics such as electron waves, temporal dynamics, and the implications of this model on particle physics. With clear illustrations and easy-to-follow explanations, readers will gain a new outlook on the universe's fundamental nature.
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
DIMENSION REDUCTION FOR SCRIPT CLASSIFICATION- PRINTED INDIAN DOCUMENTS
1. International Journal of Advanced Information Technology (IJAIT) Vol. 7, No.1/2/3, June 2017
DOI : 10.5121/ijait.2017.7301 1
DIMENSION REDUCTION FOR SCRIPT
CLASSIFICATION- PRINTED INDIAN DOCUMENTS
Hamsaveni L1
, Pradeep C2
and Chethan H K3
1
Department of Studies in Computer Science Manasagangotri, University of
Mysore, Mysore
2
Department of Computer Science and Engineering Rajarajeswari College of Engineering,
Bangalore
3
Department of Computer Science and Engineering Maharaja Institute of Technology,
Mysore
ABSTRACT
Automatic identification of a script in a given document image facilitates many important applications such
as automatic archiving of multilingual documents, searching online archives of document images and for
the selection of script specific OCR in a multilingual environment. This paper provides a comparison study
of three dimension reduction techniques, namely partial least squares (PLS), sliced inverse regression (SIR)
and principal component analysis (PCA), and evaluates the relative performance of classification
procedures incorporating those methods. For given script we extracted different features like Gray Level
Co-occurrence Method (GLCM) and Scale invariant feature transform (SIFT) features. The features are
extracted globally from a given text block which does not require any complex and reliable segmentation of
the document image into lines and characters. Extracted features are reduced using various dimension
reduction techniques. The reduced features are fed into Nearest Neighbor classifier. Thus the proposed
scheme is efficient and can be used for many practical applications which require processing large volumes
of data. The scheme has been tested on 10 Indian scripts and found to be robust in the process of scanning
and relatively insensitive to change in font size. This proposed system achieves good classification accuracy
on a large testing data set.
KEYWORDS
SIFT, GLCM, PLS, SIR, PCA, Nearest Neighbour
1. INTRODUCTION
Document image analysis has been an active research area from a few decades, and that facilitates
the establishment of paperless offices across the world. The process of converting textual symbols
present on printed and/ or handwritten paper to a machine understandable format is known as
optical character recognition (OCR) which is the core of the field of document image analysis.
The OCR technology for Indian documents is in emerging stage and most of these Indian OCR
systems can read the documents written in only a single script. As per the trilingual formula of
Indian constitution [1], every state Government has to produce an official document containing a
national language (Hindi), official language (English) and state language (or regional language).
According to the three-language policy adopted by most of the Indian states, the documents
produced in an Indian state Karnataka, are composed of texts in the regional language-Kannada,
the National language-Hindi and the world wide commonly used language-English. In addition,
majority of the documents found in most of the private and Government sectors of Indian states,
are tri-lingual type (a document having text in three languages). So, there is a growing demand to
automatically process these tri-lingual documents in every state in India, including Karnataka.
2. 2
The monolingual OCR systems will not process such multi-script documents without human
involvement for delineating different script zones of multi-lingual pages before activating the
script specific OCR engine. The need for such manual involvement can result in greater expense
and crucially delays the overall image-to-text conversion. Thus, an automatic forwarding is
required for the incoming document images to handover this to the particular OCR engine
depending on the knowledge of the intrinsic scripts. In view of this, identification of script and/ or
language is one of the elementary tasks for multi-script document processing. A script recognizer,
therefore, simplifies the task of OCR by enhancing the accuracy of recognition and reducing the
computational complexity.
2. PREVIOUS WORK
Existing works on automatic script identification are classified into either local approach or global
approach. Local approaches extract the features from a list of connected components like line,
word and character in the document images and hence they are well suited to the documents
where the script type differs at line or word level. In contrast, global approaches employ analysis
of regions comprising of at least two lines and hence do not require fine segmentation. Global
approaches are applicable to those documents where the whole document or paragraph or a set of
text lines is in one script only. The script identification task is simplified and performed faster
with the global rather than the local approach. A sample work has been reported in literature on
both Indian and non-Indian scripts using local and global approaches.
2.1 Local approaches on Indian scripts
Pal and Choudhuri [2] have proposed an automatic technique of separating the text lines from 12
Indian scripts (English, Hindi, Bangla, Gujarati, Tamil, Kashmiri, Malayalam, Oriya, Punjabi,
Telugu and Urdu) using ten triplets formed by grouping English and Devanagari with any one of
the other scripts. This method works only when the triplet type of the document is known. Script
identification technique explored by Pal [3] uses a binary tree classifier for 12 Indian scripts using
a large set of features. B Patil and Subbareddy [4] have proposed a neural network based system
for script identification of Kannada, Hindi and English languages. Dhandra et al., [5] have
exploited the use of discriminating features (aspect ratio, strokes, eccentricity, etc,) as a tool for
determining the script at word level in a bi-lingual document containing Kannada, Tamil and
Devnagari containing English numerals. A method to automatically separate text lines of Roman,
Devanagari and Telugu scripts has been proposed by Pal et al., [6]. In Lijun et al, [7] have
developed a method for Bangla and English script identification based on the analysis of
connected component profiles. Vipin [8] have presented an approach to automatically identify
Kannada, Hindi and English languages using a set of features viz., cavity analysis, end point
analysis, corner point analysis, line based analysis and Kannada base character analysis. Word-
wise script identification systems for Indian scripts has been discussed in [24].
2.2 Global approaches on Indian scripts
Adequate amount of work has been reported in literature using global approaches. S Chaudhury et
al., [9] has proposed a method for identification of Indian languages by combining Gabor filter
based technique and direction distance histogram classifier considering Hindi, English,
Malayalam, Bengali, Telugu and Urdu. G D Joshi et al., [10] have presented a script identification
technique for 10 Indian scripts using a set of features extracted from logGabor filters. Dhanya et
al., [11] have used Linear Support Vector Machine (LSVM), K-Nearest Neighbour (K-NN) and
Neural Network (NN) classifiers on Gabor-based and zoning features to classify Tamil and
3. 3
English scripts. Hiremath [12] have proposed a novel approach for script identification of South
Indian scripts using wavelet based co-occurrence histogram features. Ramachandra and Biswas
[13] have proposed a method based on rotation invariant texture features using multi channel
Gabor filter for identifying seven Indian languages namely Bengali, Kannada, Malayalam, Oriya,
Telugu and Marathi. S R Kunte and S Samuel [14] have suggested a neural approach in on-line
script recognition for Telugu language employing wavelet features. Nagabhushan et al., [15] have
presented an intelligent pin code script identification methodology based on texture analysis using
modified invariant moments. Peeta et al., [16] have presented a technique using Gabor filters for
script identification of Indian bilingual documents.
2.3 Local and global approaches on non-Indian scripts
Sufficient amount of work has also been carried out on non-Indian languages. Spitz [17] has
proposed a system, which relies on specific, well defined pixel structures for script identification.
Such features include locations and numbers of upward concavities in the script image, optical
density of connected components, the frequency and combination of relative character heights.
This approach has been shown to be successful in distinguishing between Asian languages
(Japanese, Chinese, and Korean) against European languages (English, French, German, and
Russian). Wood et al., [18] have proposed projection profile method to determine Roman,
Russian, Arabic, Korean and Chinese characters. Hochberg et al., [19] have presented a method
for automatically identifying script from a binary document image using cluster-based text
symbol templates. In Ding et al., [20], a method that uses a combined analysis of several
discriminating statistical features to classify Oriental and European scripts is presented. Tan et al.,
[21] has proposed a rotation invariant texture feature extraction method for automatic script and
language identification from document images using multiple channel (Gabor) filters and Gray
level co-occurrence matrices for seven languages: Chinese, English, Greek, Koreans, Malayalam,
Persian and Russian. A Busch et al., [22] has presented the use of texture features (gray level co-
occurrence matrix and Gabor energy features) for determining the script of a document image.
B.Kumar et al. [23] have used topological, structural features with rule based classifier for line
based multi-script identification.
It can be seen from the references cited above that sample amount of work has been done in the
area of document script/language identification. Even though some considerable amount of work
has been carried out on Indian script identification, hardly few attempts focus on the all the
languages. So, an intensive work needs to be done in this field as the demand is increasing. Also
the existing methods have to be improved to reach a stage of satisfactory practical application. It
is in this direction the research work proposes a model that automatically identifies the all the
languages in given document. We propose a based classification scheme which uses a global
approach and demonstrate its ability to classify 10 Indian language scripts. In section (3), we
describe the preprocessing scheme. The feature extraction is presented in section 4. The various
dimension reduction techniques is discussed in section 5. Results of the scheme tested over a
large data set are presented in section 6.
3. PREPROCESSING
Our scheme first segments the text area from the document image by removing the upper, lower,
left and right blank regions. After this stage, we have an image which has textual and non-textual
regions. This is then binarised after removing the graphics and pictures (at present the removal of
non-textual information is performed manually, though page segmentation algorithms such as
[12] could be readily been employed to perform this automatically). Text blocks of predefined
size (100×200 pixels) are next extracted. It should be noted that the text block may contain lines
4. 4
with different font sizes and variable spaces between lines words and characters. Numerals may
appear in the text.
4. FEATURE EXTRACTION
Feature extraction is a necessary step for any classification task. For image object classification
purpose, the use of texture and shape features has proved to be quite effective for many
applications. There are many ways for calculating texture feature descriptors. GLCM is one of
them. Many descriptors can be obtained from the co-occurrence matrix calculated. The SIFT
based descriptors describes a given object with respect to a set of interesting points which are
invariant to scale, translation, partial occlusion and clutter. These feature descriptors have been
used successfully for object recognition, robotic mapping etc.
In our work, for each script, we computed 4 texture features, contrast, homogeneity, correlation
and energy. For each object, the SIFT algorithm generates a feature vector of 128 elements. So
each image object is now represented by a feature vector of 132 elements.
4.1 GLCM Based Texture Feature Descriptors
Texture features based on spatial co-occurrence of pixel values are probably the most widely used
texture feature descriptors having been used in several application domains like analysis of
remotely sensed images, image segmentation etc. Cooccurrence texture features are extracted
from an image into two steps. First, pair wise spatial co-occurrence of pixels separated by a given
angular value are computed and stored in a grey level co-occurrence matrix. Second, the GLCM
is used to compute a set of scalar quantities that characterizes the different aspects of the
underlying texture. We have worked with four GLCM based descriptors, namely, Contrast,
Correlation, Homogeneity and Energy [26].
4.2 SIFT Feature Descriptors
In computer vision, SIFT is used to detect and describe local features in an image. SIFT features
are used for reliable matching between different views of the same object. The extracted features
are invariant to scale, orientation and are partially invariant to illumination changes. The SIFT
feature extraction is a four step process. In the first step, locations of the potential interest points
are computed in the image by finding the extremas in a set of Difference of Gaussian (DOG)
filters applied to the actual image at different scale-space. Then those interest points which are
located at the areas of low brightness and along the edges are discarded. After that an orientation
is assigned to the remaining points based on local image gradients. Finally local image features
based on image gradient is calculated at the neighboring regions of each of the key points. Every
feature is defined in the 4 x 4 neighborhoods of the key points and is a vector of 128 elements
[27].
5. DIMENSION REDUCTION
The extracted features are reduced using various dimension reduction techniques. One way to
achieve dimension reduction is to transform the large number of original variables (genes) to a
new set of variables (gene components), which are uncorrelated and ordered so that the first few
account for most of the variation in the data. The K new variables (gene components) can then
replace the initial p variables (genes), thereby reducing the data from the high p-dimension to a
lower K-dimension. PCA, PLS and SIR are three of such methods for dimension reduction. To
5. 5
describe them, let X be the n × p matrix of n samples and p features, y be the n × 1 vector of
response values, and SX be the p × p covariance matrix.
5.1 Principal Component Analysis
PCA is a well-known method of dimension reduction (Jolliffe, [30]). The basic idea of PCA is to
reduce the dimensionality of a data set, while retaining as much as possible the variation present
in the original predictor variables. This is achieved by transforming the p original variables X =
[x1, x2, …, xp] to a new set of K predictor variables, T = [t1, t2, …, tK], which are linear
combinations of the original variables. In mathematical terms, PCA sequentially maximizes the
variance of a linear combination of the original predictor variables,
)
(
max
arg 1
' u
u
u
k X
Var
u =
= (1)
subject to the constraint ui 'SX uj = 0, for all 1 ≤ i < j. The orthogonal constraint ensures that the
linear combinations are uncorrelated, i.e.Cov(Xui ,Xuj ) = 0, i ≠ j. These linear combinations
i
j Xu
t =
(2)
are known as the principal components (PCs) (Massey, [31]). Geometrically, these linear
combinations represent the selection of a new coordinate system obtained by rotating the original
system. The new axes represent the directions with maximum variability and are ordered in terms
of the amount of variation of the original data they account for. The first PC accounts for as much
of the variability as possible, and each succeeding component accounts for as much of the
remaining variability as possible. Computation of the principal components reduces to the
solution of an eigenvalue-eigenvector problem. The projection vectors (or called the weighting
vectors) u can be obtained by eigenvalue decomposition on the covariance matrix SX,
S X ui = λi ui (3)
where λi is the i-th eigenvalue in the descending order for i=1,…,K, and ui is the corresponding
eigenvector. The eigenvalue λi measures the variance of the i-th PC and the eigenvector ui
provides the weights (loadings) for the linear transformation (projection). The maximum number
of components K is determined by the number of nonzero eigenvalues, which is the rank of S X,
and K ≤ min(n,p). The computational cost of PCA, determined by the number of original
predictor variables p and the number of samples n, is in the order of min(np2 + p3 , pn2 + n3). In
other words, the cost is O(pn2 + n3) when p > n.
5.2 Partial Least Squares
The objective of constructing components in PLS is to maximize the covariance between the
response variable y and the original predictor variables X,
w K = arg max Cov(Xw, y) (4)
w'w=1
subject to the constraint wi 'S X w j = 0 , for all 1 ≤ i < j. The central task of PLS is to obtain the
vectors of optimal weights wi (i=1,…,K) to form a small number of components that best predict
the response variable y. Note that PLS is a “supervised” method because it uses information on
both X and y in constructing the components, while PCA is an “unsupervised” method that
utilizes the X data only.
6. 6
To derive the components, [t1, t2, …, tK], PLS decomposes X and y to produce a bilinear
representation of the data (Martens and Naes, 1989):
X = t1w'1 +t2 w'2 +... + t K w'K +E (5)
and
y = t1q1 + t2 q2 +... + t K qK + F (6)
where w’s are vectors of weights for constructing the PLS components t=Xw, q’s are scalars, and
E and F are the residuals. The idea of PLS is to estimate w and q by regression. Specifically, PLS
fits a sequence of bilinear models by least squares, thus given the name partial least squares
(Wold, [32],[33],[34]).
At each step i (i=1,…,K), the vector wi is estimated in such a way that the PLS component, ti, has
maximal sample covariance with the response variable y subject to being uncorrelated with all
previously constructed components. The first PLS component t1 is obtained based on the
covariance between X and y. Each subsequent component ti (i=2,…,K), is computed using the
residuals of X and y from the previous step, which account for the variations left by the previous
components. As a result, the PLS components are uncorrelated and ordered (Garthwaite, [35];
Helland, [36], [37]).
The maximum number of components, K, is less than or equal to the smaller dimension of X, i.e.
K ≤ min(n,p). The first few PLS components account for most of the covariation between the
original predictors and the response variable and thus are usually retained as the new predictors.
The computation of PLS is simple and a number of algorithms are available (Martens and Naes,
[38]). In this study, we used a standard PLS algorithm (Denham, [39]).
Like PCA, PLS reduces the complexity of microarray data analysis by constructing a small
number of gene components, which can be used to replace the large number of original gene
expression measures. Moreover, obtained by maximizing the covariance between the components
and the response variable, the PLS components are generally more predictive of the response
variable than the principal components.
The number of components, K, to be used in the class prediction model is considered to be a meta
parameter and must be estimated in the application, which we will discuss later. PLS is
computationally very efficient with cost only at O(np), i.e. the number of calculations required by
PLS is a linear function of n and p. Thus it is much faster than the other two methods (PCA and
SIR).
5.3 Sliced Inverse Regression
SIR, one of the sufficient dimension reduction methods (Li, [40]; Duan and Li,[41];Cook [42]), is
a supervised approach, which utilizes response information in achieving dimension reduction.
The idea of SIR is simple. Conventional regression models deal with the forward regression
function, E(y|X), which is a p-dimensional problem and difficult to estimate when p is large. SIR
is based on the inverse regression function,
η(y) = E(X | y) (7)
which consists of p one-dimensional regressions and is easier to deal with. The SIR directions v
can be obtained as the solution of the following optimization problem,
7. 7
vK = arg max
v' Cov(E(X | y))v
(8)
v' SX v
v'v=1
subject to the constraint vi 'S X v j = 0 , for all 1 ≤ i < j. Algebraically, the SIR
components ti=Xvi ( i=1,…,K) are linear combinations of the p original predictor variables
defined by the weighting vectors vi. Geometrically, SIR projects the data from the high p-
dimensional space to a much lower K-dimensional space spanned by the projection vectors v. The
projection vectors v are derived in such a way that the first a few represent directions with
maximum variability between the response variable and the SIR components. Computation of vi
is straightforward. Let Sη = Cov(E(X | y)) be the covariance matrix of the inverse regression
function defined in (7) and recall that SX is the variance-covariance matrix of X. The vectors vi
(i=1,…,K) can be obtained by spectral decomposition of Sη with respect to S X ,
Sη vi = λiS X vi (9)
where λi is the i-th eigenvalue in descending order for i=1,…,K; vi is the corresponding
eigenvector, and vi 'S X v j =1.
SIR is implemented by appropriate discretization of the response. Let T(y) be a discretization of
the range of y. SIR computes Cov(E(X|T(y))), the covariance matrix for the slice means of X,
which can be thought of as the between covariance for the subpopulations of X defined by T(y).
Usually, if the response is continuous, one divides its range into H slices. If the response is
categorical, one simply considers its categories. In class prediction problems, the number of
classes G is a natural choice for H, i.e. H=G. The maximum number of SIR components is H
minus one, i.e. K ≤ min(H-1,n,p). As discussed before, K is considered to be a meta-parameter
and may be estimated by cross-validation. The cost of computing SIR directions using the
standard algorithm is O(np2 + p3), which is quite expensive comparing to the cost of PLS. We
used a standard SIR algorithm (Härdle et al., [43]) in this study.
6. EXPERIMENTS AND RESULTS
6.1 Data Collection
At present, in India, standard databases of Indian scripts are unavailable. Hence, data for training
and testing the classification scheme was collected from different sources. These sources include
the regional newspapers available online [24] and scanned document images in a digital library
[25].
6.1.1 Indian Language Scripts
India has 18 official languages which includes Assamese, Bangla, English, Gujarati, Hindi,
Konkanai, Kannada, Kashmiri, Malayalam, Marathi, Nepali, Oriya, Punjabi, Rajasthani, Sanakrit,
Tamil, Telugu and Urdu. All the Indian languages do not have the unique scripts. Some of them
use the same script. For example, languages such as Hindi, Marathi, Rajasthani, Sanskrit and
Nepali are written using the Devanagari script; Assamese and Bangla languages are written using
the Bangla script; Urdu and Kashmiri are written using the same script and Telugu and Kannada
use the same script. In all, ten different scripts are used to write these 18 languages. These scripts
are named as Bangla, Devanagari, Roman (English), Gurumukhi, Gujarati, Malayalam, Oriya,
8. 8
Tamil, Kannada and Urdu. The image blocks of these images are shown in Fig. 1. The dataset
consists of 10 classes of various scripts, with 100 images of each.
6.2 Nearest Neighbour (NN)
One of the simplest classifiers which we used is the Nearest Neighbor classifier [28][29]. The
term of nearest can be taken to mean the smallest Euclidean distances in n-dimensional feature
space. This takes a test sample feature in vector form, and finds the Euclidean distance between
this and the vector representation of each training example. The training sample closest to the test
sample is termed its Nearest Neighbor. Since the trained sample in some sense is the one most
similar to our test sample, it makes sense to allocate its class label to the test sample. This exploits
the ‘smoothness’ assumption that samples near each other are likely to have the same class.
6.3 Results
We have performed experiments with different types of images such as normal, bold, thin, small,
big, etc. The experimentation has been conducted under varying training samples from 30 to 70
percent of database. We report the accuracies obtained in all cases. The results obtained for
various features like SIFT, GLCM and combination (SIFT + GLCM) are respectively shown in
Figure 1, Figure 2 and Figure 3. From figures we can understand that the combination of GLCM
and SIFT gives a good classification accuracy of 93.
7. CONCLUSION
In this paper, we have proposed a Nearest Neighbour based script classification method with the
use features such as GLCM, and SIFT. Specifically, we compared three dimension reduction
methods (PLS, SIR, PCA), examined the relative performance of classification procedures
incorporating those methods. We found that PLS and SIR were both effective in dimension
reduction and they were more effective than PCA. The PLS and SIR based classification
procedures performed consistently better than the PCA based procedure in prediction accuracy.
The empirical results are consistent with the analysis of the techniques. PLS and SIR construct
new predictors using information on the response variable while PCA does not; thus PLS and SIR
components are more likely to be good predictors than those from PCA. Considering predictive
accuracy, we conclude that the SIR based procedure has provided the best performance among
the three classification procedures.
REFERENCES
[1]. U. Pal, S. Sinha and B. B.Chaudhri, (2003) “Multi-Script Line Identification from Indian
Documents”, Proceedings of International Conference on Document Analysis and Recognition, pp.
880-884.
[2]. Pal U., Chaudhuri B.B., (1999), “Script line separation from Indian multi-script document”, Proc.
5th Int. Conf. on Document Analysis and Recognition (IEEE Comput. Soc. Press), 406–409.
[3]. Pal U. and Chaudhuri B.B., (2003), “Script line identification from Multi script documents”, IETE
journal Vol. 49, No 1, 3-11.
[4]. Basavaraj Patil S. and Subbareddy N.V., (2002), “Neural network based system for script
identification in Indian documents”, Sadhana Vol. 27, Part 1, 83–97.
[5]. Dhandra B.V., Nagabhushan P., Mallikarjun Hangarge, Ravindra Hegadi, Malemath V.S., (2006),
“Script Identification Based on Morphological Reconstruction in Document Images”, The 18th
International Conference on Pattern Recognition (ICPR'06), Vol.No. 11-3, 950-953.
9. 9
[6]. Pal U., Chaudhuri B.B., (1999), “Script line separation from Indian multi-script document, Proc. 5th
Int. Conf. on Document Analysis and Recognition”, (IEEE Comput. Soc. Press), 406–409.
[7]. Lijun Zhou, Yue Lu and Chew Lim Tan, (2006),” Bangla/English Script Identification based on
Analysis of Connected component Profiles”, Proc. 7th IAPR workshop on Document Analysis
System, New land, 234-254.
[8]. Vipin Gupta, G.N. Rathna, K.R. Ramakrishnan, (2006), “ A Novel Approach to Automatic
Identification of Kannada, English and Hindi Words from a Trilingual Document”, Int. conf. on
Signal and Image Processing, Hubli, pp. 561-566.
[9]. Santanu Chaudhury, Gaurav Harit, Shekar Madnani, Shet R.B., (2000),” Identification of scripts of
Indian languages by Combining trainable classifiers”, Proc. of ICVGIP, India.
[10]. Gopal Datt Joshi, Saurabh Garg, and Jayanthi Sivaswamy, (2006), “Script Identification from Indian
Documents”, H. Bunke and A.L. Spitz (Eds.): DAS 2006, LNCS 3872, 255–267.
[11]. Dhanya D., Ramakrishnan A.G. and Pati P.B., (2002), “Script identification in printed bilingual
documents, Sadhana”, vol. 27, 73-82.
[12]. Hiremath P S and S Shivashankar, (2008), “Wavelet Based Co-occurrence Histogram Features for
Texture Classification with an Application to Script Identification in a Document Image”, Pattern
Recognition Letters 29, pp 1182-1189.
[13]. Srinivas Rao Kunte R. and Sudhakar Samuel R.D., (2002), A Neural Approach in On-line Script
Recognition for Telugu Language Employing Wavelet Features, National Workshop on Computer
Vision, Graphics and Image Processing (WVGIP), 188-191.
[14]. Peeta Basa Pati, S. Sabari Raju, Nishikanta Pati and A. G. Ramakrishnan, (2004 ), “Gabor filters for
Document analysis in Indian Bilingual Documents”, 0-7803-8243-9/04/ IEEE, IClSlP, pp. 123- 126.
[15]. Spitz A. L., (1994), Script and language determination from document images, Proc. of the 3rd
Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, Nevada, 229-235.
[16]. Wood S. L.; Yao X.; Krishnamurthy K. and Dang L., (1995): Language identification for printed text
independent of segmentation, Proc. Int. Conf. on Image Processing, 428–431, IEEE 0- 8186-7310-
9/95.
[17]. Hochberg J., Kerns L., Kelly P. and Thomas T., (1997), Automatic script identification from images
using cluster based templates, IEEE Trans. Pattern Anal. Machine Intell. Vol. 19, No. 2, 176–181.
[18]. Ding J., Lam L. and Suen C. Y., (1997), Classification of oriental and European Scripts by using
Characteristic features, Proc. 4th ICDAR , 1023-1027.
[19]. Tan T. N., (1998): Rotation invariant texture features and their use in automatic script identification,
IEEE Trans. Pattern Anal. Machine Intell. PAMI, Vol.20, No. 7, 751–756.
[20]. Andrew Busch; Wageeh W. Boles and Sridha Sridharan, (2005), Texture for Script Identification,
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 11, pp. 1720-1732.
[21]. B. Kumar, A. Bera and T. Patnaik, (2012), “Line Based Robust Script Identification for Indian
Languages”, International Journal of Information and Electronics Engineering, vol. 2, no. 2 ,pp. 189-
192.
[22]. R. Rani, R. Dhir and G. S. Lehal, (2013), “Modified Gabor Feature Extraction Method for Word
Level Script Identification- Experimentation with Gurumukhi and English Scripts”, International
Journal of Signal Processing, Image Processing and Pattern Recognition, vol. 6, no. 5, pp. 25-38.
[23]. A. K. Jain and Y. Zhong., (1996), Page segmentation using texture analysis. Pattern Recognition 29,
743–770.
[24]. http://www.samachar.com/.
[25]. Digital Library of India. http://dli.iiit.ac.in/
[26]. R. M. Haralick, K. Shanmugam, and I. Dinstein, (1973), Textural Features of % Image
Classification, IEEE Transactions on Systems, Man and Cybernetics, % vol. SMC-3, no. 6.
[27]. Lowe, D. G., (2004), “Distinctive Image Features from Scale-Invariant Keypoints”, International
Journal of Computer Vision, 60, 2, pp. 91-110.
[28]. Hall P, Park BU, Samworth R J . (2008), “Choice of neighbor order in nearest-neighbor
classification”. Annals of Statistics36 (5): 2135–2152.
[29]. Bremner D, Demaine E, Erickson J, Iacono J, Langerman S, Morin P, Toussaint G. (2005), “Output-
sensitive algorithms for computing nearest-neighbor decision boundaries”. Discrete and
Computational Geometry33 (4): 593–604.
[30]. Jolliffe, I.T. (1986), “Principal Component Analysis”. Springer, New York.
10. 10
[31]. Massey, W.F. (1965).,”Principal Components regression in exploratory statistical research”. Journal
of American Statistical Association, 60, 234-246.
[32]. Wold, H. (1966), “Nonlinear estimation by iterative least squares procedures”. In Research Papers in
Statistics, ed. F.N. David, pp. 411-444. Wiley, New York.
[33]. Wold, H. (1973), “Nonlinear iterative partial least squares (NIPALS) modeling: some recent
developments”. In Multivariate Analysis III, ed. P. Krishnaiah, pp. 383-407, Academic Press, New
York.
[34]. Wold, H. (1982), “Soft modeling: the basic design and some extensions. In Systems under Indirect
Observation: Causality-Structure-Prediction”, ed. K. G. Joreskog and H. Wold, Vol. II, Ch. 1, pp. 1-
54, North-Holland, Amsterdam..
[35]. Garthwaite, P.H. (1994), “An interpretation of partial least squares”. Journal of American Statistical
Association, 89, 122-127.
[36]. Helland, I.S. (1988),” On the structure of partial least squares”. Communications in Statistics:
Simulation and Computation, 17, 581-607.
[37]. Helland, I.S. (1990), “Partial least squares regression and statistical models”. Scandinavian Journal
of Statistics, 17, 97-114.
[38]. Martens, H. and Naes, T. (1989), “Multivariate Calibration”. Wiley, New York.
[39]. Denham, M.C. (1995), “Implementing partial least squares. Statistics and Computing”, 5, 191-202.
[40]. Li, K.C. (1991), “Sliced inverse regression for dimension reduction”. Journal of American Statistical
Association, 86, 316-342.
[41]. Duan, N. and Li, K.C. (1991), “Slicing regression: a link-free regression method”. The Annals of
Statistics, 19, 505-530..
[42]. Cook, R.D. (1998), “Regression Graphics”. John Wiley & Sons, New York.
[43]. Härdle, W., Klinke, S. and Turlach, B.A. (1995), “XploRe: an Interactive Statistical Computing
Environment”, Springer-Verlag, New York.
[44]. Bijalwan, Vishwanath, et al. (2014),"KNN based Machine Learning Approach for Text and
Document Mining." International Journal of Database Theory and Application 7.1, 61-70.
[45]. Kumari, Pinki, and Abhishek Vaish. (2015),"Brainwave based user identification system: A pilot
study in robotics environment." Robotics and Autonomous Systems 65, 15-23.
[46]. Kumari, Pinki, and Abhishek Vaish. (2014), "Brainwave's energy feature extraction using wavelet
transform." Electrical, Electronics and Computer Science (SCEECS), IEEE Students' Conference on.
IEEE.
[47]. Bijalwan V.et al., (2014), Machine learning approach for text and document mining, arXivpreprint
arXiv:1406.1580.
[48]. Kumari, Pinki, and Abhishek Vaish. (2011),"Instant Face detection and attributes recognition."
International Journal of Advanced Computer Science and Applications (IJACSA-ISSN 2156 5570).