India is the home of different languages, due to its cultural and geographical diversity. The official and regional languages of India play an important role in communication among the people living in the country. In the Constitution of India, a provision is made for each of the Indian states to choose their own official language for communicating at the state level for official purpose. In the eighth schedule as of May 2008, there are 22 official languages in India.The availability of constantly increasing amount of textual data of various Indian regional languages in electronic form has accelerated. So the Classification of text documents based on languages is essential. The objective of the work is the representation and categorization of Indian language text documents using text mining techniques. South Indian language corpus such as Kannada, Tamil and Telugu language corpus, has been created. Several text mining techniques such as naive Bayes classifier, k-Nearest-Neighbor classifier and decision tree for text categorization have been used.There is not much work done in text categorization in Indian languages. Text categorization in Indian languages is challenging as Indian languages are very rich in morphology. In this paper an attempt has been made to categories Indian language text using text mining algorithms
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEMcscpconf
This is the era of Information Technology. Today the most important thing is how one gets theright information at right time. More and more data repositories are now being made available online. Information retrieval systems or search engines are used to access electronic information available on the internet. These information retrieval systems depend on the available tools and techniques for efficient retrieval of information content in response to the user query needs. During last few years, a wide range of information in Indian regional languages like Hindi, Urdu, Bengali, Oriya, Tamil and Telugu has been made available on web in the form of e-data. But the access to these data repositories is very low because the efficient search engines/retrieval systems supporting these languages are very limited. We have developed a language independent system to facilitate efficient retrieval of information available in Urdu language which can be used for other languages as well. The system gives precision of 0.63 and the recall of the system is 0.8.
A language independent approach to develop urduir systemcsandit
This is the era of Information Technology. Today the most important thing is how one gets the
right information at right time. More and more data repositories are now being made available
online. Information retrieval systems or search engines are used to access electronic
information available on the internet. These information retrieval systems depend on the
available tools and techniques for efficient retrieval of information content in response to the
user query needs. During last few years, a wide range of information in Indian regional
languages like Hindi, Urdu, Bengali, Oriya, Tamil and Telugu has been made available on web
in the form of e-data. But the access to these data repositories is very low because the efficient
search engines/retrieval systems supporting these languages are very limited. We have
developed a language independent system to facilitate efficient retrieval of information
available in Urdu language which can be used for other languages as well. The system gives
precision of 0.63 and the recall of the system is 0.8.
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
On-line text documents rapidly increase in size with the growth of World Wide Web. To manage such a huge amount of texts,several text miningapplications came into existence. Those
applications such as search engine, text categorization,summarization, and topic detection arebased on feature extraction.It is extremely time consuming and difficult task to extract keyword or feature manually.So an automated process that extracts keywords or features needs to be
established.This paper proposes a new domain keyword extraction technique that includes a new weighting method on the base of the conventional TF•IDF. Term frequency-Inverse
document frequency is widely used to express the documentsfeature weight, which can’t reflect the division of terms in the document, and then can’t reflect the significance degree and the difference between categories. This paper proposes a new weighting method to which a new weight is added to express the differences between domains on the base of original TF•IDF.The extracted feature can represent the content of the text better and has a better distinguished
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIRcscpconf
Recent and continuing advances in online information systems are creating many opportunities
and also new problems in information retrieval. Gathering the information in different natural
language is the most difficult task, which often requires huge resources. Cross-language
information retrieval (CLIR) is the retrieval of information for a query written in the native
language. This paper deals with various classification techniques that can be used for solving
the problems encountered in CLIR.
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
Text mining and Text classification are the two pro
minent and challenging tasks in the field of
Machine learning. Text mining refers to the process
of deriving high quality and relevant
information from text, while Text classification de
als with the categorization of text documents
into different classes. The real challenge in these
areas is to address the problems like handling
large text corpora, similarity of words in text doc
uments, and association of text documents with
a subset of class categories. The feature extractio
n and classification of such text documents
require an efficient machine learning algorithm whi
ch performs automatic text classification.
This paper describes the classification of product
review documents as a multi-label
classification scenario and addresses the problem u
sing Structured Support Vector Machine.
The work also explains the flexibility and performan
ce of the proposed approach for e
fficient text classification.
An unsupervised approach to develop ir system the case of urduijaia
Web Search Engines are best gifts to the mankind by Information and Communication Technologies.
Without the search engines it would have been almost impossible to make the efficient access of the
information available on the web today. They play a very vital role in the accessibility and usability of the
internet based information systems. As the internet users are increasing day by day so is the amount of
information being available on web increasing. But the access of information is not uniform across all the
language communities. Besides English and European languages that constitutes to the 60% of the
information available on the web, there is still a wide range of the information available on the internet in
different languages too. In the past few years the amount of information available in Indian Languages
has also increased. Besides English and few European Languages, there are no tools and techniques
available for the efficient retrieval of this information available on the internet. Especially in the case of
the Indian Languages the research is still in the preliminary steps. There are no sufficient amount of tools
and techniques available for the efficient retrieval of the information for Indian Languages.
As we know that Indian Languages are very resource poor languages in terms of IR test data collection.
So my main focus was mainly on developing the data set for URDU IR, training and testing data for
Stemmer.
We have developed a language independent system to facilitate efficient retrieval of information available
in Urdu language which can be used for other languages as well. The system gives precision of 0.63 and
the recall of the system is 0.8. For this Firstly I have developed an Unsupervised Stemmer for URDU
Language [1] as it is very important in the Information Retrieval.
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEMcscpconf
This is the era of Information Technology. Today the most important thing is how one gets theright information at right time. More and more data repositories are now being made available online. Information retrieval systems or search engines are used to access electronic information available on the internet. These information retrieval systems depend on the available tools and techniques for efficient retrieval of information content in response to the user query needs. During last few years, a wide range of information in Indian regional languages like Hindi, Urdu, Bengali, Oriya, Tamil and Telugu has been made available on web in the form of e-data. But the access to these data repositories is very low because the efficient search engines/retrieval systems supporting these languages are very limited. We have developed a language independent system to facilitate efficient retrieval of information available in Urdu language which can be used for other languages as well. The system gives precision of 0.63 and the recall of the system is 0.8.
A language independent approach to develop urduir systemcsandit
This is the era of Information Technology. Today the most important thing is how one gets the
right information at right time. More and more data repositories are now being made available
online. Information retrieval systems or search engines are used to access electronic
information available on the internet. These information retrieval systems depend on the
available tools and techniques for efficient retrieval of information content in response to the
user query needs. During last few years, a wide range of information in Indian regional
languages like Hindi, Urdu, Bengali, Oriya, Tamil and Telugu has been made available on web
in the form of e-data. But the access to these data repositories is very low because the efficient
search engines/retrieval systems supporting these languages are very limited. We have
developed a language independent system to facilitate efficient retrieval of information
available in Urdu language which can be used for other languages as well. The system gives
precision of 0.63 and the recall of the system is 0.8.
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
On-line text documents rapidly increase in size with the growth of World Wide Web. To manage such a huge amount of texts,several text miningapplications came into existence. Those
applications such as search engine, text categorization,summarization, and topic detection arebased on feature extraction.It is extremely time consuming and difficult task to extract keyword or feature manually.So an automated process that extracts keywords or features needs to be
established.This paper proposes a new domain keyword extraction technique that includes a new weighting method on the base of the conventional TF•IDF. Term frequency-Inverse
document frequency is widely used to express the documentsfeature weight, which can’t reflect the division of terms in the document, and then can’t reflect the significance degree and the difference between categories. This paper proposes a new weighting method to which a new weight is added to express the differences between domains on the base of original TF•IDF.The extracted feature can represent the content of the text better and has a better distinguished
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIRcscpconf
Recent and continuing advances in online information systems are creating many opportunities
and also new problems in information retrieval. Gathering the information in different natural
language is the most difficult task, which often requires huge resources. Cross-language
information retrieval (CLIR) is the retrieval of information for a query written in the native
language. This paper deals with various classification techniques that can be used for solving
the problems encountered in CLIR.
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
Text mining and Text classification are the two pro
minent and challenging tasks in the field of
Machine learning. Text mining refers to the process
of deriving high quality and relevant
information from text, while Text classification de
als with the categorization of text documents
into different classes. The real challenge in these
areas is to address the problems like handling
large text corpora, similarity of words in text doc
uments, and association of text documents with
a subset of class categories. The feature extractio
n and classification of such text documents
require an efficient machine learning algorithm whi
ch performs automatic text classification.
This paper describes the classification of product
review documents as a multi-label
classification scenario and addresses the problem u
sing Structured Support Vector Machine.
The work also explains the flexibility and performan
ce of the proposed approach for e
fficient text classification.
An unsupervised approach to develop ir system the case of urduijaia
Web Search Engines are best gifts to the mankind by Information and Communication Technologies.
Without the search engines it would have been almost impossible to make the efficient access of the
information available on the web today. They play a very vital role in the accessibility and usability of the
internet based information systems. As the internet users are increasing day by day so is the amount of
information being available on web increasing. But the access of information is not uniform across all the
language communities. Besides English and European languages that constitutes to the 60% of the
information available on the web, there is still a wide range of the information available on the internet in
different languages too. In the past few years the amount of information available in Indian Languages
has also increased. Besides English and few European Languages, there are no tools and techniques
available for the efficient retrieval of this information available on the internet. Especially in the case of
the Indian Languages the research is still in the preliminary steps. There are no sufficient amount of tools
and techniques available for the efficient retrieval of the information for Indian Languages.
As we know that Indian Languages are very resource poor languages in terms of IR test data collection.
So my main focus was mainly on developing the data set for URDU IR, training and testing data for
Stemmer.
We have developed a language independent system to facilitate efficient retrieval of information available
in Urdu language which can be used for other languages as well. The system gives precision of 0.63 and
the recall of the system is 0.8. For this Firstly I have developed an Unsupervised Stemmer for URDU
Language [1] as it is very important in the Information Retrieval.
Farthest Neighbor Approach for Finding Initial Centroids in K- MeansWaqas Tariq
Text document clustering is gaining popularity in the knowledge discovery field for effectively navigating, browsing and organizing large amounts of textual information into a small number of meaningful clusters. Text mining is a semi-automated process of extracting knowledge from voluminous unstructured data. A widely studied data mining problem in the text domain is clustering. Clustering is an unsupervised learning method that aims to find groups of similar objects in the data with respect to some predefined criterion. In this work we propose a variant method for finding initial centroids. The initial centroids are chosen by using farthest neighbors. For the partitioning based clustering algorithms traditionally the initial centroids are chosen randomly but in the proposed method the initial centroids are chosen by using farthest neighbors. The accuracy of the clusters and efficiency of the partition based clustering algorithms depend on the initial centroids chosen. In the experiment, kmeans algorithm is applied and the initial centroids for kmeans are chosen by using farthest neighbors. Our experimental results shows the accuracy of the clusters and efficiency of the kmeans algorithm is improved compared to the traditional way of choosing initial centroids.
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
Many Natural Language Processing (NLP) applications involve Named Entity Recognition (NER) as an important task, where it leads to improve the overall performance of NLP applications. In this paper the Deep learning techniques are used to perform NER task on Hindi text data as it found that as compared to English NER, Hindi language NER is not sufficiently done. This is a barrier for resource-scarce languages as many resources are not readily available. Many researchers use various techniques such as rule based, machine learning based and hybrid approaches to solve this problem. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. In this paper we devise a Novel architecture based on residual network architecture for preferably Bidirectional Long Short Term Memory (BiLSTM) with fasttext word embedding layers. For this purpose we use pre-trained word embedding to represent the words in the corpus where the NER tags of the words are defined as the used annotated corpora. BiLSTM Development of an NER system for Indian languages is a comparatively difficult task. In this paper, we have done the various experiments to compare the results of NER with normal embedding and fasttext embedding layers to analyse the performance of word embedding with different batch sizes to train the deep learning models. Here we present a state-of-the-art results with said approach F1 Score measures.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Keyword Extraction Based Summarization of Categorized Kannada Text Documents ijsc
The internet has caused a humongous growth in the number of documents available online. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indices for a given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We use two feature selection techniques for obtaining features from documents, then we combine scores obtained by GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF (Term Frequency) for extracting key words and later use these for summarization based on rank of the sentence. In the current implementation, a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUESijcseit
As the information access across languages increases, the importance of a system that supports querybased
searching with the presence of multilingual also grows. Gathering the information in different
natural language is the most difficult task, which requires huge resources like database and digital
libraries. Cross language information retrieval (CLIR) enables to search in multilingual document
collections using the native language which can be supported by the different data mining techniques. This
paper deals with various data mining techniques that can be used for solving the problems encountered in
CLIR.
"Analysis of Different Text Classification Algorithms: An Assessment "ijtsrd
Theoretical Classification of information has become a significant research region. The way toward ordering archives into predefined classifications dependent on their substance is Text characterization. It is the mechanized task of common language writings to predefined classifications. The essential prerequisite of content recovery frameworks is content characterization, which recover messages because of a client inquiry, and content getting frameworks, which change message here and there, for example, responding to questions, creating outlines or removing information. In this paper we are concentrating the different grouping calculations. Order is the way toward isolating the information to certain gatherings that can demonstration either conditionally or freely. Our fundamental point is to show the examination of the different characterization calculations like K nn, Na¯ve Bayes, Decision Tree, Random Forest and Support Vector Machine SVM with quick digger and discover which calculation will be generally reasonable for the clients. Adarsh Raushan | Prof. Ankur Taneja | Prof. Naveen Jain "Analysis of Different Text Classification Algorithms: An Assessment" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-1 , December 2019, URL: https://www.ijtsrd.com/papers/ijtsrd29869.pdf Paper URL: https://www.ijtsrd.com/computer-science/other/29869/analysis-of-different-text-classification-algorithms-an-assessment/adarsh-raushan
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
Document Classification Using KNN with Fuzzy Bags of Word Representationsuthi
Abstract — Text classification is used to classify the documents depending on the words, phrases and word combinations according to the declared syntaxes. There are many applications that are using text classification such as artificial intelligence, to maintain the data according to the category and in many other. Some keywords which are called topics are selected to classify the given document. Using these Topics the main idea of the document can be identified. Selecting the Topics is an important task to classify the document according to the category. In this proposed system keywords are extracted from documents using TF-IDF and Word Net. TF-IDF algorithm is mainly used to select the important words by which document can be classified. Word Net is mainly used to find similarity between these candidate words. The words which are having the maximum similarity are considered as Topics(keywords). In this experiment we used TF-IDF model to find the similar words so that to classify the document. Decision tree algorithm gives the better accuracy for text classification when compared to other algorithms fuzzy system to classify text written in natural language according to topic. It is necessary to use a fuzzy classifier for this task, due to the fact that a given text can cover several topics with different degrees. In this context, traditional classifiers are inappropriate, as they attempt to sort each text in a single class in a winner-takes-all fashion. The classifier we proposeautomatically learns its fuzzy rules from training examples. We have applied it to classify news articles, and the results we obtained are promising. The dimensionality of a vector is very important in text classification. We can decrease this dimensionality by using clustering based on fuzzy logic. Depending on the similarity we can classify the document and thus they can be formed into clusters according to their Topics. After formation of clusters one can easily access the documents and save the documents very easily. In this we can find the similarity and summarize the words called Topics which can be used to classify the Documents.
Abstract Generally, computer system is handled by the English language only. But the person who is unaware of the English language and structure of query language cannot handle the system. This paper proposed a new approach for accessing the database easily without knowing English. So, the database is accessed with the help of natural languages such as Hindi, Marathi etc. Natural language processing (NLP) is the study of mathematical and computational modeling of various aspects of language and the development of a wide range of systems. Natural Language Processing holds great promise for making computer interfaces that are easier to use for people, since people will be able to talk to the computer in their own language, rather than learn a specialized language of computer commands. Keywords: NLP, Mathematical modeling, Computational modeling
Feature selection, optimization and clustering strategies of text documentsIJECEIAES
Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments.
Candidate Link Generation Using Semantic Phermone Swarmkevig
Requirements tracing of natural Language artifacts consists of document parsing, Candidate Link
Generation, evaluation and analysis. Candidate Link Generation deals with checking if the high-level
artifact has been fulfilled by the low-level artifact. Requirements traceability is an important activity
undertaken as part of ensuring the quality of software in the early stages of the Software Development Life
Cycle (SDLC). The Semantic Relatedness between the terms is not considered in the existing system; hence
the Candidate Link Generation is not effective. In the proposed system, a hybrid technique combining both
the Semantic Ranking and Pheromone Swarm is implemented. Simple swarm agents are given freedom to
operate on their own, determining the search path randomly based on the environment. Pheromone swarm
agent decides on what term to select or what path to take is influenced by presence of pheromone markings
on the inspected object. Semantic Graph is constructed using semantic relatedness between two terms,
computed based on highest value path connecting any pair of the terms. The performance is evaluated with
Simple, Pheromone and Semantic Pheromone Swarm techniques. The Semantic Pheromone Swarm
provides better results when compared to Simple and Pheromone Swarm Techniques.
A comparative study on term weighting methods for automated telugu text categ...IJDKP
Automatic Text categorization refers to the process of assigning a category or some categories
automatically among predefined ones. Text categorization is challenging in Indian languages has rich in
morphology, a large number of word forms and large feature spaces. This paper investigates the
performance of different classification approaches using different term weighting approaches in order to
decide the most applicable one to Telugu text classification problem. We have investigated on different
term weighting methods for Telugu corpus in combination with Naive Bayes ( NB), Support Vector
Machine (SVM) and k Nearest Neighbor (kNN) classifiers.
A novel approach for text extraction using effective pattern matching techniqueeSAT Journals
Abstract
There are many data mining techniques have been proposed for mining useful patterns from documents. Still, how to effectively use and update discovered patterns is open for future research , especially in the field of text mining. As most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy(words have multiple meanings) and synonymy(multiple words have same meaning). People have held hypothesis that pattern-based approaches should perform better than the term-based, but many experiments does not support this hypothesis. This paper presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern matching, to improve the effective use of discovered patterns.
Keywords: Pattern Mining, Pattern Taxonomy Model, Inner Pattern Evolving, TF-IDF, NLP etc.
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...IJERA Editor
Although publicly accessible databases containing speech documents. It requires a great deal of time and effort
required to keep them up to date is often burdensome. In an effort to help identify speaker of speech if text is
available, text-mining tools, from the machine learning discipline, it can be applied to help in this process also.
Here, we describe and evaluate document classification algorithms i.e. a combo pack of text mining and
classification. This task asked participants to design classifiers for identifying documents containing speech
related information in the main literature, and evaluated them against one another. Expected systems utilizes a
novel approach of k -nearest neighbour classification and compare its performance by taking different values of
k.
Facial Feature Recognition Using Biometricsijbuiiir1
Face recognition is one of the few biometric methods that possess the merits of both high accuracy and low intrusiveness. Biometric requires no physical interaction on behalf of the user. Biometric allows to perform passive identification in a one to many environments. Passwords and PINs are hard to remember and can be stolen or guessed; cards, tokens, keys and the like can be misplaced, forgotten, purloined or duplicated; magnetic cards can become corrupted and unreadable. However individuals biological traits cannot be misplaced, forgotten, stolen or forged.
Partial Image Retrieval Systems in Luminance and Color Invariants : An Empiri...ijbuiiir1
Color of the surface is one of the most imperative characteristics in the process of recognition as well as classification of the object which is based on camera. On the other hand, color of the object sometimes differs a lot due to the difference in illumination as well as the conditions of the surface. Utilization of the diverse features of the color gets impeded due to such variations. However, Characterization of the color of the object is possible with a controlling tool known as color invariants without considering the factors such asillumination and conditions of the surface. In the research proposal, analysis has been done on the estimation procedure related to RGB images color invariants. Object color is an imperative descriptor that can find the corresponding matching object in applications based on image matching as well as search, like- Object searching based on template and CBIR otherwise known as Content Based Image Retrieval. But, many times it has been observed that the apparent color of different objects gets varied significantly due to illumination, conditions of the surface as well as observation (Finlayson et al., 1996)
More Related Content
Similar to Indian Language Text Representation and Categorization Using Supervised Learning Algorithm
Farthest Neighbor Approach for Finding Initial Centroids in K- MeansWaqas Tariq
Text document clustering is gaining popularity in the knowledge discovery field for effectively navigating, browsing and organizing large amounts of textual information into a small number of meaningful clusters. Text mining is a semi-automated process of extracting knowledge from voluminous unstructured data. A widely studied data mining problem in the text domain is clustering. Clustering is an unsupervised learning method that aims to find groups of similar objects in the data with respect to some predefined criterion. In this work we propose a variant method for finding initial centroids. The initial centroids are chosen by using farthest neighbors. For the partitioning based clustering algorithms traditionally the initial centroids are chosen randomly but in the proposed method the initial centroids are chosen by using farthest neighbors. The accuracy of the clusters and efficiency of the partition based clustering algorithms depend on the initial centroids chosen. In the experiment, kmeans algorithm is applied and the initial centroids for kmeans are chosen by using farthest neighbors. Our experimental results shows the accuracy of the clusters and efficiency of the kmeans algorithm is improved compared to the traditional way of choosing initial centroids.
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
Many Natural Language Processing (NLP) applications involve Named Entity Recognition (NER) as an important task, where it leads to improve the overall performance of NLP applications. In this paper the Deep learning techniques are used to perform NER task on Hindi text data as it found that as compared to English NER, Hindi language NER is not sufficiently done. This is a barrier for resource-scarce languages as many resources are not readily available. Many researchers use various techniques such as rule based, machine learning based and hybrid approaches to solve this problem. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. In this paper we devise a Novel architecture based on residual network architecture for preferably Bidirectional Long Short Term Memory (BiLSTM) with fasttext word embedding layers. For this purpose we use pre-trained word embedding to represent the words in the corpus where the NER tags of the words are defined as the used annotated corpora. BiLSTM Development of an NER system for Indian languages is a comparatively difficult task. In this paper, we have done the various experiments to compare the results of NER with normal embedding and fasttext embedding layers to analyse the performance of word embedding with different batch sizes to train the deep learning models. Here we present a state-of-the-art results with said approach F1 Score measures.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Keyword Extraction Based Summarization of Categorized Kannada Text Documents ijsc
The internet has caused a humongous growth in the number of documents available online. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indices for a given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We use two feature selection techniques for obtaining features from documents, then we combine scores obtained by GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF (Term Frequency) for extracting key words and later use these for summarization based on rank of the sentence. In the current implementation, a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUESijcseit
As the information access across languages increases, the importance of a system that supports querybased
searching with the presence of multilingual also grows. Gathering the information in different
natural language is the most difficult task, which requires huge resources like database and digital
libraries. Cross language information retrieval (CLIR) enables to search in multilingual document
collections using the native language which can be supported by the different data mining techniques. This
paper deals with various data mining techniques that can be used for solving the problems encountered in
CLIR.
"Analysis of Different Text Classification Algorithms: An Assessment "ijtsrd
Theoretical Classification of information has become a significant research region. The way toward ordering archives into predefined classifications dependent on their substance is Text characterization. It is the mechanized task of common language writings to predefined classifications. The essential prerequisite of content recovery frameworks is content characterization, which recover messages because of a client inquiry, and content getting frameworks, which change message here and there, for example, responding to questions, creating outlines or removing information. In this paper we are concentrating the different grouping calculations. Order is the way toward isolating the information to certain gatherings that can demonstration either conditionally or freely. Our fundamental point is to show the examination of the different characterization calculations like K nn, Na¯ve Bayes, Decision Tree, Random Forest and Support Vector Machine SVM with quick digger and discover which calculation will be generally reasonable for the clients. Adarsh Raushan | Prof. Ankur Taneja | Prof. Naveen Jain "Analysis of Different Text Classification Algorithms: An Assessment" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-1 , December 2019, URL: https://www.ijtsrd.com/papers/ijtsrd29869.pdf Paper URL: https://www.ijtsrd.com/computer-science/other/29869/analysis-of-different-text-classification-algorithms-an-assessment/adarsh-raushan
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
Document Classification Using KNN with Fuzzy Bags of Word Representationsuthi
Abstract — Text classification is used to classify the documents depending on the words, phrases and word combinations according to the declared syntaxes. There are many applications that are using text classification such as artificial intelligence, to maintain the data according to the category and in many other. Some keywords which are called topics are selected to classify the given document. Using these Topics the main idea of the document can be identified. Selecting the Topics is an important task to classify the document according to the category. In this proposed system keywords are extracted from documents using TF-IDF and Word Net. TF-IDF algorithm is mainly used to select the important words by which document can be classified. Word Net is mainly used to find similarity between these candidate words. The words which are having the maximum similarity are considered as Topics(keywords). In this experiment we used TF-IDF model to find the similar words so that to classify the document. Decision tree algorithm gives the better accuracy for text classification when compared to other algorithms fuzzy system to classify text written in natural language according to topic. It is necessary to use a fuzzy classifier for this task, due to the fact that a given text can cover several topics with different degrees. In this context, traditional classifiers are inappropriate, as they attempt to sort each text in a single class in a winner-takes-all fashion. The classifier we proposeautomatically learns its fuzzy rules from training examples. We have applied it to classify news articles, and the results we obtained are promising. The dimensionality of a vector is very important in text classification. We can decrease this dimensionality by using clustering based on fuzzy logic. Depending on the similarity we can classify the document and thus they can be formed into clusters according to their Topics. After formation of clusters one can easily access the documents and save the documents very easily. In this we can find the similarity and summarize the words called Topics which can be used to classify the Documents.
Abstract Generally, computer system is handled by the English language only. But the person who is unaware of the English language and structure of query language cannot handle the system. This paper proposed a new approach for accessing the database easily without knowing English. So, the database is accessed with the help of natural languages such as Hindi, Marathi etc. Natural language processing (NLP) is the study of mathematical and computational modeling of various aspects of language and the development of a wide range of systems. Natural Language Processing holds great promise for making computer interfaces that are easier to use for people, since people will be able to talk to the computer in their own language, rather than learn a specialized language of computer commands. Keywords: NLP, Mathematical modeling, Computational modeling
Feature selection, optimization and clustering strategies of text documentsIJECEIAES
Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments.
Candidate Link Generation Using Semantic Phermone Swarmkevig
Requirements tracing of natural Language artifacts consists of document parsing, Candidate Link
Generation, evaluation and analysis. Candidate Link Generation deals with checking if the high-level
artifact has been fulfilled by the low-level artifact. Requirements traceability is an important activity
undertaken as part of ensuring the quality of software in the early stages of the Software Development Life
Cycle (SDLC). The Semantic Relatedness between the terms is not considered in the existing system; hence
the Candidate Link Generation is not effective. In the proposed system, a hybrid technique combining both
the Semantic Ranking and Pheromone Swarm is implemented. Simple swarm agents are given freedom to
operate on their own, determining the search path randomly based on the environment. Pheromone swarm
agent decides on what term to select or what path to take is influenced by presence of pheromone markings
on the inspected object. Semantic Graph is constructed using semantic relatedness between two terms,
computed based on highest value path connecting any pair of the terms. The performance is evaluated with
Simple, Pheromone and Semantic Pheromone Swarm techniques. The Semantic Pheromone Swarm
provides better results when compared to Simple and Pheromone Swarm Techniques.
A comparative study on term weighting methods for automated telugu text categ...IJDKP
Automatic Text categorization refers to the process of assigning a category or some categories
automatically among predefined ones. Text categorization is challenging in Indian languages has rich in
morphology, a large number of word forms and large feature spaces. This paper investigates the
performance of different classification approaches using different term weighting approaches in order to
decide the most applicable one to Telugu text classification problem. We have investigated on different
term weighting methods for Telugu corpus in combination with Naive Bayes ( NB), Support Vector
Machine (SVM) and k Nearest Neighbor (kNN) classifiers.
A novel approach for text extraction using effective pattern matching techniqueeSAT Journals
Abstract
There are many data mining techniques have been proposed for mining useful patterns from documents. Still, how to effectively use and update discovered patterns is open for future research , especially in the field of text mining. As most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy(words have multiple meanings) and synonymy(multiple words have same meaning). People have held hypothesis that pattern-based approaches should perform better than the term-based, but many experiments does not support this hypothesis. This paper presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern matching, to improve the effective use of discovered patterns.
Keywords: Pattern Mining, Pattern Taxonomy Model, Inner Pattern Evolving, TF-IDF, NLP etc.
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...IJERA Editor
Although publicly accessible databases containing speech documents. It requires a great deal of time and effort
required to keep them up to date is often burdensome. In an effort to help identify speaker of speech if text is
available, text-mining tools, from the machine learning discipline, it can be applied to help in this process also.
Here, we describe and evaluate document classification algorithms i.e. a combo pack of text mining and
classification. This task asked participants to design classifiers for identifying documents containing speech
related information in the main literature, and evaluated them against one another. Expected systems utilizes a
novel approach of k -nearest neighbour classification and compare its performance by taking different values of
k.
Similar to Indian Language Text Representation and Categorization Using Supervised Learning Algorithm (20)
Facial Feature Recognition Using Biometricsijbuiiir1
Face recognition is one of the few biometric methods that possess the merits of both high accuracy and low intrusiveness. Biometric requires no physical interaction on behalf of the user. Biometric allows to perform passive identification in a one to many environments. Passwords and PINs are hard to remember and can be stolen or guessed; cards, tokens, keys and the like can be misplaced, forgotten, purloined or duplicated; magnetic cards can become corrupted and unreadable. However individuals biological traits cannot be misplaced, forgotten, stolen or forged.
Partial Image Retrieval Systems in Luminance and Color Invariants : An Empiri...ijbuiiir1
Color of the surface is one of the most imperative characteristics in the process of recognition as well as classification of the object which is based on camera. On the other hand, color of the object sometimes differs a lot due to the difference in illumination as well as the conditions of the surface. Utilization of the diverse features of the color gets impeded due to such variations. However, Characterization of the color of the object is possible with a controlling tool known as color invariants without considering the factors such asillumination and conditions of the surface. In the research proposal, analysis has been done on the estimation procedure related to RGB images color invariants. Object color is an imperative descriptor that can find the corresponding matching object in applications based on image matching as well as search, like- Object searching based on template and CBIR otherwise known as Content Based Image Retrieval. But, many times it has been observed that the apparent color of different objects gets varied significantly due to illumination, conditions of the surface as well as observation (Finlayson et al., 1996)
Applying Clustering Techniques for Efficient Text Mining in Twitter Dataijbuiiir1
Knowledge is the ultimate output of decisions on a dataset. The revolution of the Internet has made the global distance closer with the touch on the hand held electronic devices. Usage of social media sites have increased in the past decades. One of the most popular social media micro blog is Twitter. Twitter has millions of users in the world. In this paper the analysis of Twitter data is performed through the text contained in hash tags. After Preprocessing clustering algorithms are applied on text data. The different clusters formed are compared through various parameters. Visualization techniques are used to portray the results from which inferences like time series and topic flow can be easily made. The observed results show that the hierarchical clustering algorithm performs better than other algorithms.
A Study on the Cyber-Crime and Cyber Criminals: A Global Problemijbuiiir1
Today, Cybercrime has caused lot of damages to individuals, organizations and even the Government. Cybercrime detection methods and classification methods have came up with varying levels of success for preventing and protecting data from such attacks. Several laws and methods have been introduced in order to prevent cybercrime and the penalties are laid down to the criminals. However, the study shows that there are many countries facing this problem even today and United States of America is leading with maximum damage due to the cybercrimes over the years. According to the recent survey carried out it was noticed that year 2013 saw the monetary damage of nearly 781.84 million U.S. dollars. This paper describes about the common areas where cybercrime usually occurs and the different types of cybercrimes that are committed today. The paper also shows the studies made on e-mail related crimes as email is the most common medium through which the cybercrimes occur. In addition, some of the case studies related to cybercrimes are also laid down
Vehicle to Vehicle Communication of Content Downloader in Mobileijbuiiir1
The content downloading is internet based service and the expectation of this services are highly popular in wireless communication it will be supporting for road side communication. We are focusing in content downloading system for both infrastructure-to-vehicle and vehicle-to-vehicle communication. The goal to improving system throughput and formulating a max-flow problem including the channel contention and data transfer paradigm. A system communication while transferring the files or downloading some application in road side environment there is the possibility of getting disconnected. The purpose of this study used to avoid the in conventional connection at the road side environment while using system or mobile based internet connection used for content or file downloader using MILP(Mixed Integer Linear programming) for max flow problem. The bounding box technique will be used to get the proper signal from base station. To avoid the traffic and access the quick response from the server the bounding box will used. The mail goal of the mobility management service is to trace the location where the subscribers are, allowing calls, SMS and other mobile phone services to be delivered to them. First we can analysing the data and select for correct location.. It will be provide challenging in vehicular networks, that is the transmission speed of the nodes will even more efficient though the area surrounded of buildings and many other architectural infrastructures of the radio signal.
SPOC: A Secure and Private-pressuring Opportunity Computing Framework for Mob...ijbuiiir1
Today we have an abundant increase in the development of Information and Technology, which inturn made the Humans body even to carry a Mini- Computer in their Palms with Screen touch, Ex: Smart phone�s & Tablets etc., and parallel with the rich Enhancement in the Wireless Body Sensor Units, it is quite useful to the Enrichment of the Medical Treatment to be perfectly useful, comfortable via Smart Phones Using the networks (2G & 3G) carriers and made the treatment very easy even to the Common person in the society with the low cost money. With these the healthcare Authorities can treat the Patients (medical users) remotely where the patients reside at home or company or school or college or anywhere or at various places they work. This type of a treatment called for MHealthcare (Mobile- Healthcare). Although in the mhealthcare service there are many security and data Private problems to be overcome. Here we have A Secure and Private- Preserving Opportunistic Computing Framework called M-HealthCare, for Mobile-Healthcare Emergency. Using the Smart phone and SPOC, the Software or Hardware like computing power and energy can be gathered opportunistically to process the intensive Personal Health Information (PHI) of the medical user when he/she is in critical situation with minimal Private Disclosure. And also we introduce an efficient usercentric Private access control in SPOC Framework which is based on attribute access control and a new privatepreserving scalar product computation (PPSPC) technique and Makes a medical user (patient) to participate in opportunistic computing in transmitting his PHI data. Elaborated security analysis describes that the proposed SPOC framework can efficiently achieve usercentric Private access control in M-Healthcare emergency. In this paper we introduce Private-Preserving Support for Mobile Healthcare using Message Digest where we have used MD5 algorithm ,which can certainly achieves an efficient way and minimizes the memory consumed and the large amount of PHI data of the medical user (patient) is reduced to a fixed amount of size compared to AES which parallels increases the Speed of the data to be sent to TA without any delay which in-turn the professionals at Healthcare center can get exactly the Recent tablet user PHI data and can save their lives in correct time. As the algorithm is provided tight security in transmitting the patients PHI to TA. In respective performance evaluations with extensive simulations explains the MD (message digest) effectiveness in-term of providing high-reliable Personal Health Information (PHI) process and transmission while reducing the Private disclosure during Mobile-Healthcare emergency
A Survey on Implementation of Discrete Wavelet Transform for Image Denoisingijbuiiir1
Image Denoising has been a well studied problem in the field of image processing. Images are often received in defective conditions due to poor scanning and transmitting devices. Consequently, it creates problems for the subsequent process to read and understand such images. Removing noise from the original signal is still a challenging problem for researchers because noise removal introduces artifacts and causes blurring of the images. There have been several published algorithms and each approach has its assumptions, advantages, and limitations. This paper deals with using discrete wavelet transform derived features used for digital image texture analysis to denoise an image even in the presence of very high ratio of noise. Image Denoising is devised as a regression problem between the noise and signals, therefore, Wavelets appear to be a suitable tool for this task, because they allow analysis of images at various levels of resolution.
A Study on migrated Students and their Well - being using Amartya Sens Functi...ijbuiiir1
This paper deals with the multidimensional analysis of well-being from the theoretical point of views suggested by Dr. Amartya Sen. Sens Functioning Multidimensional Approach is broadly recognized as one of the most satisfying approaches to well-being. The Capability approach and the Functioning approach of Sen have found relatively many pragmatic applications mainly for its strong informational and methodological requirements. An attempt has been made to realize a multidimensional assessment of Sens concept of wellbeing with the use of the Fuzzy Set theory. The methodology is applied to the evaluative space of Functionings, with an experimental application to migrated students studying in Chennai, Tamil Nadu.
Methodologies on user Behavior Analysis and Future Request Prediction in Web ...ijbuiiir1
Web Usage Mining is a kind of web mining which provides knowledge about user navigation behavior and gets the interesting patterns from web. Web usage mining refers to the mechanical invention and scrutiny of patterns in click stream and linked data treated as a consequence of user interactions with web resources on one or more web sites. Identify the need and interest of the user and its useful for upgrade web Sources. Web site developers they can update their web site according to their attention. In this paper discuss about the different types of Methodologies which has been carried out in previous research work for Discovering User Behavior and Predicting the Future Request.
Innovative Analytic and Holistic Combined Face Recognition and Verification M...ijbuiiir1
Automatic recognition and verification of human faces is a significant problem in the development and application of Human Computer Interaction (HCI).In addition, the demand for reliable personal identification in computerized access control has resulted in an increased interest in biometrics to replace password and identification (ID) card. Over the last couple of years, face recognition researchers have been developing new techniques fuelled by the advances in computer vision techniques, Design of computers, sensors and in fast emerging face recognition systems. In this paper, a Face Recognition and Verification System has been designed which is robust to variations of illumination, pose and facial expression but very sensitive to variations of the features of the face. This design reckons in the holistic or global as well as the analyticor geometric features of the face of the human beings. The global structure of the human face is analysed by Principal Component Analysis while the features of the local structure are computed considering the geometric features of the face such as the eyes, nose and the mouth. The extracted local features of the face are trained and later tested using Artificial Neural Network (ANN). This combined approach of the global and the local structure of the face image is proved very effective in the system we have designed as it has a correct recognition rate of over 90%.
Enhancing Effective Interoperability Between Mobile Apps Using LCIM Modelijbuiiir1
Levels of conceptual interoperability model is used to develop the method and model towards enhancing interoperability among mobile apps. The LCIM is used as descriptive and prescriptive form and it also make available of both metric of the degree of conceptual representation that exists between interoperating systems. In descriptive form LCIM is used to decrease the discrepancies in rating mobile apps based on content by suggesting a rating system that is completely based on interoperability. In the prescriptive form it receives information for app development, which allows producing apps with prominent level of interoperability. The Levels of Conceptual Interoperability has the abstract backbone for developing and implementing an interoperability framework that supports to exchange of XML based languages used by M&S systems across the web
Deployment of Intelligent Transport Systems Based on User Mobility to be Endo...ijbuiiir1
The emerging increase in vehicles and very high traffic, demands the need for improved Intelligent Transport Systems (ITS). The available ITSs do not meet all the requirements of the present day situation in providing safetravels and avoidance of congestionin spite of its limitations on road. Intelligent Transport Systemsrequiremore research and implementation of better solutions on the traffic network with increased mobility and more rapid acquisition of data by sense network technology. In this paper a review is made on the present ITS where research is required so that improvement in the course of implementing reality mining can enhance the behavior of ITS. This will breed a forward leap in the improvement of safety and convenience of personal and commercial travel and in turn guarantee an ultimate drop in fatality in the society
Stock Prediction Using Artificial Neural Networksijbuiiir1
Accurate prediction of stock price movements is highly challenging and significant topic for investors. Investors need to understand that stock price data is the most essential information which is highly volatile, non-linear, and non-parametric and are affected by many uncertainties and interrelated economic and political factors across the globe. Artificial Neural Networks (ANN) have been found to be an efficient tool in modeling stock prices and quite a large number of studies have been done on it. In this paper ANN modeling of stock prices of selected stocks under BSE is attempted to predict closing prices. The network developed consists of an input layer, one hidden layer and an output layer and the inputs being opening price, high, low, closing price and volume. Mean Absolute Percentage Error, Mean Absolute Deviation and Root Mean Square Error are used as indicators of performance of the networks. This paper is organized as follows. In the first section, the adaptability of ANN in stock prediction is discussed. In section two, we justify the using of ANNs in forecasting stock prices. Section three gives the literature review on the applications of ANNs in predicting the stock prices. Section four gives an overview of artificial neural networks. Section five presents the methodology adopted. Section six gives the simulation and performance analysis. Last section concludes with future direction of the study
Highly Secured Online Voting System (OVS) Over Networkijbuiiir1
Internet voting systems have gained popularity and have been used for government elections and referendums in the United Kingdom, Estonia and Switzerland as well as municipal elections in Canada and party primary elections in the United States. Voting system can involve transmission of ballots and votes via private computer networks or the Internet. Electronic voting technology can speed the counting of ballots and can provide improved accessibility for disabled voters. The aim of this paper is to people who have citizenship of India and whose age is above 18 years and of any sex can give their vote through online without going to any physical polling station. Election Commission Officer (Election Commission Officer who will verify whether registered user and candidates are authentic or not) to participate in online voting. This online voting system is highly secured, and its design is very simple, ease of use and also reliable. The proposed software is developed and tested to work on Ethernet and allows online voting. It also creates and manages voting and an election detail as all the users must login by user name and password and click on his favorable candidates to register vote. This will increase the voting percentage in India. By applying high security it will reduce false votes.
Software Developers Performance relationship with Cognitive Load Using Statis...ijbuiiir1
The success of the software development highly appreciated with the intellectual capital instead of physical assets of the concern. Human resource is the challenging resource in the software development industry to meet the customer requirement and deliver the project on time to the client. The software industry requires multi-skill and dynamic performers to meet the challenges. The skills domain knowledge and the developer�s performance are considered as the potential key factors for the success of the delivery of projects. The developer�s performance is influenced with cognitive factors and its measures. This study aimed to relate developer�s performance in the software industry with his/her cognitive workload. The various statistical measures like correlation, regression, variance and standard deviations are to be calculated for the developer�s performance with the cognitive load. The real-time development sector observations made of around 250 employees, 15 projects work and its corresponding cognitive load such as physical ability, mental ability, temporal ability, effort, frustration and performance in Web application, Database application and Multimedia. This paper provides the measurable analysis of the development process of the developers with their assigned task using statistical approach
Wireless Health Monitoring System Using ZigBeeijbuiiir1
Recent developments in off-the-shelf wireless embedded computing boards and the increasing need for efficient health monitoring systems, fueled by the increasing number of patients, has prompted R&D professionals to explore better health monitoring systems that are both mobile and cheap. This work investigates the feasibility of using the ZigBee embedded technology in health-related monitoring applications. Selected vital signs of patients are acquired using sensor nodes and readings are transmitted wirelessly using devices that utilize the ZigBee communications protocols. A prototype system has been developed and tested with encouraging results
Image Compression Using Discrete Cosine Transform & Discrete Wavelet Transformijbuiiir1
This research paper presents a proposed method for the compression of medical images using hybrid compression technique (DWT, DCT and Huffman coding). The objective of this hybrid scheme is to achieve higher compression rates by first applying DWT and DCT on individual components RGB. After applying this image is quantized to calculate probability index for each unique quantity so as to find out the unique binary code for each unique symbol for their encoding. Finally the Huffman compression is applied. Results show that the coding performance can be significantly improved by the hybrid DWT, DCT and Huffman coding algorithm
Agile development methodologies are very promising in the software industry. Agile development techniques are very realistic n understanding the fact that requirement in a business environment changes constantly. Agile development processes optimize the opportunity provided by cloud computing by doing software release iteratively and getting user feedback more frequently. The research work, a study on Agile Methods and cloud computing. This paper analyzes the Agile Management and development methods and its benefits with cloud computing. Combining agile development methodology with cloud computing brings the best of both worlds. A business strategy, the outcomes of which optimize profitability revenue and customer satisfaction by organizing around customer segments, fostering customer-satisfying behaviors, and implementing customer-centric processes
SOA is becoming important for Business Process Management and Enterprises. Now SOA is widely used by Enterprises as it provides seamless environment, flexibility, interoperability, but at the same time security should also consider because the basic SOA framework doesn�t possess any security. It depends upon the respective proprietor for security [1]. In recent times many research work had done for SOA security. Researchers have also proposed various frameworks and models such as FIX [2], SAVT [3] which tries a lot, but cannot achieve any landmark as they are based on XML schema.This proposed novel work contains an inbuilt security module which was based on PKI. At the same time this model will intact the flexibility and interoperability as the security module is embedded by analyzing the nature of WSDL, UDDI, SOAP and XML. These protocols are also compatible with PKI. Proposed Model was implemented in the asp.net environment then experimental results are compared with other security methods such as data mining based web security and automata based web security
Internet Scheduling of Bluetooth Scatter Nets By Inter Piconet Schedulingijbuiiir1
Bluetooth includes the concept of devices participating in multiple �piconets� interconnected via bridge devices and thereby forming a �scatternet�. This paper presents a scheme for Bluetooth scatternet operation that adapts to varying traffic patterns. According to the traffic information of all masters that the bridge is connected, the bridge switches to the masters with high traffic loads and increase the usage of the bridge. In addition, load adaptive interpiconet scheduling can reduce the number of failed unsniffs and the overhead of the bridge switch wastes to further increase the overall system performance.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
Indian Language Text Representation and Categorization Using Supervised Learning Algorithm
1. Integrated Intelligent Research (IIR) International Journal of Web Technology
Volume: 02 Issue: 02 December2013 Page No.37-41
ISSN: 2278-2419
37
Indian Language Text Representation and
Categorization Using Supervised Learning
Algorithm
M Narayana Swamy1
,M. Hanumanthappa2
1
Department of Computer Applications, Presidency College ,Bangalore ,India
2
Department of Computer Science & Applications, Bangalore University,Bangalore, India
Email: 1
narayan1973.mns@gmail.com, 2
hanu6572@hotmail.com
Abstract-India is the home of different languages, due to its
cultural and geographical diversity. The official and regional
languages of India play an important role in communication
among the people living in the country. In the Constitution of
India, a provision is made for each of the Indian states to
choose their own official language for communicating at the
state level for official purpose. In the eighth schedule as of
May 2008, there are 22 official languages in India.The
availability of constantly increasing amount of textual data of
various Indian regional languages in electronic form has
accelerated. So the Classification of text documents based on
languages is essential. The objective of the work is the
representation and categorization of Indian language text
documents using text mining techniques. South Indian
language corpus such as Kannada, Tamil and Telugu language
corpus, has been created. Several text mining techniques such
as naive Bayes classifier, k-Nearest-Neighbor classifier and
decision tree for text categorization have been used.There is
not much work done in text categorization in Indian languages.
Text categorization in Indian languages is challenging as
Indian languages are very rich in morphology. In this paper an
attempt has been made to categories Indian language text using
text mining algorithms.
Keywords - Tokens, Lemmatization or Stemming, Stop words,
Zipf’s law, Vector Space Model, Bayes classifier, k-Neighbor
classifier, Decision tree, precision (p), recall (r), F-measure
I. INTRODUCTION
Data mining is the main area when dealing with structured data
in databases. Text mining refers to the process of analyzing
and detecting knowledge in unstructured data in the form of
text. The main problem in text mining is that the data in text
form is written using grammatical rules to make it readable by
humans, So to be able to analyze the text, it first needs to be
preprocessed.There are two fundamental approaches to analyse
the text. First, Text mining employs Natural Language
Processing (NLP) to extract meaning from text using
algorithms. This approach can be very successful but it has
limitations. Second, a different approach using statistical
methods is becoming increasingly popular and the techniques
are improving steadily [1].
II. OBJECTIVE
India is the home of different languages. Each state in India has
its own official language. The objective of this work is to
classify the documents based on language, using supervised
learning algorithm. In future, these categorized documents can
be used for summarization.
III. TEXT REPRESENTATION
The key objective of data preparation is to transform text into a
numerical format such as vector space model. To mine text, we
first need to process it into a form that data-mining algorithms
can use. The Pre-processing steps are shown in figure 1
A. Collecting Documents
The work resource for creating this corpus is the World Wide
Web itself. The main problem with this approach to document
collection is that the data may be of uncertain quality and
require extensive cleansing before use.
B. Document Standardization
Once the documents are collected, it is common to find them in
a variety of different formats, depending on how the
documents were generated. The documents should be
processed with miner modification, to convert them to a
standard format.
C. Tokenization
Given a character sequence and a defined document unit,
tokenization is the task of chopping it up into pieces, called
tokens, perhaps at the same time throwing away certain
characters, such as punctuation. The tokenization process is
language-dependent
Table-1
2. Integrated Intelligent Research (IIR) International Journal of Web Technology
Volume: 02 Issue: 02 December2013 Page No.37-41
ISSN: 2278-2419
38
INPUT namma dEsha BArata. nAvu BAratiyaru.
OUTPUT
(Tokens)
namma dEsha BArata
nAvu BAratiyaru
Table-2
INPUT ನಮ್ಮ ದೇಶ ಭಾರತ. ನಾವು ಭಾರತಿಯರು.
OUTPUT
(Tokens)
ನಮ್ಮ ದೇಶ ಭಾರತ
ನಾವು ಭಾರತಿಯರು
D. Dropping Common Terms: Stop Words
Some extremely common words are not informative. These
words are called stop words. The strategy used for
determining a stop list is to sort the terms by collection
frequency (the total number of times each term appears in the
document collection), and then to take the most frequent terms
(stop words). These words are discarded during indexing.
E. Lemmatization
Once tokens are created, the next possible step is to convert
each of the tokens to a standard form, a process usually
referred to as stemming or lemmatization. The advantage of
stemming is to reduce the number of distinct types in a text
corpus and to increase the frequency of occurrence of some
individual types.
Figure-2
IV. PROPERTIED OF CORPUS
The properties of large volume of text are generally referred to
as corpus statistics. This data collection comprises 300
documents. The basic statistics of corpus is shown in the table
Table : 3
Language Kannada Tamil Telugu
Documents 100 100 100
Tokens 26315 20360 18427
Vocabulary 20417 15941 14652
The most fundamental property of languages is the one known
as Zipf’s law. For any language, if we plot the frequency of
words versus their rank for a sufficiently large collection of
textual data, we will see a clear trend, which resembles a
power law distribution. Our experiment on Kannada, Tamil
and Telugu corpus statistics is illustrated by Zipf’s law.As seen
from the table 3, in the low rank extreme of the curve, which
are clearly separated from the rest of the words. These are the
most frequently used words in our considered data collection.
V. VECTOR SPACE MODEL
Vector space model or term vector model is an algebraic model
for representing text documents as vectors [2]. The principle
behind the VSM is that a vector, with elements representing
individual terms, may encode a document’s meaning according
to the relative weights of these term elements. Then one may
encode a corpus of documents as a term-by-document matrix X
of column vectors such that the rows represent terms and the
columns represent documents. Each element xij tabulates the
number of times term i occurs in document j. This matrix is
sparse due to the Zipfian distribution of terms in a language
[3]. VSM representation scheme performs well in many text
classification tasks [4].
Table-4
Document-Term Matrix
T1 T2 T3 T4 T5 T6
D1 4 3 1 3 0 2
D2 0 1 0 2 1 0
D3 2 0 2 0 1 3
D4 0 1 0 0 2 0
Figure-3
Weights are assigned to each term in a document that depends
on the number of occurrences of the term in the document. By
assigning a weight for each term in a document, a document
may be viewed as a vector of weights.
A. Term Frequency
The simplest approach is to assign the weight to be equal to the
number of occurrences of term t in document d. This weighting
scheme is referred to as term frequency and is denoted tft,d,
with the subscripts denoting the term and the document in
order.
B. Document Frequency
Instead, it is more commonplace to use for this purpose the
document frequency dft, defined to be the number of
documents in the collection that contain a term t.
C. Inverse Document Frequency
we define the inverse document frequency (idf) of a term t as
follows:
Thus the idf of a rare term is high, whereas the idf of a frequent
term is likely to be low.
dij represents number of
times that term appears in
the document
3. Integrated Intelligent Research (IIR) International Journal of Web Technology
Volume: 02 Issue: 02 December2013 Page No.37-41
ISSN: 2278-2419
39
D. TF-IDF Weighting
We now combine the definitions of term frequency and inverse
document frequency, to produce a composite weight for each
term in each document. The tf-idf weighting scheme assigns to
term t a weight in document d given by
VI. METHODOLOGY
Document/Text classification plays an important role for
several applications especially for organizing, classifying,
searching and concisely representing large volumes of
information. The Text Categorization goal is to label
documents according to a predefined set of classes
Classification models describe data relationships and predict
values for future observations.
Figure-4
A wide variety of techniques have been designed for text
classification, namely decision tree methods, Rule-based
classifiers, Bayes classifiers, The nearest neighbor classifier,
SVM classifier, regression modeling, neural network classifier
and so on. In this paper the decision tree, Naïve Bayes and
nearest neighbor Classifier are used for Categorization of the
Indian language Documents.
A. Decision Tree Algorithm(C.45)
C4.5 is an algorithm used to generate a decision tree developed
by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3
algorithm.C4.5 builds decision trees from a set of training data
in the same way as ID3, using the concept of information
entropy. The training data is a set of
already classified samples. Each sample consists of a p-
dimensional vector , where the
represents attributes or features of the sample, as well as the
class in which falls.At each node of the tree, C4.5 chooses
the attribute of the data that most effectively splits its set of
samples into subsets enriched in one class or the other. The
splitting criterion is the normalized information gain
(difference in entropy). The attribute with the highest
normalized information gain is chosen to make the decision.
The C4.5 algorithm then recurses on the smaller sublists.
This algorithm has a few base cases.
All the samples in the list belong to the same class. When
this happens, it simply creates a leaf node for the decision
tree telling it to choose that class.
None of the features provide any information gain. In this
case, C4.5 creates a decision node higher up the tree
using the expected value of the class.
Instance of previously-unseen class encountered. Again,
C4.5 creates a decision node higher up the tree using the
expected value.
B. Naive Bayes Algorithm
A naive Bayes classifier is a simple probabilistic classifier
based on applying Bayes' theorem with strong (naive)
independence assumptions One of the main reasons that NB
model works well for text domain because the evidences are
“vocabularies” or “words” appearing in texts and the size of
the vocabularies is typically in the range of thousands. The
large size of evidences (or vocabularies) makes NB model
work well for text classification problem [5][6]NB classifiers
can be applied to text categorization in two different ways [7].
One is called multi-variate Bernoulli model, and the other is
called multinomial model. The difference between these two
models stems from the interpretation of the probability p (xjc).
C. Nearest Neighbor Algorithm
Finding the nearest neighbors of a document means, to take a
new unlabeled document and predict its label. Our documents
Table 5
Zipf’s law Frequent words
Kannada
'sinimA' (ಸಿನಿಮಾ)
'I'(ಈ)
'eMdu'(ಎಂದು)
'citrada'(ಚಿತ್
ರ ದ)
'manaraMjane'
(ಮನರಂಜನೆ)
Tamil
'oru' (ஒரு)
'nta' (ன
் த)
'gkaL' (க்கள் )
'paRavaikaLin'
(பறவைகளின
் )
'kirIccoli'
(கிரீச்சசொலி)
Telugu
'oka' (ఒక)
'mariyu' (మరియు)
'citraM' (చిత్రం)
'yokka' (యొకక )
'sinimA'(సినిమా)
4. Integrated Intelligent Research (IIR) International Journal of Web Technology
Volume: 02 Issue: 02 December2013 Page No.37-41
ISSN: 2278-2419
40
have been transformed to vector. Each document is now a
vector of numbers. Figure 2 is a graphic of the overall process.
The new document is embodied in a vector. That vector is
compared to all the other vectors, and a score for similarity is
computed.
Figure-5
K-Nearest Neighbor is one of the most popular algorithms for
text categorization. Many researchers have found that the kNN
algorithm achieves very good performance in their experiments
on different data sets
[8][9][10]
Table-6
The k-NN algorithm is a similarity-based learning algorithm
that has been shown to be very effective for a variety of
problem domains including text categorization [11][12]. Given
a test document, the k-NN algorithm finds the k nearest
neighbors among the training documents, and uses the
categories of the k neighbors to weight the category candidates.
The similarity score of each neighbor document to the test
document is used as the weight of the categories of the
neighbor document. If several of the k nearest neighbors share
a category, then the per-neighbor weights of that category are
added together, and the resulting weighted sum is used as the
likelihood score of candidate categories. A ranked list is
obtained for the test document. By thresholding on these
scores, binary category assignments are obtained [13]
VII. RELATED WORK
Text Categorization is an active and upcoming research area of
text mining. Many machine learning algorithms have been
applied for many years to text categorization, include decision
tree learning and Bayesian learning, nearest neighbor learning,
and artificial neural networks, early such works may be found
in[14][15] The [16] author has reviewed the developments in
automatic text categorization over the last decade. Some of the
techniques used for automatic categorization have been
described. They also mentioned that as far as Indian languages
are concerned few result are available. Large scale corpora,
good morphological analyzers and stemmers are essential to
cope up with the richness of morphology, essential for the
Dravidian languages. This paper was the motivation towards
work on the South Indian languages. The author of [17]
applied classification algorithm to domain (Sports) Based
Ontology for the Classification of Punjabi Text Documents
(related to Sports only). Classification Techniques such as
kNN technique, Naive Bayes Algorithm, Association Based
Classification need Training Set or Labeled Documents to train
the classifier to do the classification of the unlabelled
documents.The author of [18] discussed the problem of
automatically classifying Arabic text documents. They used the
NB algorithm which is based on probabilistic framework to
handle our classification problem. Feature selection often
increases classification accuracy.The author [19] used
classification algorithm C5.0 to extract the knowledge from
Oriya language text documents.
VIII. DATASET
The corpus is created using three south Indian languages such
as Kannada, Tamil and Telugu. We have used 100 documents
related to cinema of each language. So, corpus was created
using 300 documents. All the documents are cinema related
and taken from the WWW. From these documents a corpus is
created and its properties are discussed above.
IX. EXPERIMENTS AND RESULT
A. Algorithm
1. Identify specific language files.
2. Associate a Language label with each of the files.
3. Build a Corpus C
4. Preprocess the Corpus C.
a) Apply a Stemming algorithm to reduce all the
words to their root form.
5. Generate VSM or a Term Document matrix using Binary
Term Occurrence D( i, j) ( where i is the document i and j is
the jth term of document i.)
(TF and TF-IDF are not used in the matrix because only the
occurrence of the term in the DSL file is relevant for
classification; the distinguishing or rarity of the term is
irrelevant in this approach)
Train the Classifier (kNN,j48 and NB) using C as training
examples.
Figure-6
B. Evaluation of Text Classifier
The effectiveness of a text classifier can be evaluated in terms
of its precision (p), recall (r) and F-measure. . A recall for a
category is defined as the percentage of correctly classified
documents among all documents belonging to that category i.e.
measure of completeness and precision is the percentage of
correctly classified documents among all documents that were
Confusion Matrix
kNN Classifier j48 Classifier NB Classifier
Ka
nn
ad
a
Ta
mi
l
Tel
ugu
Ka
nn
ad
a
Ta
mi
l
Tel
ugu
Ka
nna
da
Ta
mi
l
Te
lu
gu
87 2 11 99 1 0 100 0 0
2 96 2 1 97 2 2 98 0
4 0 96 4 0 96 5 0 95
5. Integrated Intelligent Research (IIR) International Journal of Web Technology
Volume: 02 Issue: 02 December2013 Page No.37-41
ISSN: 2278-2419
41
assigned to the category by the classifier i.e. measure of
exactness,. The F- measure combines the two measures in an
ad hoc way.
Table-7
Figure-7
Figure-8
X. CONCLUSION
The use of the mining algorithm k Nearest Neighbor (KNN),
Naïve bayes and Decision tree C4.5(J48) to south Indian
languages such as Kannada, Tamil and Telugu text have been
evaluated. We have used a corpus of our own; the corpus
consists of 300 documents that belong to 3 categories. All the
documents were preprocessed by removing stop words and
light stemming all the tokens. The documents were represented
using the vector space model.For measuring the effectiveness
of classification algorithm, we used the traditional recall and
precision measures The results illustrates that kNN gives 93%
accuracy, Decision tree C4.5 gives 97.33% and Naïve Bayes
gives97.66% accuracy. Very satisfactory results have been
achieved. It has been proved that text mining algorithm can
also applicable for Indian language for categorization and
Naïve Bayes is efficient algorithm for Indian language text
categorization.
REFERENCES
[1]Text Mining and Scholarly Publishing, Jonathan Clark, Publishing Research
Consortium 2013, PRCTextMiningandScholarlyPublishinFeb2013
[2] Salton G and McGill M 1983 Introduction to Modern Information Retrieval
.McGraw-Hill, New York
[3] Zipf GK 1935 The Psychobiology of Language. Houghton-Mifflin, Boston,
MA.
[4] Representation and Classification of Text Documents: A Brief Review B S
Harish,D S Guru, S Manjunath” IJCA Special Issue on “Recent Trends in
Image Processing and Pattern Recognition”RTIPPR, 2010.”
[5] T. Mitchell. Machine Learning. McCraw Hill, 1996.] [[34] Norbert Fuhr.
Probabilistic models in information retrieval. The Computer Journal,
35(3):243–255, 1992.
[6] Makoto Iwayama and Takenobu Tokunaga. A probabilistic model for text
categorization: Based on a single random variable with multiple values.In
Proceedings of ANLP-94, 4th Conference on Applied Natural Language
Processing, pages 162–167, Stuttgart, DE, 1994. Association for
Computational Linguistics, Morristown, US.
[7] Yang Y. and Liu X., 1999. A Re-examination of Text Categorization
Methods [A]. In: Proceedings of 22nd Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval[C]. 42-49
[8] Joachims T., 1998. Text Categorization with Support Vector Machines:
Learning with Many Relevant Features [A]. In: Proceedings of the European
Conference on Machine Learning[C].
[9] Li Baoli, Chen Yuzhong, and Yu Shiwen, 2002. A Comparative Study on
Automatic Categorization Methods for Chinese Search Engine In Proceedings
of the Eighth Joint International Computer Conference [C].
Hangzhou: Zhejiang University Press, 117-120
[10] Yang, Y. and Liu, X. (1999). A Re-examination of Text Categorization
Methods. In Proceedings of SIGIR-99, 22nd ACM International Conference on
Research and Development in Information Retrieval, pages 42-49.]
[11] Mitchell, T.M. (1996). Machine Learning. McGraw Hill, New York,
NY.].
[12] Yang, Y. and Liu, X. (1999). A Re-examination of Text Categorization
Methods. In
Proceedings of SIGIR-99, 22nd ACM International Conference on Research
and Development
in Information Retrieval, pages 42-49.
[13] Lewis and Ringnette, 1994 D. Lewis, M. Ringnette, "Comparison of
twolearning algorithms for text categorization, “Proceedings of the Third
Annual Symposium on Document Analysis and Information Retrieval
(SDAIR'94), 1994.
[14] E. Wiener, J. O. Pedersen, and A. S. Zeigend, "A neural network approach
to topic spotting," Proceedings of the Fourth Symposium on Document
Analysis and Information Retrieval (SDAIR'95), 1995.
[15] “Advances in Automatic Text Categorization” Kavi Narayana Murthy,
Department of Computer
and Information Science, University of Hyderabad
[16] “Punjabi Text Classification using Naïve Bayes, Centroid and Hybrid
Approach” Nidhi and Vishal Gupta, Department of Computer Science and
Engineering, Panjab University, Chandigarh, India
[17] “Naïve Bayesian Based on Chi Square to Categorize Arabic Data”
Fadi Thabtah, Philadelphia University, Mohammad Ali H. Eljinini, AL-Isra
Private University, Mannam Zamzeer, University of Jordan, Wa’el Musa Hadi,
AL-Isra Private University, Jordan, Communications of the IBIMA, Volume
10, 2009 ISSN: 1943-7765
[18] “Oriya Language Text Mining Using C5.0 Algorithm” Sohag Sundar
Nanda, Soumya Mishra, Sanghamitra Mohanty, P.G. Department of Computer
Science and Application,Utkal University, India ISSN:0975-9646
[19] Andrew McCallum and Kamal Nigam. A comparison of event models for
naïve bayes text classification. In Proceedings of AAAI-98 Workshop on
Learning for Text Categorization, pages 41–48, 1998.