The document describes an algorithmic approach to keyword extraction and text document classification. It discusses using naive Bayes and support vector machine (SVM) classifiers with keyword and key phrases extracted via porter stemming as training data. The algorithm performs preprocessing like stop word removal and stemming. Features are selected based on term frequency-inverse document frequency (TF-IDF). Documents are represented as term-document matrices. Naive Bayes and SVM are then applied for classification and compared, with the goal of improving supervised and unsupervised classification accuracy.
Text mining is a new and exciting research area that tries to solve the information overload problem by using techniques from machine learning, natural language processing (NLP), data mining, information retrieval (IR), and knowledge management. Text mining involves the pre-processing of document collections such as information extraction, term extraction, text categorization, and storage of intermediate representations. The techniques that are used to analyse these intermediate representations such as clustering, distribution analysis, association rules and visualisation of the results.
Review of Various Text Categorization Methodsiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
Text mining and Text classification are the two pro
minent and challenging tasks in the field of
Machine learning. Text mining refers to the process
of deriving high quality and relevant
information from text, while Text classification de
als with the categorization of text documents
into different classes. The real challenge in these
areas is to address the problems like handling
large text corpora, similarity of words in text doc
uments, and association of text documents with
a subset of class categories. The feature extractio
n and classification of such text documents
require an efficient machine learning algorithm whi
ch performs automatic text classification.
This paper describes the classification of product
review documents as a multi-label
classification scenario and addresses the problem u
sing Structured Support Vector Machine.
The work also explains the flexibility and performan
ce of the proposed approach for e
fficient text classification.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Most of the text classification problems are associated with multiple class labels and hence automatic text
classification is one of the most challenging and prominent research area. Text classification is the
problem of categorizing text documents into different classes. In the multi-label classification scenario,
each document is associated may have more than one label. The real challenge in the multi-label
classification is the labelling of large number of text documents with a subset of class categories. The
feature extraction and classification of such text documents require an efficient machine learning algorithm
which performs automatic text classification. This paper describes the multi-label classification of product
review documents using Structured Support Vector Machine.
Text mining is a new and exciting research area that tries to solve the information overload problem by using techniques from machine learning, natural language processing (NLP), data mining, information retrieval (IR), and knowledge management. Text mining involves the pre-processing of document collections such as information extraction, term extraction, text categorization, and storage of intermediate representations. The techniques that are used to analyse these intermediate representations such as clustering, distribution analysis, association rules and visualisation of the results.
Review of Various Text Categorization Methodsiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
Text mining and Text classification are the two pro
minent and challenging tasks in the field of
Machine learning. Text mining refers to the process
of deriving high quality and relevant
information from text, while Text classification de
als with the categorization of text documents
into different classes. The real challenge in these
areas is to address the problems like handling
large text corpora, similarity of words in text doc
uments, and association of text documents with
a subset of class categories. The feature extractio
n and classification of such text documents
require an efficient machine learning algorithm whi
ch performs automatic text classification.
This paper describes the classification of product
review documents as a multi-label
classification scenario and addresses the problem u
sing Structured Support Vector Machine.
The work also explains the flexibility and performan
ce of the proposed approach for e
fficient text classification.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Most of the text classification problems are associated with multiple class labels and hence automatic text
classification is one of the most challenging and prominent research area. Text classification is the
problem of categorizing text documents into different classes. In the multi-label classification scenario,
each document is associated may have more than one label. The real challenge in the multi-label
classification is the labelling of large number of text documents with a subset of class categories. The
feature extraction and classification of such text documents require an efficient machine learning algorithm
which performs automatic text classification. This paper describes the multi-label classification of product
review documents using Structured Support Vector Machine.
Text preprocessing is a vital stage in text classification (TC) particularly and text mining generally. Text preprocessing tools is to reduce multiple forms of the word to one form. In addition, text preprocessing techniques are provided a lot of significance and widely studied in machine learning. The basic phase in text classification involves preprocessing features, extracting relevant features against the features in a database. However, they have a great impact on reducing the time requirement and speed resources needed. The effect of the preprocessing tools on English text classification is an area of research. This paper provides an evaluation study of several preprocessing tools for English text classification. The study includes using the raw text, the tokenization, the stop words, and the stemmed. Two different methods chi-square and TF-IDF with cosine similarity score for feature extraction are used based on BBC English dataset. The Experimental results show that the text preprocessing effect on the feature extraction methods that enhances the performance of English text classification especially for small threshold values.
Feature selection, optimization and clustering strategies of text documentsIJECEIAES
Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...IJERA Editor
Assigning documents to related categories is critical task which is used for effective document retrieval. Automatic text classification is the process of assigning new text document to the predefined categories based on its content. In this paper, we implemented and performed comparison of Naïve Bayes and Centroid-based algorithms for effective document categorization of English language text. In Centroid Based algorithm, we used Arithmetical Average Centroid (AAC) and Cumuli Geometric Centroid (CGC) methods to calculate centroid of each class. Experiment is performed on R-52 dataset of Reuters-21578 corpus. Micro Average F1 measure is used to evaluate the performance of classifiers. Experimental results show that Micro Average F1 value for NB is greatest among all followed by Micro Average F1 value of CGC which is greater than Micro Average F1 of AAC. All these results are valuable for future research
TEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTIONijistjournal
The user generated content on the web grows rapidly in this emergent information age. The evolutionary changes in technology make use of such information to capture only the user’s essence and finally the useful information are exposed to information seekers. Most of the existing research on text information processing, focuses in the factual domain rather than the opinion domain. In this paper we detect online hotspot forums by computing sentiment analysis for text data available in each forum. This approach analyses the forum text data and computes value for each word of text. The proposed approach combines K-means clustering and Support Vector Machine with PSO (SVM-PSO) classification algorithm that can be used to group the forums into two clusters forming hotspot forums and non-hotspot forums within the current time span. The proposed system accuracy is compared with the other classification algorithms such as Naïve Bayes, Decision tree and SVM. The experiment helps to identify that K-means and SVM-PSO together achieve highly consistent results.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Modeling Text Independent Speaker Identification with Vector QuantizationTELKOMNIKA JOURNAL
Speaker identification is one of the most important technologies nowadays. Many fields such as
bioinformatics and security are using speaker identification. Also, almost all electronic devices are using
this technology too. Based on number of text, speaker identification divided into text dependent and text
independent. On many fields, text independent is mostly used because number of text is unlimited. So, text
independent is generally more challenging than text dependent. In this research, speaker identification text
independent with Indonesian speaker data was modelled with Vector Quantization (VQ). In this research
VQ with K-Means initialization was used. K-Means clustering also was used to initialize mean and
Hierarchical Agglomerative Clustering was used to identify K value for VQ. The best VQ accuracy was
59.67% when k was 5. According to the result, Indonesian language could be modelled by VQ. This
research can be developed using optimization method for VQ parameters such as Genetic Algorithm or
Particle Swarm Optimization.
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION cscpconf
Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text
classification. In this paper, Fast Fuzzy Feature clustering for text classification is proposed. It
is based on the framework proposed by Jung-Yi Jiang, Ren-Jia Liou and Shie-Jue Lee in 2011.
The word in the feature vector of the document is grouped into the cluster in less iteration. The
numbers of iterations required to obtain cluster centers are reduced by transforming clusters
center dimension from n-dimension to 2-dimension. Principle Component Analysis with slit
change is used for dimension reduction. Experimental results show that, this method improve
the performance by significantly reducing the number of iterations required to obtain the cluster
center. The same is being verified with three benchmark datasets
Farthest Neighbor Approach for Finding Initial Centroids in K- MeansWaqas Tariq
Text document clustering is gaining popularity in the knowledge discovery field for effectively navigating, browsing and organizing large amounts of textual information into a small number of meaningful clusters. Text mining is a semi-automated process of extracting knowledge from voluminous unstructured data. A widely studied data mining problem in the text domain is clustering. Clustering is an unsupervised learning method that aims to find groups of similar objects in the data with respect to some predefined criterion. In this work we propose a variant method for finding initial centroids. The initial centroids are chosen by using farthest neighbors. For the partitioning based clustering algorithms traditionally the initial centroids are chosen randomly but in the proposed method the initial centroids are chosen by using farthest neighbors. The accuracy of the clusters and efficiency of the partition based clustering algorithms depend on the initial centroids chosen. In the experiment, kmeans algorithm is applied and the initial centroids for kmeans are chosen by using farthest neighbors. Our experimental results shows the accuracy of the clusters and efficiency of the kmeans algorithm is improved compared to the traditional way of choosing initial centroids.
A Survey on Sentiment Categorization of Movie ReviewsEditor IJMTER
Sentiment categorization is a process of mining user generated text content and determine
the sentiment of the users towards that particular thing. It is the approach of detecting the sentiment of
the author in regard to some topics. It also known as sentiment detection, sentiment analysis and opinion
mining. It is very useful for movie production companies that interested in knowing how users feel
about their movies. For example word “excellent” indicates that the review gives positive emotion about
particular movie. The same applies to movies, songs, cars, holiday destinations, Political parties, social
network sites, web blogs, discussion forum and so on. Sentiment categorization can be carried out by
using three approaches. First, Supervised machine learning based text classifier on Naïve Bayes,
Maximum Entropy, SVM, kNN classifier, hidden marcov model. Second, Unsupervised Semantic
Orientation scheme of extracting relevant N-grams of the text and then labelling. Third, SentiWordNet
based publicly available library.
03 fauzi indonesian 9456 11nov17 edit septianIAESIJEECS
Since the rise of WWW, information available online is growing rapidly. One of the example is Indonesian online news. Therefore, automatic text classification became very important task for information filtering. One of the major issue in text classification is its high dimensionality of feature space. Most of the features are irrelevant, noisy, and redundant, which may decline the accuracy of the system. Hence, feature selection is needed. Maximal Marginal Relevance for Feature Selection (MMR-FS) has been proven to be a good feature selection for text with many redundant features, but it has high computational complexity. In this paper, we propose a two-phased feature selection method. In the first phase, to lower the complexity of MMR-FS we utilize Information Gain first to reduce features. This reduced feature will be selected using MMR-FS in the second phase. The experiment result showed that our new method can reach the best accuracy by 86%. This new method could lower the complexity of MMR-FS but still retain its accuracy.
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONijaia
This paper explores the use of machine learning approaches, or more specifically, four supervised learning
Methods, namely Decision Tree(C 4.5), K-Nearest Neighbour (KNN), Naïve Bays (NB), and Support Vector
Machine (SVM) for categorization of Bangla web documents. This is a task of automatically sorting a set of
documents into categories from a predefined set. Whereas a wide range of methods have been applied to English text categorization, relatively few studies have been conducted on Bangla language text categorization. Hence, we attempt to analyze the efficiency of those four methods for categorization of Bangla documents. In order to validate, Bangla corpus from various websites has been developed and used as examples for the experiment. For Bangla, empirical results support that all four methods produce
satisfactory performance with SVM attaining good result in terms of high dimensional and relatively noisy
document feature vectors.
Sentiment analysis focuses on analyzing web documents, especially user-generated content such as product
reviews, to identify opinionated documents, sentences and opinion holders. Most of the time classifiers trained in one
domain do not perform well in another domain. The existing approaches do not detect sentiment and topics
simultaneously. Sentiments may differ with topics. Our proposed model called Joint Sentiment Topic (JST) model to
detect sentiments and topics simultaneously from text. This model is based on Gibbs sampling algorithm. Besides,
unlike supervised approaches to opinion mining which often fail to produce good performance when shifting to other
domains, the semi-supervised nature of JST makes it highly portable to other domains. JST model performs better
when compared to existing supervised approaches.
Text preprocessing is a vital stage in text classification (TC) particularly and text mining generally. Text preprocessing tools is to reduce multiple forms of the word to one form. In addition, text preprocessing techniques are provided a lot of significance and widely studied in machine learning. The basic phase in text classification involves preprocessing features, extracting relevant features against the features in a database. However, they have a great impact on reducing the time requirement and speed resources needed. The effect of the preprocessing tools on English text classification is an area of research. This paper provides an evaluation study of several preprocessing tools for English text classification. The study includes using the raw text, the tokenization, the stop words, and the stemmed. Two different methods chi-square and TF-IDF with cosine similarity score for feature extraction are used based on BBC English dataset. The Experimental results show that the text preprocessing effect on the feature extraction methods that enhances the performance of English text classification especially for small threshold values.
Feature selection, optimization and clustering strategies of text documentsIJECEIAES
Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...IJERA Editor
Assigning documents to related categories is critical task which is used for effective document retrieval. Automatic text classification is the process of assigning new text document to the predefined categories based on its content. In this paper, we implemented and performed comparison of Naïve Bayes and Centroid-based algorithms for effective document categorization of English language text. In Centroid Based algorithm, we used Arithmetical Average Centroid (AAC) and Cumuli Geometric Centroid (CGC) methods to calculate centroid of each class. Experiment is performed on R-52 dataset of Reuters-21578 corpus. Micro Average F1 measure is used to evaluate the performance of classifiers. Experimental results show that Micro Average F1 value for NB is greatest among all followed by Micro Average F1 value of CGC which is greater than Micro Average F1 of AAC. All these results are valuable for future research
TEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTIONijistjournal
The user generated content on the web grows rapidly in this emergent information age. The evolutionary changes in technology make use of such information to capture only the user’s essence and finally the useful information are exposed to information seekers. Most of the existing research on text information processing, focuses in the factual domain rather than the opinion domain. In this paper we detect online hotspot forums by computing sentiment analysis for text data available in each forum. This approach analyses the forum text data and computes value for each word of text. The proposed approach combines K-means clustering and Support Vector Machine with PSO (SVM-PSO) classification algorithm that can be used to group the forums into two clusters forming hotspot forums and non-hotspot forums within the current time span. The proposed system accuracy is compared with the other classification algorithms such as Naïve Bayes, Decision tree and SVM. The experiment helps to identify that K-means and SVM-PSO together achieve highly consistent results.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Modeling Text Independent Speaker Identification with Vector QuantizationTELKOMNIKA JOURNAL
Speaker identification is one of the most important technologies nowadays. Many fields such as
bioinformatics and security are using speaker identification. Also, almost all electronic devices are using
this technology too. Based on number of text, speaker identification divided into text dependent and text
independent. On many fields, text independent is mostly used because number of text is unlimited. So, text
independent is generally more challenging than text dependent. In this research, speaker identification text
independent with Indonesian speaker data was modelled with Vector Quantization (VQ). In this research
VQ with K-Means initialization was used. K-Means clustering also was used to initialize mean and
Hierarchical Agglomerative Clustering was used to identify K value for VQ. The best VQ accuracy was
59.67% when k was 5. According to the result, Indonesian language could be modelled by VQ. This
research can be developed using optimization method for VQ parameters such as Genetic Algorithm or
Particle Swarm Optimization.
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION cscpconf
Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text
classification. In this paper, Fast Fuzzy Feature clustering for text classification is proposed. It
is based on the framework proposed by Jung-Yi Jiang, Ren-Jia Liou and Shie-Jue Lee in 2011.
The word in the feature vector of the document is grouped into the cluster in less iteration. The
numbers of iterations required to obtain cluster centers are reduced by transforming clusters
center dimension from n-dimension to 2-dimension. Principle Component Analysis with slit
change is used for dimension reduction. Experimental results show that, this method improve
the performance by significantly reducing the number of iterations required to obtain the cluster
center. The same is being verified with three benchmark datasets
Farthest Neighbor Approach for Finding Initial Centroids in K- MeansWaqas Tariq
Text document clustering is gaining popularity in the knowledge discovery field for effectively navigating, browsing and organizing large amounts of textual information into a small number of meaningful clusters. Text mining is a semi-automated process of extracting knowledge from voluminous unstructured data. A widely studied data mining problem in the text domain is clustering. Clustering is an unsupervised learning method that aims to find groups of similar objects in the data with respect to some predefined criterion. In this work we propose a variant method for finding initial centroids. The initial centroids are chosen by using farthest neighbors. For the partitioning based clustering algorithms traditionally the initial centroids are chosen randomly but in the proposed method the initial centroids are chosen by using farthest neighbors. The accuracy of the clusters and efficiency of the partition based clustering algorithms depend on the initial centroids chosen. In the experiment, kmeans algorithm is applied and the initial centroids for kmeans are chosen by using farthest neighbors. Our experimental results shows the accuracy of the clusters and efficiency of the kmeans algorithm is improved compared to the traditional way of choosing initial centroids.
A Survey on Sentiment Categorization of Movie ReviewsEditor IJMTER
Sentiment categorization is a process of mining user generated text content and determine
the sentiment of the users towards that particular thing. It is the approach of detecting the sentiment of
the author in regard to some topics. It also known as sentiment detection, sentiment analysis and opinion
mining. It is very useful for movie production companies that interested in knowing how users feel
about their movies. For example word “excellent” indicates that the review gives positive emotion about
particular movie. The same applies to movies, songs, cars, holiday destinations, Political parties, social
network sites, web blogs, discussion forum and so on. Sentiment categorization can be carried out by
using three approaches. First, Supervised machine learning based text classifier on Naïve Bayes,
Maximum Entropy, SVM, kNN classifier, hidden marcov model. Second, Unsupervised Semantic
Orientation scheme of extracting relevant N-grams of the text and then labelling. Third, SentiWordNet
based publicly available library.
03 fauzi indonesian 9456 11nov17 edit septianIAESIJEECS
Since the rise of WWW, information available online is growing rapidly. One of the example is Indonesian online news. Therefore, automatic text classification became very important task for information filtering. One of the major issue in text classification is its high dimensionality of feature space. Most of the features are irrelevant, noisy, and redundant, which may decline the accuracy of the system. Hence, feature selection is needed. Maximal Marginal Relevance for Feature Selection (MMR-FS) has been proven to be a good feature selection for text with many redundant features, but it has high computational complexity. In this paper, we propose a two-phased feature selection method. In the first phase, to lower the complexity of MMR-FS we utilize Information Gain first to reduce features. This reduced feature will be selected using MMR-FS in the second phase. The experiment result showed that our new method can reach the best accuracy by 86%. This new method could lower the complexity of MMR-FS but still retain its accuracy.
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONijaia
This paper explores the use of machine learning approaches, or more specifically, four supervised learning
Methods, namely Decision Tree(C 4.5), K-Nearest Neighbour (KNN), Naïve Bays (NB), and Support Vector
Machine (SVM) for categorization of Bangla web documents. This is a task of automatically sorting a set of
documents into categories from a predefined set. Whereas a wide range of methods have been applied to English text categorization, relatively few studies have been conducted on Bangla language text categorization. Hence, we attempt to analyze the efficiency of those four methods for categorization of Bangla documents. In order to validate, Bangla corpus from various websites has been developed and used as examples for the experiment. For Bangla, empirical results support that all four methods produce
satisfactory performance with SVM attaining good result in terms of high dimensional and relatively noisy
document feature vectors.
Sentiment analysis focuses on analyzing web documents, especially user-generated content such as product
reviews, to identify opinionated documents, sentences and opinion holders. Most of the time classifiers trained in one
domain do not perform well in another domain. The existing approaches do not detect sentiment and topics
simultaneously. Sentiments may differ with topics. Our proposed model called Joint Sentiment Topic (JST) model to
detect sentiments and topics simultaneously from text. This model is based on Gibbs sampling algorithm. Besides,
unlike supervised approaches to opinion mining which often fail to produce good performance when shifting to other
domains, the semi-supervised nature of JST makes it highly portable to other domains. JST model performs better
when compared to existing supervised approaches.
Text classification supervised algorithms with term frequency inverse documen...IJECEIAES
Over the course of the previous two decades, there has been a rise in the quantity of text documents stored digitally. The ability to organize and categorize those documents in an automated mechanism, is known as text categorization which is used to classify them into a set of predefined categories so they may be preserved and sorted more efficiently. Identifying appropriate structures, architectures, and methods for text classification presents a challenge for researchers. This is due to the significant impact this concept has on content management, contextual search, opinion mining, product review analysis, spam filtering, and text sentiment mining. This study analyzes the generic categorization strategy and examines supervised machine learning approaches and their ability to comprehend complex models and nonlinear data interactions. Among these methods are k-nearest neighbors (KNN), support vector machine (SVM), and ensemble learning algorithms employing various evaluation techniques. Thereafter, an evaluation is conducted on the constraints of every technique and how they can be applied to real-life situations.
Survey of Machine Learning Techniques in Textual Document ClassificationIOSR Journals
Classification of Text Document points towards associating one or more predefined categories based
on the likelihood expressed by the training set of labeled documents. Many machine learning algorithms plays
an important role in training the system with predefined categories. The importance of Machine learning
approach has felt because of which the study has been taken up for text document classification based on the
statistical event models available. The aim of this paper is to present the important techniques and
methodologies that are employed for text documents classification, at the same time making awareness of some
of the interesting challenges that remain to be solved, focused mainly on text representation and machine
learning techniques.
Machine learning for text document classification-efficient classification ap...IAESIJAI
Numerous alternative methods for text classification have been created because of the increase in the amount of online text information available. The cosine similarity classifier is the most extensively utilized simple and efficient approach. It improves text classification performance. It is combined with estimated values provided by conventional classifiers such as Multinomial Naive Bayesian (MNB). Consequently, combining the similarity between a test document and a category with the estimated value for the category enhances the performance of the classifier. This approach provides a text document categorization method that is both efficient and effective. In addition, methods for determining the proper relationship between a set of words in a document and its document categorization is also obtained.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Context Driven Technique for Document ClassificationIDES Editor
In this paper we present an innovative hybrid Text
Classification (TC) system that bridges the gap between
statistical and context based techniques. Our algorithm
harnesses contextual information at two stages. First it extracts
a cohesive set of keywords for each category by using lexical
references, implicit context as derived from LSA and wordvicinity
driven semantics. And secondly, each document is
represented by a set of context rich features whose values are
derived by considering both lexical cohesion as well as the extent
of coverage of salient concepts via lexical chaining. After
keywords are extracted, a subset of the input documents is
apportioned as training set. Its members are assigned categories
based on their keyword representation. These labeled
documents are used to train binary SVM classifiers, one for
each category. The remaining documents are supplied to the
trained classifiers in the form of their context-enhanced feature
vectors. Each document is finally ascribed its appropriate
category by an SVM classifier.
NLP Techniques for Text Classification.docxKevinSims18
Natural Language Processing (NLP) is an area of computer science and artificial intelligence that aims to enable machines to understand and interpret human language. Text classification is one of the most common tasks in NLP, and it involves categorizing text into predefined categories or classes. In this blog post, we will explore some of the most effective NLP techniques for text classification.
A simplified classification computational model of opinion mining using deep ...IJECEIAES
Opinion and attempts to develop an automated system to determine people's viewpoints towards various units such as events, topics, products, services, organizations, individuals, and issues. Opinion analysis from the natural text can be regarded as a text and sequence classification problem which poses high feature space due to the involvement of dynamic information that needs to be addressed precisely. This paper introduces effective modelling of human opinion analysis from social media data subjected to complex and dynamic content. Firstly, a customized preprocessing operation based on natural language processing mechanisms as an effective data treatment process towards building quality-aware input data. On the other hand, a suitable deep learning technique, bidirectional long short term-memory (Bi-LSTM), is implemented for the opinion classification, followed by a data modelling process where truncating and padding is performed manually to achieve better data generalization in the training phase. The design and development of the model are carried on the MATLAB tool. The performance analysis has shown that the proposed system offers a significant advantage in terms of classification accuracy and less training time due to a reduction in the feature space by the data treatment operation.
Experimental Result Analysis of Text Categorization using Clustering and Clas...ijtsrd
In a world that routinely produces more textual data. It is very critical task to managing that textual data. There are many text analysis methods are available to managing and visualizing that data, but many techniques may give less accuracy because of the ambiguity of natural language. To provide the ne grained analysis, in this paper introduce e cient machine learning algorithms for categorize text data. To improve the accuracy, in proposed system I introduced Natural language toolkit NLTK python library to perform natural language processing. The main aim of proposed system is to generalize the model for real time text categorization applications by using e cient text classi cation as well as clustering machine learning algorithms and nd the efficient and accurate model for input dataset using performance measure concept. Patil Kiran Sanajy | Prof. Kurhade N. V. ""Experimental Result Analysis of Text Categorization using Clustering and Classification Algorithms"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd25077.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/25077/experimental-result-analysis-of-text-categorization-using-clustering-and-classification-algorithms/patil-kiran-sanajy
An effective pre processing algorithm for information retrieval systemsijdms
The Internet is probably the most successful distributed computing system ever. However, our capabilities
for data querying and manipulation on the internet are primordial at best. The user expectations are
enhancing over the period of time along with increased amount of operational data past few decades. The
data-user expects more deep, exact, and detailed results. Result retrieval for the user query is always
relative o the pattern of data storage and index. In Information retrieval systems, tokenization is an
integrals part whose prime objective is to identifying the token and their count. In this paper, we have
proposed an effective tokenization approach which is based on training vector and result shows that
efficiency/ effectiveness of proposed algorithm. Tokenization on documents helps to satisfy user’s
information need more precisely and reduced search sharply, is believed to be a part of information
retrieval. Pre-processing of input document is an integral part of Tokenization, which involves preprocessing
of documents and generates its respective tokens which is the basis of these tokens probabilistic
IR generate its scoring and gives reduced search space. The comparative analysis is based on the two
parameters; Number of Token generated, Pre-processing time.
Comparative study of classification algorithm for text based categorizationeSAT Journals
Abstract
Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine
learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques.
It provides conceptual views of the collected documents and has important applications in the real world. Text based
categorization is made use of for document classification with pattern recognition and machine learning. Advantages of a number
of classification algorithms have been studied in this paper to classify documents. An example of these algorithms is: Naive Bayes'
algorithm, K-Nearest Neighbor, Decision Tree etc. This paper presents a comparative study of advantages and disadvantages of
the above mentioned classification algorithm.
Keywords: Data Mining, Text Mining, Text Categorization, Machine Learning, Pattern Analysis, Naive Bayes’, KNN,
Decision Tree.
Semi Automated Text Categorization Using Demonstration Based Term SetIJCSEA Journal
Manual Analysis of huge amount of textual data requires a tremendous amount of processing time and effort in reading the text and organizing them in required format. In the current scenario, the major problem is with text categorization because of the high dimensionality of feature space. Now-a-days there are many methods available to deal with text feature selection. This paper aims at such semi automated text categorization feature selection methodology to deal with a massive data using one of the phases of David Merrill’s First principles of instruction (FPI). It uses a pre-defined category group by providing them with the proper training set based on the demonstration phase of FPI. The methodology involves the text tokenization, text categorization and text analysis.
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...idescitation
Genetic Algorithm (GA) has been a successful method
that is been used for extracting keywords. This paper presents
a full method by which keywords can be derived from the
various corpuses. We have built equations that exploit the
structure of the documents from which the keywords need to
be extracted. The procedures are been broken into two
distinguished profiles: one is to weigh the words in the whole
document content and the other is to explore the possibilities
of the occurrence of key terms by using genetic algorithm.
The basic equations of the heuristic mechanism is been varied
to allow the complete exploitation of document. The Genetic
Algorithm and the enhanced standard deviation method is
used in full potential to enable the generation of the key
terms that describe the given text document. The new
technique has an enhanced performance and better time
complexities.
1. International Journal of Research in Advent Technology, Vol.2, No.5, May 2014
E-ISSN: 2321-9637
An Algorithmic Approach of Keyword Extraction
96
based Text Document Classification
Yoganand.C.S1, Vadivel.R2
PG Student of CSE1, Assistant Professor of CSE2
Adithya Institute of Technology, Coimbatore1, 2
info.yoganand@gmail.com1,vadivelcse@gmail.com2
Abstract-The various institutions and industries are converting their documents into electronic text files. The
documents may contains applications, personal documents, properties documents etc. The categorization of the
text documents are really makes a very big issue. In this paper we propose the various techniques for the
document classification process. These documents may be in the form of supervised, unsupervised or semi-supervised
documents. The supervised documents are the standard documents which are contains the proper
format of data. They can be classified by using the Naïve Bayes model with the help Hidden Markov Model
(HMM). The keyword and key Phrases are extracted and used as a training set for the further document
classification along with the training dataset. The keyword extraction can be done based on the Word count
method and Porter stemming algorithms. Further documents can be classified using Naïve Bayes and Support
Vector Machine (SVM), Subspace, Decision Tree and k-Nearest Neighbor (k-NN) methods.
Index Terms-Support Vector Machines (SVM), Hidden Markov Model (HMM), k-Nearest Neighbor (k-NN),
Text categorization, mapping models.
1. INTRODUCTION
Nowadays all institutions and private companies
keep their files in electronic format in order to reduce
the paperwork and, at the same time provide instant
access to the information contained. Document
clustering and classification in one of the most
important text mining methods that are developed to
help users effectively navigate, summarize and
organize text documents. Document classification can
be defined as the task of automatically categorizing
collections of electronic documents into their
annotated classes based on their contents [3]. Recent
years, this has become important due to the advent of
large amounts of data in digital form. Document
classification in the form of text classification systems
have been widely implemented in numerous
applications such as spam filtering, emails
categorizing, directory maintenance and plagiarism
checking processes [6].
Data mining is useful in discovering implicit,
potentially valuable information or knowledge and
previously unknown from large datasets. Text
Document classification denotes the test of assigning
raw text documents to one or more pre-defined
categories [2]. This is a direct concept from machine
learning, which denotes the declaration of a set of
labelled categories as a way to represent the
documents, and a text classifier trained with a labelled
training set. Among these approaches, Bayesian
classification has been widely implemented in many
real world applications due to its relatively simple
training and clustering algorithms [7].
The concept of text categorization is the
classification of documents into a fixed number of
predefined categories or classes [4]. Each document
can be classified into exactly one or more category
automatically. Some of the documents are not
classified into any category [5]. This is known as
supervised learning problem. Since categories may
overlap, each category is treated as a separate binary
classification problem.Each of the document
classification schemes previously mentioned has its
own unique properties and associated problems. The
decision tree induction algorithms and the rule
induction algorithm are simple to understand and
interpret the classification [1]. However, these
algorithms do not work well when the number of
distinguishing features between documents is large.
The k-NN algorithm is easy to implement and shows
its effectiveness in a variety of problem domains [8].
As a trade-off to its simplicity, Bayesian classification
has been reported as one of the poorest-performing
classification approaches by many research groups
through extensive experiments and evaluations [7].
The paper is organized as follows: section 1 deals
with Introduction of my paper. Section 2 includes the
literature survey of my concept. Section 3 explains
related work, Section 4 gives brief explanation about
proposed system, system architecture and workflow
of the project. Section 5 contains Experimental setup
details and Section 6 contains the Performance
analysis, results graph of the document classification
and so on.
2. International Journal of Research in Advent Technology, Vol.2, No.5, May 2014
E-ISSN: 2321-9637
97
2. LITERATURE SURVEY
2.1. Naïve Bayes Model for Textclassification
Naïve Bayes classifiers which are widely used for
text classification in machine learning are based on
the conditional probability of features measures
belonging to a class, feature selection methods are
used for feature selection. An auxiliary feature
method is used for text classification [3]. The
auxiliary feature is chosen from collection of features
which is determined by using existing feature
selection method. To improve classification accuracy
the corresponding conditional probability is adjusted
[5]. The feature with auxiliary feature was found and
the probability of the feature with auxiliary feature
was adjusted after feature selection.
2.2. Support Vector Machine
The application of Support vector machine
(SVM) method is the Text Classification. The SVM
need both positive and negative training set which are
uncommon for other classification methods [11].
These negative and positive training set are needed for
the SVM to seek for the decision surface that best
separates the positive from the negative data in the n
dimensional space, is called as hyper plane [16]. SVM
classifier method is outstanding from other with its
effectiveness to improve performance of text
classification combining the HMM and SVM where
HMMs are used to as a feature extractor and then a
new feature vector is normalized as the input of
SVMs, so unknown texts are successfully classified
based on the trained SVMs, also by combing with
Bayes use to reduce number of feature which as
reducing number of dimension [13].
2.3. Decision Tree
When decision tree is used for text classification
it consist tree internal node are label by term,
branches departing from them are labelled by test on
the weight, and leaf node are represent corresponding
class labels .Tree can classify the document by
running through the query structure from root to until
it reaches a particular leaf, which represents the goal
for the classification of the text document [9]. The
decision tree classification method is outstanding
from other decision support tools with several
advantages like its simplicity in interpreting and
understanding, even for non-expert users [14]. So it is
only used in some applications processes.
2.4. Decision Rule
Decision rules classification method uses the
rule-based inference to classify documents to their
annotated categories. A popular format for
interpretable solutions is the Disjunctive Normal
Form (DNF) model [19]. A classifier for category ci
built by an inductive rule learning method consists of
a DNF rule. In the case of handling a dataset with
large number of features for each category, strict
implementation is recommended to reduce the size of
rules set without affecting the performance of the
classification [16]. The presents a hybrid method of
rule based processing and back-propagation neural
networks for spam filtering.
2.5. Term Frequency/Inverse Document
Frequency (TF-IDF)
A new improved term frequency/inverse
document frequency (TF-IDF) approach which uses
confidence, support and characteristic words to
enhance the recall and precision of text classification.
Synonyms defined by a lexicon are processed in the
improved TF-IDF approach [17]. It need to find the
best matching category for the text document. The
term (word) frequency/inverse document frequency
(TF-IDF) approach is commonly used to weigh each
word in the text document according to how unique it
is. In other words, the TF-IDF approach captures the
relevancy among words, text documents and
particular categories [1]. It put forward the novel
improved TF-IDF approach for text classification, and
will focus on this approach in the remainder of this
paper, and will describe in detail the motivation,
methodology, and implementation of the improved
TF-IDF approach.
3. RELATED WORK
The Naïve Bayes text document classification
will be depends only on the Bayesian rule. The
Bayesian rule is given below:
P(D1|D2) = ( P(D2|D1) * P(D1) ) / P(D2) (1)
where D1and D1are the two documents or two
constraints of Bayesian rule.
The probability of D1 happening given D2is
determined from the probability of D2given D1, the
probability of D1 occurring and the probability of D2.
The Bayes Rule enables the calculation of the
likelihood of event D1 given that D2 has happened.
This is used in text classification to determine the
probability that a document D2is of type D1 just by
looking at the frequencies of words in the document.
You can think of the Bayes Rule as showing how to
update the probability of event D1 happening given
that you've observed D2.
A category is represented by a collection of
words and their frequencies; the frequency is the
number of times that each word has been seen in the
documents used to train the classifier.Suppose there
are n categories C0 to Cn-1. Determining which
category a document D is most associated with means
calculating the probability that document D is in
category Ci, written P(Ci|D), for each category
Ci.Using the Bayes Rule, you can calculate P(Ci|D) by
computing:
3. International Journal of Research in Advent Technology, Vol.2, No.5, May 2014
E-ISSN: 2321-9637
98
P(Ci|D) = ( P(D|Ci ) * P(Ci) ) / P(D) (2)
P(Ci|D) is the probability that document D is in
category Ci; that is, the probability that given the set
of words in D, they appear in category Ci. P(D|Ci) is
the probability that for a given category Ci, the words
in D appear in that category.P(Ci) is the probability of
a given category; that is, the probability of a
document being in category Ci without considering its
contents. P(D) is the probability of that specific
document occurring.
To calculate which category D should go in, you
need to calculate P(Ci|D) for each of the categories
and find the largest probability. Because each of those
calculations involves the unknown but fixed value
P(D), you just ignore it and calculate:
P(Ci |D) = P(D|Ci) * P(Ci) (3)
P(D) can also be safely ignored because you are
interested in the relative not absolute values of
P(Ci|D), and P(D) simply acts as a scaling factor on
P(Ci|D).D is split into the set of words in the
document, called W0 through Wm-1. To calculate
P(D|Ci), calculate the product of the probabilities for
each word; that is, the likelihood that each word
appears in Ci. Here's the "naïve" step: Assume that
words appear independently from other words (which
is clearly not true for most languages) and P(D|Ci) is
the simple product of the probabilities for each word:
P(D|Ci) = P(W0|Ci) * P(W1|Ci) * ... * P(Wm-1|Ci) (4)
For any category, P(Wj|Ci) is calculated as the
number of times Wj appears in Ci divided by the total
number of words in Ci. P(Ci) is calculated as the total
number of words in Ci divided by the total number of
words in all the categories put together. Hence,
P(Ci|D) is:
P(W0|Ci) * P(W1|Ci) * ... * P(W m-1|Ci) * P(Ci) (5)
for each category, and picking the largest determines
the category for document D.
The documents are also classified by using the
SVM. They use hyperplane method for classification.
The hyperplane can be estimated using the same word
count and occurrences of words in the document.
Hyperplane can classify the text documents based on
the weight or term frequency values of the documents.
The Multi-Layer Perceptron (MLP) model is used for
SVM classification.
4. PROPOSED WORK
The Naïve Bayes and Support Vector Machine
algorithms are used for document classification. The
keywords can be extracted from documents using
porter-stemming algorithm. Fig.1 shows the workflow
diagram on the proposed system. The extracted
keywords and key-phrases are used as the training set
for further document classification. The TF-IDF
method is used to calculate the probability of word
occurrence in each document. TF is the Term
Frequency, it helps to calculate the occurrence of
word in each page of the document, IDF is the Inverse
Term Frequency, and it helps to calculate the
complete word occurrence count in whole documents.
The preprocessing steps involved in document
classifications are stop words removal and stemming
methods.
4.1. Stop Word Removal
This is the first step in preprocessing which will
generate a list of terms that describes the document
satisfactorily. The document is parsed through to find
out the list of all the words. The next process in this
step is to reduce the size of the list created by the
parsing process, generally using methods of stop
words removal and stemming. The stop words
removal accounts to 20% to 30% of total words
counts while the process of stemming reduce the
number of terms in the document. Both the process
helps in improving the effectiveness and efficiency of
text processing as they reduce the indexing file
size.Stop words are removed from each of the
document by comparing the with the stop word list.
This process reduces the number of words in the
document significantly since these stop words are
insignificant for search keywords. Stop words can be
pre-specified list of words or they can depend on the
context of the corpus.
4.2. Stemming
The next process in phase one after stop word
removal is stemming. Stemming is process of
linguistic normalization in which the variant forms of
a word is reduced to a common form. For example:
the word, connect has various forms such as connect,
connection, connective, connected, etc., Stemming
process reduces all these forms of words to a
normalized word connect. Porter’s English stemmer
algorithm is used to stem the words for each of the
document in our stemming process.
4.3. Feature Selection
Feature selection is the process of selecting a
subset of the terms occurring in the training set and
using only this subset as features in text classification.
Feature selection serves two main purposes. First, it
makes training and applying a classifier more
efficient by decreasing the size of the effective
vocabulary. This is of particular importance for
classifiers that, unlike NB, are expensive to train.
Second, feature selection often increases classification
accuracy by eliminating noise features. A noise
feature is one that, when added to the document
representation, increases the classification error on
new data. Here the Information Gain (IG). These
features are selected based on the frequency
measurement of the document.
4. International Journal of Research in Advent Technology, Vol.2, No.5, May 2014
E-ISSN: 2321-9637
99
4.4. Document Representation
A term-document matrix can be encoded as a
collection of n documents and m terms. An entry in
the matrix corresponds to the “weight” of a term in
the document; zero means the term has no
significance in the document or it simply doesn’t exist
in the document. The whole document collection can
therefore be seen as a m x n-feature matrix A (with m
as the number of documents) where the element aij
represents the frequency of occurrence of feature j in
document i. This was of representing the document
is called term-frequency method. The most popular
term weighting is the Inverse document frequency,
where the term frequency is weighed with respect to
the total number of times the term appears in the
corpus. There is an extension of this designated the
term frequency inverse document frequency (tf-idf).
The formulation of tf-idf is given as follows:-
Wij = tfi,j * log (N / dfi)
where Wij is the weight of the term i in document j,
tfi,j = number of occurrences of term i in document j,
N is the total number of documents in the corpus, dfi
= is the number of documents containing the term i.
Then the Naïve Bayes and Support Vector
Machine classification algorithms are applied one by
one for the document classification. The classification
results will be calculated and compared for both the
methods. Finally it produces the better performance
on classification accuracy. The k-NN technique helps
to cluster or group the classified documents into the
proper category. These are all done by using the
various steps for classification.
Fig. 1. System Architecture
Finally the supervised and unsupervised
document classification accuracy will increased by
using these various classification algorithms based on
keyword extraction processes.
5. EXPERIMENTAL SETUP
To set a benchmark for Document Classification
and allow for comparisons of other methods with the
proposed approach in this paper. The experiments are
performed on Intel Core i -3 3210, 2.3 GHz processor
and 4 GB RAM with Windows7 as an operating
system and the experiments are implemented in
Microsoft visual studio 2008 using C#.Netwith MS
SQL server and Java for document analysis and
classification process.
To measure the effectiveness of the classification
this approach can be applied and verified based on
various datasets. The various collections of datasets
are tabulated below with their topics. It contains the
total of 814 documents belonging to seven different
classes (business (B), entertainment (E), health (H),
international (I), politics (P), sports (S) and
technology (T)) used for training and two test data
sets (news items at different time intervals, see Table
1).The Reuters 21578 dataset is used as training
dataset for document classification process.
5. International Journal of Research in Advent Technology, Vol.2, No.5, May 2014
E-ISSN: 2321-9637
100
Table 1. Training and Test Datasets
Categories B E H I P S T
Training Data
No. of
documents
130 133 91 110 130 130 90
Total no. of
terms
1848 2048 1213 1974 2070 1659 1364
Test data
Set1
No. of
documents
110 111 79 80 110 111 79
Total no. of
terms
2155 2583 1535 1999 2439 1952 1618
Test Data
Set2
No. of
documents
100 101 78 70 101 101 70
Total no. of
terms
2046 2834 1803 2604 2070 1974 1689
The Table 2 contains the comparison of four
algorithms. The four algorithms are Naïve Bayes
(NB), Nearest Neighbor (NN), Decision Tree (DT)
and Sub Space (SS) model. The various algorithm
classification measures are tabulated.The NB
algorithm performs better on test data set 1. The
recognition rate is 83.1%. The SS algorithm
performance good on the Test Data Set 2.
Table 2. Comparison of Four algorithms
Test Data’s NB NN DT SS
Tes
t
dat
a
set1
No. of
misclassification
s
115 165 178 139
Recognition rate
(%)
83.1 75.7 73.
8
79.6
Tes
t
dat
a
set2
No. of
misclassification
s
125 179 144 111
Recognition rate
(%)
79.8
7
71.1
8
76.
8
82.1
3
The Table 3 contains of IDF calculation of
documents based on the total number of
documents(N) and Document frequency (n). The IDF
value is also based on the Number of Categories (C)
of the documents. The IDF is the Inverse Document
Frequency of the document. The IDF is calculated
based on the probability of the Document Frequency.
Table 3 IDF Calculation
Total No of
Documents
(N)
12 No. of
Categories (C)
2
Document
Frequency
(n)
7 5 3 3 6
IDF log
0.2
341
0.4
771
0.6
021
0.
60
21
0.30
10
6. PERFORMANCE ANALYSIS
100
80
60
40
20
Fig.2. Best case and Average Case of Classification
The performance analyses of different algorithms
are discussed in detail and explain in graph as follows.
The Fig. 2.Can be plotted based on the Table 2 values.
In x- Axis the number of datasets and various
algorithms are taken. In y-Axis the no of documents
are taken. The Fig.3 expose the best case and average
case performance accuracy of the document
classification. The Fig. 4 exposes the worst case of
document classification measure.
Fig.3. Worst Case of Classification
0
Test data set1 Test data set2
NB NN DT SS
200
150
100
50
0
Test data set1 Test data set2
NB NN DT SS
6. International Journal of Research in Advent Technology, Vol.2, No.5, May 2014
E-ISSN: 2321-9637
101
200
150
100
50
Fig. 4. Chart for Algorithm Comparison
7. CONCLUSION
In this paper a system has been proposed for the
organization of a document in terms of content. It
comprises two stages where first one is the extraction
of keywords and key phrases for the document
classification, second one is the process of classifying
the documents based on the keywords and training
dataset. This content based classification is applicable
in the plagiarism processing and in the machine
learning processes. In future by combining these
different algorithms to improve the classification
accuracy. The future enhancement idea includes the
WordMap creation based on the relationships between
the keywords. It also improves the classification
performance.
REFERENCES
[1] Bhamidipati. N. L and Pal. S. K, “Stemming via
distribution-based word segregation for
classification and retrieval,” IEEE Trans. Syst.,
Man, Cybern. B, Cybern, vol. 37, no. 2, pp. 350–
360, Apr. 2007.
[2] Chakrabarti.S, Roy.S, and Soundalgekar.M.V,
“Fast and Accurate Text Classification via
Multiple Linear Discriminant Projection,” VLDB
J., Int’l J. Very Large Data Bases, pp. 170-185,
2003.
[3] Cunningham.P, Nowlan.N, Delany.S.J, and
Haahr.M, “A Case-Based Approach in Spam
Filtering that Can Track Concept Drift,” Proc.
ICCBR Workshop Long-Lived CBR Systems,
2003.
[4] Dino Isa, Lam Hong Lee, V.P. Kallimani, and R.
RajKumar, “Text Document Preprocessing with
the Bayes Formula for Classification Using the
Support Vector Machine” in IEEE Transactions
on Knowledge and Data Engineering, Vol. 20,
No. 9, September 2008
[5] Han.E.H, Karypis.G, and Kumar.V, “Text
Categorization Using Weight Adjusted k-Nearest
Neighbor Classification,” Dept. of Computer
Science and Eng., Army HPC Research Center,
Univ. of Minnesota, 1999.
[6] Hartley.M, Isa.D, Kallimani.V.P, and Lee.L.H,
“A Domain Knowledge Preserving in Process
Engineering Using Self-Organizing Concept,”
technical report, Intelligent System Group,
Faculty of Eng. and Computer Science, Univ. of
Nottingham, Malaysia Campus, 2006.
[7] J.G.Liang, X.F.Zhou, P.Liu, L.Guo, S.Bai “An
EMM-based Approach for Text Classification” in
Procedia Computer Science 17, 506 – 513, 2013
[8] Kerner. Y.H, Gross.Z, and Masa.A, Automatic
extraction and learning of key phrases from
scientific articles. In Computational Linguistics
and Intelligent Text Processing, pages 657–669,
2005.
[9] Lam Hong Lee, Dino Isa, Wou Onn Choo, Wen
Yeen Chue, “High Relevance Keyword
Extraction facility for Bayesian text classification
on different domains of varying characteristic“ in
Expert Systems with Applications 39, 1147–
1155, 2012
[10]Matsuo. Y and Ishizuka. M, Keyword extraction
from a single document using word co-occurrence
statistical information. International
Journal on Artificial Intelligence, 13(1):157–169,
2004.
[11]McCallum.A and Nigam.K, “A Comparison of
Event Models for Naïve Bayes Text
Classification,” J. Machine Learning Research 3,
pp. 1265-1287, 2003.
[12]Sang-Bum Kim, Kyoung-Soo Han, Hae-Chang
Rim, and Sung Hyon Myaeng, “Some Effective
Techniques for Naive Bayes Text Classification”
in IEEE Transactions on Knowledge and Data
Engineering, Vol. 18, No. 11, November 2006.
[13]Wang.J, Yao.Y, and Liu. Z. J, “A new text
classification method based on HMM-SVM,” in
Proc. Int. Symp. Commun. Inf. Technol., Sydney,
N.S.W., Australia, Oct. 17–19, 2007, pp. 1516–
1519.
[14]Wei Zhang, Feng Gao, “An Improvement to
Naive Bayes for Text Classification” in Procedia
Engineering 15, 2160 – 2164, 2011
[15]Wen Zhang, Taketoshi Yoshida, Xijin Tang,
“Text classification based on multi-word with
support vector machine” in Knowledge-Based
Systems 21, 879–886, 2008.
[16]Yang.Y, “An evaluation of statistical approaches
to text categorization,” J. Inf. Retrieval, vol. 1,
no. 1/2, pp. 69–90, 1999.
0
Test data
set1
Test data
set1
Test data
set2
Test data
set2
NB NN DT SS