The document presents a classification model for unstructured text documents that aims to support both generality and efficiency. The model follows the logical sequence of text classification steps and proposes a combination of techniques for each step. Specifically, it uses multinomial naive bayes classification with term frequency-inverse document frequency (TF-IDF) representation. The model is tested on the 20-Newsgroups dataset and results show improved performance over precision, recall, and f-score compared to other models.
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
This article will introduce some approaches for improving text categorization models by integrating
previously imported ontologies. From the Reuters Corpus Volume I (RCV1) dataset, some categories very
similar in content and related to telecommunications, Internet and computer areas were selected for models
experiments. Several domain ontologies, covering these areas were built and integrated to categorization
models for their improvements.
Review of Various Text Categorization Methodsiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Text preprocessing is a vital stage in text classification (TC) particularly and text mining generally. Text preprocessing tools is to reduce multiple forms of the word to one form. In addition, text preprocessing techniques are provided a lot of significance and widely studied in machine learning. The basic phase in text classification involves preprocessing features, extracting relevant features against the features in a database. However, they have a great impact on reducing the time requirement and speed resources needed. The effect of the preprocessing tools on English text classification is an area of research. This paper provides an evaluation study of several preprocessing tools for English text classification. The study includes using the raw text, the tokenization, the stop words, and the stemmed. Two different methods chi-square and TF-IDF with cosine similarity score for feature extraction are used based on BBC English dataset. The Experimental results show that the text preprocessing effect on the feature extraction methods that enhances the performance of English text classification especially for small threshold values.
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONijaia
This paper explores the use of machine learning approaches, or more specifically, four supervised learning
Methods, namely Decision Tree(C 4.5), K-Nearest Neighbour (KNN), Naïve Bays (NB), and Support Vector
Machine (SVM) for categorization of Bangla web documents. This is a task of automatically sorting a set of
documents into categories from a predefined set. Whereas a wide range of methods have been applied to English text categorization, relatively few studies have been conducted on Bangla language text categorization. Hence, we attempt to analyze the efficiency of those four methods for categorization of Bangla documents. In order to validate, Bangla corpus from various websites has been developed and used as examples for the experiment. For Bangla, empirical results support that all four methods produce
satisfactory performance with SVM attaining good result in terms of high dimensional and relatively noisy
document feature vectors.
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...IJERA Editor
Assigning documents to related categories is critical task which is used for effective document retrieval. Automatic text classification is the process of assigning new text document to the predefined categories based on its content. In this paper, we implemented and performed comparison of Naïve Bayes and Centroid-based algorithms for effective document categorization of English language text. In Centroid Based algorithm, we used Arithmetical Average Centroid (AAC) and Cumuli Geometric Centroid (CGC) methods to calculate centroid of each class. Experiment is performed on R-52 dataset of Reuters-21578 corpus. Micro Average F1 measure is used to evaluate the performance of classifiers. Experimental results show that Micro Average F1 value for NB is greatest among all followed by Micro Average F1 value of CGC which is greater than Micro Average F1 of AAC. All these results are valuable for future research
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
This article will introduce some approaches for improving text categorization models by integrating
previously imported ontologies. From the Reuters Corpus Volume I (RCV1) dataset, some categories very
similar in content and related to telecommunications, Internet and computer areas were selected for models
experiments. Several domain ontologies, covering these areas were built and integrated to categorization
models for their improvements.
Review of Various Text Categorization Methodsiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Text preprocessing is a vital stage in text classification (TC) particularly and text mining generally. Text preprocessing tools is to reduce multiple forms of the word to one form. In addition, text preprocessing techniques are provided a lot of significance and widely studied in machine learning. The basic phase in text classification involves preprocessing features, extracting relevant features against the features in a database. However, they have a great impact on reducing the time requirement and speed resources needed. The effect of the preprocessing tools on English text classification is an area of research. This paper provides an evaluation study of several preprocessing tools for English text classification. The study includes using the raw text, the tokenization, the stop words, and the stemmed. Two different methods chi-square and TF-IDF with cosine similarity score for feature extraction are used based on BBC English dataset. The Experimental results show that the text preprocessing effect on the feature extraction methods that enhances the performance of English text classification especially for small threshold values.
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONijaia
This paper explores the use of machine learning approaches, or more specifically, four supervised learning
Methods, namely Decision Tree(C 4.5), K-Nearest Neighbour (KNN), Naïve Bays (NB), and Support Vector
Machine (SVM) for categorization of Bangla web documents. This is a task of automatically sorting a set of
documents into categories from a predefined set. Whereas a wide range of methods have been applied to English text categorization, relatively few studies have been conducted on Bangla language text categorization. Hence, we attempt to analyze the efficiency of those four methods for categorization of Bangla documents. In order to validate, Bangla corpus from various websites has been developed and used as examples for the experiment. For Bangla, empirical results support that all four methods produce
satisfactory performance with SVM attaining good result in terms of high dimensional and relatively noisy
document feature vectors.
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...IJERA Editor
Assigning documents to related categories is critical task which is used for effective document retrieval. Automatic text classification is the process of assigning new text document to the predefined categories based on its content. In this paper, we implemented and performed comparison of Naïve Bayes and Centroid-based algorithms for effective document categorization of English language text. In Centroid Based algorithm, we used Arithmetical Average Centroid (AAC) and Cumuli Geometric Centroid (CGC) methods to calculate centroid of each class. Experiment is performed on R-52 dataset of Reuters-21578 corpus. Micro Average F1 measure is used to evaluate the performance of classifiers. Experimental results show that Micro Average F1 value for NB is greatest among all followed by Micro Average F1 value of CGC which is greater than Micro Average F1 of AAC. All these results are valuable for future research
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
Text mining is an emerging research field evolving from information retrieval area. Clustering and
classification are the two approaches in data mining which may also be used to perform text classification
and text clustering. The former is supervised while the later is un-supervised. In this paper, our objective is
to perform text clustering by defining an improved distance metric to compute the similarity between two
text files. We use incremental frequent pattern mining to find frequent items and reduce dimensionality.
The improved distance metric may also be used to perform text classification. The distance metric is
validated for the worst, average and best case situations [15]. The results show the proposed distance
metric outperforms the existing measures.
A hybrid naïve Bayes based on similarity measure to optimize the mixed-data c...TELKOMNIKA JOURNAL
In this paper, a hybrid method has been introduced to improve the classification performance of naïve Bayes (NB) for the mixed dataset and multi-class problems. This proposed method relies on a similarity measure which is applied to portions that are not correctly classified by NB. Since the data contains a multi-valued short text with rare words that limit the NB performance, we have employed an adapted selective classifier based on similarities (CSBS) classifier to exceed the NB limitations and included the rare words in the computation. This action has been achieved by transforming the formula from the product of the probabilities of the categorical variable to its sum weighted by numerical variable. The proposed algorithm has been experimented on card payment transaction data that contains the label of transactions: the multi-valued short text and the transaction amount. Based on K-fold cross validation, the evaluation results confirm that the proposed method achieved better results in terms of precision, recall, and F-score compared to NB and CSBS classifiers separately. Besides, the fact of converting a product form to a sum gives more chance to rare words to optimize the text classification, which is another advantage of the proposed method.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...IJDKP
As existing computer search engines struggle to understand the meaning of natural language, semantically
enriched metadata may improve interest-based search engine capabilities and user satisfaction.
This paper presents an enhanced version of the ecosystem focusing on semantic topic metadata detection
and enrichments. It is based on a previous paper, a semantic metadata enrichment software ecosystem
(SMESE). Through text analysis approaches for topic detection and metadata enrichments this paper
propose an algorithm to enhance search engines capabilities and consequently help users finding content
according to their interests. It presents the design, implementation and evaluation of SATD (Scalable
Annotation-based Topic Detection) model and algorithm using metadata from the web, linked open data,
concordance rules, and bibliographic record authorities. It includes a prototype of a semantic engine using
keyword extraction, classification and concept extraction that allows generating semantic topics by text,
and multimedia document analysis using the proposed SATD model and algorithm.
The performance of the proposed ecosystem is evaluated using a number of prototype simulations by
comparing them to existing enriched metadata techniques (e.g., AlchemyAPI, DBpedia, Wikimeta, Bitext,
AIDA, TextRazor). It was noted that SATD algorithm supports more attributes than other algorithms. The
results show that the enhanced platform and its algorithm enable greater understanding of documents
related to user interests.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Novelty detection via topic modeling in research articlescsandit
In today’s world redundancy is the most vital problem faced in almost all domains. Novelty
detection is the identification of new or unknown data or signal that a machine learning system
is not aware of during training. The problem becomes more intense when it comes to “Research
Articles”. A method of identifying novelty at each sections of the article is highly required for
determining the novel idea proposed in the research paper. Since research articles are semistructured,
detecting novelty of information from them requires more accurate systems. Topic
model provides a useful means to process them and provides a simple way to analyze them. This
work compares the most predominantly used topic model- Latent Dirichlet Allocation with the
hierarchical Pachinko Allocation Model. The results obtained are promising towards
hierarchical Pachinko Allocation Model when used for document retrieval.
A Novel Text Classification Method Using Comprehensive Feature WeightTELKOMNIKA JOURNAL
Currently, since the categorical distribution of short text corpus is not balanced, it is difficult to
obtain accurate classification results for long text classification. To solve this problem, this paper proposes
a novel method of short text classification using comprehensive feature weights. This method takes into
account the situation of the samples in the positive and negative categories, as well as the category
correlation of words, so as to improve the existing feature weight calculation method and obtain a new
method of calculating the comprehensive feature weight. The experimental result shows that the proposed
method is significantly higher than other feature-weight methods in the micro and macro average value,
which shows that this method can greatly improve the accuracy and recall rate of short text classification.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Text classification supervised algorithms with term frequency inverse documen...IJECEIAES
Over the course of the previous two decades, there has been a rise in the quantity of text documents stored digitally. The ability to organize and categorize those documents in an automated mechanism, is known as text categorization which is used to classify them into a set of predefined categories so they may be preserved and sorted more efficiently. Identifying appropriate structures, architectures, and methods for text classification presents a challenge for researchers. This is due to the significant impact this concept has on content management, contextual search, opinion mining, product review analysis, spam filtering, and text sentiment mining. This study analyzes the generic categorization strategy and examines supervised machine learning approaches and their ability to comprehend complex models and nonlinear data interactions. Among these methods are k-nearest neighbors (KNN), support vector machine (SVM), and ensemble learning algorithms employing various evaluation techniques. Thereafter, an evaluation is conducted on the constraints of every technique and how they can be applied to real-life situations.
Comparison analysis of Bangla news articles classification using support vect...TELKOMNIKA JOURNAL
In the information age, Bangla news articles on the internet are fast-growing. For organizing, every news site has a particular structure and categorization. News article classification is a method to determine a document’s classification based on various predefined categories. This research discusses the classification of Bangla news articles on the online platform and tries to make constructive comparison using several classification algorithms. For Bangla news articles classification, term frequencyinverse document frequency (TF-IDF) weighting and count vectorizer have been used as a feature extraction process, and two common classifiers named support vector machine (SVM) and logistic regression (LR) employed for classifying the documents. It is clear that the accuracy of the experimental results by applying SVM is 84.0% and LR is 81.0% for twelve categories of news articles. In this research work, when we have made comparison two renowned classification algorithms applied on the Bangla news articles, LR was outperformed by SVM.
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
Text mining is an emerging research field evolving from information retrieval area. Clustering and
classification are the two approaches in data mining which may also be used to perform text classification
and text clustering. The former is supervised while the later is un-supervised. In this paper, our objective is
to perform text clustering by defining an improved distance metric to compute the similarity between two
text files. We use incremental frequent pattern mining to find frequent items and reduce dimensionality.
The improved distance metric may also be used to perform text classification. The distance metric is
validated for the worst, average and best case situations [15]. The results show the proposed distance
metric outperforms the existing measures.
A hybrid naïve Bayes based on similarity measure to optimize the mixed-data c...TELKOMNIKA JOURNAL
In this paper, a hybrid method has been introduced to improve the classification performance of naïve Bayes (NB) for the mixed dataset and multi-class problems. This proposed method relies on a similarity measure which is applied to portions that are not correctly classified by NB. Since the data contains a multi-valued short text with rare words that limit the NB performance, we have employed an adapted selective classifier based on similarities (CSBS) classifier to exceed the NB limitations and included the rare words in the computation. This action has been achieved by transforming the formula from the product of the probabilities of the categorical variable to its sum weighted by numerical variable. The proposed algorithm has been experimented on card payment transaction data that contains the label of transactions: the multi-valued short text and the transaction amount. Based on K-fold cross validation, the evaluation results confirm that the proposed method achieved better results in terms of precision, recall, and F-score compared to NB and CSBS classifiers separately. Besides, the fact of converting a product form to a sum gives more chance to rare words to optimize the text classification, which is another advantage of the proposed method.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...IJDKP
As existing computer search engines struggle to understand the meaning of natural language, semantically
enriched metadata may improve interest-based search engine capabilities and user satisfaction.
This paper presents an enhanced version of the ecosystem focusing on semantic topic metadata detection
and enrichments. It is based on a previous paper, a semantic metadata enrichment software ecosystem
(SMESE). Through text analysis approaches for topic detection and metadata enrichments this paper
propose an algorithm to enhance search engines capabilities and consequently help users finding content
according to their interests. It presents the design, implementation and evaluation of SATD (Scalable
Annotation-based Topic Detection) model and algorithm using metadata from the web, linked open data,
concordance rules, and bibliographic record authorities. It includes a prototype of a semantic engine using
keyword extraction, classification and concept extraction that allows generating semantic topics by text,
and multimedia document analysis using the proposed SATD model and algorithm.
The performance of the proposed ecosystem is evaluated using a number of prototype simulations by
comparing them to existing enriched metadata techniques (e.g., AlchemyAPI, DBpedia, Wikimeta, Bitext,
AIDA, TextRazor). It was noted that SATD algorithm supports more attributes than other algorithms. The
results show that the enhanced platform and its algorithm enable greater understanding of documents
related to user interests.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Novelty detection via topic modeling in research articlescsandit
In today’s world redundancy is the most vital problem faced in almost all domains. Novelty
detection is the identification of new or unknown data or signal that a machine learning system
is not aware of during training. The problem becomes more intense when it comes to “Research
Articles”. A method of identifying novelty at each sections of the article is highly required for
determining the novel idea proposed in the research paper. Since research articles are semistructured,
detecting novelty of information from them requires more accurate systems. Topic
model provides a useful means to process them and provides a simple way to analyze them. This
work compares the most predominantly used topic model- Latent Dirichlet Allocation with the
hierarchical Pachinko Allocation Model. The results obtained are promising towards
hierarchical Pachinko Allocation Model when used for document retrieval.
A Novel Text Classification Method Using Comprehensive Feature WeightTELKOMNIKA JOURNAL
Currently, since the categorical distribution of short text corpus is not balanced, it is difficult to
obtain accurate classification results for long text classification. To solve this problem, this paper proposes
a novel method of short text classification using comprehensive feature weights. This method takes into
account the situation of the samples in the positive and negative categories, as well as the category
correlation of words, so as to improve the existing feature weight calculation method and obtain a new
method of calculating the comprehensive feature weight. The experimental result shows that the proposed
method is significantly higher than other feature-weight methods in the micro and macro average value,
which shows that this method can greatly improve the accuracy and recall rate of short text classification.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Text classification supervised algorithms with term frequency inverse documen...IJECEIAES
Over the course of the previous two decades, there has been a rise in the quantity of text documents stored digitally. The ability to organize and categorize those documents in an automated mechanism, is known as text categorization which is used to classify them into a set of predefined categories so they may be preserved and sorted more efficiently. Identifying appropriate structures, architectures, and methods for text classification presents a challenge for researchers. This is due to the significant impact this concept has on content management, contextual search, opinion mining, product review analysis, spam filtering, and text sentiment mining. This study analyzes the generic categorization strategy and examines supervised machine learning approaches and their ability to comprehend complex models and nonlinear data interactions. Among these methods are k-nearest neighbors (KNN), support vector machine (SVM), and ensemble learning algorithms employing various evaluation techniques. Thereafter, an evaluation is conducted on the constraints of every technique and how they can be applied to real-life situations.
Comparison analysis of Bangla news articles classification using support vect...TELKOMNIKA JOURNAL
In the information age, Bangla news articles on the internet are fast-growing. For organizing, every news site has a particular structure and categorization. News article classification is a method to determine a document’s classification based on various predefined categories. This research discusses the classification of Bangla news articles on the online platform and tries to make constructive comparison using several classification algorithms. For Bangla news articles classification, term frequencyinverse document frequency (TF-IDF) weighting and count vectorizer have been used as a feature extraction process, and two common classifiers named support vector machine (SVM) and logistic regression (LR) employed for classifying the documents. It is clear that the accuracy of the experimental results by applying SVM is 84.0% and LR is 81.0% for twelve categories of news articles. In this research work, when we have made comparison two renowned classification algorithms applied on the Bangla news articles, LR was outperformed by SVM.
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...ijaia
Regression models and their statistical analyses is one of the most important tool used by scientists and practitioners. The aim of a regression model is to fit parametric functions to data. It is known that the true regression is unknown and specific methods are created and used strictly pertaining to the roblem. For the pioneering work to develop procedures for fitting functions, we refer to the work on the methods of least
absolute deviations, least squares deviations and minimax absolute deviations. Today’s widely celebrated
procedure of the method of least squares for function fitting is credited to the published works of Legendre and Gauss. However, the least squares based models in practice may fail to provide optimal results in nonGaussian situations especially when the errors follow distributions with the fat tails. In this paper an unorthodox method of estimating linear regression coefficients by minimising GMSE(geometric mean of squared errors) is explored. Though GMSE(geometric mean of squared errors) is used to compare models it is rarely used to obtain the coefficients. Such a method is tedious to handle due to the large number of roots obtained by minimisation of the loss function. This paper offers a way to tackle that problem.
Application is illustrated with the ‘Advertising’ dataset from ISLR and the obtained results are compared
with the results of the method of least squares for single index linear regression model.
A simplified classification computational model of opinion mining using deep ...IJECEIAES
Opinion and attempts to develop an automated system to determine people's viewpoints towards various units such as events, topics, products, services, organizations, individuals, and issues. Opinion analysis from the natural text can be regarded as a text and sequence classification problem which poses high feature space due to the involvement of dynamic information that needs to be addressed precisely. This paper introduces effective modelling of human opinion analysis from social media data subjected to complex and dynamic content. Firstly, a customized preprocessing operation based on natural language processing mechanisms as an effective data treatment process towards building quality-aware input data. On the other hand, a suitable deep learning technique, bidirectional long short term-memory (Bi-LSTM), is implemented for the opinion classification, followed by a data modelling process where truncating and padding is performed manually to achieve better data generalization in the training phase. The design and development of the model are carried on the MATLAB tool. The performance analysis has shown that the proposed system offers a significant advantage in terms of classification accuracy and less training time due to a reduction in the feature space by the data treatment operation.
Experimental Result Analysis of Text Categorization using Clustering and Clas...ijtsrd
In a world that routinely produces more textual data. It is very critical task to managing that textual data. There are many text analysis methods are available to managing and visualizing that data, but many techniques may give less accuracy because of the ambiguity of natural language. To provide the ne grained analysis, in this paper introduce e cient machine learning algorithms for categorize text data. To improve the accuracy, in proposed system I introduced Natural language toolkit NLTK python library to perform natural language processing. The main aim of proposed system is to generalize the model for real time text categorization applications by using e cient text classi cation as well as clustering machine learning algorithms and nd the efficient and accurate model for input dataset using performance measure concept. Patil Kiran Sanajy | Prof. Kurhade N. V. ""Experimental Result Analysis of Text Categorization using Clustering and Classification Algorithms"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd25077.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/25077/experimental-result-analysis-of-text-categorization-using-clustering-and-classification-algorithms/patil-kiran-sanajy
Machine learning for text document classification-efficient classification ap...IAESIJAI
Numerous alternative methods for text classification have been created because of the increase in the amount of online text information available. The cosine similarity classifier is the most extensively utilized simple and efficient approach. It improves text classification performance. It is combined with estimated values provided by conventional classifiers such as Multinomial Naive Bayesian (MNB). Consequently, combining the similarity between a test document and a category with the estimated value for the category enhances the performance of the classifier. This approach provides a text document categorization method that is both efficient and effective. In addition, methods for determining the proper relationship between a set of words in a document and its document categorization is also obtained.
A hybrid composite features based sentence level sentiment analyzerIAESIJAI
Current lexica and machine learning based sentiment analysis approaches
still suffer from a two-fold limitation. First, manual lexicon construction and
machine training is time consuming and error-prone. Second, the
prediction’s accuracy entails sentences and their corresponding training text
should fall under the same domain. In this article, we experimentally
evaluate four sentiment classifiers, namely support vector machines (SVMs),
Naive Bayes (NB), logistic regression (LR) and random forest (RF). We
quantify the quality of each of these models using three real-world datasets
that comprise 50,000 movie reviews, 10,662 sentences, and 300 generic
movie reviews. Specifically, we study the impact of a variety of natural
language processing (NLP) pipelines on the quality of the predicted
sentiment orientations. Additionally, we measure the impact of incorporating
lexical semantic knowledge captured by WordNet on expanding original
words in sentences. Findings demonstrate that the utilizing different NLP
pipelines and semantic relationships impacts the quality of the sentiment
analyzers. In particular, results indicate that coupling lemmatization and
knowledge-based n-gram features proved to produce higher accuracy results.
With this coupling, the accuracy of the SVM classifier has improved to
90.43%, while it was 86.83%, 90.11%, 86.20%, respectively using the three
other classifiers.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
Text Categorization Using Improved K Nearest Neighbor AlgorithmIJTET Journal
Abstract— Text categorization is the process of identifying and assigning predefined class to which a document belongs. A wide variety of algorithms are currently available to perform the text categorization. Among them, K-Nearest Neighbor text classifier is the most commonly used one. It is used to test the degree of similarity between documents and k training data, thereby determining the category of test documents. In this paper, an improved K-Nearest Neighbor algorithm for text categorization is proposed. In this method, the text is categorized into different classes based on K-Nearest Neighbor algorithm and constrained one-pass clustering, which provides an effective strategy for categorizing the text. This improves the efficiency of K-Nearest Neighbor algorithm by generating the classification model. The text classification using K-Nearest Neighbor algorithm has a wide variety of text mining applications.
In this paper we present gender and authorship categorisationusing the Prediction by Partial Matching (PPM) compression scheme for text from Twitter written in Arabic. The PPMD variant of the compression scheme with different orders was used to perform the categorisation. We also applied different machine learning algorithms such as Multinational Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and an implementation of Support Vector Machine (LIBSVM), applying the same processing steps for all the algorithms. PPMD shows significantly better accuracy in comparison to all the other machine learning algorithms, with order 11 PPMD working best, achieving 90 % and 96% accuracy for gender and authorship respectively.
In this paper we present gender and authorship categorisationusing the Prediction by Partial Matching
(PPM) compression scheme for text from Twitter written in Arabic. The PPMD variant of the compression
scheme with different orders was used to perform the categorisation. We also applied different machine
learning algorithms such as Multinational Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and an
implementation of Support Vector Machine (LIBSVM), applying the same processing steps for all the
algorithms. PPMD shows significantly better accuracy in comparison to all the other machine learning
algorithms, with order 11 PPMD working best, achieving 90 % and 96% accuracy for gender and
authorship respectively.
GENDER AND AUTHORSHIP CATEGORISATION OF ARABIC TEXT FROM TWITTER USING PPMijcsit
In this paper we present gender and authorship categorisation using the Prediction by Partial Matching(PPM) compression scheme for text from Twitter written in Arabic. The PPMD variant of the compression scheme with different orders was used to perform the categorisation. We also applied different machine learning algorithms such as Multinational Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and an
implementation of Support Vector Machine (LIBSVM), applying the same processing steps for all the algorithms. PPMD shows significantly better accuracy in comparison to all the other machine learning algorithms, with order 11 PPMD working best, achieving 90 % and 96% accuracy for gender and
authorship respectively.
An improved Arabic text classification method using word embeddingIJECEIAES
Feature selection (FS) is a widely used method for removing redundant or irrelevant features to improve classification accuracy and decrease the model’s computational cost. In this paper, we present an improved method (referred to hereafter as RARF) for Arabic text classification (ATC) that employs the term frequency-inverse document frequency (TF-IDF) and Word2Vec embedding technique to identify words that have a particular semantic relationship. In addition, we have compared our method with four benchmark FS methods namely principal component analysis (PCA), linear discriminant analysis (LDA), chi-square, and mutual information (MI). Support vector machine (SVM), k-nearest neighbors (K-NN), and naive Bayes (NB) are three machine learning based algorithms used in this work. Two different Arabic datasets are utilized to perform a comparative analysis of these algorithms. This paper also evaluates the efficiency of our method for ATC on the basis of performance metrics viz accuracy, precision, recall, and F-measure. Results revealed that the highest accuracy achieved for the SVM classifier applied to the Khaleej-2004 Arabic dataset with 94.75%, while the same classifier recorded an accuracy of 94.01% for the Watan-2004 Arabic dataset.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Han’s Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insider’s LMA Course, this piece examines the course’s effects via a variety of Tim Han LMA course reviews and Success Insider comments.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
2. Related Work
Text document classification studies have become an
emerging field in the text mining research area. Consequently,
an abundance of approaches has been developed for such
purpose, including support vector machines (SVM) [1], K-nearest
neighbor (KNN) classification [2], Naive Bayes classification [3],
Decision tree (DT) [4], Neural Network (NN) [5], and maximum
entropy [6]. In regard to these approaches, Multinomial Naive
Bayes test classifier has been vastly used due to its simplicity in
training and classifying phases [3]. Many researchers proved its
effectiveness in classifying the unstructured text documents in
various domains.
Dalal and Zaveri [7] have presented a generic strategy for
automatic text classification, which includes phases such as
preprocessing, feature selection, using semantic or statistical
techniques, and selecting the appropriate machine learning
techniques (Naive bayes, Decision tree, hybrid techniques,
Support vector machines). They have also discussed some of the
key issues involved in text classification such as handling a large
number of features, working with unstructured text, dealing
with missing metadata, and selecting a suitable machine
learning technique for training a text classifier.
Bolaj and Govilkar [8] have presented a survey of text
categorization techniques for Indian regional languages and
have proved that the naive bayes, k-nearest neighbor, and
support vector machines are the most suitable techniques for
achieving better document classification results for Indian
regional languages. Jain and Saini [9] used a statistical approach
to classify Punjabi text. In this paper Naive bayes classifier has
been successfully implemented and tested. The classifier has
achieved satisfactory results in classifying the documents.
Tilve and Jain [10] used three text classification algorithms
(Naive Bayes, VSM for text classification, and the new
implemented Use of Stanford Tagger for text classification) for
text classification on two different datasets (20 Newsgroups and
New news dataset for five categories). Comparing with the
above classification strategies, Naïve Bayes is potentially good at
serving as a text classification model due to its simplicity. Gogoi
and Sarma [11] has highlighted the performance of employing
Naive bayes technique in document classification.
In this paper, a classification model has been built on and
evaluated according to a small dataset of four categories and
200 documents for training and testing. The result has been
validated using statistical measures of precision, recall and their
combination F-measure. Results showed that Naïve Bayes is a
good classifier. This study can be extended by applying Naive
Bayes classification on larger datasets. Rajeswari R, et al. [12]
focused on text classification using Naive bayes and K-nearest
neighbor classifiers and to confirm on performance and accuracy
of these approaches. The result showed that Naive bayes
classifier is a good classifier with an accuracy of 66.67 as
opposed to KNN classifier with 38.89.
One closely related research paper to our research was of
Trstenjak B, et al. [13] that has proposed a text categorization
framework based on using KNN technique with TF-IDF method.
Both KNN and TF-IDF embedded together gave good results and
confirmed the initial expectations. The framework has been
tested on several categories of text Documents. During testing,
classification gave accurate results due to KNN algorithm. This
combination gives better results and needs to upgrade and
improve the framework for better and high accuracy results.
The Proposed Model
The proposed model is expected to achieve more efficiency in
the text classification performance. Hence, the proposed model
has been presented in [13] depends on using KNN with TF-IDF.
Our work used the multinomial naive bayes technique according
to its popularity and efficiency in text classification nowadays as
discussed earlier. Especially, its superior performance than K-
nearest neighbor as [12] has been illustrated. The proposed
model presents combining multinomial naive bayes as a selected
machine learning technique for classification, and TF-IDF as a
vector space model for text extraction, and chi2 technique for
feature selection for more accurate text classification with better
performance in comparison to the testing result to of [13]
framework.
The general steps for building Classification model as
presented in Figure 1 are: Preprocessing for all labeled and
unlabeled documents. Training, in which the classifier is
constructed from the labeled training prepared instances. Then,
the testing that is responsible for testing the model by testing
prepared samples whose class labels are known but not used for
training model. Finally, usage phase, since the model is prepared
to be used for classification of new prepared data whose class
label is unknown.
Figure 1: Model Architecture
Pre-processing
This phase is applied on the input documents, either on the
labeled data for training and testing phase or on the unlabeled
for the usage phase. This phase is used to present the text
documents in a clear word format. The output documents are
prepared for next phases in text classification. Commonly the
steps taken are:
American Journal of Computer Science and Information Technology
ISSN 2349-3917 Vol.6 No.1:16
2018
2 This article is available from: https://www.imedpub.com/computer-science-and-information-technology/
4. Constructing vector space model: Once the feature selection
phase has been done and the best M features have been
selected, the vector space model will be constructed. That
vector space represents each document as a vector with M
dimensions, which is the number of selected features set that
has been produced in the previous phase, each vector can be
written as:
Vd=[W1d,W2d,……..,Wmd]
where Wid is a weight for measuring the importance of the
term i in document j. There are various methods that can be
used for weighting the terms as we mentioned in text
presentation phase. Our model uses TF-IDF, which stands for
Term Frequency-Inverse Document Frequency. TF-IDF can be
calculated as [17]:
��� = ���� * ���� = ���� * log
�
��
�
4
Where Wij is the weight of term i in document j, N is the
number of all documents in the training set, tfij is the term
frequency for term i in document j, dfi is the document
frequency of term i in the documents of the training set. The
following pseudo code present TF-IDF vector space model:
Training classifier: Training classifier is the key component of
the text classification process. The role of this phase is to build a
classifier or generate model by training it using predefined
documents that will be used to classify unlabelled documents.
The Multinomial naive bayes classifier is selected to build the
classifier in our model.
The following probability calculations should be done during
this phase:
A. For each term or feature in the selected feature set,
calculate the probability of that feature (term) to each class.
Assume that a set of classes is denoted by C, where
C={c1,c2.......,ck} is the set of class labels, and N is the length of
the selected features set {t1,t2,…..tN} according to the use of
term frequencies to estimate the class-conditional probabilities
in the multinomial model [18]:
� ��/�� =
∑�� �
�
, � ∈ �
�
+ �
∑�
� ∈ ��
+ � . �
5
it can be adapted for using with tf-idf for our model, as the
following equation:
� ��/�� =
∑����� �
�
, � ∈ �� + �
∑�
� ∈ ��
+ � . �
6
Where:
• ti: A word from the feature vector t of a particular sample.
•
∑����� ��, � ∈ �� : The sum of raw tf-idf of word ti from
all documents in the training sample that belong to a class
ck.∑�� ∈ ��: The sum of all tf-idf in the training dataset
for class ck
• α: An additive smoothing parameter (α:=1 for Laplace
smoothing).
• V: The size of the vocabulary (number of different words in
the training set).
The pseudo code for this calculation step is:
American Journal of Computer Science and Information Technology
ISSN 2349-3917 Vol.6 No.1:16
2018
4 This article is available from: https://www.imedpub.com/computer-science-and-information-technology/
6. Figure 4: Usage Phase
Experiment Results
Dataset and setup considerations
The experiment is carried out using (20-Newgroups) collected
by Ken Lang. It has become one of the standard corpora for text
classification. 20 newsgroups dataset is a collection of 19,997
newsgroup document which are taken from Usenet news,
partitioned across 20 different categories. The training dataset
consists of 11998 samples, 60% of which is training, the rest 40%
is used for testing the classifier. Table 1 presents the used
environment properties. Experimental Environment
characteristics are shown in Table 1.
Table 1: Experimental environment
OS Windows 7
CPU Core(TM) I7-3630
RAM 8.00 GB
Language Python
Results
The result of the experiment is measuring a set of items as the
following:
Superiority of using the multinomial naive bayes MNB with
TFIDF than KNN: In this section, we have investigated the
compatibility's effectiveness of our choice of combining the TF-
IDF weight with multinomial naive bayes (MNB) and compare it
with combining the same weight algorithm with KNN [13] for
classifying the unstructured text documents. Performance is
evaluated in terms of Accuracy, Precision, Recall and F-score for
all the documents as shown in Table 2 and Figure 5. Run time for
both MNB and KNN with TFIDF is presented in Figure 6.
Accuracy, Precision, Recall and F-score for each category in both
techniques are illustrated in Table 3.
Table 2: Accuracy, precision, recall and F-measure of both
approaches
Results MNB-TFIDF KNN-TFIDF
Accuracy 0.87 0.71
Precision 0.88 0.72
Recall 0.87 0.71
F1-Score 0.87 0.71
Time 0.44 (ms) 18.99 (ms)
Figure 5: Comparison of performance results for using both
KNN and Multinomial Naive bayes(MNB)with TF-IDF
American Journal of Computer Science and Information Technology
ISSN 2349-3917 Vol.6 No.1:16
2018
6 This article is available from: https://www.imedpub.com/computer-science-and-information-technology/
8. Figure 8: Runtime comparison for MNB-TFIDF with/without
Chi
The evaluation of the proposed model: Evaluating the whole
model through the precision, recall, f1-score performance
measures for each category in the testing data set as shown in
Table 5.
Table 5: performance measures for each category in the testing
data set
No. of
docs
Precision Recall F1-score
Cat. 0 278 0.69 0.79 0.73
Cat. 1 287 0.92 0.91 0.91
Cat. 2 305 0.94 0.96 0.95
Cat. 3 320 0.86 0.95 0.90
Cat. 4 288 0.96 0.95 0.96
Cat. 5 315 0.94 0.90 0.92
Cat. 6 297 0.98 0.97 0.97
Cat. 7 322 0.96 0.97 0.96
Cat. 8 298 1 0.96 0.98
Cat. 9 278 1 0.97 0.99
Cat. 10 287 0.98 0.99 0.99
Cat. 11 308 0.98 0.96 0.97
Cat. 12 288 0.94 0.90 0.92
Cat. 13 318 0.93 0.96 0.95
Cat. 14 286 0.98 0.99 0.99
Cat. 15 282 0.97 1 0.98
Cat. 16 308 0.81 0.97 0.88
Cat. 17 312 0.92 0.92 0.92
Cat. 18 304 0.84 0.70 0.77
Cat. 19 319 0.72 0.59 0.65
Capability of the model to work with other techniques: This
part of the evaluation proves the capability of the model for
improving the quality of other classification techniques as in
Tables 6 and 7 and Figures 9 and 10.
Table 6: Accuracy, Precision, Recall and F-measure of both
approaches.
Results KNN-TFIDF KNN-TFIDF using the model
Accuracy 0.71 0.89
Precision 0.72 0.89
Recall 0.71 0.89
F1-Score 0.71 0.89
Time 0:00:33.681047 0:00:10.93001
Figure 9: Comparison of performance results for using
another technique with and without the proposed model
Table 7: Performance measures for each category in the testing data set for both KNN-TFIDF with and without the proposed model
Cat No. of docs
Precision Recall F1-score
with without with without with without
Cat.0 278 0.64 0.59 0.80 0.81 0.71 0.68
Cat.1 287 0.89 0.53 0.89 0.57 0.89 0.55
Cat.2 305 0.92 0.58 0.95 0.71 0.93 0.64
American Journal of Computer Science and Information Technology
ISSN 2349-3917 Vol.6 No.1:16
2018
8 This article is available from: https://www.imedpub.com/computer-science-and-information-technology/
10. 6. Nigam K, Lafferty J, McCallum A (1999) Using maximum entropy
for text classification. In Proceedings: IJCAI-99 Workshop on
Machine Learning for Information Filtering 61-67.
7. Dalal MK, Zaveri MA (2011) Automatic Text Classification: A
Technical Review. Int J Comp App 28: 0975-8887.
8. Bolaj P, Govilkar S (2016) A Survey on Text Categorization
Techniques for Indian Regional Languages. Int J Comp Sci Inform
Technol 7: 480-483.
9. Jain U, Saini K (2015) Punjabi Text Classification using Naive Bayes
Algorithm. Int J Curr Engineering Technol 5.
10. Tilve AKS, Jain SN (2017) Text Classification using Naive Bayes,
VSM and POS Tagger. Int J Ethics in Engineering & Management
Education 4: 1.
11. Gogoi M, Sharma SK (2015) Document Classification of Assamese
Text Using Naïve Bayes Approach. Int J Comp Trends Technol 30: 4.
12. Rajeswari RP, Juliet K, Aradhana (2017) Text Classification for
Student Data Set using Naive Bayes Classifier and KNN Classifier.
Int J Comp Trends Technol 43: 1.
13. Trstenjak B, Mikac S, Donko D (2014) KNN with TF-IDF based
Framework for Text Categorization. Procedia Engineering 69:
1356-1364.
14. Grobelnik M, Mladenic D (2004) Text-Mining Tutorial in the
Proceeding of Learning methods for Text Understanding and
Mining. pp: 26-29.
15. Vector space method (2017) Available at: en.wikipedia.org/wiki/
Vector_space.
16. Thaoroijam K (2014) A Study on Document Classification using
Machine Learning Techniques. Int J Comp Sci Issues 11: 1.
17. Wang D, Zhang H, Liu R, Lv W, Wang D (2014) t-Test feature
selection approach based on term frequency for text
categorization. Pattern Recognition Letters 45: 1-10.
18. Raschka S (2014) Naive Bayes and Text Classification I-Introduction
and Theory.
19. Pop L (2006) An approach of the Naïve Bayes classifier for the
document classification. General Mathematics 14: 135-138.
American Journal of Computer Science and Information Technology
ISSN 2349-3917 Vol.6 No.1:16
2018
10 This article is available from: https://www.imedpub.com/computer-science-and-information-technology/