The document reviews various text categorization methods and proposes a new supervised term weighting method using normalized term frequency and relevant frequency (ntf.rf). It begins by discussing existing text categorization methods and their limitations. Specifically, existing methods often require labeled training data, cleaned datasets, and work best on linearly separable data. The document then proposes the new ntf.rf method to address these limitations by incorporating preprocessing and leveraging both normalized term frequency and relevant frequency to assign term weights. Finally, the document outlines how ntf.rf could improve text categorization by providing a more effective term weighting approach.
Text preprocessing is a vital stage in text classification (TC) particularly and text mining generally. Text preprocessing tools is to reduce multiple forms of the word to one form. In addition, text preprocessing techniques are provided a lot of significance and widely studied in machine learning. The basic phase in text classification involves preprocessing features, extracting relevant features against the features in a database. However, they have a great impact on reducing the time requirement and speed resources needed. The effect of the preprocessing tools on English text classification is an area of research. This paper provides an evaluation study of several preprocessing tools for English text classification. The study includes using the raw text, the tokenization, the stop words, and the stemmed. Two different methods chi-square and TF-IDF with cosine similarity score for feature extraction are used based on BBC English dataset. The Experimental results show that the text preprocessing effect on the feature extraction methods that enhances the performance of English text classification especially for small threshold values.
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...IJERA Editor
Assigning documents to related categories is critical task which is used for effective document retrieval. Automatic text classification is the process of assigning new text document to the predefined categories based on its content. In this paper, we implemented and performed comparison of Naïve Bayes and Centroid-based algorithms for effective document categorization of English language text. In Centroid Based algorithm, we used Arithmetical Average Centroid (AAC) and Cumuli Geometric Centroid (CGC) methods to calculate centroid of each class. Experiment is performed on R-52 dataset of Reuters-21578 corpus. Micro Average F1 measure is used to evaluate the performance of classifiers. Experimental results show that Micro Average F1 value for NB is greatest among all followed by Micro Average F1 value of CGC which is greater than Micro Average F1 of AAC. All these results are valuable for future research
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Text preprocessing is a vital stage in text classification (TC) particularly and text mining generally. Text preprocessing tools is to reduce multiple forms of the word to one form. In addition, text preprocessing techniques are provided a lot of significance and widely studied in machine learning. The basic phase in text classification involves preprocessing features, extracting relevant features against the features in a database. However, they have a great impact on reducing the time requirement and speed resources needed. The effect of the preprocessing tools on English text classification is an area of research. This paper provides an evaluation study of several preprocessing tools for English text classification. The study includes using the raw text, the tokenization, the stop words, and the stemmed. Two different methods chi-square and TF-IDF with cosine similarity score for feature extraction are used based on BBC English dataset. The Experimental results show that the text preprocessing effect on the feature extraction methods that enhances the performance of English text classification especially for small threshold values.
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...IJERA Editor
Assigning documents to related categories is critical task which is used for effective document retrieval. Automatic text classification is the process of assigning new text document to the predefined categories based on its content. In this paper, we implemented and performed comparison of Naïve Bayes and Centroid-based algorithms for effective document categorization of English language text. In Centroid Based algorithm, we used Arithmetical Average Centroid (AAC) and Cumuli Geometric Centroid (CGC) methods to calculate centroid of each class. Experiment is performed on R-52 dataset of Reuters-21578 corpus. Micro Average F1 measure is used to evaluate the performance of classifiers. Experimental results show that Micro Average F1 value for NB is greatest among all followed by Micro Average F1 value of CGC which is greater than Micro Average F1 of AAC. All these results are valuable for future research
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
A Novel Text Classification Method Using Comprehensive Feature WeightTELKOMNIKA JOURNAL
Currently, since the categorical distribution of short text corpus is not balanced, it is difficult to
obtain accurate classification results for long text classification. To solve this problem, this paper proposes
a novel method of short text classification using comprehensive feature weights. This method takes into
account the situation of the samples in the positive and negative categories, as well as the category
correlation of words, so as to improve the existing feature weight calculation method and obtain a new
method of calculating the comprehensive feature weight. The experimental result shows that the proposed
method is significantly higher than other feature-weight methods in the micro and macro average value,
which shows that this method can greatly improve the accuracy and recall rate of short text classification.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Semantic Based Model for Text Document Clustering with IdiomsWaqas Tariq
Text document clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. Clustering is a very powerful data mining technique to organize the large amount of information on the web. Traditionally, document clustering methods do not consider the semantic structure of the document. This paper addresses the task of developing an effective and efficient method to improve the semantic structure of the text documents. A method has been developed that performs the following: tag the documents for parsing, replacement of idioms with their original meaning, semantic weights calculation for document words and apply semantic grammar. The similarity measure is obtained between the documents and then the documents are clustered using Hierarchical clustering algorithm. The method adopted in this work is evaluated on different data sets with standard performance measures and the effectiveness of the method to develop in meaningful clusters has been proved.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Single document keywords extraction in Bahasa Indonesia using phrase chunkingTELKOMNIKA JOURNAL
Keywords help readers to understand the idea of a document quickly. Unfortunately, considerable time and effort are often needed to come up with a good set of keywords manually. This research focused on generating keywords from a document automatically using phrase chunking. Firstly, we collected part of speech patterns from a collection of documents. Secondly, we used those patterns to extract candidate keywords from the abstract and the content of a document. Finally, keywords are selected from the candidates based on the number of words in the keyword phrases and some scenarios involving candidate reduction and sorting. We evaluated the result of each scenario using precision, recall, and F-measure. The experiment results show: i) shorter-phrase keywords with string reduction extracted from the abstract and sorted by frequency provides the highest score, ii) in every proposed scenario, extracting keywords using the abstract always presents a better result, iii) using shorter-phrase patterns in keywords extraction gives better score in comparison to using all phrase patterns, iv) sorting scenarios based on the multiplication of candidate frequencies and the weight of the phrase patterns offer better results.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Query Answering Approach Based on Document SummarizationIJMER
The growing of online information obliged the availability of a thorough research in the
domain of automatic text summarization within the Natural Language Processing (NLP)
community.The aim of this paper is to propose a novel approach for a language independent automatic
summarization approach that combines three main approaches. The Rhetorical Structure Theory
(RST), the query processing approach, and the Network Representationapproach (NRA). RST, as a
theory of major aspect for the structure of natural text, is used to extract the semantic relation behind
the text.Query processing approachclassifies the question type and finds the answer in a way that suits
the user’s needs. The NRA is used to create a graph representing the extracted semantic relation. The
output is an answer, which not only responses to the question, but also gives the user an opportunity to
find additional information that is related to the question.We implemented the proposed approach. As a
case study, the implemented approachis applied on Arabic text in the agriculture field. The
implemented approach succeeded in summarizing extension documents according to user's query. The
approach results have been evaluated using Recall, Precision and F-score measures.
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
This article will introduce some approaches for improving text categorization models by integrating
previously imported ontologies. From the Reuters Corpus Volume I (RCV1) dataset, some categories very
similar in content and related to telecommunications, Internet and computer areas were selected for models
experiments. Several domain ontologies, covering these areas were built and integrated to categorization
models for their improvements.
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONijaia
This paper explores the use of machine learning approaches, or more specifically, four supervised learning
Methods, namely Decision Tree(C 4.5), K-Nearest Neighbour (KNN), Naïve Bays (NB), and Support Vector
Machine (SVM) for categorization of Bangla web documents. This is a task of automatically sorting a set of
documents into categories from a predefined set. Whereas a wide range of methods have been applied to English text categorization, relatively few studies have been conducted on Bangla language text categorization. Hence, we attempt to analyze the efficiency of those four methods for categorization of Bangla documents. In order to validate, Bangla corpus from various websites has been developed and used as examples for the experiment. For Bangla, empirical results support that all four methods produce
satisfactory performance with SVM attaining good result in terms of high dimensional and relatively noisy
document feature vectors.
AN AUTOMATED MULTIPLE-CHOICE QUESTION GENERATION USING NATURAL LANGUAGE PROCE...kevig
Automatic multiple-choice question generation (MCQG) is a useful yet challenging task in Natural Language
Processing (NLP). It is the task of automatic generation of correct and relevant questions from textual data.
Despite its usefulness, manually creating sizeable, meaningful and relevant questions is a time-consuming
and challenging task for teachers. In this paper, we present an NLP-based system for automatic MCQG for
Computer-Based Testing Examination (CBTE).We used NLP technique to extract keywords that are
important words in a given lesson material. To validate that the system is not perverse, five lesson materials
were used to check the effectiveness and efficiency of the system. The manually extracted keywords by the
teacher were compared to the auto-generated keywords and the result shows that the system was capable of
extracting keywords from lesson materials in setting examinable questions. This outcome is presented in a
user-friendly interface for easy accessibility.
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
Text mining is an emerging research field evolving from information retrieval area. Clustering and
classification are the two approaches in data mining which may also be used to perform text classification
and text clustering. The former is supervised while the later is un-supervised. In this paper, our objective is
to perform text clustering by defining an improved distance metric to compute the similarity between two
text files. We use incremental frequent pattern mining to find frequent items and reduce dimensionality.
The improved distance metric may also be used to perform text classification. The distance metric is
validated for the worst, average and best case situations [15]. The results show the proposed distance
metric outperforms the existing measures.
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
Context based Document Indexing and Retrieval using Big Data Analytics - A Re...rahulmonikasharma
In past few years it is observed that the internet usage is been grown wider all over the world, hence, the data generation and usage is been increased rapidly by the users, the data generated in different forms may or may not be structured. The usage of internet by individuals and organizations have been grown so, there is increasing quantity and diversity of digital data in the form of documents, became available to the end users. The Storage, Maintenance and organization of such huge data in databases is a challenging task. So, there is a great need of efficient and effective retrieval technique which focuses on improving the accuracy of document retrieval. In this paper we are going to discuss about document retrieval using context based indexing approach. Here lexical association between terms is used to separate content carrying terms and other-terms. Content carrying terms are used as they give idea about theme of the document. Indexing weight calculation is done for content carrying terms. Lexical association measure is used to calculate indexing weight of terms. The term having higher indexing weight is considered as important and sentence which contains these terms is also important. When user enters search query, the important terms are matched with the terms with higher weights in order to retrieve documents. The explicit semantic relation or frequent co-occurrence of terms is been considered in this context based indexing.
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...TELKOMNIKA JOURNAL
In this paper, we proposed a work on rhetorical corpus construction and sentence classification
model experiment that specifically could be incorporated in automatic paper title generation task for
scientific article. Rhetorical classification is treated as sequence labeling. Rhetorical sentence classification
model is useful in task which considers document’s discourse structure. We performed experiments using
two domains of datasets: computer science (CS dataset), and chemistry (GaN dataset). We evaluated the
models using 10-fold-cross validation (0.70-0.79 weighted average F-measure) as well as on-the-run
(0.30-0.36 error rate at best). We argued that our models performed best when handled using SMOTE
filter for imbalanced data.
IOSR Journal of Mathematics(IOSR-JM) is an open access international journal that provides rapid publication (within a month) of articles in all areas of mathemetics and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in mathematics. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
A Novel Text Classification Method Using Comprehensive Feature WeightTELKOMNIKA JOURNAL
Currently, since the categorical distribution of short text corpus is not balanced, it is difficult to
obtain accurate classification results for long text classification. To solve this problem, this paper proposes
a novel method of short text classification using comprehensive feature weights. This method takes into
account the situation of the samples in the positive and negative categories, as well as the category
correlation of words, so as to improve the existing feature weight calculation method and obtain a new
method of calculating the comprehensive feature weight. The experimental result shows that the proposed
method is significantly higher than other feature-weight methods in the micro and macro average value,
which shows that this method can greatly improve the accuracy and recall rate of short text classification.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Semantic Based Model for Text Document Clustering with IdiomsWaqas Tariq
Text document clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. Clustering is a very powerful data mining technique to organize the large amount of information on the web. Traditionally, document clustering methods do not consider the semantic structure of the document. This paper addresses the task of developing an effective and efficient method to improve the semantic structure of the text documents. A method has been developed that performs the following: tag the documents for parsing, replacement of idioms with their original meaning, semantic weights calculation for document words and apply semantic grammar. The similarity measure is obtained between the documents and then the documents are clustered using Hierarchical clustering algorithm. The method adopted in this work is evaluated on different data sets with standard performance measures and the effectiveness of the method to develop in meaningful clusters has been proved.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Single document keywords extraction in Bahasa Indonesia using phrase chunkingTELKOMNIKA JOURNAL
Keywords help readers to understand the idea of a document quickly. Unfortunately, considerable time and effort are often needed to come up with a good set of keywords manually. This research focused on generating keywords from a document automatically using phrase chunking. Firstly, we collected part of speech patterns from a collection of documents. Secondly, we used those patterns to extract candidate keywords from the abstract and the content of a document. Finally, keywords are selected from the candidates based on the number of words in the keyword phrases and some scenarios involving candidate reduction and sorting. We evaluated the result of each scenario using precision, recall, and F-measure. The experiment results show: i) shorter-phrase keywords with string reduction extracted from the abstract and sorted by frequency provides the highest score, ii) in every proposed scenario, extracting keywords using the abstract always presents a better result, iii) using shorter-phrase patterns in keywords extraction gives better score in comparison to using all phrase patterns, iv) sorting scenarios based on the multiplication of candidate frequencies and the weight of the phrase patterns offer better results.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Query Answering Approach Based on Document SummarizationIJMER
The growing of online information obliged the availability of a thorough research in the
domain of automatic text summarization within the Natural Language Processing (NLP)
community.The aim of this paper is to propose a novel approach for a language independent automatic
summarization approach that combines three main approaches. The Rhetorical Structure Theory
(RST), the query processing approach, and the Network Representationapproach (NRA). RST, as a
theory of major aspect for the structure of natural text, is used to extract the semantic relation behind
the text.Query processing approachclassifies the question type and finds the answer in a way that suits
the user’s needs. The NRA is used to create a graph representing the extracted semantic relation. The
output is an answer, which not only responses to the question, but also gives the user an opportunity to
find additional information that is related to the question.We implemented the proposed approach. As a
case study, the implemented approachis applied on Arabic text in the agriculture field. The
implemented approach succeeded in summarizing extension documents according to user's query. The
approach results have been evaluated using Recall, Precision and F-score measures.
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
This article will introduce some approaches for improving text categorization models by integrating
previously imported ontologies. From the Reuters Corpus Volume I (RCV1) dataset, some categories very
similar in content and related to telecommunications, Internet and computer areas were selected for models
experiments. Several domain ontologies, covering these areas were built and integrated to categorization
models for their improvements.
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONijaia
This paper explores the use of machine learning approaches, or more specifically, four supervised learning
Methods, namely Decision Tree(C 4.5), K-Nearest Neighbour (KNN), Naïve Bays (NB), and Support Vector
Machine (SVM) for categorization of Bangla web documents. This is a task of automatically sorting a set of
documents into categories from a predefined set. Whereas a wide range of methods have been applied to English text categorization, relatively few studies have been conducted on Bangla language text categorization. Hence, we attempt to analyze the efficiency of those four methods for categorization of Bangla documents. In order to validate, Bangla corpus from various websites has been developed and used as examples for the experiment. For Bangla, empirical results support that all four methods produce
satisfactory performance with SVM attaining good result in terms of high dimensional and relatively noisy
document feature vectors.
AN AUTOMATED MULTIPLE-CHOICE QUESTION GENERATION USING NATURAL LANGUAGE PROCE...kevig
Automatic multiple-choice question generation (MCQG) is a useful yet challenging task in Natural Language
Processing (NLP). It is the task of automatic generation of correct and relevant questions from textual data.
Despite its usefulness, manually creating sizeable, meaningful and relevant questions is a time-consuming
and challenging task for teachers. In this paper, we present an NLP-based system for automatic MCQG for
Computer-Based Testing Examination (CBTE).We used NLP technique to extract keywords that are
important words in a given lesson material. To validate that the system is not perverse, five lesson materials
were used to check the effectiveness and efficiency of the system. The manually extracted keywords by the
teacher were compared to the auto-generated keywords and the result shows that the system was capable of
extracting keywords from lesson materials in setting examinable questions. This outcome is presented in a
user-friendly interface for easy accessibility.
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
Text mining is an emerging research field evolving from information retrieval area. Clustering and
classification are the two approaches in data mining which may also be used to perform text classification
and text clustering. The former is supervised while the later is un-supervised. In this paper, our objective is
to perform text clustering by defining an improved distance metric to compute the similarity between two
text files. We use incremental frequent pattern mining to find frequent items and reduce dimensionality.
The improved distance metric may also be used to perform text classification. The distance metric is
validated for the worst, average and best case situations [15]. The results show the proposed distance
metric outperforms the existing measures.
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
Context based Document Indexing and Retrieval using Big Data Analytics - A Re...rahulmonikasharma
In past few years it is observed that the internet usage is been grown wider all over the world, hence, the data generation and usage is been increased rapidly by the users, the data generated in different forms may or may not be structured. The usage of internet by individuals and organizations have been grown so, there is increasing quantity and diversity of digital data in the form of documents, became available to the end users. The Storage, Maintenance and organization of such huge data in databases is a challenging task. So, there is a great need of efficient and effective retrieval technique which focuses on improving the accuracy of document retrieval. In this paper we are going to discuss about document retrieval using context based indexing approach. Here lexical association between terms is used to separate content carrying terms and other-terms. Content carrying terms are used as they give idea about theme of the document. Indexing weight calculation is done for content carrying terms. Lexical association measure is used to calculate indexing weight of terms. The term having higher indexing weight is considered as important and sentence which contains these terms is also important. When user enters search query, the important terms are matched with the terms with higher weights in order to retrieve documents. The explicit semantic relation or frequent co-occurrence of terms is been considered in this context based indexing.
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...TELKOMNIKA JOURNAL
In this paper, we proposed a work on rhetorical corpus construction and sentence classification
model experiment that specifically could be incorporated in automatic paper title generation task for
scientific article. Rhetorical classification is treated as sequence labeling. Rhetorical sentence classification
model is useful in task which considers document’s discourse structure. We performed experiments using
two domains of datasets: computer science (CS dataset), and chemistry (GaN dataset). We evaluated the
models using 10-fold-cross validation (0.70-0.79 weighted average F-measure) as well as on-the-run
(0.30-0.36 error rate at best). We argued that our models performed best when handled using SMOTE
filter for imbalanced data.
IOSR Journal of Mathematics(IOSR-JM) is an open access international journal that provides rapid publication (within a month) of articles in all areas of mathemetics and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in mathematics. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
50 Hz Frequency Magnetic Field Effects On Pseudomonas Aeruginosa And Bacillus...IOSR Journals
The effect of electromagnetic field of different intensities on Pseudomonas aeruginosa (as gram-negative
bacteria) and Bacillus subtilis (as gram-positive bacteria) was investigated to find out the effective magnetic field strength that alters the running physiological processes of every microorganism. Equal volumes of P. aeruginosa and B. subtilis suspensions were exposed for one hour at their maximum rate of active growth to the electromagnetic field (2 - 10 mT, 50 Hz). The results indicated that no remarkable differences were found in the growth of exposed P. aeruginosa. Moreover, a remarkable inhibition in the growth of exposed relative to unexposed B. subtilis cells was achieved at (4 mT) as compared with other intensities which may indicate that this magnetic field induction had a great effect on the biological activity of the cells, so more investigations were made at this magnetic field induction. Remarkable changes in the growth characteristics could be easily detected as the absorbance decreased which indicate a decrease in the cells number and consequently an
inhibition case for the bacteria. Also, the antibiotic sensitivity test of B. subtilis cells indicated either inhibition or stimulation case for the bacteria depending on the drug mode of action
Design Issues for Search Engines and Web Crawlers: A ReviewIOSR Journals
Abstract: The World Wide Web is a huge source of hyperlinked information contained in hypertext documents.
Search engines use web crawlers to collect these web documents from web for storage and indexing. The prompt
growth of the World Wide Web has posed incomparable challenges for the designers of search engines and web
crawlers; that help users to retrieve web pages in a reasonable amount of time. In this paper, a review on need
and working of a search engine, and role of a web crawler is being presented.
Key words: Internet, www, search engine, types, design issues, web crawlers.
A simplified classification computational model of opinion mining using deep ...IJECEIAES
Opinion and attempts to develop an automated system to determine people's viewpoints towards various units such as events, topics, products, services, organizations, individuals, and issues. Opinion analysis from the natural text can be regarded as a text and sequence classification problem which poses high feature space due to the involvement of dynamic information that needs to be addressed precisely. This paper introduces effective modelling of human opinion analysis from social media data subjected to complex and dynamic content. Firstly, a customized preprocessing operation based on natural language processing mechanisms as an effective data treatment process towards building quality-aware input data. On the other hand, a suitable deep learning technique, bidirectional long short term-memory (Bi-LSTM), is implemented for the opinion classification, followed by a data modelling process where truncating and padding is performed manually to achieve better data generalization in the training phase. The design and development of the model are carried on the MATLAB tool. The performance analysis has shown that the proposed system offers a significant advantage in terms of classification accuracy and less training time due to a reduction in the feature space by the data treatment operation.
Feature selection, optimization and clustering strategies of text documentsIJECEIAES
Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments.
Text classification supervised algorithms with term frequency inverse documen...IJECEIAES
Over the course of the previous two decades, there has been a rise in the quantity of text documents stored digitally. The ability to organize and categorize those documents in an automated mechanism, is known as text categorization which is used to classify them into a set of predefined categories so they may be preserved and sorted more efficiently. Identifying appropriate structures, architectures, and methods for text classification presents a challenge for researchers. This is due to the significant impact this concept has on content management, contextual search, opinion mining, product review analysis, spam filtering, and text sentiment mining. This study analyzes the generic categorization strategy and examines supervised machine learning approaches and their ability to comprehend complex models and nonlinear data interactions. Among these methods are k-nearest neighbors (KNN), support vector machine (SVM), and ensemble learning algorithms employing various evaluation techniques. Thereafter, an evaluation is conducted on the constraints of every technique and how they can be applied to real-life situations.
Semi Automated Text Categorization Using Demonstration Based Term SetIJCSEA Journal
Manual Analysis of huge amount of textual data requires a tremendous amount of processing time and effort in reading the text and organizing them in required format. In the current scenario, the major problem is with text categorization because of the high dimensionality of feature space. Now-a-days there are many methods available to deal with text feature selection. This paper aims at such semi automated text categorization feature selection methodology to deal with a massive data using one of the phases of David Merrill’s First principles of instruction (FPI). It uses a pre-defined category group by providing them with the proper training set based on the demonstration phase of FPI. The methodology involves the text tokenization, text categorization and text analysis.
Machine learning for text document classification-efficient classification ap...IAESIJAI
Numerous alternative methods for text classification have been created because of the increase in the amount of online text information available. The cosine similarity classifier is the most extensively utilized simple and efficient approach. It improves text classification performance. It is combined with estimated values provided by conventional classifiers such as Multinomial Naive Bayesian (MNB). Consequently, combining the similarity between a test document and a category with the estimated value for the category enhances the performance of the classifier. This approach provides a text document categorization method that is both efficient and effective. In addition, methods for determining the proper relationship between a set of words in a document and its document categorization is also obtained.
Text document clustering and similarity detection is the major part of document management, where every document should be identified by its key terms and domain knowledge. Based on the similarity, the documents are grouped into clusters. For document similarity calculation there are several approaches were proposed in the existing system. But the existing system is either term based or pattern based. And those systems suffered from several problems. To make a revolution in this challenging environment, the proposed system presents an innovative model for document similarity by applying back propagation time stamp algorithm. It discovers patterns in text documents as higher level features and creates a network for fast grouping. It also detects the most appropriate patterns based on its weight and BPTT performs the document similarity measures. Using this approach, the document can be categorized easily. In order to perform the above, a new approach is used. This helps to reduce the training process problems. The above framework is named as BPTT. The BPTT has implemented and evaluated using dot net platform with different set of datasets.
Survey of Machine Learning Techniques in Textual Document ClassificationIOSR Journals
Classification of Text Document points towards associating one or more predefined categories based
on the likelihood expressed by the training set of labeled documents. Many machine learning algorithms plays
an important role in training the system with predefined categories. The importance of Machine learning
approach has felt because of which the study has been taken up for text document classification based on the
statistical event models available. The aim of this paper is to present the important techniques and
methodologies that are employed for text documents classification, at the same time making awareness of some
of the interesting challenges that remain to be solved, focused mainly on text representation and machine
learning techniques.
A hybrid composite features based sentence level sentiment analyzerIAESIJAI
Current lexica and machine learning based sentiment analysis approaches
still suffer from a two-fold limitation. First, manual lexicon construction and
machine training is time consuming and error-prone. Second, the
prediction’s accuracy entails sentences and their corresponding training text
should fall under the same domain. In this article, we experimentally
evaluate four sentiment classifiers, namely support vector machines (SVMs),
Naive Bayes (NB), logistic regression (LR) and random forest (RF). We
quantify the quality of each of these models using three real-world datasets
that comprise 50,000 movie reviews, 10,662 sentences, and 300 generic
movie reviews. Specifically, we study the impact of a variety of natural
language processing (NLP) pipelines on the quality of the predicted
sentiment orientations. Additionally, we measure the impact of incorporating
lexical semantic knowledge captured by WordNet on expanding original
words in sentences. Findings demonstrate that the utilizing different NLP
pipelines and semantic relationships impacts the quality of the sentiment
analyzers. In particular, results indicate that coupling lemmatization and
knowledge-based n-gram features proved to produce higher accuracy results.
With this coupling, the accuracy of the SVM classifier has improved to
90.43%, while it was 86.83%, 90.11%, 86.20%, respectively using the three
other classifiers.
Text mining is a new and exciting research area that tries to solve the information overload problem by using techniques from machine learning, natural language processing (NLP), data mining, information retrieval (IR), and knowledge management. Text mining involves the pre-processing of document collections such as information extraction, term extraction, text categorization, and storage of intermediate representations. The techniques that are used to analyse these intermediate representations such as clustering, distribution analysis, association rules and visualisation of the results.
Supreme court dialogue classification using machine learning models IJECEIAES
Legal classification models help lawyers identify the relevant documents required for a study. In this study, the focus is on sentence level classification. To be more precise, the work undertaken focuses on a conversation in the supreme court between the justice and other correspondents. In the study, both the naïve Bayes classifier and logistic regression are used to classify conversations at the sentence level. The performance is measured with the help of the area under the curve score. The study found that the model that was trained on a specific case yielded better results than a model that was trained on a larger number of conversations. Case specificity is found to be more crucial in gaining better results from the classifier.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
Context Driven Technique for Document ClassificationIDES Editor
In this paper we present an innovative hybrid Text
Classification (TC) system that bridges the gap between
statistical and context based techniques. Our algorithm
harnesses contextual information at two stages. First it extracts
a cohesive set of keywords for each category by using lexical
references, implicit context as derived from LSA and wordvicinity
driven semantics. And secondly, each document is
represented by a set of context rich features whose values are
derived by considering both lexical cohesion as well as the extent
of coverage of salient concepts via lexical chaining. After
keywords are extracted, a subset of the input documents is
apportioned as training set. Its members are assigned categories
based on their keyword representation. These labeled
documents are used to train binary SVM classifiers, one for
each category. The remaining documents are supplied to the
trained classifiers in the form of their context-enhanced feature
vectors. Each document is finally ascribed its appropriate
category by an SVM classifier.
NLP Techniques for Text Classification.docxKevinSims18
Natural Language Processing (NLP) is an area of computer science and artificial intelligence that aims to enable machines to understand and interpret human language. Text classification is one of the most common tasks in NLP, and it involves categorizing text into predefined categories or classes. In this blog post, we will explore some of the most effective NLP techniques for text classification.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
C017321319
1. IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 3, Ver. II (May – Jun. 2015), PP 13-19
www.iosrjournals.org
DOI: 10.9790/0661-17321319 www.iosrjournals.org 13 | Page
Review of Various Text Categorization Methods
Chandrashekhar P. Bhamare1
, Dinesh D. Patil2
1,2.
(Department of Computer Science and Engineering, SSGB College of Engineering and Technology,
Bhusawal [M.S.], India)
Abstract : Measuring the similarity between documents is an important operation in the text processing field.
Text categorization (also known as text classification, or topic spotting) is the task of automatically sorting a set
of documents into categories from a predefine set [1]. TEXT categorization (TC) is the task of automatically
classifying unlabeled natural language documents into a predefined set of semantic categories [2]. The term
weighting methods assign appropriate weights to the terms to improve the performance of text categorization
[1]. The traditional term weighting methods borrowed from information retrieval(IR), such as binary, term
frequency (tf), tf:idf, and its various variants, belong to the unsupervised term weighting methods as the
calculation of these weighting methods do not make use of the information on the category membership of
training documents. Generally, the supervised term weighting methods adopt this known information in several
ways. Therefore, the fundamental question arise here, “Does the difference between supervised and
unsupervised term weighting methods have any relationship with different learning algorithms?”, and if we
consider normalized term frequency instead of term frequency along with relevant frequency the new method
will be ntf.rf but will this new method is effective for text categorization? So we would like to answer these
questions by implementing new supervised and unsupervised term weighing method (ntf.rf). The proposed TC
method will use a number of experiments on two benchmark text collections 20NewsGroups and Reuters.
Keywords - Text Categorization, Information Retrieval, Term weighting.
I. Introduction
Text Categorization (TC) is the task of automatically classifying unlabelled natural language
documents into a predefined set of semantic categories. As the first and a vital step, text representation converts
the content of a textual document into a compact format so that the document can be recognized and classified
by a computer or a classifier. This problem has received a special and increased attention from researchers in the
past few decades due to many reasons like the gigantic amount of digital and online documents that are easily
accessible and the increased demand to organize and retrieve these documents efficiently. The fast expansion of
the Internet globally also has increased the need for more text categorization systems. Efficient text
categorization systems are beneficial for many applications, for example, information retrieval, classification of
news stories, text filtering, categorization of incoming e-mail messages and memos, and classification of Web
pages.
1.1 Existing System
A large number of machine learning, knowledge engineering, and probabilistic-based methods have
been proposed for TC. The most popular methods include Bayesian probabilistic methods, regression models,
example-based classification, decision trees, decision rules, Rocchio method, neural networks, support vector
machines (SVM), and association rules mining. In the vector space model (VSM), the content of a document is
represented as a vector in the term space. Terms can be at various levels, such as syllables, words, phrases, or
any other complicated semantic and/or syntactic indexing units used to identify the contents of a text. Different
terms have different importance in a text, thus an important indicator wi (usually between 0 and 1) represents
how much the term ti contributes to the semantics of document d. The term weighting method is such an
important step to improve the effectiveness of TC by assigning appropriate weights to terms. Although TC has
been intensively studied for several decades, the term weighting methods for TC are usually borrowed from the
traditional information retrieval (IR) field, for example, the simplest binary representation, the most famous
tf:idf, and its various variants. Recently, the study of term weighting methods for TC has gained increasing
attention. In contrast to IR, TC is a supervised learning task as it makes use of prior information on the
membership of training documents in predefined categories. This known information is effective and has been
widely used for the feature selection [1] and the construction of text classifier to improve the performance of the
system. In this study, we group the term weighting methods into two categories according to whether the
method involves this prior information, i.e., supervised term weighting method (if it uses this known
membership information) and unsupervised term weighting method (if it does not use this information).
Generally, the supervised term weighting methods adopt this known information in several ways. One approach
2. Review of Various Text Categorization Methods
DOI: 10.9790/0661-17321319 www.iosrjournals.org 14 | Page
is to weight terms by using feature selection metrics they are naturally thought to be of great help to assign
appropriate weights to terms in TC. Another approach is based on statistical confidence intervals [4], which rely
on the prior knowledge of the statistical information in the labeled training data. Nevertheless, there is another
approach that combines the term weighting method with a text classifier [5]. Similar to the idea of using feature
selection scores, the scores used by the text classifier aim to distinguish the documents. Since these supervised
term weighting methods take the document distribution into consideration, they are naturally expected to be
superior to the unsupervised (traditional) term weighting methods. However, not much work has been done on
their comprehensive comparison with unsupervised term weighting methods. Although there are partial
comparisons in [3] and [4], these supervised term weighting methods have been shown to have mixed results. .
There are various works done in text categorization till date, as text categorization can be done in both
ways supervised and unsupervised there are various ways such as Racchio, decision trees, Naïve Bayes, SVM,
etc. but according to results of all these methods it is proven that SVM is one the best method used for Text
Categorization by using bag-of-words.
Decision Tree as one of the method uses information gain factor for text categorization which
categorizes documents on the basis of combination of word occurrences for which it uses the tree based methods
such as ID-3 and C4.5. but in this particular way as it is using combination of words most of words in English
language are used in different ways according to the need of statement, hence in such cases it may give the
proper results word-wise but fails to work in various situations also so we cannot state this as an effective one.
These methods generate classifiers by inductive learning rule [4]. Decision tree induction classifier are biased
towards frequent classes [7]
Naïve Bayes is again one of the method of text categorization in which user can get very reasonable
performance in an attractive frameworks [5]. In this method document is considered as a binary feature vector
which is the representation of whether term is present or not. And is also known as multivariate Bernoulli naïve
Bayes. but in this method there are two problems first is its rough parameter estimation and calculations are
done by taking all positive documents into consideration and second is in handling categories of rare terms or
insufficient data where it cannot work well[6].
Olex again one of the novel method for text categorization specially in case of automatic induction of
rule based text classifier. which needs documents to be classified into positive and negative literals. This rule
allows prediction about belongingness about the terms in document, and also is an optimization problem. Non
informative words are removed from the documents in order to increase the time efficiency; this uses chi2
and
information gain as one of the key method for calculating the efficiency and term goodness.
On one hand it is proved to be both effective and efficient and on other hand its local one-term-at-a-
time greedy search strategy prevents it to cope with term interaction, as no two or more terms are calculated and
evaluated at a time as a whole. Another problem is inconvenience with rule generation stems from the way how
greedy heuristics works [7].
kNN which are the instance-based classifier do not relay on statistical distribution of training data, they
cannot good positive examples. In this two different documents may be nearest neighbor even they are of
different category and a vice-versa can also occur.
Racchio method also performs well after sorting the data into positive and negative categories but does
not give that much efficiency this can also be proven by Table. 1 with the help of a labeling heuristic, called
PNLH (Positive examples and Negative examples Labeling Heuristic) which is an extension of preliminary
work in [8]. There is one more method introduced by Hisham Al- Mubaid and Syed A. umair, Lsquare using
distributional clustering and learning logic. This is mainly focused on word feature and feature clustering,
generates different separating and nested sets. This gives the results as good as the SVM with the differentiation
of very few points [4].
Table 1: Key Finding of Various Methods Used for Text Classification
Author and Year Method used Key Findings
Hisham Al-Mubaid and Syed A. Umair
Year: 2003
SVM This method is same as ours but they have tested only few
categories from the database [9].
Yiming yang, Jan O. Pedersen Year: 2005 Term weighting
method, CHI2
This method is quite expensive and is better for classifier
such as neural network [2].
Padraig Cunningham and Sarah Jane Delany
Year: 2007
K-NN k-NN is useful if analysis of neighbour is important for
classification [7].
Man Lan, Chew Lim Tan, Senior Member,
IEEE, Jian Su, and Yue Lu, Member, IEEE
Year: 2009
Tf.rf According to him relevant frequency approach is the best
way. But this method can be applied only for two
categories and two keywords from each category [1].
C. Deisy, M. Gowri, S. Baskar, S.M.A.
Kalaiarasi, N. Ramraj Year: 2010
Support Vector
Machine
They have implemented modified inverse document
frequency and have implemented classifier for radial basis
function [9].
Pascal Soucy, Guy W. Mineau
2014
Conf Weight This can replace tf.idf method and performs well for both
with and without feature selection [4].
3. Review of Various Text Categorization Methods
DOI: 10.9790/0661-17321319 www.iosrjournals.org 15 | Page
1.2 Problems in Existing System
Even though there are various methods for text categorization which works differently according to the
method parameters, but this is also true that these values changes or works according to the text hence it is
necessary to check the properties of text. Some of which are listed below
1. Requires Labeled Database
2. Requires Cleaned dataset
3. Requires Linear reparability in dataset
4. Document vectors are sparse
5. High dimensional input space
6. Irrelevant features
1.3 Improvement in Existing System
1. Requires Labeled Database: Existing Systems requires labeling input for text categorization. This can be
overcome by machine learning approach which will use training and testing phase, doesn’t requires labeled
dataset.
2. Requires Cleaned dataset: Existing Systems requires cleaned dataset. This can be overcome by
incorporating preprocessing in proposed system, which consist filtering, tokenization, steaming and
pruning.
3. Requires Linear reparability in dataset: most of categories of Ohsumed database are linearly separable
hence classifier need to find it out first and so are many of the Reuters tasks also where the idea of existing
is to find such linear separators.
4. Document vectors are sparse: for each document there are some entries which are not zero. Kivinen,
Warmuth and Auer [8] gives both the theoretical and empirical evidences for the mistakes bound model that
additive algorithms, which have similar inductive bias like proposed system are well suited for problems
like dense concepts and sparse instances.
5. High dimensional input space: When we actually categorize the text we come across many features and
proposed system will give over fitting protection which does not depends on the number of features. They
have potential to handle large number of feature spaces.
6. Irrelevant features: To avoid the above stated problem one way is this and will do through feature selection.
Through text categorization we get very few relevant features according to their information gain factor
also many time even word occurring very few times gives the more relevant information. So the good
classifier must combine many features and this aggressive selection may lead to loss of information and
proposed system will give many parameters for feature selection, which may avoid this up to great extent.
Another two advantages of proposed system are one, it is based on simple ideas and provides clear
intuition of what we are exactly learning from those examples. Second is it performs very well in practical
applications and complex algorithm of feature extraction.
II. Proposed System
2.1 Term Weighting Methods: A Brief Review
In text representation terms are words or phrases or any indexing term and each is represented by a
value whose measure gives the importance of that term. All documents are categorized to get the features and
those features acts as a keyword. The frequency of keyword that is the number of times the particular term
occurs in the document is denoted by tf (term frequency), likewise there are various terms which gives
frequency count of keywords which are given in table 2 as given below.
Table 2: Term Frequency Factor [1]
Term frequency factor Denoted by Description
1.0 Binary Binary weight =1 for terms present in a vector
Tf alone Tf Row term frequency(no of times term occurs in a document)
Log(1+tf) Log tf Logarithm of a term frequency
1-(1/(1+tf)) ITF Inverse term frequency usually tf-1
ntf Ntf Normalized term frequency
In Table 2 four are commonly used term frequency factor in which binary term gives only the presence
of term by 0 and 1 but it does not give importance of term hence we cannot use this in feature generation this is
used in Naïve Bayes and Decision Trees also. Next is most popular term frequency representation which adopts
raw term frequency however different variants of this also gives log(1+tf) which is nearly as same as log(tf) this
is used to scale the unfavorably high term frequency in the documents[1], this is again reduced to certain extend
by formula 1-(1/(1+tf)) known as inverse term frequency but this frequency factor is not as effective as an term
frequency, no doubt it reduces the value of term frequency when it is high but it does not support to the new
4. Review of Various Text Categorization Methods
DOI: 10.9790/0661-17321319 www.iosrjournals.org 16 | Page
input document of classifier to categorize if it is not exactly the keyword but very close to the keyword. In that
case we need to use the term frequency but if we do so again we need to minimize the unfavorable high value so
the another solution that we are proposing in this is normalized term frequency factor denoted as ntf. Which is
given by the equation (1) where i is the keyword that we want to search for and j is the document in which it
occurs, while ki is the maximum times occurring keyword in that document which may be or may not be same
as i.
),(max
),(
jkfreq
jifreq
iTk
ntf
(1)
This gives the normalized term frequency of the document. There are various term weighting methods
used for text categorization before this which are listed in table 4, if we go according to that we need to
calculated few parameters such as information gain (ig), odds ratio (OR), chi square(x2
), relevant frequency (rf)
and inverse document frequency (idf) this calculation is done by dividing the documents in to positive and
negative categories and all the calculations are done on the term basis. Suppose there are six terms t1, t2, t3, t4,
t5 and t6 as shown in Figure 2 Given one chosen positive category on data collection. Each column represents
document
Fig 1: Examples of different distributions of documents that contain six terms in the whole collection [1]
Distribution in the corpus for each term and height is number of documents. The horizontal line divides
these documents into two categories the positive (above) and negative(below) the heights of the column above
and below the horizontal line denote the number of documents in the positive and negative categories,
respectively. The height of the shaded part is the number of documents that contains this term we use a, b, c and
d to denote the number of different document as listed below
a to denote the number of documents in positive category that contains this term
b to denote the number of documents in positive category that do not contains this term
c to denote the number of documents in negative category that contains this term
d to denote the number of documents in negative category that do not contains this term[1]
Table 3 : Summary of Nine Different Term Weighting Methods[1]
Methods Denoted by Description
Unsupervised Term
Weighting
Binary 1 for presence, 0 for absence
Tf Term frequency alone
tf.idf Classic tf and idf
Supervised Term Weighting
tf.rf Term and relevant frequency
Rf relevant frequency alone
tf.x2
Tf and chi square
tf.ig Tf and information gain
tf.logOR Tf and logarithm of odds ratio
ntf.rf Proposed method
Now after getting normalized term frequency you need to find for relevant frequency which need two
main parameters one that the number of documents in positive category that contains this term that is a, and the
number of documents in negative category that contains this term that is c, based on these two parameters
relevant frequency is calculated as given in equation (7)
)(2log c
a
rf (2)
5. Review of Various Text Categorization Methods
DOI: 10.9790/0661-17321319 www.iosrjournals.org 17 | Page
By looking at above equation in worst case it may happen that there are no such documents in which
the given term is not occurring at all in that case the denominator will become zero and will lead to divide by
zero error hence in that case the another option is chose one instead of c, and equation(7) will be modified as
equation(8) as given below
)( ),1.(max2log c
a
rf (3)
As the term weighting methods described in table 4 we can go to the calculation of our proposed term
weighting method that calculating ntf and rf and multiplying the both which will generate the new term
weighting method. When we make term frequency as an normalized one it gives the frequency in range of 0 to 1
which we can call as an normalized one and restrict to unfavorable term count values that is why chosen
normalized frequency rather that term frequency when we combine that with idf we get the weights of the terms
which generate the vector space model.
2.2 Block Diagram of Proposed System
This input data is then filtered i.e. the special characters and special symbols such as @, <, >, $, ^, etc.
are removed. Tokenization During this phase, all remaining text is parsed, lowercased and all punctuation
removed. Stemming techniques are used to find out the root/stem of a word. Stemming converts words to their
stems, which incorporates a great deal of language-dependent linguistic knowledge. Pruning also counts the
number of times a particular term is occurring in the document which is also called as term frequency. In this
way the documents will be preprocessed and in pruning it will measure the number of times of the term
occurrences in the particular document as shown below. On the basis of this count, term frequency is calculated
as the term frequency is nothing but the number of times occurrence of term in the document. Now the data is
ready for further processing which is the generating vector space model for which we need term weighting
method to be calculated first. Here as we are going to use classic tf.idf and its normalization form ntf.idf, will
have to calculate their weight as well to get VSM and proper term weight.
Fig. block diagram of proposed system
6. Review of Various Text Categorization Methods
DOI: 10.9790/0661-17321319 www.iosrjournals.org 18 | Page
2.3 Flowchart of Proposed System
The basic design of this work is to transform documents into a representation suitable for
categorization and then categorize documents to the predefined categories based on the training weights and the
flow is as shown in figure
Fig. flowchart of proposed system.
2.4 Algorithm of Proposed System
Proposed algorithm for text categorization:
Input
Set of documents D = {D1, D2, . . . , Dm}
// Set of documents to be classified
Fixed set of categories C = {C1, C2, . . . , Cn}
//set of categories in which documents should be classified
Output: Determine di Ɛ C // where di is ith
document from Set of documents D.
Steps:
1. Wordlist {m, 1} each word in the document
2. // read each word in the document to be classified and set each word in different row which will make a list
3. Wordlist {m, 2} Filtering (Wordlist)
// above generated list will be given to filtering phase which will remove special characters from the list and
will modify it in second column
4. Wordlist {m, 3} Tokenization (Wordlist)
// list from second column will be accepted as an input and all stop words will be removed from the list.
Modified list will be in third column
5. Wordlist {m, 4} Stemming (Wordlist)
// list from third column will be accepted as an input and all words will be converted in their proper verb
form. Modified list will be in fourth column
6. Wordlist Pruning (Wordlist)
// list from fourth column will be accepted as an input and all words will be counted by their number of
occurrences in the list. Modified list will be given to another uitable.
7. ntf (tf, k, t)
// by accepting the output of pruning and performing operation as given in equation (27) we will get
normalized term frequency.
7. Review of Various Text Categorization Methods
DOI: 10.9790/0661-17321319 www.iosrjournals.org 19 | Page
8. idf log(n/N) // as given in equation (28) inverse document frequency will be calculated
9. VSM ntf * idf // as given in equation (29) and (30) vector space model and normalized vector space
model is calculated.
10. Crate Classifier // depending on the methodology getting used classifier will be created
11. Use Classifier C (Y/X) // by passing appropriate parameters to the above created classifier document get
categorized
12. Select the best classification result // finally we will get the classification results
2.5 Method : Proposed system will work in three major steps
Step 1: Preprocessing
Filtering: Filters the input dataset.
Tokenization: It removes lowercased and all punctuation.
Stemming: IT converts words to their root stems.
Pruning: It counts the number of times a particular term is occurring in the document.
Step 2: Training the Classifier
Calculating tf, idf, VSM: calculate the weights as well to get VSM and proper term weight.
Keyword List: It contains all terms occurring in the documents of respective category.
Training the classifier: It trains in supervised and unsupervised manner.
Step 3: Testing the classifier: It test the accuracy of classifier by providing unseen dataset.
III. Conclusion And Future Work
The performance of term weighting methods specially unsupervised term weighting methods has close
relationship with learning algorithms and data corpora and we can also state that ntf.rf and ntf.idf will give
better performance than tf.rf and tf has no relationship with algorithm and data corpora. Proposed system will
perform very well in all cases. There is no doubt that tf.rf performs well is proved by all evidences but this is not
well suited when there are large number of categories and more number of keywords hence in that case ntf.rf
and even more that that ntf.idf is well doing. And a good text categorization can be performed in both supervises
and unsupervised machine learning. Proposed system will use term weighting methods with preprocessing, so it
will not requires labeled data and with the help of this, automatically result will be improved in the form of
precession, recall and accuracy.
In future we are planning to test the data for all other term weighting methods as well, for supervised as
well as unsupervised. Along with this we want to test the data for different natural language datasets, which are
globally available.
References
[1] Man lan, Chew lim Tan, senier member, IEEE, Jian Su and Yue Lu, member, IEEE “Supervised and traditional term weighting
method for Automatic Text Categorization” IEEE Trans. on pattern Analysis and Machine Intellegence, Vol 31, no 4, April 2009.
[2] Z-H. Deng, S.-W. Tang, D.-Q. Yang, M.- M. Zhang, L.-Y. Li, and K.Q. Xie, “A Comparative Study on Feature Weight in Text
Categorization,” Proc. Asia-Pacific Web Conf., vol. 3007, pp. 588-597, 2004.
[3] F Debole and F. Sebastini, “Supervisedterm weighting method for automated text categorization” Proc. ACM Symp, Applied
Computing, pp. 784-788, 2003.
[4] P.Soucy and G. W. Mineau, “ beyond TFIDF weighting for text categorizationin vector space model”, Proc. Int’l Joint Conf.
Artifial Intelligence. Pp. 1130-1135, 2005.
[5] Thorsten Joachims, “Text Categorization with Support Vector Machines: Learning with many relevant features” Universitat
Dortmund informatik ls8, baropar, 44221 Dortmund, Germany.
[6] Sang-Bum, Kim, Kyong-Soo Han, hai-Chang Rim, and Sung Hyon Myaeng,” Some effective techniques for naïve Bayes text
Classification” IEEE trans. On Knowledge and Data Engineering Vol 18, no 11, Nov 2006.
[7] Pasquale Rullo, Veronica Lucia Policicchio, Chiara Cumbo, and Salvature liritano,” Olex: Effective Rule Learning for Text
Categorization”, IEEE trans. On Knowledge and Data Engineering Vol 21, no 8, Aug 2009.
[8] Hisham al-Mubaid and Syed A. Umair, ”A new Text Categorization Technique using Distributional Clustering and learning Logic”
IEEE trans. On Knowledge and Data Engineering Vol 18, no 9, Sept 2006.
[9] Marti A. Hearst, University of California, Berkeley, “Support Vector Machine”.
[10] Thornsten Joachims and Megh Meel Gure ”Text Categorization with support vector machines: Learning with Many Relevant
Features”.
[11] Aixin Sun, Ee-Peng Lim, senier member, IEEE, Wee-Keong Ng, Member, IEEE Computer Society and Jaideeep srivastava,
Felleow, IEEE, “ Blocking Reduction Strategies in Heirarchhical text Classification”, IEEE trans. On Knowledge and Data
Engineering Vol 16, no 10, Oct 2004.
[12] Axin Sun, Ee-Peng Lim and Ying Liu, “What makes Categories difficult to Classify? A stydy on predicting classification
performance for Categories”.