Organize continuing growth of dynamic unstructured documents is the major challenge to the field experts.
Handling of such unorganized documents causes more expensive. Clustering of such dynamic documents helps
to reduce the cost. Document clustering by analysing the keywords of the documents is one the best method to
organize the unstructured dynamic documents. Statistical analysis is the best adaptive method to extract the
keywords from the documents. In this paper an algorithm was proposed to cluster the documents. It has two
parts, first part extracts the keywords using statistical method and the second part construct the clusters by
keyword using agglomerative method. This proposed algorithm gives more than 90% of accuracy.
Different Similarity Measures for Text Classification Using KnnIOSR Journals
This document summarizes research on classifying textual data using the k-nearest neighbors (KNN) algorithm and different similarity measures. It explores generating 9 different vector representations of text documents and using KNN with similarity measures like Euclidean, Manhattan, squared Euclidean, etc. to classify documents. The researchers tested KNN on a Reuters news corpus with 5,485 training documents across 8 classes and found that normalization and k=4 produced the best accuracy of 94.47%. They conclude KNN with different similarity measures and vector representations is effective for multi-class text classification.
A Text Mining Research Based on LDA Topic Modellingcsandit
A Large number of digital text information is gener
ated every day. Effectively searching,
managing and exploring the text data has become a m
ain task. In this paper, we first represent
an introduction to text mining and a probabilistic
topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles a
nd users’ tweets topic modelling. The
former one builds up a document topic model, aiming
to a topic perspective solution on
searching, exploring and recommending articles. The
latter one sets up a user topic model,
providing a full research and analysis over Twitter
users’ interest. The experiment process
including data collecting, data pre-processing and
model training is fully documented and
commented. Further more, the conclusion and applica
tion of this paper could be a useful
computation tool for social and business research.
Ginix Generalized Inverted Index for Keyword SearchIRJET Journal
This paper presents a new index structure called Ginix (Generalized Inverted Index) that more efficiently supports keyword searches on text datasets. Ginix compresses traditional inverted indexes by merging consecutive document IDs into intervals to reduce storage space. It also develops efficient algorithms for set operations like union and intersection directly on the interval lists without requiring decompression. Experiments show that Ginix not only reduces index size but also improves search performance compared to traditional inverted indexes on real datasets.
Clustering the results of a search helps the user to overview the information returned. In this paper, we
look upon the clustering task as cataloguing the search results. By catalogue we mean a structured label
list that can help the user to realize the labels and search results. Labelling Cluster is crucial because
meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra
time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able
to label clusters effectively, a new cluster labelling method is introduced. More emphasis was given to
/produce comprehensible and accurate cluster labels in addition to the discovery of document clusters. We
also present a new metric that employs to assess the success of cluster labelling. We adopt a comparative
evaluation strategy to derive the relative performance of the proposed method with respect to the two
prominent search result clustering methods: Suffix Tree Clustering and Lingo.
we perform the experiments using the publicly available Datasets Ambient and ODP-239
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
Text mining is an emerging research field evolving from information retrieval area. Clustering and
classification are the two approaches in data mining which may also be used to perform text classification
and text clustering. The former is supervised while the later is un-supervised. In this paper, our objective is
to perform text clustering by defining an improved distance metric to compute the similarity between two
text files. We use incremental frequent pattern mining to find frequent items and reduce dimensionality.
The improved distance metric may also be used to perform text classification. The distance metric is
validated for the worst, average and best case situations [15]. The results show the proposed distance
metric outperforms the existing measures.
This document provides an overview and summary of Pankaj Jajoo's 2008 master's thesis on improving document clustering algorithms. The thesis explores two approaches: 1) preprocessing the graph representation of documents to remove noise before applying standard graph partitioning algorithms, and 2) clustering words first before clustering documents to reduce noise. Experimental results on three datasets show these approaches improve clustering quality over standard K-Means clustering. The thesis provides background on clustering, reviews existing document clustering methods, and describes the two new algorithms and evaluation of their performance.
The document describes a project to semantically annotate research papers with ACM classification categories. It discusses using cosine similarity, latent Dirichlet allocation, and a proposed model combining labeled LDA and doc2vec. The proposed model trains a supervised topic model to learn document representations that capture semantic relationships between papers and categories. The model achieved 59.31% mean average precision and 45.03% NDCG on a test dataset, demonstrating an improvement over baselines.
Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collap...IJECEIAES
Latent Dirichlet Allocation (LDA) is a probability model for grouping hidden topics in documents by the number of predefined topics. If conducted incorrectly, determining the amount of K topics will result in limited word correlation with topics. Too large or too small number of K topics causes inaccuracies in grouping topics in the formation of training models. This study aims to determine the optimal number of corpus topics in the LDA method using the maximum likelihood and Minimum Description Length (MDL) approach. The experimental process uses Indonesian news articles with the number of documents at 25, 50, 90, and 600; in each document, the numbers of words are 3898, 7760, 13005, and 4365. The results show that the maximum likelihood and MDL approach result in the same number of optimal topics. The optimal number of topics is influenced by alpha and beta parameters. In addition, the number of documents does not affect the computation times but the number of words does. Computational times for each of those datasets are 2.9721, 6.49637, 13.2967, and 3.7152 seconds. The optimisation model has resulted in many LDA topics as a classification model. This experiment shows that the highest average accuracy is 61% with alpha 0.1 and beta 0.001.
Different Similarity Measures for Text Classification Using KnnIOSR Journals
This document summarizes research on classifying textual data using the k-nearest neighbors (KNN) algorithm and different similarity measures. It explores generating 9 different vector representations of text documents and using KNN with similarity measures like Euclidean, Manhattan, squared Euclidean, etc. to classify documents. The researchers tested KNN on a Reuters news corpus with 5,485 training documents across 8 classes and found that normalization and k=4 produced the best accuracy of 94.47%. They conclude KNN with different similarity measures and vector representations is effective for multi-class text classification.
A Text Mining Research Based on LDA Topic Modellingcsandit
A Large number of digital text information is gener
ated every day. Effectively searching,
managing and exploring the text data has become a m
ain task. In this paper, we first represent
an introduction to text mining and a probabilistic
topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles a
nd users’ tweets topic modelling. The
former one builds up a document topic model, aiming
to a topic perspective solution on
searching, exploring and recommending articles. The
latter one sets up a user topic model,
providing a full research and analysis over Twitter
users’ interest. The experiment process
including data collecting, data pre-processing and
model training is fully documented and
commented. Further more, the conclusion and applica
tion of this paper could be a useful
computation tool for social and business research.
Ginix Generalized Inverted Index for Keyword SearchIRJET Journal
This paper presents a new index structure called Ginix (Generalized Inverted Index) that more efficiently supports keyword searches on text datasets. Ginix compresses traditional inverted indexes by merging consecutive document IDs into intervals to reduce storage space. It also develops efficient algorithms for set operations like union and intersection directly on the interval lists without requiring decompression. Experiments show that Ginix not only reduces index size but also improves search performance compared to traditional inverted indexes on real datasets.
Clustering the results of a search helps the user to overview the information returned. In this paper, we
look upon the clustering task as cataloguing the search results. By catalogue we mean a structured label
list that can help the user to realize the labels and search results. Labelling Cluster is crucial because
meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra
time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able
to label clusters effectively, a new cluster labelling method is introduced. More emphasis was given to
/produce comprehensible and accurate cluster labels in addition to the discovery of document clusters. We
also present a new metric that employs to assess the success of cluster labelling. We adopt a comparative
evaluation strategy to derive the relative performance of the proposed method with respect to the two
prominent search result clustering methods: Suffix Tree Clustering and Lingo.
we perform the experiments using the publicly available Datasets Ambient and ODP-239
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
Text mining is an emerging research field evolving from information retrieval area. Clustering and
classification are the two approaches in data mining which may also be used to perform text classification
and text clustering. The former is supervised while the later is un-supervised. In this paper, our objective is
to perform text clustering by defining an improved distance metric to compute the similarity between two
text files. We use incremental frequent pattern mining to find frequent items and reduce dimensionality.
The improved distance metric may also be used to perform text classification. The distance metric is
validated for the worst, average and best case situations [15]. The results show the proposed distance
metric outperforms the existing measures.
This document provides an overview and summary of Pankaj Jajoo's 2008 master's thesis on improving document clustering algorithms. The thesis explores two approaches: 1) preprocessing the graph representation of documents to remove noise before applying standard graph partitioning algorithms, and 2) clustering words first before clustering documents to reduce noise. Experimental results on three datasets show these approaches improve clustering quality over standard K-Means clustering. The thesis provides background on clustering, reviews existing document clustering methods, and describes the two new algorithms and evaluation of their performance.
The document describes a project to semantically annotate research papers with ACM classification categories. It discusses using cosine similarity, latent Dirichlet allocation, and a proposed model combining labeled LDA and doc2vec. The proposed model trains a supervised topic model to learn document representations that capture semantic relationships between papers and categories. The model achieved 59.31% mean average precision and 45.03% NDCG on a test dataset, demonstrating an improvement over baselines.
Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collap...IJECEIAES
Latent Dirichlet Allocation (LDA) is a probability model for grouping hidden topics in documents by the number of predefined topics. If conducted incorrectly, determining the amount of K topics will result in limited word correlation with topics. Too large or too small number of K topics causes inaccuracies in grouping topics in the formation of training models. This study aims to determine the optimal number of corpus topics in the LDA method using the maximum likelihood and Minimum Description Length (MDL) approach. The experimental process uses Indonesian news articles with the number of documents at 25, 50, 90, and 600; in each document, the numbers of words are 3898, 7760, 13005, and 4365. The results show that the maximum likelihood and MDL approach result in the same number of optimal topics. The optimal number of topics is influenced by alpha and beta parameters. In addition, the number of documents does not affect the computation times but the number of words does. Computational times for each of those datasets are 2.9721, 6.49637, 13.2967, and 3.7152 seconds. The optimisation model has resulted in many LDA topics as a classification model. This experiment shows that the highest average accuracy is 61% with alpha 0.1 and beta 0.001.
Dynamic extraction of key paper from the cluster using variance values of cit...IJDKP
When looking into recent research trends in the field of academic landscape, citation network analysis is
common and automated clustering of many academic papers has been achieved by making good use of
various techniques. However, specifying the features of each area identified by automated clustering or
dynamically extracted key papers in each research area has not yet been achieved. In this study, therefore,
we propose a method for dynamically specifying the key papers in each area identified by clustering. We
will investigate variance values of the publication year of the cited literature and calculate each cited
paper’s importance by applying the variance values to the PageRank algorithm.
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...IJERA Editor
Assigning documents to related categories is critical task which is used for effective document retrieval. Automatic text classification is the process of assigning new text document to the predefined categories based on its content. In this paper, we implemented and performed comparison of Naïve Bayes and Centroid-based algorithms for effective document categorization of English language text. In Centroid Based algorithm, we used Arithmetical Average Centroid (AAC) and Cumuli Geometric Centroid (CGC) methods to calculate centroid of each class. Experiment is performed on R-52 dataset of Reuters-21578 corpus. Micro Average F1 measure is used to evaluate the performance of classifiers. Experimental results show that Micro Average F1 value for NB is greatest among all followed by Micro Average F1 value of CGC which is greater than Micro Average F1 of AAC. All these results are valuable for future research
This document discusses integrating natural language processing and parse tree query language with text mining and topic summarization methods to more efficiently extract relevant content from documents. It presents an approach that uses natural language processing to automatically generate queries from sentences, and then applies a topic summarization method called TSCAN to identify themes, segment events, and construct an evolution graph to show relationships between events. The integrated system aims to make content extraction more effective and easier to use for real-time applications. Evaluation of the methods showed benefits for tasks like information extraction.
The document describes latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA represents documents as random mixtures over latent topics, characterized by distributions over words. It is a three-level hierarchical Bayesian model where documents are generated by first sampling a per-document topic distribution from a Dirichlet prior, then repeatedly sampling topics and words from these distributions. LDA addresses limitations of previous models by capturing statistical structure within and between documents through the hierarchical Bayesian formulation.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...IRJET Journal
This paper proposes a method to mine rare sequential topic patterns (URSTPs) from tweet data. It involves preprocessing tweets to extract topics, identifying user sessions, generating sequential topic pattern (STP) candidates, and selecting URSTPs based on rarity analysis. Experiments show the approach can identify special users and interpretable URSTPs, indicating users' characteristics. The paper aims to capture personalized and abnormal user behaviors through sequential relationships between extracted topics from successive tweets.
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
A Competent and Empirical Model of Distributed ClusteringIRJET Journal
This document discusses distributed document clustering. It begins with an introduction to how documents are stored and indexed in computers. It then discusses different clustering algorithms like hierarchical and k-means clustering that are used to group similar documents. The document proposes a new framework for efficiently clustering text documents stored across different distributed resources. It argues that traditional clustering algorithms cannot perfectly cluster text data in decentralized systems. The framework uses properties of traditional algorithms with the ability to cluster in distributed systems.
The document discusses multidimensional databases. It defines multidimensional databases as systems designed to efficiently store and retrieve large volumes of related data that can be viewed from different perspectives or dimensions. It provides an example using automobile sales data that can be analyzed based on dimensions like model, color, dealership, and time. Multidimensional databases allow for interactive analysis of data from multiple angles, unlike relational databases that are slower for such analyses.
A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MININGijcsit
In today’s world of internet, with whole lot of e-documents such, as html pages, digital libraries etc. occupying considerable amount of cyber space, organizing these documents has become a practical need. Clustering is an important technique that organizes large number of objects into smaller coherent groups.This helps in efficient and effective use of these documents for information retrieval and other NLP tasks.Email is one of the most frequently used e-document by individual or organization. Email categorization is one of the major tasks of email mining. Categorizing emails into different groups help easy retrieval and maintenance. Like other e-documents, emails can also be classified using clustering algorithms. In this
paper a similarity measure called Similarity Measure for Text Processing is suggested for email clustering.
The suggested similarity measure takes into account three situations: feature appears in both emails, feature appears in only one email and feature appears in none of the emails. The potency of suggested similarity measure is analyzed on Enron email data set to categorize emails. The outcome indicates that the efficiency acquired by the suggested similarity measure is better than that acquired by other measures.
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
Semantic annotation is done through first representing words and documents in the vector space model using Word2Vec and Doc2Vec implementations, the vectors are taken as features into a classifier, trained and a model is made which can classify a document with ACM classification tree categories, with the help of Wikipedia corpus.
Project Presentation: https://youtu.be/706HJteh1xc
Project Webpage: http://rohitsakala.github.io/semanticAnnotationAcmCategories/
Source Code: https://github.com/rohitsakala/semanticAnnotationAcmCategories
References:
Quoc V. Le, and Tomas Mikolov, ''Distributed Representations of Sentences and Documents ICML", 2014
Comparison of Text Classifiers on News ArticlesIRJET Journal
This document compares the performance of five text classification algorithms (SVM, Naive Bayes, K-Nearest Neighbors, Decision Tree, Rocchio) on news article datasets. It finds that an SVM classifier achieves the highest accuracy on the datasets tested, with accuracies of 86.7%, 75.6%, and 97.6% on the Twenty Newsgroups, Reuters, and BBC News datasets respectively. The document also evaluates and compares the training times and testing times of the classifiers on the different datasets.
Simplicial closure & higher-order link predictionAustin Benson
The document discusses higher-order link prediction in networks. It summarizes previous work representing higher-order interactions as tensors, hypergraphs, etc. It then proposes evaluating models of higher-order data using "higher-order link prediction" to predict which groups of more than two nodes will interact based on past data. The authors analyze dynamics of triadic closure in several real-world networks and propose methods to predict closure based on structural properties like edge weights.
This document discusses hierarchical clustering and similarity measures for document clustering. It summarizes that hierarchical clustering creates a hierarchical decomposition of data objects through either agglomerative or divisive approaches. The success of clustering depends on the similarity measure used, with traditional measures using a single viewpoint, while multiviewpoint measures use different viewpoints to increase accuracy. The paper then focuses on applying a multiviewpoint similarity measure to hierarchical clustering of documents.
This document discusses research on modeling and predicting higher-order interactions in networks beyond pairwise connections. The researchers collected datasets containing time-stamped groups or "simplices" of nodes and analyzed properties like triangle closure. They propose "higher-order link prediction" to predict which new simplices will form based on structural features like edge weights between nodes. Scoring functions were tested and averages of edge weights often performed well, differing from classical link prediction methods.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
A Document Similarity Measurement without Dictionaries鍾誠 陳鍾誠
The document proposes a measure of document similarity called Common Keyword Similarity (CKS) that does not rely on dictionaries. CKS is based on finding common substrings between documents using a PAT-tree data structure. The importance of each substring is determined by its discriminating effect (KDE), which reflects how well it fits a given classification system. CKS is computed as the sum of the weights of the common keywords between two documents. Experimental results on news articles show that CKS without a dictionary has better recall and precision than a method using cosine coefficient that relies on a dictionary, since many terms cannot be found in dictionaries. The classification system used to determine keyword weights also significantly impacts performance.
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
-In this study a comprehensive evaluation of two
supervised feature selection methods for dimensionality
reduction is performed - Latent Semantic Indexing (LSI) and
Principal Component Analysis (PCA). This is gauged against
unsupervised techniques like fuzzy feature clustering using
hard fuzzy C-means (FCM) . The main objective of the study is
to estimate the relative efficiency of two supervised techniques
against unsupervised fuzzy techniques while reducing the
feature space. It is found that clustering using FCM leads to
better accuracy in classifying documents in the face of
evolutionary algorithms like LSI and PCA. Results show that
the clustering of features improves the accuracy of document
classification
This document describes a proposed concept-based mining model that aims to improve document clustering and information retrieval by extracting concepts and semantic relationships rather than just keywords. The model uses natural language processing techniques like part-of-speech tagging and parsing to extract concepts from text. It represents concepts and their relationships in a semantic network and clusters documents based on conceptual similarity rather than term frequency. The model is evaluated using singular value decomposition to increase the precision of key term and phrase extraction.
Experimental Result Analysis of Text Categorization using Clustering and Clas...ijtsrd
In a world that routinely produces more textual data. It is very critical task to managing that textual data. There are many text analysis methods are available to managing and visualizing that data, but many techniques may give less accuracy because of the ambiguity of natural language. To provide the ne grained analysis, in this paper introduce e cient machine learning algorithms for categorize text data. To improve the accuracy, in proposed system I introduced Natural language toolkit NLTK python library to perform natural language processing. The main aim of proposed system is to generalize the model for real time text categorization applications by using e cient text classi cation as well as clustering machine learning algorithms and nd the efficient and accurate model for input dataset using performance measure concept. Patil Kiran Sanajy | Prof. Kurhade N. V. ""Experimental Result Analysis of Text Categorization using Clustering and Classification Algorithms"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd25077.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/25077/experimental-result-analysis-of-text-categorization-using-clustering-and-classification-algorithms/patil-kiran-sanajy
An Advanced IR System of Relational Keyword Search Techniquepaperpublications3
Abstract: Now these days keyword search to relational data set becomes an area of research within the data base and Information Retrieval. There is no standard process of information retrieval, which will clearly show the accurate result also it shows keyword search with ranking. Execution time is retrieving of data is more in existing system. We propose a system for increasing performance of relational keyword search systems. In the proposed system we combine schema-based and graph-based approaches and propose a Relational Keyword Search System to overcome the mentioned disadvantages of existing systems and manage the information and user access the information very efficiently. Keyword Search with the ranking requires very low execution time. Execution time of retrieving information and file length during Information retrieval can be display using chart.Keywords: Keyword Search, Datasets, Information Retrieval Query Workloads, Schema-based Systems, Graph-based Systems, ranking, relational databases.
Title: An Advanced IR System of Relational Keyword Search Technique
Author: Dhananjay A. Gholap, Gumaste S. V
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
Dynamic extraction of key paper from the cluster using variance values of cit...IJDKP
When looking into recent research trends in the field of academic landscape, citation network analysis is
common and automated clustering of many academic papers has been achieved by making good use of
various techniques. However, specifying the features of each area identified by automated clustering or
dynamically extracted key papers in each research area has not yet been achieved. In this study, therefore,
we propose a method for dynamically specifying the key papers in each area identified by clustering. We
will investigate variance values of the publication year of the cited literature and calculate each cited
paper’s importance by applying the variance values to the PageRank algorithm.
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...IJERA Editor
Assigning documents to related categories is critical task which is used for effective document retrieval. Automatic text classification is the process of assigning new text document to the predefined categories based on its content. In this paper, we implemented and performed comparison of Naïve Bayes and Centroid-based algorithms for effective document categorization of English language text. In Centroid Based algorithm, we used Arithmetical Average Centroid (AAC) and Cumuli Geometric Centroid (CGC) methods to calculate centroid of each class. Experiment is performed on R-52 dataset of Reuters-21578 corpus. Micro Average F1 measure is used to evaluate the performance of classifiers. Experimental results show that Micro Average F1 value for NB is greatest among all followed by Micro Average F1 value of CGC which is greater than Micro Average F1 of AAC. All these results are valuable for future research
This document discusses integrating natural language processing and parse tree query language with text mining and topic summarization methods to more efficiently extract relevant content from documents. It presents an approach that uses natural language processing to automatically generate queries from sentences, and then applies a topic summarization method called TSCAN to identify themes, segment events, and construct an evolution graph to show relationships between events. The integrated system aims to make content extraction more effective and easier to use for real-time applications. Evaluation of the methods showed benefits for tasks like information extraction.
The document describes latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA represents documents as random mixtures over latent topics, characterized by distributions over words. It is a three-level hierarchical Bayesian model where documents are generated by first sampling a per-document topic distribution from a Dirichlet prior, then repeatedly sampling topics and words from these distributions. LDA addresses limitations of previous models by capturing statistical structure within and between documents through the hierarchical Bayesian formulation.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...IRJET Journal
This paper proposes a method to mine rare sequential topic patterns (URSTPs) from tweet data. It involves preprocessing tweets to extract topics, identifying user sessions, generating sequential topic pattern (STP) candidates, and selecting URSTPs based on rarity analysis. Experiments show the approach can identify special users and interpretable URSTPs, indicating users' characteristics. The paper aims to capture personalized and abnormal user behaviors through sequential relationships between extracted topics from successive tweets.
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
A Competent and Empirical Model of Distributed ClusteringIRJET Journal
This document discusses distributed document clustering. It begins with an introduction to how documents are stored and indexed in computers. It then discusses different clustering algorithms like hierarchical and k-means clustering that are used to group similar documents. The document proposes a new framework for efficiently clustering text documents stored across different distributed resources. It argues that traditional clustering algorithms cannot perfectly cluster text data in decentralized systems. The framework uses properties of traditional algorithms with the ability to cluster in distributed systems.
The document discusses multidimensional databases. It defines multidimensional databases as systems designed to efficiently store and retrieve large volumes of related data that can be viewed from different perspectives or dimensions. It provides an example using automobile sales data that can be analyzed based on dimensions like model, color, dealership, and time. Multidimensional databases allow for interactive analysis of data from multiple angles, unlike relational databases that are slower for such analyses.
A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MININGijcsit
In today’s world of internet, with whole lot of e-documents such, as html pages, digital libraries etc. occupying considerable amount of cyber space, organizing these documents has become a practical need. Clustering is an important technique that organizes large number of objects into smaller coherent groups.This helps in efficient and effective use of these documents for information retrieval and other NLP tasks.Email is one of the most frequently used e-document by individual or organization. Email categorization is one of the major tasks of email mining. Categorizing emails into different groups help easy retrieval and maintenance. Like other e-documents, emails can also be classified using clustering algorithms. In this
paper a similarity measure called Similarity Measure for Text Processing is suggested for email clustering.
The suggested similarity measure takes into account three situations: feature appears in both emails, feature appears in only one email and feature appears in none of the emails. The potency of suggested similarity measure is analyzed on Enron email data set to categorize emails. The outcome indicates that the efficiency acquired by the suggested similarity measure is better than that acquired by other measures.
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
Semantic annotation is done through first representing words and documents in the vector space model using Word2Vec and Doc2Vec implementations, the vectors are taken as features into a classifier, trained and a model is made which can classify a document with ACM classification tree categories, with the help of Wikipedia corpus.
Project Presentation: https://youtu.be/706HJteh1xc
Project Webpage: http://rohitsakala.github.io/semanticAnnotationAcmCategories/
Source Code: https://github.com/rohitsakala/semanticAnnotationAcmCategories
References:
Quoc V. Le, and Tomas Mikolov, ''Distributed Representations of Sentences and Documents ICML", 2014
Comparison of Text Classifiers on News ArticlesIRJET Journal
This document compares the performance of five text classification algorithms (SVM, Naive Bayes, K-Nearest Neighbors, Decision Tree, Rocchio) on news article datasets. It finds that an SVM classifier achieves the highest accuracy on the datasets tested, with accuracies of 86.7%, 75.6%, and 97.6% on the Twenty Newsgroups, Reuters, and BBC News datasets respectively. The document also evaluates and compares the training times and testing times of the classifiers on the different datasets.
Simplicial closure & higher-order link predictionAustin Benson
The document discusses higher-order link prediction in networks. It summarizes previous work representing higher-order interactions as tensors, hypergraphs, etc. It then proposes evaluating models of higher-order data using "higher-order link prediction" to predict which groups of more than two nodes will interact based on past data. The authors analyze dynamics of triadic closure in several real-world networks and propose methods to predict closure based on structural properties like edge weights.
This document discusses hierarchical clustering and similarity measures for document clustering. It summarizes that hierarchical clustering creates a hierarchical decomposition of data objects through either agglomerative or divisive approaches. The success of clustering depends on the similarity measure used, with traditional measures using a single viewpoint, while multiviewpoint measures use different viewpoints to increase accuracy. The paper then focuses on applying a multiviewpoint similarity measure to hierarchical clustering of documents.
This document discusses research on modeling and predicting higher-order interactions in networks beyond pairwise connections. The researchers collected datasets containing time-stamped groups or "simplices" of nodes and analyzed properties like triangle closure. They propose "higher-order link prediction" to predict which new simplices will form based on structural features like edge weights between nodes. Scoring functions were tested and averages of edge weights often performed well, differing from classical link prediction methods.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
A Document Similarity Measurement without Dictionaries鍾誠 陳鍾誠
The document proposes a measure of document similarity called Common Keyword Similarity (CKS) that does not rely on dictionaries. CKS is based on finding common substrings between documents using a PAT-tree data structure. The importance of each substring is determined by its discriminating effect (KDE), which reflects how well it fits a given classification system. CKS is computed as the sum of the weights of the common keywords between two documents. Experimental results on news articles show that CKS without a dictionary has better recall and precision than a method using cosine coefficient that relies on a dictionary, since many terms cannot be found in dictionaries. The classification system used to determine keyword weights also significantly impacts performance.
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
-In this study a comprehensive evaluation of two
supervised feature selection methods for dimensionality
reduction is performed - Latent Semantic Indexing (LSI) and
Principal Component Analysis (PCA). This is gauged against
unsupervised techniques like fuzzy feature clustering using
hard fuzzy C-means (FCM) . The main objective of the study is
to estimate the relative efficiency of two supervised techniques
against unsupervised fuzzy techniques while reducing the
feature space. It is found that clustering using FCM leads to
better accuracy in classifying documents in the face of
evolutionary algorithms like LSI and PCA. Results show that
the clustering of features improves the accuracy of document
classification
This document describes a proposed concept-based mining model that aims to improve document clustering and information retrieval by extracting concepts and semantic relationships rather than just keywords. The model uses natural language processing techniques like part-of-speech tagging and parsing to extract concepts from text. It represents concepts and their relationships in a semantic network and clusters documents based on conceptual similarity rather than term frequency. The model is evaluated using singular value decomposition to increase the precision of key term and phrase extraction.
Experimental Result Analysis of Text Categorization using Clustering and Clas...ijtsrd
In a world that routinely produces more textual data. It is very critical task to managing that textual data. There are many text analysis methods are available to managing and visualizing that data, but many techniques may give less accuracy because of the ambiguity of natural language. To provide the ne grained analysis, in this paper introduce e cient machine learning algorithms for categorize text data. To improve the accuracy, in proposed system I introduced Natural language toolkit NLTK python library to perform natural language processing. The main aim of proposed system is to generalize the model for real time text categorization applications by using e cient text classi cation as well as clustering machine learning algorithms and nd the efficient and accurate model for input dataset using performance measure concept. Patil Kiran Sanajy | Prof. Kurhade N. V. ""Experimental Result Analysis of Text Categorization using Clustering and Classification Algorithms"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd25077.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/25077/experimental-result-analysis-of-text-categorization-using-clustering-and-classification-algorithms/patil-kiran-sanajy
An Advanced IR System of Relational Keyword Search Techniquepaperpublications3
Abstract: Now these days keyword search to relational data set becomes an area of research within the data base and Information Retrieval. There is no standard process of information retrieval, which will clearly show the accurate result also it shows keyword search with ranking. Execution time is retrieving of data is more in existing system. We propose a system for increasing performance of relational keyword search systems. In the proposed system we combine schema-based and graph-based approaches and propose a Relational Keyword Search System to overcome the mentioned disadvantages of existing systems and manage the information and user access the information very efficiently. Keyword Search with the ranking requires very low execution time. Execution time of retrieving information and file length during Information retrieval can be display using chart.Keywords: Keyword Search, Datasets, Information Retrieval Query Workloads, Schema-based Systems, Graph-based Systems, ranking, relational databases.
Title: An Advanced IR System of Relational Keyword Search Technique
Author: Dhananjay A. Gholap, Gumaste S. V
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Research on ontology based information retrieval techniquesKausar Mukadam
The document summarizes and compares three novel ontology-based information retrieval techniques. It discusses a technique for retrieving information in the domain of Traditional Chinese Medicine that uses an ontology to represent concepts and measures concept similarity to sort search results. It also describes a framework for semantic indexing and querying that uses an ontology and entity-attribute-value model to improve scalability, usability, and retrieval performance for transport systems. Additionally, it outlines a semantic extension retrieval model that uses ontology annotation and semantic extension of queries to address limitations of keyword-based search. The techniques are evaluated based on precision and recall measures to analyze their effectiveness compared to traditional methods.
This document proposes a new method to re-rank web documents retrieved by search engines based on their relevance to a user's query using ontology concepts. It involves building an ontology of concepts for a given domain (electronic commerce), extracting concepts from retrieved documents, and re-ranking documents based on the frequency of ontology concepts within them. An evaluation showed the approach reduced average ranking error compared to search engines alone. The method was tested on the first 30 documents retrieved for the query "e-commerce" from search engines.
Reviews on swarm intelligence algorithms for text document clusteringIRJET Journal
This document reviews swarm intelligence algorithms that have been used for text document clustering. It discusses how text clustering is an unsupervised learning technique that groups similar documents into clusters while separating dissimilar documents. Various swarm intelligence algorithms like particle swarm optimization, artificial bee colony, grey wolf optimizer, and krill herd have been applied to text document clustering problems. The document surveys previous research that has used these swarm intelligence algorithms for text clustering and discusses their advantages and limitations. It aims to provide readers an overview of the different swarm intelligence algorithms available for text document clustering applications.
Topic detecton by clustering and text miningIRJET Journal
This document discusses topic detection from text documents using text mining and clustering techniques. It proposes extracting keywords from documents, representing topics as groups of keywords, and using k-means clustering on the keywords to group them into topics. The keywords are extracted based on frequency counts and preprocessed by removing stop words and stemming. The k-means clustering algorithm is used to assign keywords to topics represented by cluster centroids, and the centroids are iteratively updated until cluster assignments converge.
This document summarizes a research paper that introduces a novel multi-viewpoint similarity measure for clustering text documents. The paper begins with background on commonly used similarity measures like Euclidean distance and cosine similarity. It then presents the novel multi-viewpoint measure, which considers multiple viewpoints (objects not assumed to be in the same cluster) rather than a single viewpoint. The paper proposes two new clustering criterion functions based on this measure and compares them to other algorithms on benchmark datasets. The goal is to develop a similarity measure and clustering methods that provide high-quality, consistent performance like k-means but can better handle sparse, high-dimensional text data.
This document summarizes an article that proposes an automatic text summarization technique using feature terms to calculate sentence relevance. The technique uses both statistical and linguistic methods to identify semantically important sentences for creating a generic summary. It determines the relevance of sentences based on feature term ranks and performs semantic analysis of sentences with the highest ranks to select those most important for the summary. The performance is evaluated by comparing summaries to those created by human evaluators.
Clustering Algorithm with a Novel Similarity MeasureIOSR Journals
This document proposes a new multi-viewpoint based similarity measure for clustering text documents that aims to overcome limitations of existing measures. Existing measures use a single viewpoint to measure similarity between documents, but the proposed measure uses multiple viewpoints to ensure clusters exhibit all relationships between documents. The empirical study found that using a multi-viewpoint similarity measure forms more meaningful clusters by capturing more informative relationships between documents.
Data mining is the knowledge discovery in databases and the gaol is to extract patterns and knowledge from
large amounts of data. The important term in data mining is text mining. Text mining extracts the quality
information highly from text. Statistical pattern learning is used to high quality information. High –quality in
text mining defines the combinations of relevance, novelty and interestingness. Tasks in text mining are text
categorization, text clustering, entity extraction and sentiment analysis. Applications of natural language
processing and analytical methods are highly preferred to turn
IRJET- Concept Extraction from Ambiguous Text Document using K-MeansIRJET Journal
This document discusses using a K-means clustering algorithm to extract concepts from ambiguous text documents. It involves preprocessing the text by tokenizing, removing stop words, and stemming words. The words are then represented as vectors and dimensionality reduction using PCA is applied. Finally, K-means clustering is used to group similar words into clusters to identify the overall concepts in the document without reading the entire text. The aim is to help users understand the key topics in a document in a time-efficient manner without having to read the full text.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
1) The document discusses a review of semantic approaches for nearest neighbor search. It describes using an ontology to add a semantic layer to an information retrieval system to relate concepts using query words.
2) A technique called spatial inverted index is proposed to locate multidimensional information and handle nearest neighbor queries by finding the hospitals closest to a given address.
3) Several semantic approaches are described including using clustering measures, specificity measures, link analysis, and relation-based page ranking to improve search and interpret hidden concepts behind keywords.
The document summarizes text mining techniques in data mining. It discusses common text mining tasks like text categorization, clustering, and entity extraction. It also reviews several text mining algorithms and techniques, including information extraction, clustering, classification, and information visualization. Several literature papers applying these techniques to domains like movie reviews, research proposals, and e-commerce are also summarized. The document concludes that text mining can extract useful patterns from unstructured text through techniques like clustering, classification, and information extraction.
Knowledge Graph and Similarity Based Retrieval Method for Query Answering SystemIRJET Journal
This document proposes a knowledge graph and question answering system to extract and analyze information from large volumes of unstructured data like annual reports. It discusses using natural language processing techniques like named entity recognition with spaCy and dependency parsing to extract entity-relation pairs from text and construct a knowledge graph. For question answering, it analyzes user queries with similar NLP approaches and then matches query triplets to the knowledge graph to retrieve answers, combining information retrieval and trained classifiers. The proposed system aims to provide faster understanding and analysis of complex, unstructured data for professionals.
IRJET- Implementation of Automatic Question Paper Generator SystemIRJET Journal
This document describes a proposed system for automatically generating question papers from input documents. The system performs several steps: it converts PDF/document files to text, conducts preprocessing like removing stop words, uses natural language processing and TF-IDF for key phrase extraction, checks phrases against Wikipedia for domain knowledge, generates triplets for questions, and checks the quality of generated questions using linguistic rules. The system aims to make the question paper generation process faster, more randomized and secure compared to traditional manual methods.
This document provides lecture notes on information retrieval systems. It covers key concepts like precision and recall, different retrieval strategies including vector space model and probabilistic models, and retrieval utilities. The vector space model represents documents and queries as vectors in a shared space and calculates similarity using cosine similarity. Probabilistic models assign probabilities to terms and documents and estimate relevance probabilities. The notes discuss term weighting schemes, inverted indexes to improve efficiency, and integrating structured data with text retrieval. The overall objective is for students to learn fundamental models and techniques for information storage and retrieval.
Similar to Construction of Keyword Extraction using Statistical Approaches and Document Clustering by Agglomerative method (20)
Batteries -Introduction – Types of Batteries – discharging and charging of battery - characteristics of battery –battery rating- various tests on battery- – Primary battery: silver button cell- Secondary battery :Ni-Cd battery-modern battery: lithium ion battery-maintenance of batteries-choices of batteries for electric vehicle applications.
Fuel Cells: Introduction- importance and classification of fuel cells - description, principle, components, applications of fuel cells: H2-O2 fuel cell, alkaline fuel cell, molten carbonate fuel cell and direct methanol fuel cells.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Comparative analysis between traditional aquaponics and reconstructed aquapon...bijceesjournal
The aquaponic system of planting is a method that does not require soil usage. It is a method that only needs water, fish, lava rocks (a substitute for soil), and plants. Aquaponic systems are sustainable and environmentally friendly. Its use not only helps to plant in small spaces but also helps reduce artificial chemical use and minimizes excess water use, as aquaponics consumes 90% less water than soil-based gardening. The study applied a descriptive and experimental design to assess and compare conventional and reconstructed aquaponic methods for reproducing tomatoes. The researchers created an observation checklist to determine the significant factors of the study. The study aims to determine the significant difference between traditional aquaponics and reconstructed aquaponics systems propagating tomatoes in terms of height, weight, girth, and number of fruits. The reconstructed aquaponics system’s higher growth yield results in a much more nourished crop than the traditional aquaponics system. It is superior in its number of fruits, height, weight, and girth measurement. Moreover, the reconstructed aquaponics system is proven to eliminate all the hindrances present in the traditional aquaponics system, which are overcrowding of fish, algae growth, pest problems, contaminated water, and dead fish.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...shadow0702a
This document serves as a comprehensive step-by-step guide on how to effectively use PyCharm for remote debugging of the Windows Subsystem for Linux (WSL) on a local Windows machine. It meticulously outlines several critical steps in the process, starting with the crucial task of enabling permissions, followed by the installation and configuration of WSL.
The guide then proceeds to explain how to set up the SSH service within the WSL environment, an integral part of the process. Alongside this, it also provides detailed instructions on how to modify the inbound rules of the Windows firewall to facilitate the process, ensuring that there are no connectivity issues that could potentially hinder the debugging process.
The document further emphasizes on the importance of checking the connection between the Windows and WSL environments, providing instructions on how to ensure that the connection is optimal and ready for remote debugging.
It also offers an in-depth guide on how to configure the WSL interpreter and files within the PyCharm environment. This is essential for ensuring that the debugging process is set up correctly and that the program can be run effectively within the WSL terminal.
Additionally, the document provides guidance on how to set up breakpoints for debugging, a fundamental aspect of the debugging process which allows the developer to stop the execution of their code at certain points and inspect their program at those stages.
Finally, the document concludes by providing a link to a reference blog. This blog offers additional information and guidance on configuring the remote Python interpreter in PyCharm, providing the reader with a well-rounded understanding of the process.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Construction of Keyword Extraction using Statistical Approaches and Document Clustering by Agglomerative method
1. R. Nagarajan Int. Journal of Engineering Research and Applications www.ijera.com
ISSN: 2248-9622, Vol. 6, Issue 1, (Part - 2) January 2016, pp.73-78
www.ijera.com 73 | P a g e
Construction of Keyword Extraction using Statistical Approaches
and Document Clustering by Agglomerative method
R. Nagarajan*, Dr. P. Aruna**
*(Department of Computer Science & Engineering, Annamalai University, Annamalainagar)
** (Department of Computer Science & Engineering, Annamalai University, Annamalainagar)
ABSTRACT
Organize continuing growth of dynamic unstructured documents is the major challenge to the field experts.
Handling of such unorganized documents causes more expensive. Clustering of such dynamic documents helps
to reduce the cost. Document clustering by analysing the keywords of the documents is one the best method to
organize the unstructured dynamic documents. Statistical analysis is the best adaptive method to extract the
keywords from the documents. In this paper an algorithm was proposed to cluster the documents. It has two
parts, first part extracts the keywords using statistical method and the second part construct the clusters by
keyword using agglomerative method. This proposed algorithm gives more than 90% of accuracy.
Keywords– Agglomerative Method, Co-occurrences Statistical Information (CSI), Document Clustering,
Similarity Measures, TF-ISF
I. Introduction
In this digital epoch, the tremendous increase of
dynamic unstructured documents is unavoidable and
should be organized in a good manner to use it cost
effectively. This increase of unstructured documents
raises the challenge to the field experts to use it
effectively. Such documents are more informative
and need to the fields those reveals around the data
handling such as web search, machine learning ;
Document Clustering is the most powerful method to
solve the problem of organizing unstructured
dynamic documents. There are various approaches
available to cluster the documents. Clustering based
on the concepts extraction is the straight and best
method. Keywords help to extract the concept of the
documents. Keywords are the words used in the
documents, which summarises the concept of the
documents. Extraction of the fruitful keywords from
the bag of words is another challenge job. The basic
assumptions that (i) authors of scientific articles
choose their technical terms carefully; (ii) when
different terms are used in the same articles it is
therefore because the author is either recognizing or
postulating some non-trivial relationship between
their references; and (iii) if enough different authors
appear to recognize the same relationship, then that
relationship may be assumed to have some
significance within the area of science concerned the
keywords are extracted from the documents. In the
first of part of the proposed algorithm extract the
significant keywords by applying the statistical
analysis on the bag of words, the second part of the
algorithm deals the document clustering.
II. Related work
In this section we review previous work on
document clustering algorithms and discuss how
these algorithms measure up to the requirements of
the Web domain. In [1] statistical feature extraction
methods have been discussed and also framework for
statistical keyword extraction is defined. In [2]
survey of keyword extraction techniques have been
presented and also deals the merits and demerits of
the simple statistical approach, Linguistics Approach,
Machine learning approach and other approaches like
heuristic approach. In [3] a model for extracting
keywords based on their relatedness weight among
the entire text terms has been discussed and strength
of the terms are evaluated by semantic similarity. In
[4] different ways to structure a textual document for
keyword extraction, different domain independent
keyword extraction methods, and the impact of the
number of keywords on the incremental clustering
quality are analyzed and a framework for domain
independent statistical keyword extraction is
introduced. In [5] hybrid keyword extraction method
based on TF and semantic strategies have been
discussed and also semantic strategies were
introduced to filter the dependent words and remove
the synonyms. In [6] suffix tree clustering algorithm
has been discussed and the authors also create an
application that use this algorithm in the process of
clustering, and search of clustered documents. In[7]
novel down-top incremental conceptual hierarchical
text clustering approach using CFu-tree (ICHTC-CF)
representation has been discussed. In [8] variety of
distance functions and similarity measures are
compared and analyzed, the effectiveness of these
measures in partition clustering for text document
RESEARCH ARTICLE OPEN ACCESS
2. R. Nagarajan Int. Journal of Engineering Research and Applications www.ijera.com
ISSN: 2248-9622, Vol. 6, Issue 1, (Part - 2) January 2016, pp.73-78
www.ijera.com 74 | P a g e
datasets has been discussed and also the results are
compared with standard K-means algorithm. In [9]
different agglomerative algorithms based on the
evaluation of the clusters quality produced by
different hierarchical agglomerative clustering
algorithms using different criterion functions for the
problem of clustering medical documents has been
discussed. In[10] text document space dimension
reduction in text document retrieval by agglomerative
clustering and Hebbian-type neural network has been
discussed.
III. Methodology
IV. 3.1 Feature Extraction
The first part of our proposed algorithm deals
with the keyword extraction using statistical analysis
on the words of the documents. The steps involved
in the proposed keyword extraction algorithm
Pre-processing of the documents.
Construction of sentences Vs keyword matrix.
Calculate the weight of the words of the
documents.
Rank the words based on their weight.
Find out the keywords based on the higher
weights.
Step 1:
For each document D
do
Begin
Step 2: removal of stop words, stemming
words, and removal of unnecessary
characters and word simplification.
Step 3 : Construction of Sentences Vs
Words Matrix
i. extract sentences from the documents
and labeled as DSi
ii. extract words from each sentence DSi
and stored in a Sentence DSiWj array.
iii. construct the Sentences Vs words
matrix using Sentences DSiWj array.
Step 4 : calculate the words weight using
the following statistical approaches
i. Most frequency words
ii. Term Frequency - Inverse Sentence
Frequency
iii. CSI Measure (Co-occurrence statistical
information) for noise removal from
co-words construction
Step 5 : Extraction of keyword from higher
weight words
End
In the preprocessing stage, the stop words and
the unnecessary words are removed, then the
stemming of the words are done and finally the words
are simplified. All the sentences are extracted from
the preprocessed document and labeled as DSi, words
in the sentences are extracted with their frequency
and their sentence label. To find out the keyword of
the documents, sentences Vs words matrix is
constructed. Table-1 shows the sentence-word
matrix.
Table -1
Sentences-Words matrix
In Table-1, each row corresponds to a sentence
of a documents and column represents word. The set
of sentences are represented as S = {S1,S2,S3,...Si}
and set of words are represented as W =
{W1,W2,W3....Wj}. The value 1 is assigned to a cell
(S1W1) if the word occurs in that sentence and the
value 0 is assigned otherwise. To compute the word
weight three statistical methods are used. i) Higher
Frequency words ii) Term frequency – Inverse
Sentence Frequency iii) Co-occurrence Statistical
Information.
i) Higher Frequent words (HF):
Higher frequent is the basic statistical measure, it
just extract keywords straightly from higher
frequency words. The word weight is calculated by
counting the number of occurrence of the word in the
Sentence-word matrix. i.e
HFW(Wj) = 𝑊𝑗 𝑆𝑖𝑆 𝑖∈𝑠
where Wj is jth
word in a document , and Si is the ith
Sentence in a document.
ii) Term Frequency – Inverse Sentence
Frequency(TF-ISF) :
The TF-ISF is the another statistical measure to
find out the weightage of the words in the documents.
It finds out the weight of the word according to its
frequency and its distribution through the sentences
of the document. The weight of the word is given by
TF-ISF(Wj) = Frequency(Wj) * log(
S
Frequency (Wj)
)
Where Wj is the jth
word in the document, In this
method the weight of the word is less when it occurs
more number of sentences. That is it should be
Sentences/
Words
W1 W2 W3 ... Wj
S1 S1W1 S1W2 S1W3 ... S1Wj
S2 S2W1 S2W2 S2W3 ... S2Wj
S3 S3W1 S3W2 S3W3 ... S3Wj
... ... ... ... ... ...
Si SiW1 SiW2 Si W3 ... SiWj
3. R. Nagarajan Int. Journal of Engineering Research and Applications www.ijera.com
ISSN: 2248-9622, Vol. 6, Issue 1, (Part - 2) January 2016, pp.73-78
www.ijera.com 75 | P a g e
identified as a common word and it will not helps to
identifies the concept of the document.
iii) Co-occurrences Statistical Information(CSI):
CSI is another statistical measure to find out the
weight of the words in the documents using χ2
measure. It also find out the word which has more
frequency but not such important word to find the
concept of the document. χ2 measure calculate the
deviation of the observed frequencies from the
expected frequencies. The χ2 measure of word (wj) is
given by
CSIW(wj)=χ2
(Wj)=
(𝐶𝑂−𝑂𝐶𝐶𝑈𝑅 𝑊 𝑗 ,𝑊 𝐾 −𝐶𝑂−𝑂𝐶𝐶𝑈𝑅 𝑊 𝑗 𝑝(𝑊 𝑘))2
𝑐𝑜−𝑜𝑐𝑐𝑢𝑟 𝑊 𝑗 𝑝(𝑊 𝑘 )𝑊 𝑗∈𝑆
in which p(wk) is the probability of the word wk
occurs in the sentence-term matrix and co-occur(wj)
is the total number of co-occurrences of the term wj
with terms wk € W.
In this case, co-occur(wj,wk) corresponds to the
observed frequency and co_occur(wj)p(wk)
corresponds to the expected frequency. Generally, all
documents are composed by sentences with variable
length, words used in a lengthy sentences tends to co-
occur with more words used in that sentence. So, our
keyword extraction approaches identify the keywords
erroneously, to rectify such false identification, the
CSI measure redefined p(wj) as the sum of the total
number of words in sentences where wk appears
divided by the total number of words in the
document, co-occur(wj) as the total number of words
in sentences where wj appears. Moreover, the value
of χ2 measure can be influenced by non important but
adjunct terms. To make the method more robust to
this type of situation, the authors of the CSI measure
subtracts from the χ2 (wj) the maximum χ2 value for
any wk € W, i.e.:
CSIW(Wj) = χ2
(wj)-
argmax 𝑊 𝑘∈𝑊
𝑓𝑟𝑒𝑞 𝑊 𝑗 𝑊 𝑘 −𝑛𝑊 𝑗 𝑝𝑊 𝑘
2
𝑛𝑊 𝑗 𝑝𝑊 𝑘
By comparing the weights of the words derived from
the three statistical approaches, the common accepted
top weighted keywords are identified with their
documents and labeled as a keywords of the
document.
V. Document Clustering
5.1 Clustering process
Clustering is the process of grouping similar
documents into sub sets based on their aspects and
each subset is a cluster, i.e in a cluster the document
are similar to each other. Unlike classification,
clustering do not need any training data. Because of
this nature, it is better suited to cluster unsupervised
documents. In this proposed method, agglomerative
hierarchical clustering approach is followed to
construct clusters. It is down-top method. Our
proposed methods starts by leasing each document be
a cluster and iteratively either merges clusters into
larger clusters or splits the cluster. Merging process is
followed when two clusters are closed to each other’s
according to the similarity measures, inversely
splitting process is followed when two clusters are far
away, and this iteration process is continued till the
termination constraint reached.
5.2 Similarity Measure
In this section, the distance between the
documents are calculated with the features of the
documents derived,
Dist(DiFi, DjFj) = |𝑑𝑖𝑘 − 𝑑𝑗𝑘 | 2𝑚
𝐾=1
where i ≠ j
DiFi and DjFj represents the features of the
documents Di and Dj respectively and taken as two
individual clusters Ci and Cj, If the distance between
the two clusters is maximum value, that show there is
no common features between them. Inversely the
distance between two clusters is minimum value
when two clusters have common features.
5.3 Merging of clusters
The range of values allowed for the distance is 0
to √2. To normalize the similarity measuring values
The similarity between the clusters i and j is defined
as
Sim(Fi,Fj) =
|𝑑 𝑖𝑘 −𝑑 𝑗𝑘 | 2𝑚
𝐾=1
2
where i ≠ j
By the similarity measure, the value is 0 assigned
when the similarity measure between the two
documents is far away and 1 when the distance
between the two documents is closer. Closet pair of
clusters are merged together and form one larger
cluster.
Cluster(Ci, Cj) ↔ max{Sim(DiFi, DjFj) i ≠ j and
Sim(DiFi, DjFj ≥ Ѳ
where Ci and Cj are clusters can be merged,
Sim(DiFi,DjFj) is the similarity between Ci and Cj,
Fi and Fj are the features of Ci and Cj, respectively.
4. R. Nagarajan Int. Journal of Engineering Research and Applications www.ijera.com
ISSN: 2248-9622, Vol. 6, Issue 1, (Part - 2) January 2016, pp.73-78
www.ijera.com 76 | P a g e
VI. Experimental Analysis
To validate our proposed algorithm, we conduct
the experiments on the sample data. 40 documents
are considered for experiment. Initially our keyword
extraction portion of algorithm is applied. In the first
stage 3800 words are extracted from sample of
documents, after preprocessing the documents, 1428
unique words and 19420 sentences are extracted.
Then the algorithm constructs the Sentences Vs
Words matrix for 19420 sentences as rows and 1228
words as column to test our TF-ISF statistical
approach. The cell value is filled with 1 if the word
occurs in the sentence and 0 otherwise.
To weight the words extracted, we apply three
statistical approaches, First, Most Frequent words-it
just identifies the higher order frequency words as the
keywords irrespective of other measuring methods,
for the 1428 unique words, it identifies 828 words as
the keywords (threshold value 5). Term Frequency-
Inverse Sentence Frequency (TF-ISF), second
statistical approach, it find out the weight of the word
according to not only its frequency but also its
distribution through the sentences of the document.
By this TF-ISF, the words with high frequency value
but truly not much important to help to identify the
concept of the document are eliminated, because
those words may occur in more number of sentences.
Finally 710 words are identified by this TF-ISF
statistical approach.
Co-occurrences Statistical Information(CSI)-
third statistical measure to find out the weight of the
words in the documents using χ2
measure. It also
finds out the word which has more frequency and less
presence in the sentences but not much important
word to find the concept of the document. χ2
measure
the deviation of the observed frequencies from the
expected frequencies. By applying the Co-
occurrences Statistical Information (CSI) approach,
640 words are extracted as keyword of the sample
data. By Comparing the words resulted from the three
statistical approaches 590 words were labeled as
keywords of our sample documents. The following
Table-2 shows the results obtained from the three
statistical approaches.
Table-2
Keywords extracted by applying statistical
approaches
R MF TF-ISF CSI
1 Data Mining Data Mining Data Mining
2 Machine
Learning
Machine
Learning
Machine
Learning
3 Image Mining Image Mining Image Mining
4 Recognition Database Data encryption
5 Segmentation Data
encryption
Database
6 Database Graphics Data set
7 data encryption Pre-processing Pre-processing
8 data compression Security
System
Data
compression
9 Data set Information
Retrieval
Segmentation
10 Pre-processing Segmentation Data encryption
11 Graphics Data set Graphics
12 Information
Retrieval
data
compression
Security System
12 Security system Router Information
Retrieval
13 Hub Neural
Networks
SVM
14 Router SVM Hub
15 Neural networks Hub Router
16 SVM Clustering Clustering
17 Clustering Recognition Recognition
From the Table-2 values, the top ranked keyword
of 14 documents are derived and displayed in the
following Table-3
Table-3
Documents keyword representation
Doc.Id Features/Keywords Extracted
D1 { data mining, dataset, preprocessing
,information retrieval, machine learning}
D2 { image mining, graphics, recognition,
segmentation}
D3 database, data encryption, data compression,
data mining}
D4 {data mining, preprocessing, dataset,
machine learning}
D5 {database, data compression, data
encryption}
Step 1 : Initialize T (Tree)
Step 2 : for each Di in a document set
Step i : ci ← preprocessed(Di)
Step ii : add ci to T as a separate node
Step iii : Repeat for each pair of clusters
Cj and Ck in T
if Cj and Ck are the closest pair of
clusters in T
then
merge(Cj, Ck)
compute cluster feature vectors
changed
Else
split (Cj, Ck)
compute cluster feature vectors
changed
Endif
Until all clusters are not changed
End for
Step 3 : Return T
5. R. Nagarajan Int. Journal of Engineering Research and Applications www.ijera.com
ISSN: 2248-9622, Vol. 6, Issue 1, (Part - 2) January 2016, pp.73-78
www.ijera.com 77 | P a g e
D6 {image mining, recognition, segmentation}
D7 security system, hub, router}
D8 {database, data encryption, data
compression}
D9 { image mining, recognition, graphics,
segmentation}
D10 {neural network, SVM, clustering}
D11 {data mining, machine learning, dataset}
D12 {data mining, preprocessing, machine
learning, information retrieval}
D13 {image mining, recognition, segmentation}
D14 {database, data encryption, data
compression}
After extracting the features, the distance
between the documents are measured by the
similarity measure, the following matrix shows the
distance between sample documents as numerical
values. The values ranged from 0 to 1.
To create clusters, our second part of the
algorithm executes by assuming the document D1 as
our first node, the value of D1 and D2, D1 and D3,
D1 and D4, D1 and D5… D1 and D14 are compared,
the values closer to 1 indicates that those documents
deals same concepts, the distance values between D1
to D4,D11,D12 are 0.9, 0.8, 0.7 respectively, which
are closer to value 1 and forms the cluster
C1{D1,D4,D11,D12}. Likewise the values between
D2 to D6,D9,D13 are 0.8, 0.9 and 0.9 respectively
shows that the second cluster C2 merges the
documents D6,D9 and D13 with D2 forms
C2{D2,D6,D9,D13}. The values between D3 to
D5,D8 and D14 are 0.9, 0.9 and 0.8 forms the third
cluster C3{D3,D5,D8,D14}. It is also identified from
the table value of D7 and D10 with all the other
documents are far away that is so smaller than 1,
which shows that documents the concepts of D7 and
D10 are far away the concepts of other documents
taken for testing. Finally three number clusters are
formed from our testing data set.
Figure-1 shows the distance values between the
documents C1D1, D4, D11, D12 C2D2, D6, D9,
D13 C3 D3, D5, D8, D14
VII. Conclusion
In this paper, we proposed an algorithm for
feature extraction and document clustering, three
statistical approaches are applied to the words of the
documents. Because of the different base natures of
the statistical approaches the fake features are
eliminated, the values arrived by the statistical
approaches are compared and the top ranked words
are labeled as the keywords or features of the
documents. Then, the distance between the
documents are calculated using the features extracted.
By applying the similarity measuring the close
concept documents forms the clusters. Summarily,
according to the experimental results, the document
clustering method proposed in this paper handles the
unstructured unlabelled documents effectively.
References
[1] Rafael Geraldeli Rossi, Ricardo Marcondes
Marcacini, Solange Oliveira Rezende,
“Analysis of Statistical Keyword Extraction
Methods for Incremental Clustering”, 2013.
[2] Jasmeen Kaur and Vishal Gupta,” Effective
Approaches For Extraction Of Keywords”,
IJCSI International Journal of Computer
Science Issues”, Vol. 7, Issue 6, 2010.
[3] Mohamed H. Haggag, “Keyword Extraction
using Semantic Analysis”, International
Journal of Computer Applications, Volume
61– No.1, 2013.
[4] Rafael Geraldeli Rossi , Ricardo Marcondes
Marcacini , Solange Oliveira Rezende,”
Analysis Of Domain Independent statistical
keyword extraction methods for Incremental
Clustering”, Journal of the Brazilian Society
on Computational Intelligence (SBIC), Vol.
12, Iss. 1, pp.17 37, 2014.
[5] S. Wang, M. Y. Wang, J. Zheng, K. Zheng,
"A Hybrid Keyword Extraction Method
Based on TF and Semantic Strategies for
Chinese Document", Applied Mechanics and
Materials, Vols.635-637, pp.1476-1479,
2014.
[6] Milos Ilic, Petar Spalevic, Mladen
Veinovic,” Suffix Tree Clustering - Data
mining algorithm”, ERK2014, Portoroz,
B:15-18, 2014.
[7] Tao Peng and Lu Liu, “A novel incremental
conceptual hierarchical text clustering
methodUsing CFu-tree”, Applied Soft
Computing, Vol. 27, pp. 269-278, 2015.
[8] Anna Huang, “Similarity Measures for Text
Document Clustering”, New Zealand
Computer Science Research Student
Conference”, pp.49-56, 2008.
[9] Fathi H. Saad ,Omer I. E. Mohamed , and
Rafa E. Al-Qutaish, “Comparison of
Hierarchical Agglomerative Algorithms for
Clustering Medical Documents”,
International Journal of Software
Engineering & Applications, Vol.3, No.3,
2012.
[10] Gopal Patidar, Anju Singh and Divakar
Singh, “An Approach for Document
Clustering using Agglomerative Clustering
and Hebbian-type Neural Network”,
International Journal of Computer
Applications, Vol.75, No.9, 2013.