Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
With ever increasing number of documents on web and other repositories, the task of organizing and
categorizing these documents to the diverse need of the user by manual means is a complicated job, hence
a machine learning technique named clustering is very useful. Text documents are clustered by pair wise
similarity of documents with similarity measures like Cosine, Jaccard or Pearson. Best clustering results
are seen when overlapping of terms in documents is less, that is, when clusters are distinguishable. Hence
for this problem, to find document similarity we apply link and neighbor introduced in ROCK. Link
specifies number of shared neighbors of a pair of documents. Significantly similar documents are called as
neighbors. This work applies links and neighbors to Bisecting K-means clustering in identifying seed
documents in the dataset, as a heuristic measure in choosing a cluster to be partitioned and as a means to
find the number of partitions possible in the dataset. Our experiments on real-time datasets showed a
significant improvement in terms of accuracy with minimum time.
A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MININGijcsit
In today’s world of internet, with whole lot of e-documents such, as html pages, digital libraries etc. occupying considerable amount of cyber space, organizing these documents has become a practical need. Clustering is an important technique that organizes large number of objects into smaller coherent groups.This helps in efficient and effective use of these documents for information retrieval and other NLP tasks.Email is one of the most frequently used e-document by individual or organization. Email categorization is one of the major tasks of email mining. Categorizing emails into different groups help easy retrieval and maintenance. Like other e-documents, emails can also be classified using clustering algorithms. In this
paper a similarity measure called Similarity Measure for Text Processing is suggested for email clustering.
The suggested similarity measure takes into account three situations: feature appears in both emails, feature appears in only one email and feature appears in none of the emails. The potency of suggested similarity measure is analyzed on Enron email data set to categorize emails. The outcome indicates that the efficiency acquired by the suggested similarity measure is better than that acquired by other measures.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
With ever increasing number of documents on web and other repositories, the task of organizing and
categorizing these documents to the diverse need of the user by manual means is a complicated job, hence
a machine learning technique named clustering is very useful. Text documents are clustered by pair wise
similarity of documents with similarity measures like Cosine, Jaccard or Pearson. Best clustering results
are seen when overlapping of terms in documents is less, that is, when clusters are distinguishable. Hence
for this problem, to find document similarity we apply link and neighbor introduced in ROCK. Link
specifies number of shared neighbors of a pair of documents. Significantly similar documents are called as
neighbors. This work applies links and neighbors to Bisecting K-means clustering in identifying seed
documents in the dataset, as a heuristic measure in choosing a cluster to be partitioned and as a means to
find the number of partitions possible in the dataset. Our experiments on real-time datasets showed a
significant improvement in terms of accuracy with minimum time.
A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MININGijcsit
In today’s world of internet, with whole lot of e-documents such, as html pages, digital libraries etc. occupying considerable amount of cyber space, organizing these documents has become a practical need. Clustering is an important technique that organizes large number of objects into smaller coherent groups.This helps in efficient and effective use of these documents for information retrieval and other NLP tasks.Email is one of the most frequently used e-document by individual or organization. Email categorization is one of the major tasks of email mining. Categorizing emails into different groups help easy retrieval and maintenance. Like other e-documents, emails can also be classified using clustering algorithms. In this
paper a similarity measure called Similarity Measure for Text Processing is suggested for email clustering.
The suggested similarity measure takes into account three situations: feature appears in both emails, feature appears in only one email and feature appears in none of the emails. The potency of suggested similarity measure is analyzed on Enron email data set to categorize emails. The outcome indicates that the efficiency acquired by the suggested similarity measure is better than that acquired by other measures.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Editor IJMTER
Databases are build with the fixed number of fields and records. Uncertain database contains a
different number of fields and records. Clustering techniques are used to group up the relevant records
based on the similarity values. The similarity measures are designed to estimate the relationship between
the transactions with fixed attributes. The uncertain data similarity is estimated using similarity
measures with some modifications.
Clustering on uncertain data is one of the essential tasks in mining uncertain data. The existing
methods extend traditional partitioning clustering methods like k-means and density-based clustering
methods like DBSCAN to uncertain data. Such methods cannot handle uncertain objects. Probability
distributions are essential characteristics of uncertain objects have not been considered in measuring
similarity between uncertain objects.
The customer purchase transaction data is analyzed using uncertain data clustering scheme. The
density based clustering mechanism is used for the uncertain data clustering process. This model
produces results with minimum accuracy levels. The clustering technique is improved with distribution
based similarity model for uncertain data. The nearest neighbor search technique is applied on the
distribution based data environment. The system is designed using java as a front end and oracle as a
back end.
An Improved Similarity Matching based Clustering Framework for Short and Sent...IJECEIAES
Text clustering plays a key role in navigation and browsing process. For an efficient text clustering, the large amount of information is grouped into meaningful clusters. Multiple text clustering techniques do not address the issues such as, high time and space complexity, inability to understand the relational and contextual attributes of the word, less robustness, risks related to privacy exposure, etc. To address these issues, an efficient text based clustering framework is proposed. The Reuters dataset is chosen as the input dataset. Once the input dataset is preprocessed, the similarity between the words are computed using the cosine similarity. The similarities between the components are compared and the vector data is created. From the vector data the clustering particle is computed. To optimize the clustering results, mutation is applied to the vector data. The performance the proposed text based clustering framework is analyzed using the metrics such as Mean Square Error (MSE), Peak Signal Noise Ratio (PSNR) and Processing time. From the experimental results, it is found that, the proposed text based clustering framework produced optimal MSE, PSNR and processing time when compared to the existing Fuzzy C-Means (FCM) and Pairwise Random Swap (PRS) methods.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
XCLS++: A new algorithm to improve XCLS+ for clustering XML documents IJITCA Journal
The purpose of this paper is to offer a method for clustering of XML documents. Many different ways of clustering were discussed for clustering documents which can be divided to structure, content and combination of structure and content. One method that has been proposed to solve this problem is XCLS which be improved with XCLS+ later. XCLS+ is efficient algorithm for clustering XML documents and xposure into structure based on clustering category. The conducted survey showed which XCLS+ method has problem that makes away from its optimal value too. This paper presents a method with name XCLS++ which has not related problem XCLS+ and its efficiency are diagnosed more than XCLS+.
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSijseajournal
ABSTRACT
In this paper we propose a novel method to cluster categorical data while retaining their context. Typically, clustering is performed on numerical data. However it is often useful to cluster categorical data as well, especially when dealing with data in real-world contexts. Several methods exist which can cluster categorical data, but our approach is unique in that we use recent text-processing and machine learning advancements like GloVe and t- SNE to develop a a context-aware clustering approach (using pre-trained
word embeddings). We encode words or categorical data into numerical, context-aware, vectors that we use to cluster the data points using common clustering algorithms like K-means.
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Abstract: Traditional approaches for document classification need data which is labelled for the construction reliable classifiers which are even accurate. Unfortunately, data which is already labelled are rarely available, and often too costly to obtain. For the given learning task for which data which is trained is unavailable, abundant labelled data may be there for a different and related domain. One would like to use the related labelled data as auxiliary information to accomplish the classification task in the target domain. Recently, the paradigm of transfer learning has been introduced to enable effective learning strategies when auxiliary data obey a different probability distribution. A co-clustering based classification algorithm has been previously proposed to tackle cross-domain text classification. In this work, we extend the idea underlying this approach by making the latent semantic relationship between the two domains explicit. This goal is achieved with the use of Wikipedia. As a result, the pathway that allows propagating labels between the two domains not only captures common words, but also semantic concepts based on the content of documents. We empirically demonstrate the efficacy of our semantic-based approach to cross-domain classification using a variety of real data.Keywords: Classification, Clustering, Cross-domain Text Classification, Co-clustering, Labelled data, Traditional Approaches.
Title: Co-Clustering For Cross-Domain Text Classification
Author: Rayala Venkat, Mahanthi Kasaragadda
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
Specific objective to discover some novel information from a set of documents initially retrieved in response to some query. Clustering sentences level text, effective use and update is still an open research issue, especially in domain of text mining. Since most existing system uses pattern belong to a single cluster. But here we can use patterns belongs to all cluster with different degree of membership. Since sentences of those documents we would expect at least one of the clusters to be closely related to the concepts described by the query term. This paper presents a Novel Fuzzy Clustering Algorithm that operates on relational input data (i.e. data in the form of square matrix of pair wise similarities between data objects).
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
The objective is to find among all partitions of the data set, best publishing according to some quality measure. Affinity propagation is a low error, high speed, flexible, and remarkably simple clustering algorithm that may be used in forming teams of participants for business simulations and experiential exercises, and in organizing participant’s preferences for the parameters of simulations. This paper proposes an efficient Affinity Propagation algorithm that guarantees the same clustering result as the original algorithm after convergence. The heart of our approach is (1) to prune unnecessary message exchanges in the iterations and (2) to compute the convergence values of pruned messages after the iterations to determine clusters.
The Search of New Issues in the Detection of Near-duplicated Documentsijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...kevig
The tremendous increase in the amount of available research documents impels researchers to propose
topic models to extract the latent semantic themes of a documents collection. However, how to extract the
hidden topics of the documents collection has become a crucial task for many topic model applications.
Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of
documents collection increases. In this paper, the Correlated Topic Model with variational Expectation-
Maximization algorithm is implemented in MapReduce framework to solve the scalability problem. The
proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of
the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are
conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed
approach has a comparable performance in terms of topic coherences with LDA implemented in
MapReduce framework.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
This article will introduce some approaches for improving text categorization models by integrating
previously imported ontologies. From the Reuters Corpus Volume I (RCV1) dataset, some categories very
similar in content and related to telecommunications, Internet and computer areas were selected for models
experiments. Several domain ontologies, covering these areas were built and integrated to categorization
models for their improvements.
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION cscpconf
Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text
classification. In this paper, Fast Fuzzy Feature clustering for text classification is proposed. It
is based on the framework proposed by Jung-Yi Jiang, Ren-Jia Liou and Shie-Jue Lee in 2011.
The word in the feature vector of the document is grouped into the cluster in less iteration. The
numbers of iterations required to obtain cluster centers are reduced by transforming clusters
center dimension from n-dimension to 2-dimension. Principle Component Analysis with slit
change is used for dimension reduction. Experimental results show that, this method improve
the performance by significantly reducing the number of iterations required to obtain the cluster
center. The same is being verified with three benchmark datasets
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Editor IJMTER
Databases are build with the fixed number of fields and records. Uncertain database contains a
different number of fields and records. Clustering techniques are used to group up the relevant records
based on the similarity values. The similarity measures are designed to estimate the relationship between
the transactions with fixed attributes. The uncertain data similarity is estimated using similarity
measures with some modifications.
Clustering on uncertain data is one of the essential tasks in mining uncertain data. The existing
methods extend traditional partitioning clustering methods like k-means and density-based clustering
methods like DBSCAN to uncertain data. Such methods cannot handle uncertain objects. Probability
distributions are essential characteristics of uncertain objects have not been considered in measuring
similarity between uncertain objects.
The customer purchase transaction data is analyzed using uncertain data clustering scheme. The
density based clustering mechanism is used for the uncertain data clustering process. This model
produces results with minimum accuracy levels. The clustering technique is improved with distribution
based similarity model for uncertain data. The nearest neighbor search technique is applied on the
distribution based data environment. The system is designed using java as a front end and oracle as a
back end.
An Improved Similarity Matching based Clustering Framework for Short and Sent...IJECEIAES
Text clustering plays a key role in navigation and browsing process. For an efficient text clustering, the large amount of information is grouped into meaningful clusters. Multiple text clustering techniques do not address the issues such as, high time and space complexity, inability to understand the relational and contextual attributes of the word, less robustness, risks related to privacy exposure, etc. To address these issues, an efficient text based clustering framework is proposed. The Reuters dataset is chosen as the input dataset. Once the input dataset is preprocessed, the similarity between the words are computed using the cosine similarity. The similarities between the components are compared and the vector data is created. From the vector data the clustering particle is computed. To optimize the clustering results, mutation is applied to the vector data. The performance the proposed text based clustering framework is analyzed using the metrics such as Mean Square Error (MSE), Peak Signal Noise Ratio (PSNR) and Processing time. From the experimental results, it is found that, the proposed text based clustering framework produced optimal MSE, PSNR and processing time when compared to the existing Fuzzy C-Means (FCM) and Pairwise Random Swap (PRS) methods.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
XCLS++: A new algorithm to improve XCLS+ for clustering XML documents IJITCA Journal
The purpose of this paper is to offer a method for clustering of XML documents. Many different ways of clustering were discussed for clustering documents which can be divided to structure, content and combination of structure and content. One method that has been proposed to solve this problem is XCLS which be improved with XCLS+ later. XCLS+ is efficient algorithm for clustering XML documents and xposure into structure based on clustering category. The conducted survey showed which XCLS+ method has problem that makes away from its optimal value too. This paper presents a method with name XCLS++ which has not related problem XCLS+ and its efficiency are diagnosed more than XCLS+.
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSijseajournal
ABSTRACT
In this paper we propose a novel method to cluster categorical data while retaining their context. Typically, clustering is performed on numerical data. However it is often useful to cluster categorical data as well, especially when dealing with data in real-world contexts. Several methods exist which can cluster categorical data, but our approach is unique in that we use recent text-processing and machine learning advancements like GloVe and t- SNE to develop a a context-aware clustering approach (using pre-trained
word embeddings). We encode words or categorical data into numerical, context-aware, vectors that we use to cluster the data points using common clustering algorithms like K-means.
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Abstract: Traditional approaches for document classification need data which is labelled for the construction reliable classifiers which are even accurate. Unfortunately, data which is already labelled are rarely available, and often too costly to obtain. For the given learning task for which data which is trained is unavailable, abundant labelled data may be there for a different and related domain. One would like to use the related labelled data as auxiliary information to accomplish the classification task in the target domain. Recently, the paradigm of transfer learning has been introduced to enable effective learning strategies when auxiliary data obey a different probability distribution. A co-clustering based classification algorithm has been previously proposed to tackle cross-domain text classification. In this work, we extend the idea underlying this approach by making the latent semantic relationship between the two domains explicit. This goal is achieved with the use of Wikipedia. As a result, the pathway that allows propagating labels between the two domains not only captures common words, but also semantic concepts based on the content of documents. We empirically demonstrate the efficacy of our semantic-based approach to cross-domain classification using a variety of real data.Keywords: Classification, Clustering, Cross-domain Text Classification, Co-clustering, Labelled data, Traditional Approaches.
Title: Co-Clustering For Cross-Domain Text Classification
Author: Rayala Venkat, Mahanthi Kasaragadda
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
Specific objective to discover some novel information from a set of documents initially retrieved in response to some query. Clustering sentences level text, effective use and update is still an open research issue, especially in domain of text mining. Since most existing system uses pattern belong to a single cluster. But here we can use patterns belongs to all cluster with different degree of membership. Since sentences of those documents we would expect at least one of the clusters to be closely related to the concepts described by the query term. This paper presents a Novel Fuzzy Clustering Algorithm that operates on relational input data (i.e. data in the form of square matrix of pair wise similarities between data objects).
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
The objective is to find among all partitions of the data set, best publishing according to some quality measure. Affinity propagation is a low error, high speed, flexible, and remarkably simple clustering algorithm that may be used in forming teams of participants for business simulations and experiential exercises, and in organizing participant’s preferences for the parameters of simulations. This paper proposes an efficient Affinity Propagation algorithm that guarantees the same clustering result as the original algorithm after convergence. The heart of our approach is (1) to prune unnecessary message exchanges in the iterations and (2) to compute the convergence values of pruned messages after the iterations to determine clusters.
The Search of New Issues in the Detection of Near-duplicated Documentsijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...kevig
The tremendous increase in the amount of available research documents impels researchers to propose
topic models to extract the latent semantic themes of a documents collection. However, how to extract the
hidden topics of the documents collection has become a crucial task for many topic model applications.
Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of
documents collection increases. In this paper, the Correlated Topic Model with variational Expectation-
Maximization algorithm is implemented in MapReduce framework to solve the scalability problem. The
proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of
the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are
conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed
approach has a comparable performance in terms of topic coherences with LDA implemented in
MapReduce framework.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
This article will introduce some approaches for improving text categorization models by integrating
previously imported ontologies. From the Reuters Corpus Volume I (RCV1) dataset, some categories very
similar in content and related to telecommunications, Internet and computer areas were selected for models
experiments. Several domain ontologies, covering these areas were built and integrated to categorization
models for their improvements.
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION cscpconf
Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text
classification. In this paper, Fast Fuzzy Feature clustering for text classification is proposed. It
is based on the framework proposed by Jung-Yi Jiang, Ren-Jia Liou and Shie-Jue Lee in 2011.
The word in the feature vector of the document is grouped into the cluster in less iteration. The
numbers of iterations required to obtain cluster centers are reduced by transforming clusters
center dimension from n-dimension to 2-dimension. Principle Component Analysis with slit
change is used for dimension reduction. Experimental results show that, this method improve
the performance by significantly reducing the number of iterations required to obtain the cluster
center. The same is being verified with three benchmark datasets
SEMANTIC INTEGRATION FOR AUTOMATIC ONTOLOGY MAPPING cscpconf
In the last decade, ontologies have played a key technology role for information sharing and agents interoperability in different application domains. In semantic web domain, ontologies are efficiently used toface the great challenge of representing the semantics of data, in order to bring the actual web to its full
power and hence, achieve its objective. However, using ontologies as common and shared vocabularies requires a certain degree of interoperability between them. To confront this requirement, mapping ontologies is a solution that is not to be avoided. In deed, ontology mapping build a meta layer that allows different applications and information systems to access and share their informations, of course, after resolving the different forms of syntactic, semantic and lexical mismatches. In the contribution presented in this paper, we have integrated the semantic aspect based on an external lexical resource, wordNet, to design a new algorithm for fully automatic ontology mapping. This fully automatic character features the
main difference of our contribution with regards to the most of the existing semi-automatic algorithms of ontology mapping, such as Chimaera, Prompt, Onion, Glue, etc. To better enhance the performances of our algorithm, the mapping discovery stage is based on the combination of two sub-modules. The former
analysis the concept’s names and the later analysis their properties. Each one of these two sub-modules is
it self based on the combination of lexical and semantic similarity measures.
Data-to-text technologies present an enormous and exciting opportunity to help
audiences understand some of the insights present in today’s vasts and growing amounts of electronic
data. In this article we analyze the potential value and benefits of these solutions as well as their risks
and limitations for a wider penetration. These technologies already bring substantial advantages of
cost, time, accuracy and clarity versus other traditional approaches or format. On the other hand,
there are still important limitations that restrict the broad applicability of these solutions, most
importantly in the limited quality of their output. However we find that the current state of
development is sufficient for the application of these solution across many domains and use cases and
recommend businesses of all sectors to consider how to deploy them to enhance the value they are
currently getting from their data. As the availability of data keeps growing exponentially and natural
language generation technology keeps improving, we expect data-to-text solutions to take a much
more bigger role in the production of automated content across many different domains.
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGijnlc
In this paper, we propose a novel algorithm that rearrange the topic assignment results obtained from topic
modeling algorithms, including NMF and LDA. The effectiveness of the algorithm is measured by how much
the results conform to expert opinion, which is a data structure called TDAG that we defined to represent the
probability that a pair of highly correlated words appear together. In order to make sure that the internal
structure does not get changed too much from the rearrangement, coherence, which is a well known metric
for measuring the effectiveness of topic modeling, is used to control the balance of the internal structure.
We developed two ways to systematically obtain the expert opinion from data, depending on whether the
data has relevant expert writing or not. The final algorithm which takes into account both coherence and
expert opinion is presented. Finally we compare amount of adjustments needed to be done for each topic
modeling method, NMF and LDA.
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTSijsc
Classification is an important technique used in information retrieval. Supervised classification suffers
from certain limitations concerning the collection and labeling of the training dataset. When facing Multi-
Domain classification, multiple training datasets and classifiers are needed which is relatively difficult. In
this paper an unsupervised classification system is proposed that can manage the Multi-Domain
classification problem as well. It is a multi-domain system where each domain represented by an ontology.
A document is mapped on each ontology based on the weights of the mutual tokens between them with the
help of fuzzy sets, resulting in a mapping degree of the document with each domain. An experiment carried
out showing satisfying classification results with an improvement in the evaluation results of the proposed
system compared to Apache Lucene.
A Proposed Multi-Domain Approach for Automatic Classification of Text Documents ijsc
Classification is an important technique used in information retrieval. Supervised classification suffers from certain limitations concerning the collection and labeling of the training dataset. When facing MultiDomain classification, multiple training datasets and classifiers are needed which is relatively difficult. In this paper an unsupervised classification system is proposed that can manage the Multi-Domain classification problem as well. It is a multi-domain system where each domain represented by an ontology. A document is mapped on each ontology based on the weights of the mutual tokens between them with the help of fuzzy sets, resulting in a mapping degree of the document with each domain. An experiment carried out showing satisfying classification results with an improvement in the evaluation results of the proposed system compared to Apache Lucene.
Information residing in relational databases and delimited file systems are inadequate for reuse and sharing over the web. These file systems do not adhere to commonly set principles for maintaining data harmony. Due to these reasons, the resources have been suffering from lack of uniformity, heterogeneity as well as redundancy throughout the web. Ontologies have been widely used for solving such type of problems, as they help in extracting knowledge out of any information system. In this article, we focus on extracting concepts and their relations from a set of CSV files. These files are served as individual concepts and grouped into a particular domain, called the domain ontology. Furthermore, this domain ontology is used for capturing CSV data and represented in RDF format retaining links among files or concepts. Datatype and object properties are automatically detected from header fields. This reduces the task of user involvement in generating mapping files. The detail analysis has been performed on Baseball tabular data and the result shows a rich set of semantic information.
Towards optimize-ESA for text semantic similarity: A case study of biomedical...IJECEIAES
Explicit Semantic Analysis (ESA) is an approach to measure the semantic relatedness between terms or documents based on similarities to documents of a references corpus usually Wikipedia. ESA usage has received tremendous attention in the field of natural language processing NLP and information retrieval. However, ESA utilizes a huge Wikipedia index matrix in its interpretation by multiplying a large matrix by a term vector to produce a high-dimensional vector. Consequently, the ESA process is too expensive in interpretation and similarity steps. Therefore, the efficiency of ESA will slow down because we lose a lot of time in unnecessary operations. This paper propose enhancements to ESA called optimize-ESA that reduce the dimension at the interpretation stage by computing the semantic similarity in a specific domain. The experimental results show clearly that our method correlates much better with human judgement than the full version ESA approach.
Text document clustering and similarity detection is the major part of document management, where every document should be identified by its key terms and domain knowledge. Based on the similarity, the documents are grouped into clusters. For document similarity calculation there are several approaches were proposed in the existing system. But the existing system is either term based or pattern based. And those systems suffered from several problems. To make a revolution in this challenging environment, the proposed system presents an innovative model for document similarity by applying back propagation time stamp algorithm. It discovers patterns in text documents as higher level features and creates a network for fast grouping. It also detects the most appropriate patterns based on its weight and BPTT performs the document similarity measures. Using this approach, the document can be categorized easily. In order to perform the above, a new approach is used. This helps to reduce the training process problems. The above framework is named as BPTT. The BPTT has implemented and evaluated using dot net platform with different set of datasets.
Similar to O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE (20)
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE
1. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.2, April 2015
DOI : 10.5121/ijdms.2015.7201 1
ONTOLOGY BASED DOCUMENT CLUSTERING
USING MAPREDUCE
Abdelrahman Elsayed 1
, Hoda M. O. Mokhtar 2
and Osama Ismail3
1
Central Lab for Agriculture Expert Systems, Agriculture Research Center, Egypt
2,3
Faculty of Computers and Information, Cairo University, Egypt
ABSTRACT
Nowadays, document clustering is considered as a data intensive task due to the dramatic, fast increase in
the number of available documents. Nevertheless, the features that represent those documents are also too
large. The most common method for representing documents is the vector space model, which represents
document features as a bag of words and does not represent semantic relations between words. In this
paper we introduce a distributed implementation for the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation is to solve the problem of clustering intensive data
documents. In addition, we propose integrating the WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enhance document clustering results. Our presented
experimental results show that using lexical categories for nouns only enhances internal evaluation
measures of document clustering; and decreases the documents features from thousands to tens features.
Our experiments were conducted using Amazon Elastic MapReduce to deploy the Bisecting k-means
algorithm.
KEYWORDS
Document clustering, Ontology, Text Mining, Distributed Computing
1. INTRODUCTION
Document clustering is the process of grouping similar documents. Document clustering helps in
organizing documents and enhances the methods of displaying search engine results. “Grouper” is
an example of document clustering interface, it groups the results of search dynamically into
clusters, and it can be integrated with any search engine [1]. In most document clustering
approaches, documents are represented using vector space model [2]. Given a set of n documents
{d1, d2, d3,...., dn}, the unique terms m are extracted from these documents, and the documents
are represented as n × m Matrix. Each row of the matrix represents a document vector, and
columns represent document terms. In vector space model, terms are represented as a bag of
words. The main drawback of this representation is that it ignores the semantic relations between
terms [3].
Ontology has been adopted to enhance document clustering, further it has been used to solve the
problem of large number of document features [4]. Inspired by the importance of ontology and its
capability to enhance document clustering, in this research we investigate the use of WordNet
ontology [5] in order to decrease the huge volume of document features to just 26 features
representing the WordNet lexical noun categories. Moreover, WordNet helps in representing
semantic relations between terms. For example, the words cheese and bread have the same
2. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.2, April 2015
2
lexical noun category noun.food, so it will be represented by the same term feature, which
consequently results in dramatically reducing the number of dimensions.
Although using WordNet to reduce document dimensionality enhances document clustering
efficiency, the traditional implementation of this approach is not very efficient due to the huge
volumes of documents. Thus, using a parallel programming paradigm for implementing this
solution turns to be a more appealing approach. Thus, we investigated running document
clustering process in a distributed framework, by adopting the common MapReduce
programming model [6]. MapReduce is a programming model initiated by Google’s Team for
processing huge datasets in distributed systems. It presents simple and powerful interface that
enables automatic parallelization and distribution of large-scale computations. In addition, it
enables programmers who have no experience with parallel and distributed system to utilize the
resources of a large distributed system in easily and efficient way [6].
Also, we integrate WordNet ontology with bisecting k-means algorithm using MapReduce in
document clustering.
The reminder of this paper is structured as follows: Section 2 introduces related work and a brief
background about K-means and its variant the bisecting k-means. Section 3 introduces the
problem and discusses the proposed system. Section 4 is devoted to explain the document
clustering evaluation methods that we used to evaluate our proposed clustering approach. Section
5 introduces the details about the used datasets and discusses the experimental results. Finally,
Section 6 concludes and presents possible directions for future work.
2. BACKGROUND AND RELATED WORK
Ontology is usually referred to as a method for obtaining an agreement on similar activities in a
specific domain. [7]. Many ontologies have been developed for different domains, such as food
ontology [8], gene ontology [9], and agriculture ontology [10]. In the following discussion we
overview some research efforts that investigated the use of ontology for enhancing the process of
documents clustering.
In [11], the authors integrated WordNet ontology with document clustering. WordNet organizes
words into sets of synonyms called 'synsets'. They tried to solve the problem of the bag of words
representation, in order to represent relationships between terms which do not co-occur literally.
The authors considered the synsets as concepts and extended the bag of words model by including
the parent concepts (hypernyms) of synsets up to five levels. For example, they utilized the
WordNet ontology to find the similarity between the related terms, such as "beef" and "pork",
these two terms have the same parent concept "meat". Therefore, a document having the term
"beef" will be related to a document with the term "pork" appearing in it; the proposed approach
is shown to enhance the clustering performance.
In [12], the authors extended Hotho experiment by adding all word senses instead of the first
sense; also they included all hypernym levels instead of just 5 levels of hypernym. They
concluded that, their approach does not enhance document clustering, because more noises are
attached to the dataset by including all word senses and all hypernym levels. In [13], the authors
claim that the drawback of augmenting WordNet synsets with original terms is increasing the
dimensionality of terms. In [13] they used ontology to reduce the number of features. In their
work they showed that by identifying and using noun features alone text document clustering is
improved. Furthermore, they used word sense disambiguation techniques in order to resolve
polysemy, which happens when the word has different meanings. In addition, they used ontology
to resolve synonymy problems, which occurs when two words refer to the same concept, by using
the corresponding concepts and expelling the synonymous.
Recupero [4] coped with problems of vector space model, which include high dimensionality of
data and ignoring the semantic relationships between terms. Recupero used WordNet ontology to
3. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.2, April 2015
3
find the lexical category of each term, and replaced each term with its corresponding lexical
category. Using the fact that WordNet has 41 lexical categories for representing nouns and verbs,
all documents dimensions were reduced to just 41 dimensions. Recupero used ANNIE, which is
an information extraction system [14] to help in understanding if two words or compound words
refer to the same entity. For example entities like "Tony Blair, Mr. Blair, Prime Minister Tony
Blair, and Prime Minister" will all be represented by one entity.
In the remaining of this section we present previous work in document clustering techniques. K-
means clustering and its variant bisection k-means are the focus of our discussion due to their
wide use and adoption in many applications.
2.1 K-means and Bisecting k-means
K-means is a widely known partitioning clustering technique, The input of k-means is N objects,
M dimensions, and k representing the targeted number of clusters. The output of k-means is k
clusters represented by the cluster centers. The main idea in the traditional k-means algorithm is
to increase (maximize) inter cluster similarity and minimize the similarity with other clusters [15]
[16].
K-means is an iterative algorithm. The steps of k-mean are repeated until convergence, which
may be satisfied if there is no changes in cluster centers (more specifically when convergence is
achieved), or may be satisfied after a specific number of iterations.
Bisecting k-means is a variant of k-means developed by Steinbach, et al [17]; it has been applied
in document clustering. In [17] the authors showed that bisecting k-means gives better document
clustering results than the basic k-means algorithm. In brief, bisecting k-means proceeds as
follows: Given a desired number of clusters denoted by “k”:
1- Use basic k-means algorithm to get two sub clusters of input dataset.
2- Select the sub-cluster that satisfies the highest overall similarity to be the input dataset.
3- Repeat steps 1, 2 until reaching the desired number of clusters.
Choosing which cluster to be split, may be done by either selecting the largest cluster or selecting
the cluster that has the least overall similarity. In [17] the authors concluded that the difference
between the two methods is small.
3. PROBLEM STATEMENT AND PROPOSED SOLUTION
The two main problems which face document clustering process are namely: the huge volume of
documents, and the large size of document features. In this paper we propose a novel approach to
enhance the efficiency of document clustering. The proposed system is divided into two phases;
the objective of the first phase is to decrease the very large number of document terms (also
known as document features) through adopting WordNet ontology. The second phase aims to
cope with the huge volume of documents by employing the MapReduce parallel programming
paradigm to enhance the system performance. In the following discussion we discuss each phase
in more details.
3.1 Phase 1: Using WordNet Ontology for Document Features Reduction
Ontology plays a vital role in document clustering process by decreasing the large number of
documents features from thousands to tens of features only. The features reduction process
utilizes ontology characteristic which includes semantic relations between words such as
synonyms and hierarchical relations between words. From hierarchy relations we can get word
parent and use it for representing document features. For example the words corn, wheat, and rice
can be represented by only one word which is plant. Also words such as beef and pork can be
4. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.2, April 2015
4
represented by words meat or food depending of the degree of hierarchy that will be used in the
clustering process. Exploiting semantic relations between words will help in setting documents
that contain words such as rice, wheat and corn at the same cluster. This paper utilizes semantic
relations that are included in WordNet ontology as follow:
WordNet organizes word synsets into 45 lexicographer files; which are further divided into 26 for
nouns, 15 for verbs, and 4 for adverbs and adjectives. We extended [4] approach which replaces
each term by its corresponding WordNet lexical category, we get the lexical category which
corresponds to nouns terms only. Also we used lexical frequency inverse document frequency ”lf-
idf” instead of lexical frequency to give low weight for lexical categories that occur in most of the
documents. Figure (1) displays the detailed steps of replacing the traditional representation of
document terms as a bag of words by a bag of lexical categories.
Figure 1 Steps for representing documents as bag of lexical categories
Given a set of documents, the first step in this phase is the Extract-Words process. The Extract-
Words process removes stop words and extracts the remaining words, it generates two files; the
vocabulary file that contains the list of all words; and the document-words file which stores
associations between words and document (bag of words). Figure (2) displays the format of the
resulting document-words file [18].
Figure 2 Format of bag of words file
The next process is “Get Lexical Categories”; it converts the bag of words into a bag of WordNet
lexical categories. This process involves the following steps:
1- Mapping each word to its WordNet lexical category. This step generates a WordID-
CategoryID file. In case of the word doesn’t have a corresponding WordNet lexical
category, it is mapped to Uncategorized category.
2- Generating a bag of lexical categories that replaces " docID wordID count " to "docID
CategoryID count".
5. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.2, April 2015
5
The third process is "Optimize", the input to this process is the bag of words file or bag of lexical
categories, and the output is an optimized representation. In case of bag of words, each document
will be represented by one line as follows:
"docID word1ID:count word2ID:count ...... wordnID:count ".
In case of bag of lexical categories, each document will be represented also by one line as
follows:
"docID Lexical1ID:count Lexical2ID:count ...... LexicalnID:count ".
The optimized representation of bag of word reduces file size dramatically. For example, for
"PubMed" bag of words dataset [18]: it reduced the bag of words file from 6.3 gigabytes to
928.475 megabytes.
3.2 Phase 2: Bisecting k-means Implementation over MapReduce Framework
To overcome the continuous increase in document size, we used MapReduce to run the bisecting
k-means algorithm. To implement bisecting k-means over the MapReduce framework, first,
traditional k-means algorithm is implemented to generate two clusters; we adapt the method
which is presented in [19] as follow:
1. Initialize centers.
2. In Map function each document vector is assigned to the nearest center. The key of map
function is the document Id and the value is document vector, the map function emits
cluster index, and associated document vector.
3. In Reduce function new centers are calculated. The key will be the cluster index and the
value is document vector. Using cluster index instead of Cluster centroid in reduce
function as a key reduce the amount of data that will be aggregated by reduce workers
nodes.
4. In the clear function of reducer, new centers are saved in a global file. It will be used for
next iteration of k-means algorithm.
5. Finally, if convergence is achieved then stop, else go to step 2.
The main idea of implementing the bisecting k-means algorithm to run on MapReduce is
controlling Hadoop Distributed File System file paths of dataset, cluster centers, and output
clusters. Controlling HDFS paths help in implementing algorithms that use multiple MapReduce
iterations such as Bisecting k-means algorithm.
Figure (3) displays the algorithm of running bisecting k-means on MapReduce. The inputs of the
algorithm are as follows: the path of input dataset, the path of cluster centroids, the output path,
number of dimensions, and the required number of clusters. The output of the algorithm is the
documents’ clusters, and clusters centroids.
The proposed algorithm proceeds as follows: First, it calls the basic k-means with the following
parameters: path of dataset, path of cluster centroids, path of output, and number of features
(dimensions). In this step, the basic k-means algorithm is adjusted to generate two clusters at each
call. Next, the largest cluster is selected to be the path of the input dataset. Also, the string that
represents the output path is concatenated with "1", and the string that represents the path of
clusters’ centroids is concatenated with "1". Every time we call the basic k-means the number of
obtained clusters is increased by one except at the last call. The process of calling the basic k-
means algorithm is repeated until we reach the desired number of clusters.
6. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.2, April 2015
6
The following figure shows the pseudo-code of the proposed algorithm, Let PI represents the path
of the input dataset, k denotes the desired number of clusters, Nd denotes the number of
dimensions, PCC represents the path of the clusters’ centroids, and PO represents the output path.
Let Basic-K-means be the procedure that computes the traditional k-means algorithm and return
k-clusters.
Figure 3 Bisecting k-means over MapReduce Algorithm
4. DOCUMENT CLUSTERING EVALUATION METHODS
For evaluating the proposed document clustering approach, we perform two types of cluster
evaluation; external evaluation, and internal evaluation. External evaluation is applied when the
documents are labeled (i.e. their true cluster are known apriori). Internal evaluation is applied
when documents labels are unknown.
4.1 External Evaluation
Three measures for external evaluation of clustering results are used in this paper. These
measures are purity, entropy, and the F-measure [20], [21], [17]. As the value of purity and F-
measure increase it means that better clustering is achieved, on the other hand, as the value of
entropy decreases it means better results are achieved [22]. In the rest of this section we discuss
each measure in more details.
Purity
Purity is a measure for the degree at which each cluster contains single class label [20]. To
compute purity, for each cluster j, we compute the number of occurrences for each class i and
select the maximum occurrence the purity is thus the summation of all maximum
occurrences divided by the total number of objects n.
Bisecting k-means(PI,k,Nd,PCC,PO)
Output: Clusters’ centers, set of clusters.
BEGIN:
R= 0 // R is number of clusters obtained
FirstTime = true //the first time of calling basic
k-means
while (R < K)
BEGIN
Basic-K-means(PI,Nd,PCC,PO);
if (FirstTime)
R = R+ 2;
FirstTime = false;
else
R = R+1;
PI = Path of Largest cluser returned by
Basic-K-means
PCC= concatenate(PCC,"1");
PO = concatenate(PO,"1");
END
END
7. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.2, April 2015
7
Entropy
Entropy is a measure of uncertainty for evaluating clustering results [21]. For each cluster j the
entropy is calculated as follow
Where, c is the number of classes, is the probability that member of cluster j belongs to class
i, , where is the number of objects of class i belonging to cluster j, is total
number of objects in cluster j.
The total entropy for all clusters is calculated as follow,
Where k is the number of clusters, is the total number of objects in cluster j, and n is the total
number of all objects.
F-measure
F-measure is a measure for evaluating the quality for hierarchical clustering [4]. F-measure is a
mix of recall and precision. In [17] they considered each cluster as a result of a query, and each
class is considered as the desired set of documents. First the precision and recall are computed for
each class i in each cluster j.
Where, is the number of objects of class i in cluster j, is total number of objects in class i
and is the total number of objects in cluster j.
The F-measure of class i and cluster j is then computed as follow
the maximum value of F-measure of each class is selected then, the total f-measure is calculated
as following, where n is total number of documents, c is the total number of classes
8. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.2, April 2015
8
4.2 Internal evaluation
For internal evaluation, the goal is to maximize the cosine similarity between each document and
its associated center [23]. Then, we divide the results by the total number of documents.
Where k denotes to number of clusters, is the number of documents assigned to
cluster is the center of cluster is the document vector. The value of this
measure is a range from 0 to1, as this value increases better clustering results are
achieved.
5. EXPERIMENT & RESULTS
In this section we discuss our experimental results for evaluating the performance of our proposed
clustering approach. The experiments are divided into two parts. The first part works on labeled
dataset in order to evaluate the applicability of the proposed system. The second part works on
unlabeled dataset in order to measure the scalability of the proposed system and the efficiency of
applying it on large datasets.
For labeled dataset we compared our approach against applying bisecting k-means clustering
using Hotho method [11], Recupero [4] lexical categories method, lexical nouns methods and
stemmed "without ontology” method. For Hotho method we applied the “add” concept method.
In this method for each term in the vector representation of the dataset, we check if it appears in
the WordNet [5]. In case it appears we extend it by WordNet hypernym synset concept up to 5
levels. Also, we applied first concept of WordNet for word sense disambiguation. To prepare
dataset for lexical categories we used WordNet ontology to identify the lexical category of the
term. In the following sub-sections we will provide more details about the datasets and the
experimental results.
5.1 Labeled Dataset
We used Reuters-21578 dataset [24]; it contains 21578 news documents which have been
classified into topics. However, not all documents have a topic, so we used only documents that
have at least one topic. Figure (4) displays the classes’ sizes of Reuters-21578 dataset.
9. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.2, April 2015
9
Figure 4 Classes sizes of Reuter dataset
As shown in Figure (4) the dataset has classes of different sizes, some classes have large size such
as earn class that has 3734 documents belonging to it, while other classes have size less than 5
documents such as rice. So, we extracted 3 unbiased subsets from the dataset. The first dataset is
"Reu-15-20", that represents documents of classes that have size equal 15 and less than or equal
20. The second dataset is "Reu-100-SS” that represents classes which have size greater than or
equal 100; we considered only 100 news documents for each class. The third dataset is "Reu-200-
SS" contains classes which have size greater than or equal 200; we considered only 200 news
document for each class.
5.1.1 Labeled Dataset Results
Figure (5) displays F-measure results. Lexical categories representations outperform other
methods of dataset representation for datasets Reu-15-20, Reu-200-SS. In case of the dataset Reu-
100-SS, lexical nouns representation outperforms other methods of dataset representations. The
experimental results of F-measure confirm that Utilizing WordNet ontology for representing
document terms either as lexical categories or lexical nouns only outperforms stemmed method
"without ontology" and Hotho method. The reason for this result is due to representing semantic
relations between document terms; for example, documents that contain words such as rice or
wheat will be related to a document with the term "corn" appearing in it; hence the utilization of
ontology enhances document clustering results.
Figure 5 F-measure results of labeled dataset
Figure 6 displays Entropy results, the experimental results for datasets Reu-15-20, Reu-100-SS
indicates that, lexical categories representations outperforms other methods of dataset
10. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.2, April 2015
10
representation. For dataset Reu-200-SS, Hotho representation outperforms other methods of
dataset representations.
Figure 6 Entropy results of labeled dataset
The experimental results confirm that exploiting ontology for dataset representation enhances
document clustering, either by using lexical categories representation or Hotho representation.
Figure 7 displays Purity results, for datasets Reu-15-20, Reu-200-SS lexical categories
representations outperforms other methods of dataset representation. For dataset Reu-100-SS,
Stemmed "without ontology" representation outperforms other methods of dataset
representations, but still lexical categories, and lexical nouns representation purity results are near
to it.
Figure 7 Purity results of labeled dataset
5.2 Unlabeled Dataset
We used three bags of words datasets which are available at UCI Machine Learning Repository
[18]. The first dataset is NIPS (Neural Information Processing Systems), it consists of 1500
documents, number of words 12419, and number of stems 9072. The second dataset is NYTimes
news articles; it consists of 300.000 news articles, number of words 102660, and number of stems
84435. The third dataset is PubMed abstracts dataset which consists of 8200000 records, 141043
words, and 119725 stems.
5.2.1 Unlabeled Dataset Results
We run three experiments for NIPS dataset, one for "Without Ontology" method, and the others
for the WordNet lexical categories, and WordNet lexical nouns methods. As displayed in Table
(1): representing dataset features as WordNet lexical Nouns only results in better internal
evaluation than using Stemmed "Without Ontology" or lexical category.
11. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.2, April 2015
11
Table 1: Internal Evaluation for Unlabeled Datasets
Dataset Type Features Internal Evaluation
Nips
Without Ontology 9072 0.78092
WordNet Lexical Categories 45 0.73119
WordNet Lexical Nouns 27 0.85828
NYTimes
Without Ontology 84435 NA
WordNet Lexical Categories 45 0.70215
WordNet Lexical Nouns 27 0.74354
PubMed
Without Ontology 119725 NA
WordNet Lexical Categories 45 0.62839
WordNet Lexical Nouns 27 0.69913
The reasons for this result is due to nouns are more representative for document contents than
other parts of speech. Further, using nouns alone provides improved document clustering
[13].The results showed that using lexical nouns only get the highest value of internal evaluation.
6. CONCLUSION
This paper investigates approaches that enhance and facilitate clustering huge number of
documents. Enhancement has been achieved through exploiting WordNet Ontology to represent
semantics between documents’ terms, in addition to decreasing document features by mapping
documents terms to its WordNet lexical categories. On the other hand, this paper introduces an
implementation of bisecting k-means algorithm using MapReduce programming model; which
facilitates running document clustering in distributed environment. We concluded that using
WordNet lexical categories to represent document terms has many benefits, first it represents
semantic relations between words, second it reduces features dimensions, and finally it made
clustering big data visible by dealing with reduced number of dimensions. Also in this paper we
investigated using WordNet lexical categories for nouns only, and we concluded that it enhances
the internal evaluation measure. For future work we plan to extend our approach to other lexical
categories and to further investigate possible enhancements for bisecting k-means implementation
over Map-Reduce paradigm.
REFERENCES
[1] Zamir, O., & Etzioni, O. (1999). Grouper: a dynamic clustering interface to Web search results.
Computer Networks, 31(11), 1361-1374.
[2] Andrews, N. O., & Fox, E. A. (2007). Recent developments in document clustering. Tech. rept. TR-
07-35. Department of Computer Science, Virginia Tech.
[3] Cheng, Y. (2008). Ontology-based fuzzy semantic clustering. In Convergence and Hybrid
Information Technology, 2008. ICCIT'08. Third International Conference on (Vol. 2, pp. 128-133).
IEEE.
[4] Recupero, D. R. (2007). A new unsupervised method for document clustering by using WordNet
lexical and conceptual relations. Information Retrieval, 10(6), 563-579.
[5] Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the ACM, 38(11),
39-41.
[6] Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters.
Communications of the ACM, 51(1), 107-113.
[7] Gruber, T. R. (1991). The role of common ontology in achieving sharable, reusable knowledge bases.
KR, 91, 601-602.
[8] Cantais, J., Dominguez, D., Gigante, V., Laera, L., & Tamma, V. (2005). An example of food
ontology for diabetes control. In Proceedings of the International Semantic Web Conference 2005
workshop on Ontology Patterns for the Semantic Web.
12. International Journal of Database Management Systems ( IJDMS ) Vol.7, No.2, April 2015
12
[9] Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., ... & Sherlock, G.
(2000). Gene Ontology: tool for the unification of biology. Nature genetics, 25(1), 25-29.
[10] Lauser, B., Sini, M., Liang, A., Keizer, J., & Katz, S. (2006). From AGROVOC to the Agricultural
Ontology Service/Concept Server. An OWL model for creating ontologies in the agricultural domain.
In Dublin Core Conference Proceedings. Dublin Core DCMI.
[11]Hotho, A., Staab, S., & Stumme, G. (2003). Ontologies improve text document clustering. In Data
Mining, 2003. ICDM 2003. Third IEEE International Conference on (pp. 541-544). IEEE.
[12] Sedding, J., & Kazakov, D. (2004, August). WordNet-based text document clustering. In Proceedings
of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data (pp. 104-113).
Association for Computational Linguistics.
[13] Fodeh, S., Punch, B., & Tan, P. N. (2011). On ontology-driven document clustering using core
semantic features. Knowledge and information systems, 28(2), 395-421.
[14] Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002, July). A framework and graphical
development environment for robust NLP tools and applications. In ACL (pp. 168-175).
[15] MacQueen, J. (1967, June). Some methods for classification and analysis of multivariate
observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and
probability (Vol. 1, No. 14, pp. 281-297).
[16] Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Applied
statistics, 100-108.
[17] Steinbach, M., Karypis, G., & Kumar, V. (2000, August). A comparison of document clustering
techniques. In KDD workshop on text mining (Vol. 400, No. 1, pp. 525-526).
[18] Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml].
Irvine, CA: University of California, School of Information and Computer Science.
[19] Zhao, W., Ma, H., & He, Q. (2009). Parallel k-means clustering based on mapreduce. In Cloud
Computing (pp. 674-679). Springer Berlin Heidelberg.
[20] Huang, A. (2008, April). Similarity measures for text document clustering. In Proceedings of the sixth
new zealand computer science research student conference (NZCSRSC2008), Christchurch, New
Zealand (pp. 49-56).
[21] RRNYI, A. (1961). On measures of entropy and information. In Fourth Berkeley Symposium on
Mathematical Statistics and Probability (pp. 547-561).
[22] Deepa, M., Revathy, P. (2012). Validation of Document Clustering based on Purity and Entropy
measures. International Journal of Advanced Research in Computer and Communication Engineering,
1(3), 147-152.
[23] Zhao, Y., & Karypis, G. (2002). Comparison of agglomerative and partitional document clustering
algorithms (No. TR-02-014). Minnesota Univ Minneapolis Dept Of Computer Science.
[24] Lewis, D. D. (1997). Reuters-21578 text categorization test collection, distribution 1.0. http://www.
research. att. com/~ lewis/reuters21578. html.
AUTHORS
Abdelrahman Elsayed is working as Research assistant at Central Laboratory for agriculture expert
systems. He received his master and BSc in 2008 and 2000 resp. from Information System Dept., Faculty of
Computers and Information, Cairo University, Egypt. His research interests are data mining and data
intensive algorithms.
Dr. Hoda Mokhtar is an Associate Professor in the Information Systems Dept. at the Faculty of
Computers and Information, Cairo University. Dr. Hoda received her BSc with "Distinction with Honors"
from the Dept. of Computer Engineering, Faculty of Engineering, Cairo University in 1997. In 2000, Dr.
Hoda received her MSc degree in Computer Engineering from the Faculty of Engineering, Cairo
University. She received her PhD degree in Computer Science from the University of California Santa
Barbara (UCSB) in 2005.
Dr. Osama Ismail received a MSc. degree in Computer Science and Information from Cairo University in
1997 and a PhD degree in Computer Science from Cairo University in 2008. He is a Lecturer in Faculty of
Computers and Information, Cairo University. His fields of interest include, Cloud Computing, Multi-agent
Systems, Service Oriented Architectures, and Data Mining.