Traditional techniques of document clustering do not consider the semantic relationships between words
when assigning documents to clusters. For instance, if two documents talking about the same topic do that
using different words (which may be synonyms or semantically associated), these techniques may assign
documents to different clusters. Previous research has approached this problem by enriching the document
representation with the background knowledge in an ontology. This paper presents a new approach to
enhance document clustering by exploiting the semantic knowledge contained in Wikipedia. We first map
terms within documents to their corresponding Wikipedia concepts. Then, similarity between each pair of
terms is calculated by using the Wikipedia's link structure. The document’s vector representation is then
adjusted so that terms that are semantically related gain more weight. Our approach differs from related
efforts in two aspects: first, unlink others who built their own methods of measuring similarity through the
Wikipedia categories; our approach uses a similarity measure that is modelled after the Normalized
Google Distance which is a well-known and low-cost method of measuring term similarity. Second, it is
more time efficient as it applies an algorithm for phrase extraction from documents prior to matching terms
with Wikipedia. Our approach was evaluated by being compared with different methods from the state of
the art on two different datasets. Empirical results showed that our approach improved the clustering
results as compared to other approaches.
In this paper we tried to correlate text sequences those provides common topics for semantic clues. We propose a two step method for asynchronous text mining. Step one check for the common topics in the sequences and isolates these with their timestamps. Step two takes the topic and tries to give the timestamp of the text document. After multiple repetitions of step two, we could give optimum result.
SPIRIT: A TREE KERNEL-BASED METHOD FOR TOPIC PERSON INTERACTION DETECTIONNexgen Technology
TO GET THIS PROJECT COMPLETE SOURCE ON SUPPORT WITH EXECUTION PLEASE CALL BELOW CONTACT DETAILS
MOBILE: 9791938249, 0413-2211159, WEB: WWW.NEXGENPROJECT.COM,WWW.FINALYEAR-IEEEPROJECTS.COM, EMAIL:Praveen@nexgenproject.com
NEXGEN TECHNOLOGY provides total software solutions to its customers. Apsys works closely with the customers to identify their business processes for computerization and help them implement state-of-the-art solutions. By identifying and enhancing their processes through information technology solutions. NEXGEN TECHNOLOGY help it customers optimally use their resources.
Improving Text Categorization with Semantic Knowledge in Wikipediachjshan
Text categorization, especially short text categorization, is a difficult and challenging task since the text data is sparse and multidimensional. In traditional text classification methods, document texts are represented with “Bag of Words (BOW)” text representation schema, which is based on word co-occurrence and has many limitations. In this paper, we mapped document texts to Wikipedia concepts and used the Wikipedia-concept-based document representation method to take the place of traditional BOW model for text classification. In order to overcome the weakness of ignoring the semantic relationships among terms in document representation model and utilize rich semantic knowledge in Wikipedia, we constructed a semantic matrix to enrich Wikipedia-concept-based document representation. Experimental evaluation on five real datasets of long and short text shows that our approach outperforms the traditional BOW method.
SEMANTICS GRAPH MINING FOR TOPIC DISCOVERY AND WORD ASSOCIATIONSIJDKP
Big Data creates many challenges for data mining experts, in particular in getting meanings of text data. It is beneficial for text mining to build a bridge between word embedding process and graph capacity to connect the dots and represent complex correlations between entities. In this study we examine processes of building a semantic graph model to determine word associations and discover document topics. We introduce a novel Word2Vec2Graph model that is built on top of Word2Vec word embedding model. We demonstrate how this model can be used to analyze long documents, get unexpected word associations and uncover document topics. To validate topic discovery method we transfer words to vectors and vectors to images and use CNN deep learning image classification.
In this paper we tried to correlate text sequences those provides common topics for semantic clues. We propose a two step method for asynchronous text mining. Step one check for the common topics in the sequences and isolates these with their timestamps. Step two takes the topic and tries to give the timestamp of the text document. After multiple repetitions of step two, we could give optimum result.
SPIRIT: A TREE KERNEL-BASED METHOD FOR TOPIC PERSON INTERACTION DETECTIONNexgen Technology
TO GET THIS PROJECT COMPLETE SOURCE ON SUPPORT WITH EXECUTION PLEASE CALL BELOW CONTACT DETAILS
MOBILE: 9791938249, 0413-2211159, WEB: WWW.NEXGENPROJECT.COM,WWW.FINALYEAR-IEEEPROJECTS.COM, EMAIL:Praveen@nexgenproject.com
NEXGEN TECHNOLOGY provides total software solutions to its customers. Apsys works closely with the customers to identify their business processes for computerization and help them implement state-of-the-art solutions. By identifying and enhancing their processes through information technology solutions. NEXGEN TECHNOLOGY help it customers optimally use their resources.
Improving Text Categorization with Semantic Knowledge in Wikipediachjshan
Text categorization, especially short text categorization, is a difficult and challenging task since the text data is sparse and multidimensional. In traditional text classification methods, document texts are represented with “Bag of Words (BOW)” text representation schema, which is based on word co-occurrence and has many limitations. In this paper, we mapped document texts to Wikipedia concepts and used the Wikipedia-concept-based document representation method to take the place of traditional BOW model for text classification. In order to overcome the weakness of ignoring the semantic relationships among terms in document representation model and utilize rich semantic knowledge in Wikipedia, we constructed a semantic matrix to enrich Wikipedia-concept-based document representation. Experimental evaluation on five real datasets of long and short text shows that our approach outperforms the traditional BOW method.
SEMANTICS GRAPH MINING FOR TOPIC DISCOVERY AND WORD ASSOCIATIONSIJDKP
Big Data creates many challenges for data mining experts, in particular in getting meanings of text data. It is beneficial for text mining to build a bridge between word embedding process and graph capacity to connect the dots and represent complex correlations between entities. In this study we examine processes of building a semantic graph model to determine word associations and discover document topics. We introduce a novel Word2Vec2Graph model that is built on top of Word2Vec word embedding model. We demonstrate how this model can be used to analyze long documents, get unexpected word associations and uncover document topics. To validate topic discovery method we transfer words to vectors and vectors to images and use CNN deep learning image classification.
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...IJDKP
As existing computer search engines struggle to understand the meaning of natural language, semantically
enriched metadata may improve interest-based search engine capabilities and user satisfaction.
This paper presents an enhanced version of the ecosystem focusing on semantic topic metadata detection
and enrichments. It is based on a previous paper, a semantic metadata enrichment software ecosystem
(SMESE). Through text analysis approaches for topic detection and metadata enrichments this paper
propose an algorithm to enhance search engines capabilities and consequently help users finding content
according to their interests. It presents the design, implementation and evaluation of SATD (Scalable
Annotation-based Topic Detection) model and algorithm using metadata from the web, linked open data,
concordance rules, and bibliographic record authorities. It includes a prototype of a semantic engine using
keyword extraction, classification and concept extraction that allows generating semantic topics by text,
and multimedia document analysis using the proposed SATD model and algorithm.
The performance of the proposed ecosystem is evaluated using a number of prototype simulations by
comparing them to existing enriched metadata techniques (e.g., AlchemyAPI, DBpedia, Wikimeta, Bitext,
AIDA, TextRazor). It was noted that SATD algorithm supports more attributes than other algorithms. The
results show that the enhanced platform and its algorithm enable greater understanding of documents
related to user interests.
An efficient metric of automatic weight generation for properties in instance...Md. Hanif Seddiqui
The proliferation of heterogeneous data sources of semantic knowledge base intensifies the need of an automatic instance matching technique. However, the efficiency of instance matching is often influenced by the weight of a property associated to instances. Automatic weight generation is a non-trivial, however an important task in instance matching technique. Therefore, identifying an appropriate metric for generating weight for a property automatically is nevertheless a formidable task. In this paper, we investigate an approach of generating weights automatically by considering hypotheses: (1) the weight of a property is directly proportional to the ratio of the number of its distinct values to the number of instances contain the property, and (2) the weight is also proportional to the ratio of the number of distinct values of a property to the number of instances in a training dataset. The basic intuition behind the use of our approach is the classical theory of information content that infrequent words are more informative than frequent ones. Our mathematical model derives a metric for generating property weights automatically, which is applied in instance matching system to produce re-conciliated instances efficiently. Our experiments and evaluations show the effectiveness of our proposed metric of automatic weight generation for properties in an instance matching technique.
ASSESSING SIMILARITY BETWEEN ONTOLOGIES: THE CASE OF THE CONCEPTUAL SIMILARITYIJwest
In ontology engineering, there are many cases where assessing similarity between ontologies is required, this is the case of the alignment activities, ontology evolutions, ontology similarities, etc. This paper presents a new method for assessing similarity between concepts of ontologies. The method is based on the
set theory, edges and feature similarity. We first determine the set of concepts that is shared by two ontologies and the sets of concepts that are different from them. Then, we evaluate the average value of similarity for each set by using edges-based semantic similarity. Finally, we compute similarity between
ontologies by using average values of each set and by using feature-based similarity measure too.
Extracting and Reducing the Semantic Information Content of Web Documents to ...ijsrd.com
Ranking and optimization of web service compositions represent challenging areas of research with significant implication for realization of the "Web of Services" vision. The semantic web, where the semantics information is indicated using machine-process able language such as the Web Ontology Language (OWL) "Semantic web service" use formal semantic description of web service functionality and enable automated reasoning over web service compositions. These semantic web services can then be automatically discovered, composed into more complex services, and executed. Automating web service composition through the use of semantic technologies calculating the semantic similarities between outputs and inputs of connected constituent services, and aggregate these values into a measure of semantics quality for the composition. It propose a novel and extensible model balancing the new dimensions of semantic quality ( as a functional quality metric) with QoS metric, and using them together as a ranking and optimization criteria. It also demonstrates the utility of Genetic Algorithms to allow optimization within the context of a large number of services foreseen by the "Web of Service" vision. To reduce the semantics of the web documents then to support semantic document retrieval by using Network Ontology Language (NOL) and to improve QoS as a ranking and optimization.
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...University of Bari (Italy)
The current abundance of electronic documents requires automatic techniques that support the users in understanding their content and extracting useful information. To this aim, improving the retrieval performance must necessarily go beyond simple lexical interpretation of the user queries, and pass through an understanding of their semantic content and aims. It goes without saying that any digital library would take enormous advantage from the availability of effective Information Retrieval techniques to provide to their users. This paper proposes an approach to Information Retrieval based on a correspondence of the domain of discourse between the query and the documents in the repository. Such an association is based on standard general-purpose linguistic resources (WordNet and WordNet Domains) and on a novel similarity assessment technique. Although the work is at a preliminary stage, interesting initial results suggest to go on extending and improving the approach.
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...IJCI JOURNAL
In this paper we describe a new unsupervised algorithm for automatic documents clustering with the aid of Wikipedia. Contrary to other related algorithms in the field, our algorithm utilizes only two aspects of Wikipedia, namely its categories network and articles titles. We do not utilize the inner content of the articles in Wikipedia or their inner or inter links. The implemented algorithm was evaluated in an
experiment for documents clustering. The findings we obtained indicate that the utilized features from
Wikipedia in our framework can give competing results especially when compared against other models in
the literature which employ the inner content of Wikipedia articles.
Adapting Measures of Clumping Strength to Assess Term-Term SimilarityVladimir Kulyukin
Automated information retrieval relies heavily on statistical regularities that emerge as terms are deposited to produce text. This paper examines statistical patterns expected of a pair of
terms that are semantically related to each other. Guided by a conceptualization of the text generation process, we derive measures of how tightly two terms are semantically associated.
Our main objective is to probe whether such measures yields reasonable results. Specifically, we examine how the tendency of a content bearing term to clump, as quantified by previously
developed measures of term clumping, is influenced by the presence of other terms. This
approach allows us to present a toolkit from which a range of measures can be constructed.
As an illustration, one of several suggested measures is evaluated on a large text corpus built from an on-line encyclopedia.
Many of previous research have proven that the usage of rhetorical relations is capable to enhance many applications such as text summarization, question answering and natural language generation. This work proposes an approach that expands the benefit of rhetorical relations to address redundancy problem for cluster-based text summarization of multiple documents. We exploited rhetorical relations exist between sentences to group similar sentences into multiple clusters to identify themes of common information. The candidate summary were extracted from these clusters. Then, cluster-based text summarization is performed using Conditional Markov Random Walk Model to measure the saliency scores of the candidate summary. We evaluated our method by measuring the cohesion and separation of the clusters constructed by exploiting rhetorical relations and ROUGE score of generated summaries. The experimental result shows that our method performed well which shows promising potential of applying rhetorical relation in text clustering which benefits text summarization of multiple documents.
For further details contact:
N.RAJASEKARAN B.E M.S 9841091117,9840103301.
IMPULSE TECHNOLOGIES,
Old No 251, New No 304,
2nd Floor,
Arcot road ,
Vadapalani ,
Chennai-26.
www.impulse.net.in
Email: ieeeprojects@yahoo.com/ imbpulse@gmail.com
Resource management and search is very important yet challenging in large-scale distributed systems like
P2Pnetworks. Most existing P2P systems rely on indexing to efficiently route queries over the network.
However, searches based on such indices face two key issues. First, majority of existing search schemes
often rely on simply keyword based indices that can only support exact string based matches without taking
into account the meaning of words. Second it is difficult, if not impossible, to devise query based indexing
schemes that can represent all possible concept combinations without resulting in exponential index sizes.
To address these problems, we present BSI, a novel P2P indexing and query routing strategy to support
semantic based content searches. The BSI indexing structure captures the semantic content of documents
using a reference ontology. Our indexing scheme can efficiently handle multi-concept queries by
maintaining summary level information for each individual concept and concept combinations using a
novel space-efficient Two-level Semantic Bloom Filter(TSBF) data structure. By using TSBFs to represent
a large document and query base, BSI significantly reduces the communication cost and storage cost of
indices. Furthermore, We devise a low-overhead mechanism to allow peers to dynamically estimate the
relevance strength of a peer for multi-concept queries with high accuracy solely based on TSBFs. We also
propose a routing index compression mechanism to observe peers’ dynamic storage limitations with
minimal loss of information by exploiting a reference ontology structure. Based on the proposed index
structure, we design a novel query routing algorithm that exploits semantic based information to route
queries to semantically relevant peers. Performance evaluation demonstrates that our proposed approach
can improve the search recall of unstructured P2P systems up to 383.71% while keeping the
communication cost at a low level compared to state-of-art search mechanism OSQR [7].
With ever increasing number of documents on web and other repositories, the task of organizing and
categorizing these documents to the diverse need of the user by manual means is a complicated job, hence
a machine learning technique named clustering is very useful. Text documents are clustered by pair wise
similarity of documents with similarity measures like Cosine, Jaccard or Pearson. Best clustering results
are seen when overlapping of terms in documents is less, that is, when clusters are distinguishable. Hence
for this problem, to find document similarity we apply link and neighbor introduced in ROCK. Link
specifies number of shared neighbors of a pair of documents. Significantly similar documents are called as
neighbors. This work applies links and neighbors to Bisecting K-means clustering in identifying seed
documents in the dataset, as a heuristic measure in choosing a cluster to be partitioned and as a means to
find the number of partitions possible in the dataset. Our experiments on real-time datasets showed a
significant improvement in terms of accuracy with minimum time.
Text categorization is the task where text documents are classified into one or more categories (pre-defined) based on their contents. Analysis of text content and categorization of the documents is a complex task. To achieve this, various techniques has been developed recently. In this seminar, the task of text categorization has been divided into three parts, viz. text document representation, classifier construction and performance. Rough set is a method for classificatory analysis of data. The main goal of rough set is to analyze and synthesize approximation of concepts of the acquired data. Thus, the traditional categorization methods and rough sets has been integrated to perform text categorization in order to achieve a good classifier.
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...IJDKP
As existing computer search engines struggle to understand the meaning of natural language, semantically
enriched metadata may improve interest-based search engine capabilities and user satisfaction.
This paper presents an enhanced version of the ecosystem focusing on semantic topic metadata detection
and enrichments. It is based on a previous paper, a semantic metadata enrichment software ecosystem
(SMESE). Through text analysis approaches for topic detection and metadata enrichments this paper
propose an algorithm to enhance search engines capabilities and consequently help users finding content
according to their interests. It presents the design, implementation and evaluation of SATD (Scalable
Annotation-based Topic Detection) model and algorithm using metadata from the web, linked open data,
concordance rules, and bibliographic record authorities. It includes a prototype of a semantic engine using
keyword extraction, classification and concept extraction that allows generating semantic topics by text,
and multimedia document analysis using the proposed SATD model and algorithm.
The performance of the proposed ecosystem is evaluated using a number of prototype simulations by
comparing them to existing enriched metadata techniques (e.g., AlchemyAPI, DBpedia, Wikimeta, Bitext,
AIDA, TextRazor). It was noted that SATD algorithm supports more attributes than other algorithms. The
results show that the enhanced platform and its algorithm enable greater understanding of documents
related to user interests.
An efficient metric of automatic weight generation for properties in instance...Md. Hanif Seddiqui
The proliferation of heterogeneous data sources of semantic knowledge base intensifies the need of an automatic instance matching technique. However, the efficiency of instance matching is often influenced by the weight of a property associated to instances. Automatic weight generation is a non-trivial, however an important task in instance matching technique. Therefore, identifying an appropriate metric for generating weight for a property automatically is nevertheless a formidable task. In this paper, we investigate an approach of generating weights automatically by considering hypotheses: (1) the weight of a property is directly proportional to the ratio of the number of its distinct values to the number of instances contain the property, and (2) the weight is also proportional to the ratio of the number of distinct values of a property to the number of instances in a training dataset. The basic intuition behind the use of our approach is the classical theory of information content that infrequent words are more informative than frequent ones. Our mathematical model derives a metric for generating property weights automatically, which is applied in instance matching system to produce re-conciliated instances efficiently. Our experiments and evaluations show the effectiveness of our proposed metric of automatic weight generation for properties in an instance matching technique.
ASSESSING SIMILARITY BETWEEN ONTOLOGIES: THE CASE OF THE CONCEPTUAL SIMILARITYIJwest
In ontology engineering, there are many cases where assessing similarity between ontologies is required, this is the case of the alignment activities, ontology evolutions, ontology similarities, etc. This paper presents a new method for assessing similarity between concepts of ontologies. The method is based on the
set theory, edges and feature similarity. We first determine the set of concepts that is shared by two ontologies and the sets of concepts that are different from them. Then, we evaluate the average value of similarity for each set by using edges-based semantic similarity. Finally, we compute similarity between
ontologies by using average values of each set and by using feature-based similarity measure too.
Extracting and Reducing the Semantic Information Content of Web Documents to ...ijsrd.com
Ranking and optimization of web service compositions represent challenging areas of research with significant implication for realization of the "Web of Services" vision. The semantic web, where the semantics information is indicated using machine-process able language such as the Web Ontology Language (OWL) "Semantic web service" use formal semantic description of web service functionality and enable automated reasoning over web service compositions. These semantic web services can then be automatically discovered, composed into more complex services, and executed. Automating web service composition through the use of semantic technologies calculating the semantic similarities between outputs and inputs of connected constituent services, and aggregate these values into a measure of semantics quality for the composition. It propose a novel and extensible model balancing the new dimensions of semantic quality ( as a functional quality metric) with QoS metric, and using them together as a ranking and optimization criteria. It also demonstrates the utility of Genetic Algorithms to allow optimization within the context of a large number of services foreseen by the "Web of Service" vision. To reduce the semantics of the web documents then to support semantic document retrieval by using Network Ontology Language (NOL) and to improve QoS as a ranking and optimization.
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...University of Bari (Italy)
The current abundance of electronic documents requires automatic techniques that support the users in understanding their content and extracting useful information. To this aim, improving the retrieval performance must necessarily go beyond simple lexical interpretation of the user queries, and pass through an understanding of their semantic content and aims. It goes without saying that any digital library would take enormous advantage from the availability of effective Information Retrieval techniques to provide to their users. This paper proposes an approach to Information Retrieval based on a correspondence of the domain of discourse between the query and the documents in the repository. Such an association is based on standard general-purpose linguistic resources (WordNet and WordNet Domains) and on a novel similarity assessment technique. Although the work is at a preliminary stage, interesting initial results suggest to go on extending and improving the approach.
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...IJCI JOURNAL
In this paper we describe a new unsupervised algorithm for automatic documents clustering with the aid of Wikipedia. Contrary to other related algorithms in the field, our algorithm utilizes only two aspects of Wikipedia, namely its categories network and articles titles. We do not utilize the inner content of the articles in Wikipedia or their inner or inter links. The implemented algorithm was evaluated in an
experiment for documents clustering. The findings we obtained indicate that the utilized features from
Wikipedia in our framework can give competing results especially when compared against other models in
the literature which employ the inner content of Wikipedia articles.
Adapting Measures of Clumping Strength to Assess Term-Term SimilarityVladimir Kulyukin
Automated information retrieval relies heavily on statistical regularities that emerge as terms are deposited to produce text. This paper examines statistical patterns expected of a pair of
terms that are semantically related to each other. Guided by a conceptualization of the text generation process, we derive measures of how tightly two terms are semantically associated.
Our main objective is to probe whether such measures yields reasonable results. Specifically, we examine how the tendency of a content bearing term to clump, as quantified by previously
developed measures of term clumping, is influenced by the presence of other terms. This
approach allows us to present a toolkit from which a range of measures can be constructed.
As an illustration, one of several suggested measures is evaluated on a large text corpus built from an on-line encyclopedia.
Many of previous research have proven that the usage of rhetorical relations is capable to enhance many applications such as text summarization, question answering and natural language generation. This work proposes an approach that expands the benefit of rhetorical relations to address redundancy problem for cluster-based text summarization of multiple documents. We exploited rhetorical relations exist between sentences to group similar sentences into multiple clusters to identify themes of common information. The candidate summary were extracted from these clusters. Then, cluster-based text summarization is performed using Conditional Markov Random Walk Model to measure the saliency scores of the candidate summary. We evaluated our method by measuring the cohesion and separation of the clusters constructed by exploiting rhetorical relations and ROUGE score of generated summaries. The experimental result shows that our method performed well which shows promising potential of applying rhetorical relation in text clustering which benefits text summarization of multiple documents.
For further details contact:
N.RAJASEKARAN B.E M.S 9841091117,9840103301.
IMPULSE TECHNOLOGIES,
Old No 251, New No 304,
2nd Floor,
Arcot road ,
Vadapalani ,
Chennai-26.
www.impulse.net.in
Email: ieeeprojects@yahoo.com/ imbpulse@gmail.com
Resource management and search is very important yet challenging in large-scale distributed systems like
P2Pnetworks. Most existing P2P systems rely on indexing to efficiently route queries over the network.
However, searches based on such indices face two key issues. First, majority of existing search schemes
often rely on simply keyword based indices that can only support exact string based matches without taking
into account the meaning of words. Second it is difficult, if not impossible, to devise query based indexing
schemes that can represent all possible concept combinations without resulting in exponential index sizes.
To address these problems, we present BSI, a novel P2P indexing and query routing strategy to support
semantic based content searches. The BSI indexing structure captures the semantic content of documents
using a reference ontology. Our indexing scheme can efficiently handle multi-concept queries by
maintaining summary level information for each individual concept and concept combinations using a
novel space-efficient Two-level Semantic Bloom Filter(TSBF) data structure. By using TSBFs to represent
a large document and query base, BSI significantly reduces the communication cost and storage cost of
indices. Furthermore, We devise a low-overhead mechanism to allow peers to dynamically estimate the
relevance strength of a peer for multi-concept queries with high accuracy solely based on TSBFs. We also
propose a routing index compression mechanism to observe peers’ dynamic storage limitations with
minimal loss of information by exploiting a reference ontology structure. Based on the proposed index
structure, we design a novel query routing algorithm that exploits semantic based information to route
queries to semantically relevant peers. Performance evaluation demonstrates that our proposed approach
can improve the search recall of unstructured P2P systems up to 383.71% while keeping the
communication cost at a low level compared to state-of-art search mechanism OSQR [7].
With ever increasing number of documents on web and other repositories, the task of organizing and
categorizing these documents to the diverse need of the user by manual means is a complicated job, hence
a machine learning technique named clustering is very useful. Text documents are clustered by pair wise
similarity of documents with similarity measures like Cosine, Jaccard or Pearson. Best clustering results
are seen when overlapping of terms in documents is less, that is, when clusters are distinguishable. Hence
for this problem, to find document similarity we apply link and neighbor introduced in ROCK. Link
specifies number of shared neighbors of a pair of documents. Significantly similar documents are called as
neighbors. This work applies links and neighbors to Bisecting K-means clustering in identifying seed
documents in the dataset, as a heuristic measure in choosing a cluster to be partitioned and as a means to
find the number of partitions possible in the dataset. Our experiments on real-time datasets showed a
significant improvement in terms of accuracy with minimum time.
Text categorization is the task where text documents are classified into one or more categories (pre-defined) based on their contents. Analysis of text content and categorization of the documents is a complex task. To achieve this, various techniques has been developed recently. In this seminar, the task of text categorization has been divided into three parts, viz. text document representation, classifier construction and performance. Rough set is a method for classificatory analysis of data. The main goal of rough set is to analyze and synthesize approximation of concepts of the acquired data. Thus, the traditional categorization methods and rough sets has been integrated to perform text categorization in order to achieve a good classifier.
AN EFFICIENT APPROACH FOR SEMANTICALLYENHANCED DOCUMENT CLUSTERING BY USING W...ijaia
Traditional techniques of document clustering do not consider the semantic relationships between words
when assigning documents to clusters. For instance, if two documents talking about the same topic do that
using different words (which may be synonyms or semantically associated), these techniques may assign
documents to different clusters. Previous research has approached this problem by enriching the document
representation with the background knowledge in an ontology. This paper presents a new approach to
enhance document clustering by exploiting the semantic knowledge contained in Wikipedia. We first map
terms within documents to their corresponding Wikipedia concepts. Then, similarity between each pair of
terms is calculated by using the Wikipedia's link structure. The document’s vector representation is then
adjusted so that terms that are semantically related gain more weight. Our approach differs from related
efforts in two aspects: first, unlink others who built their own methods of measuring similarity through the
Wikipedia categories; our approach uses a similarity measure that is modelled after the Normalized
Google Distance which is a well-known and low-cost method of measuring term similarity. Second, it is
more time efficient as it applies an algorithm for phrase extraction from documents prior to matching terms
with Wikipedia. Our approach was evaluated by being compared with different methods from the state of
the art on two different datasets. Empirical results showed that our approach improved the clustering
results as compared to other approaches.
Towards optimize-ESA for text semantic similarity: A case study of biomedical...IJECEIAES
Explicit Semantic Analysis (ESA) is an approach to measure the semantic relatedness between terms or documents based on similarities to documents of a references corpus usually Wikipedia. ESA usage has received tremendous attention in the field of natural language processing NLP and information retrieval. However, ESA utilizes a huge Wikipedia index matrix in its interpretation by multiplying a large matrix by a term vector to produce a high-dimensional vector. Consequently, the ESA process is too expensive in interpretation and similarity steps. Therefore, the efficiency of ESA will slow down because we lose a lot of time in unnecessary operations. This paper propose enhancements to ESA called optimize-ESA that reduce the dimension at the interpretation stage by computing the semantic similarity in a specific domain. The experimental results show clearly that our method correlates much better with human judgement than the full version ESA approach.
Abstract: Traditional approaches for document classification need data which is labelled for the construction reliable classifiers which are even accurate. Unfortunately, data which is already labelled are rarely available, and often too costly to obtain. For the given learning task for which data which is trained is unavailable, abundant labelled data may be there for a different and related domain. One would like to use the related labelled data as auxiliary information to accomplish the classification task in the target domain. Recently, the paradigm of transfer learning has been introduced to enable effective learning strategies when auxiliary data obey a different probability distribution. A co-clustering based classification algorithm has been previously proposed to tackle cross-domain text classification. In this work, we extend the idea underlying this approach by making the latent semantic relationship between the two domains explicit. This goal is achieved with the use of Wikipedia. As a result, the pathway that allows propagating labels between the two domains not only captures common words, but also semantic concepts based on the content of documents. We empirically demonstrate the efficacy of our semantic-based approach to cross-domain classification using a variety of real data.Keywords: Classification, Clustering, Cross-domain Text Classification, Co-clustering, Labelled data, Traditional Approaches.
Title: Co-Clustering For Cross-Domain Text Classification
Author: Rayala Venkat, Mahanthi Kasaragadda
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
Ontology Based Approach for Classifying Biomedical Text AbstractsWaqas Tariq
Classifying biomedical literature is a difficult and challenging task, especially when a large number of biomedical articles should be organized into a hierarchical structure. Due to this problem, various classification methods were proposed by many researchers for classifying biomedical literature in order to help users find relevant articles on the web. In this paper, we propose a new approach to classifying a collection of biomedical text abstracts by using ontology alignment algorithm that we have developed. To accomplish our goal, we construct the OHSUMED disease hierarchy as the initial training hierarchy and the Medline abstract disease hierarchies as our testing hierarchy. For enriching our training hierarchy, we use the relevant features that extracted from selected categories in the OHSUMED dataset as feature vectors. These feature vectors then are mapped to each node or concept in the OHSUMED disease hierarchy according to their specific category. Afterward, we align and match the concepts in both hierarchies using our ontology alignment algorithm for finding probable concepts or categories. Subsequently, we compute the cosine similarity score between the feature vectors in probable concepts, in the genrichedh OHSUMED disease hierarchy and the Medline abstract disease hierarchy. Finally, we predict a category to the new Medline abstracts based on the highest cosine similarity score. The results obtained from the experiments demonstrate that our proposed approach for hierarchical classification performs slightly better than the multi-class flat classification.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
Many of previous research have proven that the usage of rhetorical relations is capable to enhance many applications such as text summarization, question answering and natural language generation. This work proposes an approach that expands the benefit of rhetorical
relations to address redundancy problem in text summarization. We first examined and redefined the type of rhetorical relations that is useful to retrieve sentences with identical content and performed the identification of those relations using SVMs. By exploiting the
rhetorical relations exist between sentences, we generate clusters of similar sentences from document sets. Then, cluster-based text summarization is performed using Conditional Markov Random Walk Model to measure the saliency scores of candidates summary. We evaluated our
method by measuring the cohesion and separation of the clusters and ROUGE score of generated summaries. The experimental result shows that our method performed well which shows promising potential of applying rhetorical relation in cluster-based text summarization.
ASSESSING SIMILARITY BETWEEN ONTOLOGIES: THE CASE OF THE CONCEPTUAL SIMILARITYdannyijwest
In ontology engineering, there are many cases where assessing similarity between ontologies is required, this is the case of the alignment activities, ontology evolutions, ontology similarities, etc. This paper presents a new method for assessing similarity between concepts of ontologies. The method is based on the set theory, edges and feature similarity. We first determine the set of concepts that is shared by two ontologies and the sets of concepts that are different from them. Then, we evaluate the average value of similarity for each set by using edges-based semantic similarity. Finally, we compute similarity between ontologies by using average values of each set and by using feature-based similarity measure too.
Concurrent Inference of Topic Models and Distributed Vector RepresentationsParang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSINGijaia
In this paper we present and compare two methodologies for rapidly inducing multiple subject-specific
taxonomies from crawled data. The first method involves a sentence-level words co-occurrence frequency
method for building the taxonomy, while the second involves the bootstrapping of a Word2Vec based
algorithm with a directed crawler. We exploit the multilingual open-content directory of the World Wide
Web, DMOZ1
to seed the crawl, and the domain name to direct the crawl. This domain corpus is then input
to our algorithm that can automatically induce taxonomies. The induced taxonomies provide hierarchical
semantic dimensions for the purposes of faceted browsing. As part of an ongoing personal semantics
project, we applied the resulting taxonomies to personal social media data (Twitter, Gmail, Facebook,
Instagram, Flickr) with an objective of enhancing an individual’s exploration of their personal information
through faceted searching. We also perform a comprehensive corpus based evaluation of the algorithms
based on many datasets drawn from the fields of medicine (diseases) and leisure (hobbies) and show that
the induced taxonomies are of high quality.
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...dannyijwest
Considerable research in the field of ontology matching has been performed where information sharing
and reuse becomes necessary in ontology development. Measurement of lexical similarity in ontology
matching is performed using synset, defined in WordNet. In this paper, we defined a Super Word Set,
which is an aggregate set that includes hypernym, hyponym, holonym, and meronym sets in WordNet.
The Super Word Set Similarity is calculated by the rate of words of concept name and synset’s words
inclusion in the Super Word Set. In order to measure of Super Word Set Similarity, we first extracted
Matched Concepts(MC), Matched Properties(MP) and Property Unmatched Concepts(PUC) from the
result of ontology matching. We compared these against two ontology matching tools – COMA++ and
LOM. The Super Word Set Similarity shows an average improvement of 12% over COMA++ and 19%
over LOM.
Similar to An efficient approach for semantically enhanced document clustering by using wikipedia link structure (20)
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
An efficient approach for semantically enhanced document clustering by using wikipedia link structure
1. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 5, No. 6, November 2014
AN EFFICIENT APPROACH FOR SEMANTICALLY-ENHANCED
DOCUMENT CLUSTERING BY USING
WIKIPEDIA LINK STRUCTURE
Iyad AlAgha1 and Rami Nafee2
1Faculty of Information Technology, The Islamic University of Gaza, Palestine
2Faculty of Information Technology, The Islamic University of Gaza, Palestine
ABSTRACT
Traditional techniques of document clustering do not consider the semantic relationships between words
when assigning documents to clusters. For instance, if two documents talking about the same topic do that
using different words (which may be synonyms or semantically associated), these techniques may assign
documents to different clusters. Previous research has approached this problem by enriching the document
representation with the background knowledge in an ontology. This paper presents a new approach to
enhance document clustering by exploiting the semantic knowledge contained in Wikipedia. We first map
terms within documents to their corresponding Wikipedia concepts. Then, similarity between each pair of
terms is calculated by using the Wikipedia's link structure. The document’s vector representation is then
adjusted so that terms that are semantically related gain more weight. Our approach differs from related
efforts in two aspects: first, unlink others who built their own methods of measuring similarity through the
Wikipedia categories; our approach uses a similarity measure that is modelled after the Normalized
Google Distance which is a well-known and low-cost method of measuring term similarity. Second, it is
more time efficient as it applies an algorithm for phrase extraction from documents prior to matching terms
with Wikipedia. Our approach was evaluated by being compared with different methods from the state of
the art on two different datasets. Empirical results showed that our approach improved the clustering
results as compared to other approaches.
KEYWORDS
Document Clustering, Semantic Similarity, Ontology, Wikipedia.
1. INTRODUCTION
Traditional techniques of document clustering are often based on word co-occurrence while it
ignores the semantic relationships between words[1,2]. Thus, if two documents use different
collections of words to represent the same topic, the clustering approach will consider these
documents different and will assign them to different clusters despite the semantic similarity
between the core words.
Exiting research on document clustering has approached the lack of semantics by integrating
ontologies as background knowledge [3,4]. An Ontology is a hierarchy of domain terms related
via specific relations. Ontologies can improve document clustering by identifying words that are
probably synonyms or semantically related even though they are syntactically different[5].
Comparing terms in documents using ontology relies on the fact that their corresponding concepts
within the ontology may have properties in the form of attributes, level of generality or
specificity, and their relationships.
DOI : 10.5121/ijaia.2014.5605 53
2. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 5, No. 6, November 2014
WordNet is an example of ontologies that is widely used as background knowledge for document
clustering. Many efforts have explored the use of WordNet to incorporate semantics into the bag
of words (BOW)[6]. This can be done by matching and replacing words in the document with the
most appropriate terms from WordNet. Then, the attributes of and relationships between these
terms can be exploited to enhance clustering. However, WordNet-based similarity has not proven
to enhance clustering as first expected. While some works showed that WordNet has the potential
to improve clustering results[5,7,8], other works have reported that the ontological concepts add
no value and sometimes impair performance of document clustering[9]. Besides, some
researchers indicate that WordNet does not provide good word similarity data and that its
structure does not fit well for this task[10].
Some clustering techniques have used domain ontologies such as Mesh Ontology to cluster
documents related to specific domains of knowledge[3]. However, these ontologies have limited
coverage and it is unlikely to cover all concepts mentioned in the document collections, especially
when documents are from general domain.
Rather than relying on domain specific and limited coverage ontologies, some approaches have
used Wikipedia concepts and category information to enrich document representation and handle
the semantic relationships between document terms[11-13]. Wikipedia is much more
comprehensive than other ontologies since it capture a wide range of domains, is frequently
updated and well structured. Wikipedia can be seen as an ontology where each article represents
a single ontology concept, and all concepts are linked together by hyperlinks. In addition,
Wikipedia has a hierarchal categorization that resembles the structure of an ontology whereas
each article belongs to one or more information categories.
In this paper, we propose an approach to improve document clustering by explicitly incorporating
the semantic similarity between Wikipedia concepts into the document’s vector space model. Our
approach is distinguished over similar approaches in terms of the way we used to efficiently map
the document content to Wikipedia concepts and the low-cost measure we adapted to determine
semantic similarity between terms. In the following section, we discuss similar efforts that also
exploited knowledge from Wikipedia to enhance document clustering, and compare their
approaches with ours.
54
2. RELATED WORK
Techniques that employ Ontological features for clustering try to integrate the ontological
background knowledge into the clustering algorithm. Ontology based similarity measures are
often used in these techniques to calculate the semantic similarity between document terms. There
is a plenty of Ontology-based similarity measures that exploit different ontological features, such
as distance, information content and shared features, in order to quantify the mutual information
between terms (reader is referred to [3] for a review and comparison of ontology based similarity
measures). Distance between two document vectors is then computed based on the semantic
similarity of their terms.
A number of research efforts explored the use of Wikipedia to enhance text mining tasks,
including document clustering[11,14,15], text classification [16] and information retrieval[17].
Few approaches have explored utilizing Wikipedia as a knowledge base for document clustering.
Gabrilovich et al. [18] proposed and evaluated a method that is based on matching documents
with the most relevant articles of Wikipeda, and then augmenting the document’s BOW with the
semantic features. Spanakis et al. [19] proposed a method for conceptual hieratical clustering that
exploits Wikipedia textual content and link structure to create compact document representation.
3. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 5, No. 6, November 2014
However, these efforts do not make use of the structural relations in Wikipeida. As a result, the
semantic relatedness between words that are not synonyms is not considered when computing the
similarity between documents.
Huang et al. [13] proposed an approach that maps terms within documents to Wikipedia’s anchors
vocabulary. Then they incorporated the semantic relatedness between concepts by using Milne
and Witten measure [20] which takes into account all of the Wikipedi’s hyperlinks. Our work is
similar in that it also uses a similarity measure that is based on the Wikipedia’s hyperlinks.
However, their approach did not tackle the issue of frequent itemsets, and they instead used a less
efficient approach by examining all possible n-grams. Another difference is the way the
document similarity is measured: while they augmented the measure of document similarity, our
approach augments the document’s vector by reweighting the tf-idf score of each word according
to its relatedness to other document’s words. This makes our approach independent of, and can be
used with, any measure of document similarity since the reweighting process is carried out before
computing similarity between document pairs.
Another work that can be compared to ours is presented by Hu et al.[11]. They developed two
approaches: exact match and relatedness-match, to map documents to Wikipedia concepts, and
further to Wikipedia categories. Then documents are clustered based on a similarity metric which
combines document content information, concept information as well as category information.
However, their approach requires pre-processing of the whole Wikipedia’s textual content, a
thing that leads to substantial increase in both runtime and memory requirements. Instead, our
approach does not require any access to the Wikipedia’s textual content, and relies only on the
Wikipedia’s link structure to compute similarity between terms.
Hu et al. [12] proposed a method that mines synonym, hypernym and associative relations from
Wikipedia, and append that to traditional text similarity measure to facilitate document clustering.
However, their method was developed specifically for the task and has not been investigated
independently. They also built their own method of measuring similarity through Wikipedia’s
category links and redirects. We instead used a similarity measure that is modeled after the
Normalized Google Distance [21] which is a well-known and low-cost method of measuring
similarity between terms based on the link structure of the Web.
Wikipedia has been employed in some efforts for short text classification. For example, Hu et al.
[22] proposed an approach that generates queries from short text and use them to retrieve accurate
Wikipedia pages with the help of a search engine. Titles and links from the retrieved pages are
then extracted to serve as additional features for clustering. Phan et al. [23] presented a
framework for building classifiers that deal with short text. They sought to expand the coverage
of classifiers by topics coming for external knowledge base (e.g. Wikipedia) that do not exist in
small training datasets. These approaches, however, use Wikipedia concepts without considering
the hierarchical relationships and categories embedded in Wikipedia.
55
3. AN APPROACH FOR WIKIPEDIA-BASED DOCUMENT CLUSTERING
The pseudo code of our approach for Wikipedia-based document clustering is shown in Figure 1,
and consists of three phases: The first phase includes of a set of text processing steps for the
purpose of determining terms that best represent the document content. In the second phase, each
document is represented by using the tf-idf weighted vector. Document terms are then mapped to
Wikipedia concepts. In the third phase, the similarity between each pair of Wikipedia concepts is
measured by using the Wikipedia link structure. The tf-idf weights of original terms are then
reweighted to incorporate the similarity scores obtained from Wikipedia. By the end of the
algorithm, the tf-idf representation of each document is enriched so that terms that are
4. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 5, No. 6, November 2014
semantically related gain more weight. Documents can then be clustered using any traditional
clustering method such as k-means. These phases are explained in detail in the subsequent
sections.
Prior to applying our approach, Wikipedia’s vocabulary of anchor text is retrieved from the
Wikipedia dump, which is a copy of all Wikipedia content, and stemmed in order to be
comparable with the stemmed document content. For measuring similarity between Wikipedia
concepts, all outgoing hyperlinks, incoming hyperlinks and categories of articles are also
retrieved. Note that this task incurs a one-time cost, thus allowing the clustering algorithm to be
invoked multiple times without the additional overhead of reprocessing the Wikipedia content.
56
Input: A set of documents
Begin
{Phase 1: Pre-processing and extraction of frequent phrases}
for each document d D do
Apply stemming and stopword removal
end for
Concatenate all documents
Apply Apriori algorithm to extract frequent itemsets
{Phase 2: Construct tf-idf weighted vectors and map terms to Wikipedia concepts}
for each document d D do
Tokenize d
Discard tokens that overlap with frequent phrases
Discard rare terms
Build the BOW of d where BOW = Retained tokens ⋃ frequent phrases
for each term t BOW do
Calculate tf-idf for t
if t matches Wikipedia concept(s) then
Replace t with matching Wikipedia concept(s)
end if
end for
end for
{Phase 3: {Reweighting tf-idf weights}
for each document d D do
for each term BOW of d do
for each term BOW of d AND do
Compute similarity between using equation 1
if are ambiguous then
Perform word-sense disambiguation (see section 6)
end if
end for
Reweight using equation 2
end for
end for
{Document clustering}
Apply any conventional clustering algorithm (e.g. k-means)
End
Figure 1. Pseudo code of our algorithm of document clustering
5. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 5, No. 6, November 2014
57
4. CONSTRUCTION OF DOCUMENT’S VECTOR SPACE MODEL
The first step of our clustering approach is to represent each document as a bag of words (BOW).
Note that traditional clustering algorithms treat a document as a set of single words, thus losing
valuable information about the meaning of terms. When incorporating semantics in document
clustering, it is necessary to preserve phrases, the consecutive words that stand together as a
conceptual unit. Without preserving phrases, actual meanings of document terms may be lost,
making it difficult to measure semantic similarity in between. For example, the phrase “big bang
theory” refers to a concept that is entirely different from what its individual tokens refer to. Thus,
we aim to create a document’s BOW representation whose attributes include not only single
words but also phrases that have standalone meanings. This phase starts with some standard text
processing operations including stopword removal and word stemming. Stopwords are words that
occur frequently in documents and have little informational meanings. Stemming finds the root
form of a word by removing its suffix. In the context of mapping with Wikipedia concepts,
stemming allows to recognize and deal with variations of the same word as if they were the same,
hence detecting mappings between words with the same stem.
To construct the document’s bag of words and phrases, we used a simple method based on
Apriori algorithm [24] to find frequent occurring phrases from a document collection. A phrase is
defined as frequent if it appears in n number of documents (For our task we set n = 3). The
Apriori algorithm consists of two steps: In the first step, it extracts frequent itemsets, or phrases,
from a set of transactions that satisfy a user-specified minimum support. In the second step, it
generates rules from the discovered frequent itemsets. For this task, we only need the first step,
i.e., finding frequent itemsets. In addition, the algorithm was restricted to find itemsets with four
words or fewer as we believe that most Wikipedia concepts contain no more than four words (this
restriction can be easily relaxed).
After extracting frequent itemsets, we perform word tokenization to break the document text into
single words. Many of the resulting tokens can be already part of the extracted itemsets.
Therefore, we remove tokens that overlap with any of the retrieved frequent itemsets. Stemmed
tokens as well as frequent itemsets that occur in the document will be combined together to form
the BOW representing the document. Rare terms that infrequently occur in the document
collection can introduce noise and degrade performance. Thus, terms that occur in the document
collection less than or equal to a predefined threshold are discarded from the document’s BOW.
It is worth noting here that similar works that exploited Wikipedia for document clustering often
did not consider mining frequent itemsets occurring in the document [11 , 13 ]. Instead, they
extract all possible n-grams from the document by using a sliding widow approach and match
them with the Wikipedia content. In contrast, our approach of extracting frequent itemset prior to
the concept-mapping process is more time-efficient as it avoids the bottleneck of matching all
possible n-grams to Wikipedia concepts.
After identifying the document’s BOW, the next step is to map terms within the BOW to
Wikipedia concepts: Each term is compared with Wikipedia anchors, and matching terms are
replaced by the corresponding Wikipedia concepts. Terms that do not match any Wikipedia
concept are not discarded from the BOW in order to avoid any noise or information loss.
Formally, let be a set of documents and be the set of different
terms occurring in a document . Note that T includes both: 1) Wikipedia concepts which replace
original terms after the mapping process. 2) Terms that do not match any Wikipedia concepts.
The weight of each document term is then calculated using tf-idf (term frequency-inverted
document frequency). Tf-idf weighs the frequency of a term in a document with a factor that
6. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 5, No. 6, November 2014
discounts its importance when it appears in almost all documents. The tf-idf of term in
document is calculated using the following equation:
58
The document’s vector representation is then constructed from the tf-idf weights of its terms:
5. MEASURING SEMANTIC SIMILARITY BETWEEN WIKIPEDIA TERMS
After representing the document as a vector of term tf-idf weights, the next step is to augment
these weights so that terms gain more importance according to their semantic similarity to the
other document terms.
To measure similarity between document terms, we used a measure that is based on the
Normalized Google Distance Measure (NGD) [21]. The NGD measure first uses the Google
search engine to obtain all Web pages mentioning these terms. Pages that mention both terms
indicate relatedness, while pages that mention only one term indicate the opposite. The NGD
denotes the distance or dissimilarity between two terms: the smaller the value of NGD, the more
related the terms are semantically. For this work, the measure is adapted to exploit Wikipedia
articles instead of the Google’s search results. Formally, the Wikipedia-based similarity measure
is:
(1)
where s and t are a pair of Wikipedia concepts. S and T are the sets of all Wikipedia articles that
link to s and t respectively, and R is set of all Wikipedia concepts. The output of this measure
ranges between 0 and 1, where values close to 1 denote related terms while values close to 0
denote the opposite. Note that the advantage of this measure is its low computational cost since it
only considers the links between Wikipedia articles to define similarity.
After computing the similarity between each pair of terms, the tf-idf weight of each term is
adjusted to consider its relatedness to other terms within the document’s vector representation.
The adjusted weight of a term is calculated using the following equation:
(2)
where is the semantic similarity between the terms and , and is calculated using
Equation 1. N is the number of co-occurred terms in document d. The threshold denotes the
minimum similarity score between two terms. Since we are interested in emphasizing more
weight on terms that are more semantically related, it is necessary to set up a threshold value.
Note that this measure assigns an additional weight to the original term’s weight based on its
similarity to other terms in the document. The term weight remains unchanged if it is not related
to any other term in the document or if it is not mapped to any Wikipedia concept. The final
document’s vector is:
7. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 5, No. 6, November 2014
After constructing the semantically-augmented vectors for all documents, any conventional
measure of document similarity, such as the cosine measure, can be used to measure similarity
between document pairs. Note that in our approach we incorporate the similarity scores in the
document representation before applying the document similarity measure. Thus, our approach is
independent of, and hence can be used with, any similarity measure or clustering algorithm.
59
6. WORD SENSE DISAMBIGUATION
Concepts mentioned in Wikipedia are explicitly linked to their corresponding articles through
anchors. These anchors can be considered as sense annotations for Wikipedia concepts.
Ambiguous words such as “eclipse” are linked to different Wikipedia articles based on their
meanings in the context where they occur (e.g. eclipse "astronomical event", eclipse "software
suite", eclipse "foundation"). When mapping document terms to Wikipedia concepts, it is
necessary to perform word sense disambiguation to identify the correct word sense. Failing to do
so may result in false results when measuring similarity between terms.
One way to disambiguate words is to simply use the most common sense. The commonness of a
sense is identified by the number of anchors that link to it in Wikipedia. For example, over 95%
of anchors labelled as “Paris” link to the capital of France while the rest link to other places,
people or even music. However, choosing the most common sense is not enough and it is not
always the best decision. Instead, we used the same approach used in [21] which uses the two
terms involved in the similarity measure to disambiguate each other. This is done by selecting the
two candidate senses that most closely related to each other. We start by choosing the top
common senses for each term (For simplicity, we ignore senses that contribute with less than 1%
of the anchor’s links). We then measure the similarity between every pair of senses, and the two
senses with the highest similarity score are considered.
7. EVALUATION
Wikipedia releases its database dumps periodically, which can be downloaded from
http://download.wikipedia.org. The Wikipedia dump used in this evaluation was released on the
13th August 2014, and contains 12100939 articles. The data was presented in XML format. We
used the WikipediaMiner [25] toolkit to process the data and extract the categories and outlinks
out of Wikipedia dump.
7.1. Methodology
Our objective was to compare our approach with other approaches from the state of the art.
Therefore, we used the same evaluation settings used by Hu et al. [12] in order to make our
results comparable with theirs. The following two test sets were created:
• Reuters-21578 contains short news articles. The subset created consists of categories in the
original Reuters dataset that have at least 20 and at most 200 documents. This results in 1658
documents and 30 categories in total.
• OHSUMed contains 23 categories and 18302 documents. Each document is the concatenation
of title and abstract of a medical science paper.
Besides our method, we implemented and tested three different text representation methods, as
defined below:
• Bag of Words: The traditional BOW method with no semantics. This is the baseline case.
8. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 5, No. 6, November 2014
• Hotho et al.’s method: this is a reimplementation of Hotho et al.’s WordNet-based
algorithm[8]. The intention of considering this method is to compare how the use of Wikipedia
as background knowledge influences the clustering results as compared to WordNet.
• Hu et al.: this is a reimplementation of the Hu et al.’s algorithm [12] which leverages
Wikipedia as a background knowledge for document clustering. While Hu et al. built their
own measure of document similarity based on Wikipedia’s links, our approach uses a
similarity measure that is based on the Normalized Google Distance.
To focus our investigation on the representation rather than the clustering method, the standard k-means
clustering algorithm was used. We used two evaluation metrics: Purity and F-score. Purity
assumes that all samples of a cluster are predicted to be members of the actual dominant class for
that cluster. F-score combines the information of precision and recall which is extensively applied
in information retrieval.
60
7.2. Results
Table 1 shows how the different methods perform in clustering on the two datasets. In general,
the performance of BOW on both datasets is improved by incorporating background knowledge
either from WordNet (Hotho et al.’s method) or Wikipedia (Hu et al. and our method). For
instance, according to the F-score, for the Reuters dataset, our method and Hotho et al’s method
achieve 27.23% and 14.71% respectively.
On comparing the use of Wikipedia to WordNet, our approach and Hu et al.’s, which both use
Wikipedia to enrich document similarity, outperformed the Hotho et al.’s approach for both
datasets. This demonstrates the potential of integrating Wikipedia as a knowledge source as
compared to the WordNet based method.
Comparing our approach with the other three methods, our approach achieves the best F-score
and purity on both datasets. We applied t-test to compare between the performance of our
approach and the others. Results show that our approach significantly outperformed all other
methods on the Reuters dataset with the p-value < 0.05. When using the OHSUMed dataset, the
difference between our approach and the BOW and Hotho et al. was also significant. However,
there was no significant difference between the Hu et al. approach and ours (p < 0.32). In general,
our approach provides slight improvement over the Hu et al.’s method. As both approaches
exploit the Wikipedia link structure to enrich the document representation, we believe that the
slight improvement in our approach stems from the adopted similarity measure that is based on
the well-known Normalized Google Distance.
Table 1. Comparison with related work in terms of purity and F-score
Reuters 21578 OHSUMed
Purity (Impr.) F-score (Impr.) Purity (Impr.) F-score (Impr.)
Bag of Words 0.571 0.639 0.36 0.470
Hotho et al. 0.594 (4.02%) 0.685 (7.20%) 0.405 (12.5%) 0.521 (10.85%)
Hu et al. 0.655 (8.40%) 0.733 (14.71%) 0.475 (31.94%) 0.566 (20.43%)
Our
0.703 (23.11%) 0.813 (27.23%) 0.483 (34.16%) 0.592 (25.95%)
Approach
9. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 5, No. 6, November 2014
8. CONCLUSION AND FUTURE WORK
In this work, we proposed an approach for leveraging Wikipedia link structure to improve text
clustering performance. Our approach uses a phrase extraction technique to efficiently map
document terms to Wikipedia concepts. Afterwards, the semantic similarity between Wikipedia
terms is measured by using a measure that is based on the Normalized Google Distance and the
Wikipedia’s link structure. The document representation is then adjusted so that each term is
assigned an additional weight based on its similarity to other terms in the document. Our
approach differs from similar efforts from the state of the art in two aspects: first, unlink other
works that built their own methods of measuring similarity through the Wikipedia’s category
links and redirects, we instead used a similarity measure that is modelled after the Normalized
Google Distance which is a well-known and low-cost method of measuring similarity between
terms. Second, while other approaches used to match all possible n-grams to Wikipedia concepts,
our approach is more time efficient as it applies an algorithm for phrase extraction from
documents prior to matching terms with Wikipedia. In addition, our approach does not require
any access to the Wikipedia’s textual content, and relies only on the Wikipedia’s link structure to
compute similarity between terms. The proposed approach was evaluated by being compared with
different methods from related work (e.g. Bag of Words with no semantics, clustering with
WordNet as well as clustering with Wikipedia) on two datasets: Reuters 21578 and OHSUMed.
We think that our approach can be extended to other applications that are based on the
measurement of document similarity, such as information retrieval and short text classification. In
our future work, we aim to further improve our concept-mapping technique: Currently, only the
document terms that exactly match Wikipedia concepts are extracted and used for the similarity
measure. Instead of exact matching, we aim to utilize the graph of Wikipedia links to build the
connection between Wikipedia concepts and the document content even if they cannot exactly
match. This approach can be more useful when Wikipedia concepts cannot fully cover the
document content.
REFERENCES
[1] Hotho, Andreas, Maedche, Alexander & Staab, Steffen, (2002) "Ontology-Based Text Document
61
Clustering", KI, Vol. 16, No. 4: pp. 48-54.
[2] Charola, Apeksha & Machchhar, Sahista, (2013) "Comparative Study on Ontology Based Text
Documents Clustering Techniques", Data Mining and Knowledge Engineering, Vol. 5, No. 12: pp.
426.
[3] Zhang, Xiaodan, Jing, Liping, Hu, Xiaohua, Ng, Michael & Zhou, Xiaohua, (2007) "A Comparative
Study of Ontology Based Term Similarity Measures on Pubmed Document Clustering", Advances in
Databases: Concepts, Systems and ApplicationsSpringer. pp. 115-126.
[4] Logeswari, S & Premalatha, K, (2013) "Biomedical Document Clustering Using Ontology Based
Concept Weight", Computer Communication and Informatics (ICCCI), 2013 International Conference
on: IEEE. pp. 1-4.
[5] Hotho, Andreas, Staab, Steffen & Stumme, Gerd, (2003) "Ontologies Improve Text Document
Clustering", Data Mining, 2003. ICDM 2003. Third IEEE International Conference on: IEEE. pp.
541-544.
[6] Wei, Tingting, Zhou, Qiang, Chang, Huiyou, Lu, Yonghe & Bao, Xianyu, (2014) "A Semantic
Approach for Text Clustering Using Wordnet and Lexical Chains", Expert Systems with
Applications, Vol. 42, No. 4,
[7] Wang, Pu, Hu, Jian, Zeng, Hua-Jun, Chen, Lijun & Chen, Zheng, (2007) "Improving Text
Classification by Using Encyclopedia Knowledge", Data Mining, 2007. ICDM 2007. Seventh IEEE
International Conference on: IEEE. pp. 332-341.
[8] Hotho, Andreas, Staab, Steffen & Stumme, Gerd, (2003) "Wordnet Improves Text Document
Clustering", Proceedings of SIGIR Semantic Web Workshop: ACM. pp. 541-544.
[9] Moravec, Pavel, Kolovrat, Michal & Snasel, Vaclav, (2004) "Lsi Vs. Wordnet Ontology in
Dimension Reduction for Information Retrieval", Dateso. pp. 18-26.
10. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 5, No. 6, November 2014
[10] Passos, Alexandre & Wainer, Jacques, (2009) "Wordnet-Based Metrics Do Not Seem to Help
Document Clustering", Proc. of the of the II Workshop on Web and Text Intelligente, S£o Carlos,
Brazil. pp.
[11] Hu, Xiaohua, Zhang, Xiaodan, Lu, Caimei, Park, Eun K Zhou, Xiaohua, (2009) Exploiting
Wikipedia as External Knowledge for Document Clustering, Proceedings of the 15th ACM SIGKDD
international conference on Knowledge discovery and data mining: ACM. pp. 389-396.
[12] Hu, Jian, Fang, Lujun, Cao, Yang, Zeng, Hua-Jun, Li, Hua, Yang, Qiang Chen, Zheng, (2008)
Enhancing Text Clustering by Leveraging Wikipedia Semantics, Proceedings of the 31st annual
international ACM SIGIR conference on Research and development in information retrieval: ACM.
pp. 179-186.
[13] Huang, Anna, Milne, David, Frank, Eibe Witten, Ian H, (2009) Clustering Documents Using a
Wikipedia-Based Concept Representation, Advances in Knowledge Discovery and Data
MiningSpringer. pp. 628-636.
[14] Kumar, Kiran, Santosh, GSK Varma, Vasudeva, (2011) Multilingual Document Clustering Using
62
Wikipedia as External Knowledge: Springer
[15] Roul, Rajendra Kumar, Devanand, Omanwar Rohit Sahay, SK, (2014) Web Document Clustering
and Ranking Using Tf-Idf Based Apriori Approach, arXiv preprint arXiv:1406.5617, Vol., No.
[16] Zhang, Zhilin, Lin, Huaizhong, Li, Pengfei, Wang, Huazhong Lu, Dongming, (2013) Improving
Semi-Supervised Text Classification by Using Wikipedia Knowledge, Web-Age Information
ManagementSpringer. pp. 25-36.
[17] Han, May Sabai, (2013) Semantic Information Retrieval Based on Wikipedia Taxonomy,
International Journal of Computer Applications Technology and Research, Vol. 2, No. 1: pp. 77-80.
[18] Gabrilovich, Evgeniy Markovitch, Shaul, (2006) Overcoming the Brittleness Bottleneck Using
Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge, AAAI. pp. 1301-1306.
[19] Spanakis, Gerasimos, Siolas, Georgios Stafylopatis, Andreas, (2012) Exploiting Wikipedia
Knowledge for Conceptual Hierarchical Clustering of Documents, The Computer Journal, Vol. 55,
No. 3: pp. 299-312.
[20] Milne, David Witten, Ian H, (2008) Learning to Link with Wikipedia, Proceedings of the 17th
ACM conference on Information and knowledge management: ACM. pp. 509-518.
[21] Cilibrasi, Rudi L Vitanyi, Paul MB, (2007) The Google Similarity Distance, Knowledge and
Data Engineering, IEEE Transactions on, Vol. 19, No. 3: pp. 370-383.
[22] Hu, Xia, Sun, Nan, Zhang, Chao Chua, Tat-Seng, (2009) Exploiting Internal and External
Semantics for the Clustering of Short Texts Using World Knowledge, Proceedings of the 18th ACM
conference on Information and knowledge management: ACM. pp. 919-928.
[23] Phan, Xuan-Hieu, Nguyen, Le-Minh Horiguchi, Susumu, (2008) Learning to Classify Short and
Sparse Text Web with Hidden Topics from Large-Scale Data Collections, Proceedings of the 17th
international conference on World Wide Web: ACM. pp. 91-100.
[24] Agrawal, Rakesh Srikant, Ramakrishnan, (1994) Fast Algorithms for Mining Association Rules,
Proc. 20th int. conf. very large data bases, VLDB. pp. 487-499.
[25] Wikipedia Miner. 20th September 2014]; Available from: http://wikipedia-miner.
cms.waikato.ac.nz/.
Authors
Iyad M. AlAgha received his MSc and PhD in Computer Science from the University
of Durham, the UK. He worked as a research associate in the center of technology
enhanced learning at the University of Durham, investigating the use of Multi-touch
devices for learning and teaching. He is currently working as an assistant professor at
the Faculty of Information technology at the Islamic University of Gaza, Palestine. His
research interests are Semantic Web technology, Adaptive Hypermedia, Human-
Computer Interaction and Technology Enhanced Learning.
Rami H. Nafee received his BSc from Al-Azhar University-Gaza and his MSc degree
in Information technology from the Islamic University of Gaza. He works as a Web
programmer at Information Technology Unit at Al-Azhar University-Gaza. He is also a
lecturer at the Faculty of Intermediate Studies at Al-Azhar University. His research
interests include Data Mining and Semantic Web.