This document discusses challenging issues and similarity measures for web document clustering. It begins with an introduction to text mining and document clustering. It then reviews related work on similarity approaches and measures. Some key challenging issues in web document clustering are discussed, such as measuring semantic similarity between words and evaluating cluster validity. Various types of similarity measures are also described, including string-based measures like Jaro-Winkler distance and corpus-based measures like latent semantic analysis. The conclusion states that accurate clustering requires a precise definition of similarity between document pairs and discusses different similarity measures that can be used.
An efficient approach for semantically enhanced document clustering by using ...ijaia
Traditional techniques of document clustering do not consider the semantic relationships between words
when assigning documents to clusters. For instance, if two documents talking about the same topic do that
using different words (which may be synonyms or semantically associated), these techniques may assign
documents to different clusters. Previous research has approached this problem by enriching the document
representation with the background knowledge in an ontology. This paper presents a new approach to
enhance document clustering by exploiting the semantic knowledge contained in Wikipedia. We first map
terms within documents to their corresponding Wikipedia concepts. Then, similarity between each pair of
terms is calculated by using the Wikipedia's link structure. The document’s vector representation is then
adjusted so that terms that are semantically related gain more weight. Our approach differs from related
efforts in two aspects: first, unlink others who built their own methods of measuring similarity through the
Wikipedia categories; our approach uses a similarity measure that is modelled after the Normalized
Google Distance which is a well-known and low-cost method of measuring term similarity. Second, it is
more time efficient as it applies an algorithm for phrase extraction from documents prior to matching terms
with Wikipedia. Our approach was evaluated by being compared with different methods from the state of
the art on two different datasets. Empirical results showed that our approach improved the clustering
results as compared to other approaches.
Based on the social capital theory, this article takes the Chinese inland scholars’ co-authorship network in the field of information science as the case, and investigates the influence of periphery authors’ network embeddedness on research impact by applying Poisson regression model. The empirical results indicate that: (1) Degree centrality positively affect the net citations, (2) an inverted U-shaped relationship exists between the degree centrality and the net citations, which confirms the existence of the optimal collaboration size, and (3) structural hole positively affect the net citations.
SPIRIT: A TREE KERNEL-BASED METHOD FOR TOPIC PERSON INTERACTION DETECTIONNexgen Technology
TO GET THIS PROJECT COMPLETE SOURCE ON SUPPORT WITH EXECUTION PLEASE CALL BELOW CONTACT DETAILS
MOBILE: 9791938249, 0413-2211159, WEB: WWW.NEXGENPROJECT.COM,WWW.FINALYEAR-IEEEPROJECTS.COM, EMAIL:Praveen@nexgenproject.com
NEXGEN TECHNOLOGY provides total software solutions to its customers. Apsys works closely with the customers to identify their business processes for computerization and help them implement state-of-the-art solutions. By identifying and enhancing their processes through information technology solutions. NEXGEN TECHNOLOGY help it customers optimally use their resources.
PRE-RANKING DOCUMENTS VALORIZATION IN THE INFORMATION RETRIEVAL PROCESScsandit
In this short paper we present three methods to valorise score relevance of some documents
basing on their characteristics in order to enhance their ranking. Our framework is an
information retrieval system dedicated to children. The valorisation methods aim to increase the
relevance score of some documents by an additional value which is proportional to the number
of multimedia objects included, the number of objects linked to the user particulars and the
included topics. All of the three valorization methods use fuzzy rules to identify the valorization
value.
An efficient approach for semantically enhanced document clustering by using ...ijaia
Traditional techniques of document clustering do not consider the semantic relationships between words
when assigning documents to clusters. For instance, if two documents talking about the same topic do that
using different words (which may be synonyms or semantically associated), these techniques may assign
documents to different clusters. Previous research has approached this problem by enriching the document
representation with the background knowledge in an ontology. This paper presents a new approach to
enhance document clustering by exploiting the semantic knowledge contained in Wikipedia. We first map
terms within documents to their corresponding Wikipedia concepts. Then, similarity between each pair of
terms is calculated by using the Wikipedia's link structure. The document’s vector representation is then
adjusted so that terms that are semantically related gain more weight. Our approach differs from related
efforts in two aspects: first, unlink others who built their own methods of measuring similarity through the
Wikipedia categories; our approach uses a similarity measure that is modelled after the Normalized
Google Distance which is a well-known and low-cost method of measuring term similarity. Second, it is
more time efficient as it applies an algorithm for phrase extraction from documents prior to matching terms
with Wikipedia. Our approach was evaluated by being compared with different methods from the state of
the art on two different datasets. Empirical results showed that our approach improved the clustering
results as compared to other approaches.
Based on the social capital theory, this article takes the Chinese inland scholars’ co-authorship network in the field of information science as the case, and investigates the influence of periphery authors’ network embeddedness on research impact by applying Poisson regression model. The empirical results indicate that: (1) Degree centrality positively affect the net citations, (2) an inverted U-shaped relationship exists between the degree centrality and the net citations, which confirms the existence of the optimal collaboration size, and (3) structural hole positively affect the net citations.
SPIRIT: A TREE KERNEL-BASED METHOD FOR TOPIC PERSON INTERACTION DETECTIONNexgen Technology
TO GET THIS PROJECT COMPLETE SOURCE ON SUPPORT WITH EXECUTION PLEASE CALL BELOW CONTACT DETAILS
MOBILE: 9791938249, 0413-2211159, WEB: WWW.NEXGENPROJECT.COM,WWW.FINALYEAR-IEEEPROJECTS.COM, EMAIL:Praveen@nexgenproject.com
NEXGEN TECHNOLOGY provides total software solutions to its customers. Apsys works closely with the customers to identify their business processes for computerization and help them implement state-of-the-art solutions. By identifying and enhancing their processes through information technology solutions. NEXGEN TECHNOLOGY help it customers optimally use their resources.
PRE-RANKING DOCUMENTS VALORIZATION IN THE INFORMATION RETRIEVAL PROCESScsandit
In this short paper we present three methods to valorise score relevance of some documents
basing on their characteristics in order to enhance their ranking. Our framework is an
information retrieval system dedicated to children. The valorisation methods aim to increase the
relevance score of some documents by an additional value which is proportional to the number
of multimedia objects included, the number of objects linked to the user particulars and the
included topics. All of the three valorization methods use fuzzy rules to identify the valorization
value.
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...IJERA Editor
Although publicly accessible databases containing speech documents. It requires a great deal of time and effort
required to keep them up to date is often burdensome. In an effort to help identify speaker of speech if text is
available, text-mining tools, from the machine learning discipline, it can be applied to help in this process also.
Here, we describe and evaluate document classification algorithms i.e. a combo pack of text mining and
classification. This task asked participants to design classifiers for identifying documents containing speech
related information in the main literature, and evaluated them against one another. Expected systems utilizes a
novel approach of k -nearest neighbour classification and compare its performance by taking different values of
k.
Topic Modeling : Clustering of Deep Webpagescsandit
The internet is comprised of massive amount of info
rmation in the form of zillions of web
pages.This information can be categorized into the
surface web and the deep web. The existing
search engines can effectively make use of surface
web information.But the deep web remains
unexploited yet. Machine learning techniques have b
een commonly employed to access deep
web content.
Under Machine Learning, topic models provide a simp
le way to analyze large volumes of
unlabeled text. A "topic" consists of a cluster of
words that frequently occur together. Using
contextual clues, topic models can connect words wi
th similar meanings and distinguish
between words with multiple meanings. Clustering is
one of the key solutions to organize the
deep web databases.In this paper, we cluster deep w
eb databases based on the relevance found
among deep web forms by employing a generative prob
abilistic model called Latent Dirichlet
Allocation(LDA) for modeling content representative
of deep web databases. This is
implemented after preprocessing the set of web page
s to extract page contents and form
contents.Further, we contrive the distribution of “
topics per document” and “words per topic”
using the technique of Gibbs sampling. Experimental
results show that the proposed method
clearly outperforms the existing clustering methods
.
Annotation Approach for Document with Recommendation ijmpict
An enormous number of organizations generate and share textual descriptions of their products, facilities, and activities. Such collections of textual data comprise a significant amount of controlled information, which residues buried in the unstructured text. Whereas information extraction systems simplify the extraction of structured associations, they are frequently expensive and incorrect, particularly when working on top of text that does not comprise any examples of the targeted structured data. Projected an alternative methodology that simplifies the structured metadata generation by recognizing documents that are possible to contain information of awareness and this data will be beneficial for querying the database. Moreover, we intend algorithms to extract attribute-value pairs, and similarly devise new mechanisms to map such pairs to manually created schemes. We apply clustering technique to the item content information to complement the user rating information, which improves the correctness of collaborative similarity, and solves the cold start problem.
Classification-based Retrieval Methods to Enhance Information Discovery on th...IJMIT JOURNAL
The widespread adoption of the World-Wide Web (the Web) has created challenges both for society as a whole and for the technology used to build and maintain the Web. The ongoing struggle of information retrieval systems is to wade through this vast pile of data and satisfy users by presenting them with information that most adequately it’s their needs. On a societal level, the Web is expanding faster than we can comprehend its implications or develop rules for its use. The ubiquitous use of the Web has raised important social concerns in the areas of privacy, censorship, and access to information. On a technical level, the novelty of the Web and the pace of its growth have created challenges not only in the development of new applications that realize the power of the Web, but also in the technology needed to scale applications to accommodate the resulting large data sets and heavy loads. This thesis presents searching algorithms and hierarchical classification techniques for increasing a search service's understanding of web queries. Existing search services rely solely on a query's occurrence in the document collection to locate relevant documents. They typically do not perform any task or topic-based analysis of queries using other available resources, and do not leverage changes in user query patterns over time. Provided within are a set of techniques and metrics for performing temporal analysis on query logs. Our log analyses are shown to be reasonable and informative, and can be used to detect changing trends and patterns in the query stream, thus providing valuable data to a search service.
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...ijaia
Regression models and their statistical analyses is one of the most important tool used by scientists and practitioners. The aim of a regression model is to fit parametric functions to data. It is known that the true regression is unknown and specific methods are created and used strictly pertaining to the roblem. For the pioneering work to develop procedures for fitting functions, we refer to the work on the methods of least
absolute deviations, least squares deviations and minimax absolute deviations. Today’s widely celebrated
procedure of the method of least squares for function fitting is credited to the published works of Legendre and Gauss. However, the least squares based models in practice may fail to provide optimal results in nonGaussian situations especially when the errors follow distributions with the fat tails. In this paper an unorthodox method of estimating linear regression coefficients by minimising GMSE(geometric mean of squared errors) is explored. Though GMSE(geometric mean of squared errors) is used to compare models it is rarely used to obtain the coefficients. Such a method is tedious to handle due to the large number of roots obtained by minimisation of the loss function. This paper offers a way to tackle that problem.
Application is illustrated with the ‘Advertising’ dataset from ISLR and the obtained results are compared
with the results of the method of least squares for single index linear regression model.
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...IJDKP
As existing computer search engines struggle to understand the meaning of natural language, semantically
enriched metadata may improve interest-based search engine capabilities and user satisfaction.
This paper presents an enhanced version of the ecosystem focusing on semantic topic metadata detection
and enrichments. It is based on a previous paper, a semantic metadata enrichment software ecosystem
(SMESE). Through text analysis approaches for topic detection and metadata enrichments this paper
propose an algorithm to enhance search engines capabilities and consequently help users finding content
according to their interests. It presents the design, implementation and evaluation of SATD (Scalable
Annotation-based Topic Detection) model and algorithm using metadata from the web, linked open data,
concordance rules, and bibliographic record authorities. It includes a prototype of a semantic engine using
keyword extraction, classification and concept extraction that allows generating semantic topics by text,
and multimedia document analysis using the proposed SATD model and algorithm.
The performance of the proposed ecosystem is evaluated using a number of prototype simulations by
comparing them to existing enriched metadata techniques (e.g., AlchemyAPI, DBpedia, Wikimeta, Bitext,
AIDA, TextRazor). It was noted that SATD algorithm supports more attributes than other algorithms. The
results show that the enhanced platform and its algorithm enable greater understanding of documents
related to user interests.
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...University of Bari (Italy)
The current abundance of electronic documents requires automatic techniques that support the users in understanding their content and extracting useful information. To this aim, improving the retrieval performance must necessarily go beyond simple lexical interpretation of the user queries, and pass through an understanding of their semantic content and aims. It goes without saying that any digital library would take enormous advantage from the availability of effective Information Retrieval techniques to provide to their users. This paper proposes an approach to Information Retrieval based on a correspondence of the domain of discourse between the query and the documents in the repository. Such an association is based on standard general-purpose linguistic resources (WordNet and WordNet Domains) and on a novel similarity assessment technique. Although the work is at a preliminary stage, interesting initial results suggest to go on extending and improving the approach.
Semantic Query Optimisation with Ontology Simulationdannyijwest
Semantic Web is, without a doubt, gaining momentum in both industry and academia. The word “Semantic” refers to “meaning” – a semantic web is a web of meaning. In this fast changing and result oriented practical world, gone are the days where an individual had to struggle for finding information on the Internet where knowledge management was the major issue. The semantic web has a vision of linking, integrating and analysing data from various data sources and forming a new information stream, hence a web of databases connected with each other and machines interacting with other machines to yield results which are user oriented and accurate. With the emergence of Semantic Web framework the naïve approach of searching information on the syntactic web is cliché. This paper proposes an optimised semantic searching of keywords exemplified by simulation an ontology of Indian universities with a proposed algorithm which ramifies the effective semantic retrieval of information which is easy to access and time saving.
Sentimental classification analysis of polarity multi-view textual data using...IJECEIAES
The data and information available in most community environments is complex in nature. Sentimental data resources may possibly consist of textual data collected from multiple information sources with different representations and usually handled by different analytical models. These types of data resource characteristics can form multi-view polarity textual data. However, knowledge creation from this type of sentimental textual data requires considerable analytical efforts and capabilities. In particular, data mining practices can provide exceptional results in handling textual data formats. Besides, in the case of the textual data exists as multi-view or unstructured data formats, the hybrid and integrated analysis efforts of text data mining algorithms are vital to get helpful results. The objective of this research is to enhance the knowledge discovery from sentimental multi-view textual data which can be considered as unstructured data format to classify the polarity information documents in the form of two different categories or types of useful information. A proposed framework with integrated data mining algorithms has been discussed in this paper, which is achieved through the application of X-means algorithm for clustering and HotSpot algorithm of association rules. The analysis results have shown improved accuracies of classifying the sentimental multi-view textual data into two categories through the application of the proposed framework on online polarity user-reviews dataset upon a given topics.
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...IJERA Editor
Although publicly accessible databases containing speech documents. It requires a great deal of time and effort
required to keep them up to date is often burdensome. In an effort to help identify speaker of speech if text is
available, text-mining tools, from the machine learning discipline, it can be applied to help in this process also.
Here, we describe and evaluate document classification algorithms i.e. a combo pack of text mining and
classification. This task asked participants to design classifiers for identifying documents containing speech
related information in the main literature, and evaluated them against one another. Expected systems utilizes a
novel approach of k -nearest neighbour classification and compare its performance by taking different values of
k.
Topic Modeling : Clustering of Deep Webpagescsandit
The internet is comprised of massive amount of info
rmation in the form of zillions of web
pages.This information can be categorized into the
surface web and the deep web. The existing
search engines can effectively make use of surface
web information.But the deep web remains
unexploited yet. Machine learning techniques have b
een commonly employed to access deep
web content.
Under Machine Learning, topic models provide a simp
le way to analyze large volumes of
unlabeled text. A "topic" consists of a cluster of
words that frequently occur together. Using
contextual clues, topic models can connect words wi
th similar meanings and distinguish
between words with multiple meanings. Clustering is
one of the key solutions to organize the
deep web databases.In this paper, we cluster deep w
eb databases based on the relevance found
among deep web forms by employing a generative prob
abilistic model called Latent Dirichlet
Allocation(LDA) for modeling content representative
of deep web databases. This is
implemented after preprocessing the set of web page
s to extract page contents and form
contents.Further, we contrive the distribution of “
topics per document” and “words per topic”
using the technique of Gibbs sampling. Experimental
results show that the proposed method
clearly outperforms the existing clustering methods
.
Annotation Approach for Document with Recommendation ijmpict
An enormous number of organizations generate and share textual descriptions of their products, facilities, and activities. Such collections of textual data comprise a significant amount of controlled information, which residues buried in the unstructured text. Whereas information extraction systems simplify the extraction of structured associations, they are frequently expensive and incorrect, particularly when working on top of text that does not comprise any examples of the targeted structured data. Projected an alternative methodology that simplifies the structured metadata generation by recognizing documents that are possible to contain information of awareness and this data will be beneficial for querying the database. Moreover, we intend algorithms to extract attribute-value pairs, and similarly devise new mechanisms to map such pairs to manually created schemes. We apply clustering technique to the item content information to complement the user rating information, which improves the correctness of collaborative similarity, and solves the cold start problem.
Classification-based Retrieval Methods to Enhance Information Discovery on th...IJMIT JOURNAL
The widespread adoption of the World-Wide Web (the Web) has created challenges both for society as a whole and for the technology used to build and maintain the Web. The ongoing struggle of information retrieval systems is to wade through this vast pile of data and satisfy users by presenting them with information that most adequately it’s their needs. On a societal level, the Web is expanding faster than we can comprehend its implications or develop rules for its use. The ubiquitous use of the Web has raised important social concerns in the areas of privacy, censorship, and access to information. On a technical level, the novelty of the Web and the pace of its growth have created challenges not only in the development of new applications that realize the power of the Web, but also in the technology needed to scale applications to accommodate the resulting large data sets and heavy loads. This thesis presents searching algorithms and hierarchical classification techniques for increasing a search service's understanding of web queries. Existing search services rely solely on a query's occurrence in the document collection to locate relevant documents. They typically do not perform any task or topic-based analysis of queries using other available resources, and do not leverage changes in user query patterns over time. Provided within are a set of techniques and metrics for performing temporal analysis on query logs. Our log analyses are shown to be reasonable and informative, and can be used to detect changing trends and patterns in the query stream, thus providing valuable data to a search service.
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...ijaia
Regression models and their statistical analyses is one of the most important tool used by scientists and practitioners. The aim of a regression model is to fit parametric functions to data. It is known that the true regression is unknown and specific methods are created and used strictly pertaining to the roblem. For the pioneering work to develop procedures for fitting functions, we refer to the work on the methods of least
absolute deviations, least squares deviations and minimax absolute deviations. Today’s widely celebrated
procedure of the method of least squares for function fitting is credited to the published works of Legendre and Gauss. However, the least squares based models in practice may fail to provide optimal results in nonGaussian situations especially when the errors follow distributions with the fat tails. In this paper an unorthodox method of estimating linear regression coefficients by minimising GMSE(geometric mean of squared errors) is explored. Though GMSE(geometric mean of squared errors) is used to compare models it is rarely used to obtain the coefficients. Such a method is tedious to handle due to the large number of roots obtained by minimisation of the loss function. This paper offers a way to tackle that problem.
Application is illustrated with the ‘Advertising’ dataset from ISLR and the obtained results are compared
with the results of the method of least squares for single index linear regression model.
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...IJDKP
As existing computer search engines struggle to understand the meaning of natural language, semantically
enriched metadata may improve interest-based search engine capabilities and user satisfaction.
This paper presents an enhanced version of the ecosystem focusing on semantic topic metadata detection
and enrichments. It is based on a previous paper, a semantic metadata enrichment software ecosystem
(SMESE). Through text analysis approaches for topic detection and metadata enrichments this paper
propose an algorithm to enhance search engines capabilities and consequently help users finding content
according to their interests. It presents the design, implementation and evaluation of SATD (Scalable
Annotation-based Topic Detection) model and algorithm using metadata from the web, linked open data,
concordance rules, and bibliographic record authorities. It includes a prototype of a semantic engine using
keyword extraction, classification and concept extraction that allows generating semantic topics by text,
and multimedia document analysis using the proposed SATD model and algorithm.
The performance of the proposed ecosystem is evaluated using a number of prototype simulations by
comparing them to existing enriched metadata techniques (e.g., AlchemyAPI, DBpedia, Wikimeta, Bitext,
AIDA, TextRazor). It was noted that SATD algorithm supports more attributes than other algorithms. The
results show that the enhanced platform and its algorithm enable greater understanding of documents
related to user interests.
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...University of Bari (Italy)
The current abundance of electronic documents requires automatic techniques that support the users in understanding their content and extracting useful information. To this aim, improving the retrieval performance must necessarily go beyond simple lexical interpretation of the user queries, and pass through an understanding of their semantic content and aims. It goes without saying that any digital library would take enormous advantage from the availability of effective Information Retrieval techniques to provide to their users. This paper proposes an approach to Information Retrieval based on a correspondence of the domain of discourse between the query and the documents in the repository. Such an association is based on standard general-purpose linguistic resources (WordNet and WordNet Domains) and on a novel similarity assessment technique. Although the work is at a preliminary stage, interesting initial results suggest to go on extending and improving the approach.
Semantic Query Optimisation with Ontology Simulationdannyijwest
Semantic Web is, without a doubt, gaining momentum in both industry and academia. The word “Semantic” refers to “meaning” – a semantic web is a web of meaning. In this fast changing and result oriented practical world, gone are the days where an individual had to struggle for finding information on the Internet where knowledge management was the major issue. The semantic web has a vision of linking, integrating and analysing data from various data sources and forming a new information stream, hence a web of databases connected with each other and machines interacting with other machines to yield results which are user oriented and accurate. With the emergence of Semantic Web framework the naïve approach of searching information on the syntactic web is cliché. This paper proposes an optimised semantic searching of keywords exemplified by simulation an ontology of Indian universities with a proposed algorithm which ramifies the effective semantic retrieval of information which is easy to access and time saving.
Sentimental classification analysis of polarity multi-view textual data using...IJECEIAES
The data and information available in most community environments is complex in nature. Sentimental data resources may possibly consist of textual data collected from multiple information sources with different representations and usually handled by different analytical models. These types of data resource characteristics can form multi-view polarity textual data. However, knowledge creation from this type of sentimental textual data requires considerable analytical efforts and capabilities. In particular, data mining practices can provide exceptional results in handling textual data formats. Besides, in the case of the textual data exists as multi-view or unstructured data formats, the hybrid and integrated analysis efforts of text data mining algorithms are vital to get helpful results. The objective of this research is to enhance the knowledge discovery from sentimental multi-view textual data which can be considered as unstructured data format to classify the polarity information documents in the form of two different categories or types of useful information. A proposed framework with integrated data mining algorithms has been discussed in this paper, which is achieved through the application of X-means algorithm for clustering and HotSpot algorithm of association rules. The analysis results have shown improved accuracies of classifying the sentimental multi-view textual data into two categories through the application of the proposed framework on online polarity user-reviews dataset upon a given topics.
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...IJORCS
The increasing nature of World Wide Web has imposed great challenges for researchers in improving the search efficiency over the internet. Now days web document clustering has become an important research topic to provide most relevant documents in huge volumes of results returned in response to a simple query. In this paper, first we proposed a novel approach, to precisely define clusters based on maximal frequent item set (MFI) by Apriori algorithm. Afterwards utilizing the same maximal frequent item set (MFI) based similarity measure for Hierarchical document clustering. By considering maximal frequent item sets, the dimensionality of document set is decreased. Secondly, providing privacy preserving of open web documents is to avoiding duplicate documents. There by we can protect the privacy of individual copy rights of documents. This can be achieved using equivalence relation.
Document Retrieval System, a Case StudyIJERA Editor
In this work we have proposed a method for automatic indexing and retrieval. This method will provide as a
result the most likelihood document which is related to the input query. The technique used in this project is
known as singular-value decomposition, in this method a large term by document matrix is analyzed and
decomposed into 100 factors. Documents are represented by 100 item vector of factor weights. On the other
hand queries are represented as pseudo-document vectors, which are formed from weighed combinations of
terms.
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERINGIJDKP
Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting Near
Duplicates is very difficult in large collection of data like ”internet”. The presence of these web pages
plays an important role in the performance degradation while integrating data from heterogeneous
sources. These pages either increase the index storage space or increase the serving costs. Detecting these
pages has many potential applications for example may indicate plagiarism or copyright infringement.
This paper concerns detecting, and optionally removing duplicate and near duplicate documents which are
used to perform clustering of documents .We demonstrated our approach in web news articles domain. The
experimental results show that our algorithm outperforms in terms of similarity measures. The near
duplicate and duplicate document identification has resulted reduced memory in repositories.
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...IJERA Editor
Assigning documents to related categories is critical task which is used for effective document retrieval. Automatic text classification is the process of assigning new text document to the predefined categories based on its content. In this paper, we implemented and performed comparison of Naïve Bayes and Centroid-based algorithms for effective document categorization of English language text. In Centroid Based algorithm, we used Arithmetical Average Centroid (AAC) and Cumuli Geometric Centroid (CGC) methods to calculate centroid of each class. Experiment is performed on R-52 dataset of Reuters-21578 corpus. Micro Average F1 measure is used to evaluate the performance of classifiers. Experimental results show that Micro Average F1 value for NB is greatest among all followed by Micro Average F1 value of CGC which is greater than Micro Average F1 of AAC. All these results are valuable for future research
Cluster Based Web Search Using Support Vector MachineCSCJournals
Now days, searches for the web pages of a person with a given name constitute a notable fraction of queries to Web search engines. This method exploits a variety of semantic information extracted from web pages. The rapid growth of the Internet has made the Web a popular place for collecting information. Today, Internet user access billions of web pages online using search engines. Information in the Web comes from many sources, including websites of companies, organizations, communications and personal homepages, etc. Effective representation of Web search results remains an open problem in the Information Retrieval community. For ambiguous queries, a traditional approach is to organize search results into groups (clusters), one for each meaning of the query. These groups are usually constructed according to the topical similarity of the retrieved documents, but it is possible for documents to be totally dissimilar and still correspond to the same meaning of the query. To overcome this problem, the relevant Web pages are often located close to each other in the Web graph of hyperlinks. It presents a graphical approach for entity resolution & complements the traditional methodology with the analysis of the entity-relationship (ER) graph constructed for the dataset being analyzed. It also demonstrates a technique that measures the degree of interconnectedness between various pairs of nodes in the graph. It can significantly improve the quality of entity resolution. Using Support vector machines (SVMs) which are a set of related Supervised learning methods used for classification of load of user queries to the sever machine to different client machines so that system will be stable. clusters web pages based on their capacities stores whole database on server machine. Keywords: SVM, cluster; ER.
AN EFFICIENT APPROACH FOR SEMANTICALLYENHANCED DOCUMENT CLUSTERING BY USING W...ijaia
Traditional techniques of document clustering do not consider the semantic relationships between words
when assigning documents to clusters. For instance, if two documents talking about the same topic do that
using different words (which may be synonyms or semantically associated), these techniques may assign
documents to different clusters. Previous research has approached this problem by enriching the document
representation with the background knowledge in an ontology. This paper presents a new approach to
enhance document clustering by exploiting the semantic knowledge contained in Wikipedia. We first map
terms within documents to their corresponding Wikipedia concepts. Then, similarity between each pair of
terms is calculated by using the Wikipedia's link structure. The document’s vector representation is then
adjusted so that terms that are semantically related gain more weight. Our approach differs from related
efforts in two aspects: first, unlink others who built their own methods of measuring similarity through the
Wikipedia categories; our approach uses a similarity measure that is modelled after the Normalized
Google Distance which is a well-known and low-cost method of measuring term similarity. Second, it is
more time efficient as it applies an algorithm for phrase extraction from documents prior to matching terms
with Wikipedia. Our approach was evaluated by being compared with different methods from the state of
the art on two different datasets. Empirical results showed that our approach improved the clustering
results as compared to other approaches.
Data mining , knowledge discovery is the process
of analyzing data from different perspectives and summarizing it
into useful information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize
it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among
dozens of fields in large relational databases. The goal of
clustering is to determine the intrinsic grouping in a set of
unlabeled data. But how to decide what constitutes a good
clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the
clustering. Consequently, it is the user which must supply this
criterion, in such a way that the result of the clustering will suit
their needs.
For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in
finding “natural clusters” and describe their unknown properties
(“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier
detection).Of late, clustering techniques have been applied in the
areas which involve browsing the gathered data or in categorizing
the outcome provided by the search engines for the reply to the
query raised by the users. In this paper, we are providing a
comprehensive survey over the document clustering.
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...ijdms
This paper aims to show how mutual information can help provide a semantic interpretation of anomalies in data, characterize the anomalies, and how mutual information can help measure the information that object item X shares with another object item Y. Whilst most link mining approaches focus on predicting link type, link based object classification or object identification, this research focused on using link mining to detect anomalies and discovering links/objects among anomalies. This paper attempts to demonstrate the contribution of mutual information to interpret anomalies using a case study.
In this paper we tried to correlate text sequences those provides common topics for semantic clues. We propose a two step method for asynchronous text mining. Step one check for the common topics in the sequences and isolates these with their timestamps. Step two takes the topic and tries to give the timestamp of the text document. After multiple repetitions of step two, we could give optimum result.
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Topic Modeling : Clustering of Deep Webpagescsandit
The internet is comprised of massive amount of information in the form of zillions of web pages.This information can be categorized into the surface web and the deep web. The existing search engines can effectively make use of surface web information.But the deep web remains unexploited yet. Machine learning techniques have been commonly employed to access deep web content.
Under Machine Learning, topic models provide a simple way to analyze large volumes of unlabeled text. A "topic" consists of a cluster of words that frequently occur together. Using
contextual clues, topic models can connect words with similar meanings and distinguish between words with multiple meanings. Clustering is one of the key solutions to organize the deep web databases.In this paper, we cluster deep web databases based on the relevance found among deep web forms by employing a generative probabilistic model called Latent Dirichlet
Allocation(LDA) for modeling content representative of deep web databases. This is implemented after preprocessing the set of web pages to extract page contents and form
contents.Further, we contrive the distribution of “topics per document” and “words per topic”
using the technique of Gibbs sampling. Experimental results show that the proposed method clearly outperforms the existing clustering methods.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Planning Of Procurement o different goods and services
Challenging Issues and Similarity Measures for Web Document Clustering
1. IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. IV (Jan – Feb. 2015), PP 55-59
www.iosrjournals.org
DOI: 10.9790/0661-17145559 www.iosrjournals.org 55 | Page
Challenging Issues and Similarity Measures for Web Document
Clustering
S. Mahalakshmi
Research Scholar, Bharathiar University, Coimbatore, India.
Abstract: Web itself contains a large amount of documents available in electronic form. The available
documents are in various forms and the information in them is not in organized form. The lack of organization
of materials in the WWW motivates people to automatically manage the huge amount of information. Text-
mining refers generally to the process of extracting interesting and non-trivial information and knowledge
from unstructured text. Text mining framework contains Information Retrieval, Information Extraction,
Information Mining and Interpretation. During Information Retrieval, so many web documents are retrieved.
In that how we can find out similar documents among retrieved? This paper deals with the challenging
issues and similarity measures for web document clustering .
Key words: Text Mining, Information Retrieval,Framework,Information Extraction,Similarity,Clustering
I. Introduction
The growth of information in the web is too large, so search engine come to play a more critical role
to find relation between input keywords. Similarity Measure is widely used in Information Retrieval (IR) and
also it is important component in various tasks on the web such as relation extraction, community mining,
document clustering, and automatic metadata extraction. The main goal of clustering is to partition documents
into homogeneous groups according to their characteristics and abilities. Text Clustering is to find out the
groups information from the text documents and cluster these documents into the most relevant groups. Text
clustering groups the document in an unsupervised way and there is not label or class information. Clustering
methods have to discover the connections between the document and then based on these connections the
documents are clustered.
Given huge volumes of documents, a good document clustering method may organize those huge
numbers of documents into meaningful groups, which enable further browsing and navigation of this corpus be
much easier. A basic idea of text clustering is to find out
which documents have many words in common and place these documents with the most words in
common into same group Each cluster is a collection of data objects that are similar to one another are placed
within the same cluster but are dissimilar to objects in other clusters. This paper discuss about the role of
similarity in clustering and how to find similarity between the retrieved web documents and the problems
faced during similarity measures.
Section 2 presents Review of Related Work.
Section 3 introduces Challenging Issues in Text Mining.
Section 4 Describes Challenging Issues In Web Clustering Based On Similarity
Section 5 describes Conclusion.
II. Review Of Related Literature
Information on the Web is present in the form of text documents (formatted in HTML), and that is the
reason many Web document processing systems are rooted in text data mining techniques. [14]Due to the
growth of information in web leads to drastic increase in field of information retrieval. Efficient information
retrieval and navigation is provided by document clustering. Document clustering is the process of
automatically grouping the related documents into clusters. Instead of searching entire documents for relevant
information, these clusters will improve the efficiency and avoid overlapping of contents. Relevant document
can be efficiently retrieved and accessed by means of document clustering.
String based similarity contains similarity measures like character based and term based
similarity [1]. Wael H. Gomaa , Aly A. Fahmy explained different types of similarity approaches and finally
they conclude that hybrid similarity will give better results[2].Finding similarity between documents will be
useful in clustering and clustering is used to find intrinsic structures in data and organize them into subgroups
2. Challenging Issues and Similarity Measures for Web Document Clustering
DOI: 10.9790/0661-17145559 www.iosrjournals.org 56 | Page
[3]. Mark.Dixon[4] describes about the framework for text mining and it is closely related with IE and
IR.Different similarity measures were explained by Anna Huang [5]. As claimed by [6], the ambiguity is still
the major “world problem” in text mining applications. As a result, most approaches for clustering non-
segmented documents consist of two phases: a text mining process to extract the keywords, and a document
clustering process to compute the similarity between the input documents[8]. The vector space representation
of text is an incredibly powerful tool. Any text can be treated as a vector in a V-dimensional vector-space
(Jaime Arguello) [11]. Documents are matched with a query based on their similarity. If a document is similar
to the query, it is likely to be relevant. Non-binary weights for index terms in queries and documents are used
in the calculation of degree of similarity. Decreasing order of this degree of similarity for the retrieved
documents gives the ranked documents with partial match (Manwar et al.) [12]. In research paper[13], Wei
Ning proved that in some document corpus K-Means might even achieve a better performance with the help of
SVD although including SVD into the clustering processing might result in more time consumption. Thereby
we suggest K-Means a good candidate on text mining and organization of large document corpus. Further
research might be made on the feasibility of combining Frequent Itemset and K-Means.
In essence, document clustering is to group documents based on how relative they are. To cluster
documents correctly, it is very important to measure how much a document is relative to another. And the
extent of relativeness should be some real numbers and can be compared[13].Also they explained about
quality measures for clustering . Basically there are two measures to evaluate how good a clustering algorithm
is. One is Precision rate and the other is Recall rate.
III. Challenging Issues In Web Clustering Based On Similarity
The major challenging issue in text mining arise from the complexity of the natural language
itself.The natural language is not free from the ambiguity problem.Ambiguity means the capability of
being understood in two or more possible senses or ways.One word may have different meanings.One
phrase or sentence may have multiple meanings.One phrase or sentence can be interpreted in various
ways,thus various meanings can be obtained.Although a number of Researches have been conducted in
Resolving the ambiguity problem, the work is still immature.
Fig 1: Challenging issues in Web document clustering
Major problems and the impact of similarity measures in web clustering are discussed below:
i) The World Wide Web is huge, widely distributed; global information service centre in this retrieving
accurate information for users in Search Engine faces a lot of problems. This is due to accurately measuring
the semantic similarity between words is an important problem, and also efficient estimation of semantic
similarity between words is critical for various natural language processing tasks such as word sense
disambiguation, textual entailment, and automatic text summarization. In information retrieval, one of the
main problems is to retrieve a set of documents that is semantically related to a given user query. Retrieving
accurate information to users to such kind of similar words is challenging. Some existing system proposed an
architecture and method to measure semantic similarity between words, which consists of snippets, page-
counts and two class support vector machine. Context-Aware Semantic Association Ranking can be applied to
enhance the results with ranking [7].
Semi
structured
data
Not readily
accecible by
computers
Unstructured
data
Ambiguity
Need training
examples
SematicSyntacticLexical Pragmatic
Challenge
s
3. Challenging Issues and Similarity Measures for Web Document Clustering
DOI: 10.9790/0661-17145559 www.iosrjournals.org 57 | Page
i) Cluster Validity: Recent developments in data stream clustering have heightened the need for
determining suitable criteria to validate results. Most outcomes of methods are depended to specific
application. However, employing suitable criteria in results evaluation is one of the most important challenges
in this area.
ii) Authors LIN, Zhenjiang [9] explained about challenges in similarity measures using link based
methods. They explained that unimportant neighbors are pruned using MatchSim and PageSim Algorithms.
They first proposed two novel neighbor-based similarity measures called MatchSim and PageSim, respectively.
MatchSim takes the similarity between neighbors into account by defining recursively the similarity between
objects by the average similarity of their maximum matched similar neighbors. PageSim measures the
influences of indirect neighbors by adopting feature propagation strategy.They also proposed the Extended
Neighborhood Structure (ENS) model which defines a bi-directional and multi-hop model, to help neighbor-
based methods achieve higher accuracy.
So first there is a challenge for the researches to improve the efficiency of MatchSim algorithm in
order to make it practical. Second, in MatchSim and PageSim, we prune unimportant neighbors according to
the PageRank scores. There are other possible ranking methods, such as IDF-like weighting scheme, which
may help to produce better results. Third, many kinds of properties of objects can be exploited to measure
similarity, so how to integrate the link-based methods or similarity results with others is always a practical
issue for us. So there is a challenge for researchers to integrate IDF-like weighting scheme with link
based methods to produce better results.
iii)Clustering is a widely used technique to partition a set of heterogeneous data to homogeneous and well
separated groups. Its main characteristic is that it does not require a-priori knowledge about the nature and the
hidden structure of the data domain. In this thesis [10], they investigate clustering techniques and their
applications to Web text and video information retrieval. In particular they focus on: web snippets clustering,
video summarization and similarity searching. For web snippets, clustering is used to organize the results
returned by one or more search engines in response to a user query on the fly. The main difficulties concern:
the poor informative strength of snippets, the strict time constraints and the cluster labeling.
iv)In thesis [10], Filippo Geraci focused on the problem of clustering in the web scenario. To reduce the
processing time in adding points to clusters they used Medoid Furthest Point First heuristic algorithm. In
similarity searching of semi structured text documents one should want to allow the user to assign different
weights to each field at query time. This requirement raises the problem of vector score aggregation which
complicates preprocessing because at that time weights are unknown and thus one should build a data structure
able to handle all the possible weight assignments. We conclude this thesis discussing some preliminary ideas
about on-going research directions. We observed that clustering algorithms spend most of the processing time
in adding points to the clusters, searching for the closest center. This time can be reduced using a keen
similarity searching scheme. We plan to develop a general scheme for approximate FPF that takes advantage
from the approximate similarity searching to reduce the algorithm running time without any data dependence.
IV. Similarity Measures for Text Document Clustering
Clustering is a useful technique that organize a large quantity of unordered text documents
into a small number of meaningful and coherent clusters. A wide variety of document distance functions
and similarity measures have been used for clustering such as Euclidean Distance and relative entropy.
Accurate clustering requires a precise definition of the closeness between a pair of objects ,in
terms of either the pair wise similarity or distance. A wide variety of similarity or distance measure
have been proposed and widely applied such as cosine similarity And the Jaccard Correlation coefficient.
Meanwhile similarity is often conceived in terms of dissimilarity or distance as well. Measures
such as Euclidean Distance and Relative entropy have been applied in clustering to calculate the pair wise
distances.
4.1 String Based Similarity
4.1.1 Character Based Similarity
Longest Common Substring (LCS). This algorithm considers the similarity between two strings is based
on the length of contiguous chain of characters that exist in both strings.
Damerau-Levenshtein Jaro
The Levenshtein distance between two strings is defined as the minimum number of edits needed to
transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution
of a single character.
4. Challenging Issues and Similarity Measures for Web Document Clustering
DOI: 10.9790/0661-17145559 www.iosrjournals.org 58 | Page
Jaro–Winkler
In computer science and statistics, the Jaro–Winkler distance is a measure of similarity between two
strings. It is a variant of the Jaro distance metric and mainly used in the area of record linkage. The higher the
Jaro–Winkler distance for two strings is, the more similar the strings are. The Jaro–Winkler distance metric is
designed and best suited for short strings such as person names. The score is normalized such that 0 equates to
no similarity and 1 is an exact match.
Smith-Waterman
The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar
regions between two strings or nucleotide or protein sequences. Instead of looking at the total sequence, the
Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure
N-gram
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n
items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base
pairs according to the application. The n-grams typically are collected from a text or speech corpus.
4.1.2 Term based Similarity
Block Distance is also known as Manhattan distance, boxcar distance, absolute value distance, L1
distance, city block distance and Manhattan distance. It computes the distance that would be traveled to
get from one data point to the other if a grid-like path is followed. The Block distance between two items
is the sum of the differences of their corresponding components
Cosine similarity is a measure of similarity between two vectors of an inner product space that measures
the cosine of the angle between them.
Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the
total number of terms in both strings .
Euclidean distance or L2 distance is the square root of the sum of squared differences between
corresponding elements of the two vectors.
Jaccard similarity is computed as the number of shared terms over the number of all unique terms in both
strings .
Fig 2: Term based Similarity
4.2 Corpus Based Similarity
Corpus-Based similarity is a semantic similarity measure that determines the similarity between words
according to information gained from large corpora. A Corpus is a large collection of written or spoken texts
that is used for language research.
Latent Semantic Analysis
5. Challenging Issues and Similarity Measures for Web Document Clustering
DOI: 10.9790/0661-17145559 www.iosrjournals.org 59 | Page
LSA is a technique of analyzing relationships between a set of documents and the terms they contain
by producing a set of concepts related to the documents and terms. In the context of its application to
information retrieval, it is called LSI.
Explicit Semantic Analysis (ESA)
In natural language processing and information retrieval, explicit semantic analysis (ESA) is a
vectorial representation of text (individual words or entire documents) that uses a document corpus as a
knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text
corpus and a document (string of words) is represented as the centroid of the vectors representing its words.
Pointwise Mutual Information
Pointwise mutual information (PMI), or point mutual information, is a measure of association used in
information theory and statistics.
V. Conclusion
This paper explains detailed description about text mining and its framework, Challenging issues in
web clustering based on similarity, Different similarity measures such as string based, corpusbased, knowledge
based and hybrid based similarity. Finally we come to know that, Accurate clustering requires a precise
definition of the closeness between a pair of objects ,in terms of either the pair wise similarity or distance.
References
[1]. Wael H. Gomaa , Aly A. Fahmy “ Short Answer Grading Using String Similarity And Corpus-Based Similarity”,International Journal
of Advanced Computer Science and Applications, Vol. 3, No. 11, 2012
[2]. Wael H. Gomaa , Aly A. Fahmy , “ A Survey of Text Similarity Approaches ”, International Journal of Computer Applications (0975 –
8887) Volume 68– No.13, April 2013
[3]. Kalaivendhan.K, Sumathi.P , “An Efficient Clustering Method To Find Similarity Between The Documents ” ,International Journal of
Innovative Research in Computer and Communication Engineering, Vol.2, Special Issue 1, March 2014
[4]. Mark.Dixon(1997), “An overview of Document Mining Technology”,http://www.geocities.com/Research Triangle
/Thinktank1997/mark/writing/dix 97-dm.ps
[5]. Anna Huang, “Similarity measures for Text document” ,Proceedings of the New Zealand CS Research Student Conference , April
2008, New Zealand.
[6]. Shaidah Jusoh and Hejab M. Alfawareh,“ Techniques, Applications and Challenging Issues in Text Mining”,IJCSI International
Journal of Computer Science Issues, Vol. 9, Issue 6, No 2, November 2012
[7]. P.Ilakiya, “Discovering Semantic Similarity between Words Using Web Document and Context Aware Semantic Association Ranking”
,International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) ,Volume 2, Issue 6, June 2013,
[8]. Todsanai Chumwatana, Kok Wai Wong, Hong Xie, “A SOM-Based Document Clustering Using Frequent Max Substrings for Non-
Segmented Texts “”,J. Intelligent Learning Systems & Applications, 2010, 2, 117-125
[9]. LIN, Zhenjiang ,Phd Thesis on“Link-based Similarity Measurement Techniques and Applications” , Computer Science and Engineering
,The Chinese University of Hong Kong .September 2011
[10]. Filippo Geraci,Phd Thesis: “Fast Clustering For Web Information Retrieval”, Anno Accademico 2007-2008
[11]. A. B. Manwar, Hemant S. Mahalle , K. D. Chinchkhede and Vinay Chavan , “A vector space model for information retrieval: matlab
approach”, Indian Journal of Computer Science and Engineering, Vol. 3, No. 2, pp. 222-229, 2012.
[12]. S. K. Jayanthi and S. Prema, “ Facilitating Efficient Integrated Semantic Web Search with Visualization and Data Mining Techniques”,
Proceedings of International Conference on Information and Communication Technologies, pp. 437 – 442, 2010.
[13]. Wei Ning, Phd Thesis :“Textmining and Organization in Large Corpus”, Kongens Lyngby 2005
[14]. G.Thilagavathi, J.Anitha, K.Nethra,“Sentence-Similarity Based Document Clustering Using Fuzzy Algorithm”, International Journal of
Advance Foundation and Research in Computer (IJAFRC) Volume 1, Issue 3, March 2014. ISSN 2348 - 4853