SlideShare a Scribd company logo
1 of 5
Download to read offline
IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. IV (Jan – Feb. 2015), PP 55-59
www.iosrjournals.org
DOI: 10.9790/0661-17145559 www.iosrjournals.org 55 | Page
Challenging Issues and Similarity Measures for Web Document
Clustering
S. Mahalakshmi
Research Scholar, Bharathiar University, Coimbatore, India.
Abstract: Web itself contains a large amount of documents available in electronic form. The available
documents are in various forms and the information in them is not in organized form. The lack of organization
of materials in the WWW motivates people to automatically manage the huge amount of information. Text-
mining refers generally to the process of extracting interesting and non-trivial information and knowledge
from unstructured text. Text mining framework contains Information Retrieval, Information Extraction,
Information Mining and Interpretation. During Information Retrieval, so many web documents are retrieved.
In that how we can find out similar documents among retrieved? This paper deals with the challenging
issues and similarity measures for web document clustering .
Key words: Text Mining, Information Retrieval,Framework,Information Extraction,Similarity,Clustering
I. Introduction
The growth of information in the web is too large, so search engine come to play a more critical role
to find relation between input keywords. Similarity Measure is widely used in Information Retrieval (IR) and
also it is important component in various tasks on the web such as relation extraction, community mining,
document clustering, and automatic metadata extraction. The main goal of clustering is to partition documents
into homogeneous groups according to their characteristics and abilities. Text Clustering is to find out the
groups information from the text documents and cluster these documents into the most relevant groups. Text
clustering groups the document in an unsupervised way and there is not label or class information. Clustering
methods have to discover the connections between the document and then based on these connections the
documents are clustered.
Given huge volumes of documents, a good document clustering method may organize those huge
numbers of documents into meaningful groups, which enable further browsing and navigation of this corpus be
much easier. A basic idea of text clustering is to find out
which documents have many words in common and place these documents with the most words in
common into same group Each cluster is a collection of data objects that are similar to one another are placed
within the same cluster but are dissimilar to objects in other clusters. This paper discuss about the role of
similarity in clustering and how to find similarity between the retrieved web documents and the problems
faced during similarity measures.
 Section 2 presents Review of Related Work.
 Section 3 introduces Challenging Issues in Text Mining.
 Section 4 Describes Challenging Issues In Web Clustering Based On Similarity
 Section 5 describes Conclusion.
II. Review Of Related Literature
Information on the Web is present in the form of text documents (formatted in HTML), and that is the
reason many Web document processing systems are rooted in text data mining techniques. [14]Due to the
growth of information in web leads to drastic increase in field of information retrieval. Efficient information
retrieval and navigation is provided by document clustering. Document clustering is the process of
automatically grouping the related documents into clusters. Instead of searching entire documents for relevant
information, these clusters will improve the efficiency and avoid overlapping of contents. Relevant document
can be efficiently retrieved and accessed by means of document clustering.
String based similarity contains similarity measures like character based and term based
similarity [1]. Wael H. Gomaa , Aly A. Fahmy explained different types of similarity approaches and finally
they conclude that hybrid similarity will give better results[2].Finding similarity between documents will be
useful in clustering and clustering is used to find intrinsic structures in data and organize them into subgroups
Challenging Issues and Similarity Measures for Web Document Clustering
DOI: 10.9790/0661-17145559 www.iosrjournals.org 56 | Page
[3]. Mark.Dixon[4] describes about the framework for text mining and it is closely related with IE and
IR.Different similarity measures were explained by Anna Huang [5]. As claimed by [6], the ambiguity is still
the major “world problem” in text mining applications. As a result, most approaches for clustering non-
segmented documents consist of two phases: a text mining process to extract the keywords, and a document
clustering process to compute the similarity between the input documents[8]. The vector space representation
of text is an incredibly powerful tool. Any text can be treated as a vector in a V-dimensional vector-space
(Jaime Arguello) [11]. Documents are matched with a query based on their similarity. If a document is similar
to the query, it is likely to be relevant. Non-binary weights for index terms in queries and documents are used
in the calculation of degree of similarity. Decreasing order of this degree of similarity for the retrieved
documents gives the ranked documents with partial match (Manwar et al.) [12]. In research paper[13], Wei
Ning proved that in some document corpus K-Means might even achieve a better performance with the help of
SVD although including SVD into the clustering processing might result in more time consumption. Thereby
we suggest K-Means a good candidate on text mining and organization of large document corpus. Further
research might be made on the feasibility of combining Frequent Itemset and K-Means.
In essence, document clustering is to group documents based on how relative they are. To cluster
documents correctly, it is very important to measure how much a document is relative to another. And the
extent of relativeness should be some real numbers and can be compared[13].Also they explained about
quality measures for clustering . Basically there are two measures to evaluate how good a clustering algorithm
is. One is Precision rate and the other is Recall rate.
III. Challenging Issues In Web Clustering Based On Similarity
The major challenging issue in text mining arise from the complexity of the natural language
itself.The natural language is not free from the ambiguity problem.Ambiguity means the capability of
being understood in two or more possible senses or ways.One word may have different meanings.One
phrase or sentence may have multiple meanings.One phrase or sentence can be interpreted in various
ways,thus various meanings can be obtained.Although a number of Researches have been conducted in
Resolving the ambiguity problem, the work is still immature.
Fig 1: Challenging issues in Web document clustering
Major problems and the impact of similarity measures in web clustering are discussed below:
i) The World Wide Web is huge, widely distributed; global information service centre in this retrieving
accurate information for users in Search Engine faces a lot of problems. This is due to accurately measuring
the semantic similarity between words is an important problem, and also efficient estimation of semantic
similarity between words is critical for various natural language processing tasks such as word sense
disambiguation, textual entailment, and automatic text summarization. In information retrieval, one of the
main problems is to retrieve a set of documents that is semantically related to a given user query. Retrieving
accurate information to users to such kind of similar words is challenging. Some existing system proposed an
architecture and method to measure semantic similarity between words, which consists of snippets, page-
counts and two class support vector machine. Context-Aware Semantic Association Ranking can be applied to
enhance the results with ranking [7].
Semi
structured
data
Not readily
accecible by
computers
Unstructured
data
Ambiguity
Need training
examples
SematicSyntacticLexical Pragmatic
Challenge
s
Challenging Issues and Similarity Measures for Web Document Clustering
DOI: 10.9790/0661-17145559 www.iosrjournals.org 57 | Page
i) Cluster Validity: Recent developments in data stream clustering have heightened the need for
determining suitable criteria to validate results. Most outcomes of methods are depended to specific
application. However, employing suitable criteria in results evaluation is one of the most important challenges
in this area.
ii) Authors LIN, Zhenjiang [9] explained about challenges in similarity measures using link based
methods. They explained that unimportant neighbors are pruned using MatchSim and PageSim Algorithms.
They first proposed two novel neighbor-based similarity measures called MatchSim and PageSim, respectively.
MatchSim takes the similarity between neighbors into account by defining recursively the similarity between
objects by the average similarity of their maximum matched similar neighbors. PageSim measures the
influences of indirect neighbors by adopting feature propagation strategy.They also proposed the Extended
Neighborhood Structure (ENS) model which defines a bi-directional and multi-hop model, to help neighbor-
based methods achieve higher accuracy.
So first there is a challenge for the researches to improve the efficiency of MatchSim algorithm in
order to make it practical. Second, in MatchSim and PageSim, we prune unimportant neighbors according to
the PageRank scores. There are other possible ranking methods, such as IDF-like weighting scheme, which
may help to produce better results. Third, many kinds of properties of objects can be exploited to measure
similarity, so how to integrate the link-based methods or similarity results with others is always a practical
issue for us. So there is a challenge for researchers to integrate IDF-like weighting scheme with link
based methods to produce better results.
iii)Clustering is a widely used technique to partition a set of heterogeneous data to homogeneous and well
separated groups. Its main characteristic is that it does not require a-priori knowledge about the nature and the
hidden structure of the data domain. In this thesis [10], they investigate clustering techniques and their
applications to Web text and video information retrieval. In particular they focus on: web snippets clustering,
video summarization and similarity searching. For web snippets, clustering is used to organize the results
returned by one or more search engines in response to a user query on the fly. The main difficulties concern:
the poor informative strength of snippets, the strict time constraints and the cluster labeling.
iv)In thesis [10], Filippo Geraci focused on the problem of clustering in the web scenario. To reduce the
processing time in adding points to clusters they used Medoid Furthest Point First heuristic algorithm. In
similarity searching of semi structured text documents one should want to allow the user to assign different
weights to each field at query time. This requirement raises the problem of vector score aggregation which
complicates preprocessing because at that time weights are unknown and thus one should build a data structure
able to handle all the possible weight assignments. We conclude this thesis discussing some preliminary ideas
about on-going research directions. We observed that clustering algorithms spend most of the processing time
in adding points to the clusters, searching for the closest center. This time can be reduced using a keen
similarity searching scheme. We plan to develop a general scheme for approximate FPF that takes advantage
from the approximate similarity searching to reduce the algorithm running time without any data dependence.
IV. Similarity Measures for Text Document Clustering
Clustering is a useful technique that organize a large quantity of unordered text documents
into a small number of meaningful and coherent clusters. A wide variety of document distance functions
and similarity measures have been used for clustering such as Euclidean Distance and relative entropy.
Accurate clustering requires a precise definition of the closeness between a pair of objects ,in
terms of either the pair wise similarity or distance. A wide variety of similarity or distance measure
have been proposed and widely applied such as cosine similarity And the Jaccard Correlation coefficient.
Meanwhile similarity is often conceived in terms of dissimilarity or distance as well. Measures
such as Euclidean Distance and Relative entropy have been applied in clustering to calculate the pair wise
distances.
4.1 String Based Similarity
4.1.1 Character Based Similarity
 Longest Common Substring (LCS). This algorithm considers the similarity between two strings is based
on the length of contiguous chain of characters that exist in both strings.
 Damerau-Levenshtein Jaro
The Levenshtein distance between two strings is defined as the minimum number of edits needed to
transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution
of a single character.
Challenging Issues and Similarity Measures for Web Document Clustering
DOI: 10.9790/0661-17145559 www.iosrjournals.org 58 | Page
 Jaro–Winkler
In computer science and statistics, the Jaro–Winkler distance is a measure of similarity between two
strings. It is a variant of the Jaro distance metric and mainly used in the area of record linkage. The higher the
Jaro–Winkler distance for two strings is, the more similar the strings are. The Jaro–Winkler distance metric is
designed and best suited for short strings such as person names. The score is normalized such that 0 equates to
no similarity and 1 is an exact match.
 Smith-Waterman
The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar
regions between two strings or nucleotide or protein sequences. Instead of looking at the total sequence, the
Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure
 N-gram
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n
items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base
pairs according to the application. The n-grams typically are collected from a text or speech corpus.
4.1.2 Term based Similarity
 Block Distance is also known as Manhattan distance, boxcar distance, absolute value distance, L1
distance, city block distance and Manhattan distance. It computes the distance that would be traveled to
get from one data point to the other if a grid-like path is followed. The Block distance between two items
is the sum of the differences of their corresponding components
 Cosine similarity is a measure of similarity between two vectors of an inner product space that measures
the cosine of the angle between them.
 Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the
total number of terms in both strings .
 Euclidean distance or L2 distance is the square root of the sum of squared differences between
corresponding elements of the two vectors.
 Jaccard similarity is computed as the number of shared terms over the number of all unique terms in both
strings .
Fig 2: Term based Similarity
4.2 Corpus Based Similarity
Corpus-Based similarity is a semantic similarity measure that determines the similarity between words
according to information gained from large corpora. A Corpus is a large collection of written or spoken texts
that is used for language research.
 Latent Semantic Analysis
Challenging Issues and Similarity Measures for Web Document Clustering
DOI: 10.9790/0661-17145559 www.iosrjournals.org 59 | Page
LSA is a technique of analyzing relationships between a set of documents and the terms they contain
by producing a set of concepts related to the documents and terms. In the context of its application to
information retrieval, it is called LSI.
 Explicit Semantic Analysis (ESA)
In natural language processing and information retrieval, explicit semantic analysis (ESA) is a
vectorial representation of text (individual words or entire documents) that uses a document corpus as a
knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text
corpus and a document (string of words) is represented as the centroid of the vectors representing its words.
 Pointwise Mutual Information
Pointwise mutual information (PMI), or point mutual information, is a measure of association used in
information theory and statistics.
V. Conclusion
This paper explains detailed description about text mining and its framework, Challenging issues in
web clustering based on similarity, Different similarity measures such as string based, corpusbased, knowledge
based and hybrid based similarity. Finally we come to know that, Accurate clustering requires a precise
definition of the closeness between a pair of objects ,in terms of either the pair wise similarity or distance.
References
[1]. Wael H. Gomaa , Aly A. Fahmy “ Short Answer Grading Using String Similarity And Corpus-Based Similarity”,International Journal
of Advanced Computer Science and Applications, Vol. 3, No. 11, 2012
[2]. Wael H. Gomaa , Aly A. Fahmy , “ A Survey of Text Similarity Approaches ”, International Journal of Computer Applications (0975 –
8887) Volume 68– No.13, April 2013
[3]. Kalaivendhan.K, Sumathi.P , “An Efficient Clustering Method To Find Similarity Between The Documents ” ,International Journal of
Innovative Research in Computer and Communication Engineering, Vol.2, Special Issue 1, March 2014
[4]. Mark.Dixon(1997), “An overview of Document Mining Technology”,http://www.geocities.com/Research Triangle
/Thinktank1997/mark/writing/dix 97-dm.ps
[5]. Anna Huang, “Similarity measures for Text document” ,Proceedings of the New Zealand CS Research Student Conference , April
2008, New Zealand.
[6]. Shaidah Jusoh and Hejab M. Alfawareh,“ Techniques, Applications and Challenging Issues in Text Mining”,IJCSI International
Journal of Computer Science Issues, Vol. 9, Issue 6, No 2, November 2012
[7]. P.Ilakiya, “Discovering Semantic Similarity between Words Using Web Document and Context Aware Semantic Association Ranking”
,International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) ,Volume 2, Issue 6, June 2013,
[8]. Todsanai Chumwatana, Kok Wai Wong, Hong Xie, “A SOM-Based Document Clustering Using Frequent Max Substrings for Non-
Segmented Texts “”,J. Intelligent Learning Systems & Applications, 2010, 2, 117-125
[9]. LIN, Zhenjiang ,Phd Thesis on“Link-based Similarity Measurement Techniques and Applications” , Computer Science and Engineering
,The Chinese University of Hong Kong .September 2011
[10]. Filippo Geraci,Phd Thesis: “Fast Clustering For Web Information Retrieval”, Anno Accademico 2007-2008
[11]. A. B. Manwar, Hemant S. Mahalle , K. D. Chinchkhede and Vinay Chavan , “A vector space model for information retrieval: matlab
approach”, Indian Journal of Computer Science and Engineering, Vol. 3, No. 2, pp. 222-229, 2012.
[12]. S. K. Jayanthi and S. Prema, “ Facilitating Efficient Integrated Semantic Web Search with Visualization and Data Mining Techniques”,
Proceedings of International Conference on Information and Communication Technologies, pp. 437 – 442, 2010.
[13]. Wei Ning, Phd Thesis :“Textmining and Organization in Large Corpus”, Kongens Lyngby 2005
[14]. G.Thilagavathi, J.Anitha, K.Nethra,“Sentence-Similarity Based Document Clustering Using Fuzzy Algorithm”, International Journal of
Advance Foundation and Research in Computer (IJAFRC) Volume 1, Issue 3, March 2014. ISSN 2348 - 4853

More Related Content

What's hot

Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...IJERA Editor
 
Paper id 37201536
Paper id 37201536Paper id 37201536
Paper id 37201536IJRAT
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpagescsandit
 
A Review: Text Classification on Social Media Data
A Review: Text Classification on Social Media DataA Review: Text Classification on Social Media Data
A Review: Text Classification on Social Media DataIOSR Journals
 
Annotation Approach for Document with Recommendation
Annotation Approach for Document with Recommendation Annotation Approach for Document with Recommendation
Annotation Approach for Document with Recommendation ijmpict
 
Classification-based Retrieval Methods to Enhance Information Discovery on th...
Classification-based Retrieval Methods to Enhance Information Discovery on th...Classification-based Retrieval Methods to Enhance Information Discovery on th...
Classification-based Retrieval Methods to Enhance Information Discovery on th...IJMIT JOURNAL
 
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...ijaia
 
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...IJDKP
 
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...University of Bari (Italy)
 
On nonmetric similarity search problems in complex domains
On nonmetric similarity search problems in complex domainsOn nonmetric similarity search problems in complex domains
On nonmetric similarity search problems in complex domainsunyil96
 
Semantic Query Optimisation with Ontology Simulation
Semantic Query Optimisation with Ontology SimulationSemantic Query Optimisation with Ontology Simulation
Semantic Query Optimisation with Ontology Simulationdannyijwest
 
Sentimental classification analysis of polarity multi-view textual data using...
Sentimental classification analysis of polarity multi-view textual data using...Sentimental classification analysis of polarity multi-view textual data using...
Sentimental classification analysis of polarity multi-view textual data using...IJECEIAES
 

What's hot (15)

Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
 
Paper id 37201536
Paper id 37201536Paper id 37201536
Paper id 37201536
 
Ngdm09 han gao
Ngdm09 han gaoNgdm09 han gao
Ngdm09 han gao
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpages
 
A Review: Text Classification on Social Media Data
A Review: Text Classification on Social Media DataA Review: Text Classification on Social Media Data
A Review: Text Classification on Social Media Data
 
Annotation Approach for Document with Recommendation
Annotation Approach for Document with Recommendation Annotation Approach for Document with Recommendation
Annotation Approach for Document with Recommendation
 
Classification-based Retrieval Methods to Enhance Information Discovery on th...
Classification-based Retrieval Methods to Enhance Information Discovery on th...Classification-based Retrieval Methods to Enhance Information Discovery on th...
Classification-based Retrieval Methods to Enhance Information Discovery on th...
 
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
ESTIMATION OF REGRESSION COEFFICIENTS USING GEOMETRIC MEAN OF SQUARED ERROR F...
 
Ijetcas14 446
Ijetcas14 446Ijetcas14 446
Ijetcas14 446
 
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
 
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
 
On nonmetric similarity search problems in complex domains
On nonmetric similarity search problems in complex domainsOn nonmetric similarity search problems in complex domains
On nonmetric similarity search problems in complex domains
 
Introduction abstract
Introduction abstractIntroduction abstract
Introduction abstract
 
Semantic Query Optimisation with Ontology Simulation
Semantic Query Optimisation with Ontology SimulationSemantic Query Optimisation with Ontology Simulation
Semantic Query Optimisation with Ontology Simulation
 
Sentimental classification analysis of polarity multi-view textual data using...
Sentimental classification analysis of polarity multi-view textual data using...Sentimental classification analysis of polarity multi-view textual data using...
Sentimental classification analysis of polarity multi-view textual data using...
 

Viewers also liked (20)

D017332126
D017332126D017332126
D017332126
 
I012445764
I012445764I012445764
I012445764
 
U130304127141
U130304127141U130304127141
U130304127141
 
G017264447
G017264447G017264447
G017264447
 
Q01314109117
Q01314109117Q01314109117
Q01314109117
 
A012240105
A012240105A012240105
A012240105
 
K017247882
K017247882K017247882
K017247882
 
Core Components of the Metabolic Syndrome in Nonalcohlic Fatty Liver Disease
Core Components of the Metabolic Syndrome in Nonalcohlic Fatty Liver DiseaseCore Components of the Metabolic Syndrome in Nonalcohlic Fatty Liver Disease
Core Components of the Metabolic Syndrome in Nonalcohlic Fatty Liver Disease
 
P180305106115
P180305106115P180305106115
P180305106115
 
H012214651
H012214651H012214651
H012214651
 
D1803052831
D1803052831D1803052831
D1803052831
 
J012145256
J012145256J012145256
J012145256
 
I010235966
I010235966I010235966
I010235966
 
B010220709
B010220709B010220709
B010220709
 
H010215561
H010215561H010215561
H010215561
 
K010426371
K010426371K010426371
K010426371
 
L012137992
L012137992L012137992
L012137992
 
G017224349
G017224349G017224349
G017224349
 
I017616468
I017616468I017616468
I017616468
 
O01212115124
O01212115124O01212115124
O01212115124
 

Similar to Challenging Issues and Similarity Measures for Clustering Web Documents

Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...IJORCS
 
A Comparative Study Of Citations And Links In Document Classification
A Comparative Study Of Citations And Links In Document ClassificationA Comparative Study Of Citations And Links In Document Classification
A Comparative Study Of Citations And Links In Document ClassificationWhitney Anderson
 
Document Retrieval System, a Case Study
Document Retrieval System, a Case StudyDocument Retrieval System, a Case Study
Document Retrieval System, a Case StudyIJERA Editor
 
Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...IOSR Journals
 
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERINGA NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERINGIJDKP
 
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...IJERA Editor
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
 
Cluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCSCJournals
 
AN EFFICIENT APPROACH FOR SEMANTICALLYENHANCED DOCUMENT CLUSTERING BY USING W...
AN EFFICIENT APPROACH FOR SEMANTICALLYENHANCED DOCUMENT CLUSTERING BY USING W...AN EFFICIENT APPROACH FOR SEMANTICALLYENHANCED DOCUMENT CLUSTERING BY USING W...
AN EFFICIENT APPROACH FOR SEMANTICALLYENHANCED DOCUMENT CLUSTERING BY USING W...ijaia
 
Perception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringPerception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringIRJET Journal
 
Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020Editor IJARCET
 
Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020Editor IJARCET
 
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...ijdms
 
Text Mining: (Asynchronous Sequences)
Text Mining: (Asynchronous Sequences)Text Mining: (Asynchronous Sequences)
Text Mining: (Asynchronous Sequences)IJERA Editor
 
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpagescsandit
 
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-MeansIRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-MeansIRJET Journal
 

Similar to Challenging Issues and Similarity Measures for Clustering Web Documents (20)

Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
 
A Comparative Study Of Citations And Links In Document Classification
A Comparative Study Of Citations And Links In Document ClassificationA Comparative Study Of Citations And Links In Document Classification
A Comparative Study Of Citations And Links In Document Classification
 
Document Retrieval System, a Case Study
Document Retrieval System, a Case StudyDocument Retrieval System, a Case Study
Document Retrieval System, a Case Study
 
Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...Correlation Coefficient Based Average Textual Similarity Model for Informatio...
Correlation Coefficient Based Average Textual Similarity Model for Informatio...
 
C017161925
C017161925C017161925
C017161925
 
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERINGA NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING
 
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
 
Cluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector Machine
 
AN EFFICIENT APPROACH FOR SEMANTICALLYENHANCED DOCUMENT CLUSTERING BY USING W...
AN EFFICIENT APPROACH FOR SEMANTICALLYENHANCED DOCUMENT CLUSTERING BY USING W...AN EFFICIENT APPROACH FOR SEMANTICALLYENHANCED DOCUMENT CLUSTERING BY USING W...
AN EFFICIENT APPROACH FOR SEMANTICALLYENHANCED DOCUMENT CLUSTERING BY USING W...
 
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERINGAN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
 
Perception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringPerception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document Clustering
 
Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020
 
Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020Volume 2-issue-6-2016-2020
Volume 2-issue-6-2016-2020
 
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
Interpreting the Semantics of Anomalies Based on Mutual Information in Link M...
 
Text Mining: (Asynchronous Sequences)
Text Mining: (Asynchronous Sequences)Text Mining: (Asynchronous Sequences)
Text Mining: (Asynchronous Sequences)
 
F017243241
F017243241F017243241
F017243241
 
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
 
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep WebpagesTopic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpages
 
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-MeansIRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
 

More from IOSR Journals (20)

A011140104
A011140104A011140104
A011140104
 
M0111397100
M0111397100M0111397100
M0111397100
 
L011138596
L011138596L011138596
L011138596
 
K011138084
K011138084K011138084
K011138084
 
J011137479
J011137479J011137479
J011137479
 
I011136673
I011136673I011136673
I011136673
 
G011134454
G011134454G011134454
G011134454
 
H011135565
H011135565H011135565
H011135565
 
F011134043
F011134043F011134043
F011134043
 
E011133639
E011133639E011133639
E011133639
 
D011132635
D011132635D011132635
D011132635
 
C011131925
C011131925C011131925
C011131925
 
B011130918
B011130918B011130918
B011130918
 
A011130108
A011130108A011130108
A011130108
 
I011125160
I011125160I011125160
I011125160
 
H011124050
H011124050H011124050
H011124050
 
G011123539
G011123539G011123539
G011123539
 
F011123134
F011123134F011123134
F011123134
 
E011122530
E011122530E011122530
E011122530
 
D011121524
D011121524D011121524
D011121524
 

Recently uploaded

Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 

Recently uploaded (20)

Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 

Challenging Issues and Similarity Measures for Clustering Web Documents

  • 1. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. IV (Jan – Feb. 2015), PP 55-59 www.iosrjournals.org DOI: 10.9790/0661-17145559 www.iosrjournals.org 55 | Page Challenging Issues and Similarity Measures for Web Document Clustering S. Mahalakshmi Research Scholar, Bharathiar University, Coimbatore, India. Abstract: Web itself contains a large amount of documents available in electronic form. The available documents are in various forms and the information in them is not in organized form. The lack of organization of materials in the WWW motivates people to automatically manage the huge amount of information. Text- mining refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text mining framework contains Information Retrieval, Information Extraction, Information Mining and Interpretation. During Information Retrieval, so many web documents are retrieved. In that how we can find out similar documents among retrieved? This paper deals with the challenging issues and similarity measures for web document clustering . Key words: Text Mining, Information Retrieval,Framework,Information Extraction,Similarity,Clustering I. Introduction The growth of information in the web is too large, so search engine come to play a more critical role to find relation between input keywords. Similarity Measure is widely used in Information Retrieval (IR) and also it is important component in various tasks on the web such as relation extraction, community mining, document clustering, and automatic metadata extraction. The main goal of clustering is to partition documents into homogeneous groups according to their characteristics and abilities. Text Clustering is to find out the groups information from the text documents and cluster these documents into the most relevant groups. Text clustering groups the document in an unsupervised way and there is not label or class information. Clustering methods have to discover the connections between the document and then based on these connections the documents are clustered. Given huge volumes of documents, a good document clustering method may organize those huge numbers of documents into meaningful groups, which enable further browsing and navigation of this corpus be much easier. A basic idea of text clustering is to find out which documents have many words in common and place these documents with the most words in common into same group Each cluster is a collection of data objects that are similar to one another are placed within the same cluster but are dissimilar to objects in other clusters. This paper discuss about the role of similarity in clustering and how to find similarity between the retrieved web documents and the problems faced during similarity measures.  Section 2 presents Review of Related Work.  Section 3 introduces Challenging Issues in Text Mining.  Section 4 Describes Challenging Issues In Web Clustering Based On Similarity  Section 5 describes Conclusion. II. Review Of Related Literature Information on the Web is present in the form of text documents (formatted in HTML), and that is the reason many Web document processing systems are rooted in text data mining techniques. [14]Due to the growth of information in web leads to drastic increase in field of information retrieval. Efficient information retrieval and navigation is provided by document clustering. Document clustering is the process of automatically grouping the related documents into clusters. Instead of searching entire documents for relevant information, these clusters will improve the efficiency and avoid overlapping of contents. Relevant document can be efficiently retrieved and accessed by means of document clustering. String based similarity contains similarity measures like character based and term based similarity [1]. Wael H. Gomaa , Aly A. Fahmy explained different types of similarity approaches and finally they conclude that hybrid similarity will give better results[2].Finding similarity between documents will be useful in clustering and clustering is used to find intrinsic structures in data and organize them into subgroups
  • 2. Challenging Issues and Similarity Measures for Web Document Clustering DOI: 10.9790/0661-17145559 www.iosrjournals.org 56 | Page [3]. Mark.Dixon[4] describes about the framework for text mining and it is closely related with IE and IR.Different similarity measures were explained by Anna Huang [5]. As claimed by [6], the ambiguity is still the major “world problem” in text mining applications. As a result, most approaches for clustering non- segmented documents consist of two phases: a text mining process to extract the keywords, and a document clustering process to compute the similarity between the input documents[8]. The vector space representation of text is an incredibly powerful tool. Any text can be treated as a vector in a V-dimensional vector-space (Jaime Arguello) [11]. Documents are matched with a query based on their similarity. If a document is similar to the query, it is likely to be relevant. Non-binary weights for index terms in queries and documents are used in the calculation of degree of similarity. Decreasing order of this degree of similarity for the retrieved documents gives the ranked documents with partial match (Manwar et al.) [12]. In research paper[13], Wei Ning proved that in some document corpus K-Means might even achieve a better performance with the help of SVD although including SVD into the clustering processing might result in more time consumption. Thereby we suggest K-Means a good candidate on text mining and organization of large document corpus. Further research might be made on the feasibility of combining Frequent Itemset and K-Means. In essence, document clustering is to group documents based on how relative they are. To cluster documents correctly, it is very important to measure how much a document is relative to another. And the extent of relativeness should be some real numbers and can be compared[13].Also they explained about quality measures for clustering . Basically there are two measures to evaluate how good a clustering algorithm is. One is Precision rate and the other is Recall rate. III. Challenging Issues In Web Clustering Based On Similarity The major challenging issue in text mining arise from the complexity of the natural language itself.The natural language is not free from the ambiguity problem.Ambiguity means the capability of being understood in two or more possible senses or ways.One word may have different meanings.One phrase or sentence may have multiple meanings.One phrase or sentence can be interpreted in various ways,thus various meanings can be obtained.Although a number of Researches have been conducted in Resolving the ambiguity problem, the work is still immature. Fig 1: Challenging issues in Web document clustering Major problems and the impact of similarity measures in web clustering are discussed below: i) The World Wide Web is huge, widely distributed; global information service centre in this retrieving accurate information for users in Search Engine faces a lot of problems. This is due to accurately measuring the semantic similarity between words is an important problem, and also efficient estimation of semantic similarity between words is critical for various natural language processing tasks such as word sense disambiguation, textual entailment, and automatic text summarization. In information retrieval, one of the main problems is to retrieve a set of documents that is semantically related to a given user query. Retrieving accurate information to users to such kind of similar words is challenging. Some existing system proposed an architecture and method to measure semantic similarity between words, which consists of snippets, page- counts and two class support vector machine. Context-Aware Semantic Association Ranking can be applied to enhance the results with ranking [7]. Semi structured data Not readily accecible by computers Unstructured data Ambiguity Need training examples SematicSyntacticLexical Pragmatic Challenge s
  • 3. Challenging Issues and Similarity Measures for Web Document Clustering DOI: 10.9790/0661-17145559 www.iosrjournals.org 57 | Page i) Cluster Validity: Recent developments in data stream clustering have heightened the need for determining suitable criteria to validate results. Most outcomes of methods are depended to specific application. However, employing suitable criteria in results evaluation is one of the most important challenges in this area. ii) Authors LIN, Zhenjiang [9] explained about challenges in similarity measures using link based methods. They explained that unimportant neighbors are pruned using MatchSim and PageSim Algorithms. They first proposed two novel neighbor-based similarity measures called MatchSim and PageSim, respectively. MatchSim takes the similarity between neighbors into account by defining recursively the similarity between objects by the average similarity of their maximum matched similar neighbors. PageSim measures the influences of indirect neighbors by adopting feature propagation strategy.They also proposed the Extended Neighborhood Structure (ENS) model which defines a bi-directional and multi-hop model, to help neighbor- based methods achieve higher accuracy. So first there is a challenge for the researches to improve the efficiency of MatchSim algorithm in order to make it practical. Second, in MatchSim and PageSim, we prune unimportant neighbors according to the PageRank scores. There are other possible ranking methods, such as IDF-like weighting scheme, which may help to produce better results. Third, many kinds of properties of objects can be exploited to measure similarity, so how to integrate the link-based methods or similarity results with others is always a practical issue for us. So there is a challenge for researchers to integrate IDF-like weighting scheme with link based methods to produce better results. iii)Clustering is a widely used technique to partition a set of heterogeneous data to homogeneous and well separated groups. Its main characteristic is that it does not require a-priori knowledge about the nature and the hidden structure of the data domain. In this thesis [10], they investigate clustering techniques and their applications to Web text and video information retrieval. In particular they focus on: web snippets clustering, video summarization and similarity searching. For web snippets, clustering is used to organize the results returned by one or more search engines in response to a user query on the fly. The main difficulties concern: the poor informative strength of snippets, the strict time constraints and the cluster labeling. iv)In thesis [10], Filippo Geraci focused on the problem of clustering in the web scenario. To reduce the processing time in adding points to clusters they used Medoid Furthest Point First heuristic algorithm. In similarity searching of semi structured text documents one should want to allow the user to assign different weights to each field at query time. This requirement raises the problem of vector score aggregation which complicates preprocessing because at that time weights are unknown and thus one should build a data structure able to handle all the possible weight assignments. We conclude this thesis discussing some preliminary ideas about on-going research directions. We observed that clustering algorithms spend most of the processing time in adding points to the clusters, searching for the closest center. This time can be reduced using a keen similarity searching scheme. We plan to develop a general scheme for approximate FPF that takes advantage from the approximate similarity searching to reduce the algorithm running time without any data dependence. IV. Similarity Measures for Text Document Clustering Clustering is a useful technique that organize a large quantity of unordered text documents into a small number of meaningful and coherent clusters. A wide variety of document distance functions and similarity measures have been used for clustering such as Euclidean Distance and relative entropy. Accurate clustering requires a precise definition of the closeness between a pair of objects ,in terms of either the pair wise similarity or distance. A wide variety of similarity or distance measure have been proposed and widely applied such as cosine similarity And the Jaccard Correlation coefficient. Meanwhile similarity is often conceived in terms of dissimilarity or distance as well. Measures such as Euclidean Distance and Relative entropy have been applied in clustering to calculate the pair wise distances. 4.1 String Based Similarity 4.1.1 Character Based Similarity  Longest Common Substring (LCS). This algorithm considers the similarity between two strings is based on the length of contiguous chain of characters that exist in both strings.  Damerau-Levenshtein Jaro The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.
  • 4. Challenging Issues and Similarity Measures for Web Document Clustering DOI: 10.9790/0661-17145559 www.iosrjournals.org 58 | Page  Jaro–Winkler In computer science and statistics, the Jaro–Winkler distance is a measure of similarity between two strings. It is a variant of the Jaro distance metric and mainly used in the area of record linkage. The higher the Jaro–Winkler distance for two strings is, the more similar the strings are. The Jaro–Winkler distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match.  Smith-Waterman The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings or nucleotide or protein sequences. Instead of looking at the total sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure  N-gram In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. 4.1.2 Term based Similarity  Block Distance is also known as Manhattan distance, boxcar distance, absolute value distance, L1 distance, city block distance and Manhattan distance. It computes the distance that would be traveled to get from one data point to the other if a grid-like path is followed. The Block distance between two items is the sum of the differences of their corresponding components  Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them.  Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the total number of terms in both strings .  Euclidean distance or L2 distance is the square root of the sum of squared differences between corresponding elements of the two vectors.  Jaccard similarity is computed as the number of shared terms over the number of all unique terms in both strings . Fig 2: Term based Similarity 4.2 Corpus Based Similarity Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. A Corpus is a large collection of written or spoken texts that is used for language research.  Latent Semantic Analysis
  • 5. Challenging Issues and Similarity Measures for Web Document Clustering DOI: 10.9790/0661-17145559 www.iosrjournals.org 59 | Page LSA is a technique of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. In the context of its application to information retrieval, it is called LSI.  Explicit Semantic Analysis (ESA) In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectorial representation of text (individual words or entire documents) that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document (string of words) is represented as the centroid of the vectors representing its words.  Pointwise Mutual Information Pointwise mutual information (PMI), or point mutual information, is a measure of association used in information theory and statistics. V. Conclusion This paper explains detailed description about text mining and its framework, Challenging issues in web clustering based on similarity, Different similarity measures such as string based, corpusbased, knowledge based and hybrid based similarity. Finally we come to know that, Accurate clustering requires a precise definition of the closeness between a pair of objects ,in terms of either the pair wise similarity or distance. References [1]. Wael H. Gomaa , Aly A. Fahmy “ Short Answer Grading Using String Similarity And Corpus-Based Similarity”,International Journal of Advanced Computer Science and Applications, Vol. 3, No. 11, 2012 [2]. Wael H. Gomaa , Aly A. Fahmy , “ A Survey of Text Similarity Approaches ”, International Journal of Computer Applications (0975 – 8887) Volume 68– No.13, April 2013 [3]. Kalaivendhan.K, Sumathi.P , “An Efficient Clustering Method To Find Similarity Between The Documents ” ,International Journal of Innovative Research in Computer and Communication Engineering, Vol.2, Special Issue 1, March 2014 [4]. Mark.Dixon(1997), “An overview of Document Mining Technology”,http://www.geocities.com/Research Triangle /Thinktank1997/mark/writing/dix 97-dm.ps [5]. Anna Huang, “Similarity measures for Text document” ,Proceedings of the New Zealand CS Research Student Conference , April 2008, New Zealand. [6]. Shaidah Jusoh and Hejab M. Alfawareh,“ Techniques, Applications and Challenging Issues in Text Mining”,IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 6, No 2, November 2012 [7]. P.Ilakiya, “Discovering Semantic Similarity between Words Using Web Document and Context Aware Semantic Association Ranking” ,International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) ,Volume 2, Issue 6, June 2013, [8]. Todsanai Chumwatana, Kok Wai Wong, Hong Xie, “A SOM-Based Document Clustering Using Frequent Max Substrings for Non- Segmented Texts “”,J. Intelligent Learning Systems & Applications, 2010, 2, 117-125 [9]. LIN, Zhenjiang ,Phd Thesis on“Link-based Similarity Measurement Techniques and Applications” , Computer Science and Engineering ,The Chinese University of Hong Kong .September 2011 [10]. Filippo Geraci,Phd Thesis: “Fast Clustering For Web Information Retrieval”, Anno Accademico 2007-2008 [11]. A. B. Manwar, Hemant S. Mahalle , K. D. Chinchkhede and Vinay Chavan , “A vector space model for information retrieval: matlab approach”, Indian Journal of Computer Science and Engineering, Vol. 3, No. 2, pp. 222-229, 2012. [12]. S. K. Jayanthi and S. Prema, “ Facilitating Efficient Integrated Semantic Web Search with Visualization and Data Mining Techniques”, Proceedings of International Conference on Information and Communication Technologies, pp. 437 – 442, 2010. [13]. Wei Ning, Phd Thesis :“Textmining and Organization in Large Corpus”, Kongens Lyngby 2005 [14]. G.Thilagavathi, J.Anitha, K.Nethra,“Sentence-Similarity Based Document Clustering Using Fuzzy Algorithm”, International Journal of Advance Foundation and Research in Computer (IJAFRC) Volume 1, Issue 3, March 2014. ISSN 2348 - 4853