SlideShare a Scribd company logo
1 of 3
A Similarity Measure for Text Classification
and Clustering
Abstract:
Measuring the similarity between documents is an important operation in the text
processing field. In this paper, a new similarity measure is proposed. To compute
the similarity between two documents with respect to a feature, the proposed
measure takes the following three cases into account: a) The feature appears in
both documents, b) the feature appears in only one document, and c) the feature
appears in none of the documents. For the first case, the similarity increases as the
difference between the two involved feature values decreases. Furthermore, the
contribution of the difference is normally scaled. For the second case, a fixed value
is contributed to the similarity. For the last case, the feature has no contribution to
the similarity. The proposed measure is extended to gauge the similarity between
two sets of documents. The effectiveness of our measure is evaluated on several
real-world data sets for text classification and clustering problems. The results
show that the performance obtained by the proposed measure is better than that
achieved by other measures.
Existing System:
• Clustering is one of the most interesting and important topics in data mining.
The aim of clustering is to find intrinsic structures in data, and organize
them into meaningful subgroups for further study and analysis.
• Existing Systems greedily picks the next frequent item set which represent
the next cluster to minimize the overlapping between the documents that
contain both the item set and some remaining item sets.
• In other words, the clustering result depends on the order of picking up the
item sets, which in turns depends on the greedy heuristic. This method does
not follow a sequential order of selecting clusters.
DISADVANTAGES:
• Its disadvantage is that it does not yield the same result with each run, since
the resulting clusters depend on the initial random assignments.
• It minimizes intra-cluster variance, but does not ensure that the result has a
global minimum of variance.
• But has the same problems as k-means, the minimum is a local minimum,
and the results depend on the initial choice of weights.
• The Expectation-maximization algorithm is a more statistically formalized
method which includes some of these ideas: partial membership in classes
ProposedSystem:
• The main work is to develop a novel hierarchal algorithm for document
clustering which provides maximum efficiency and performance. Propose a
novel way to evaluate similarity between documents, and consequently
formulate new criterion functions for document clustering.
• Assume that the majority. The purpose of this test is to check how much a
similarity measure coincides with the true class labels.
• It is particularly focused in studying and making use of cluster overlapping
phenomenon to design cluster merging criteria.
• Experiments in both public data and document clustering data show that this
approachcan improve the efficiency of clustering and save computing time.
System Requirements:
Software Requirements:
• Windows XP/Windows 2000
• Java Runtime Environment with higher version(1.5)
• Net Beans
• My SQL Server
Hardware requirements:
• Pentium ProcessorIV with 2.80GHZ or Higher
• 512 MB RAM
• 2 GB HDD
• 15” Monitor

More Related Content

What's hot

Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
Editor IJMTER
 
Real Time Competitive Marketing Intelligence
Real Time Competitive Marketing IntelligenceReal Time Competitive Marketing Intelligence
Real Time Competitive Marketing Intelligence
feiwin
 
Data Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and FutureData Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and Future
feiwin
 
A survey of web clustering engines
A survey of web clustering enginesA survey of web clustering engines
A survey of web clustering engines
unyil96
 
Local Influence Diagnostics for Generalized Linear Mixed Models with Overdisp...
Local Influence Diagnostics for Generalized Linear Mixed Models with Overdisp...Local Influence Diagnostics for Generalized Linear Mixed Models with Overdisp...
Local Influence Diagnostics for Generalized Linear Mixed Models with Overdisp...
nino_chan38
 
11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution
Alexander Decker
 

What's hot (19)

Cross-domain Document Retrieval: Matching between Conversational and Formal W...
Cross-domain Document Retrieval: Matching between Conversational and Formal W...Cross-domain Document Retrieval: Matching between Conversational and Formal W...
Cross-domain Document Retrieval: Matching between Conversational and Formal W...
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
 
Poster Final
Poster FinalPoster Final
Poster Final
 
Real Time Competitive Marketing Intelligence
Real Time Competitive Marketing IntelligenceReal Time Competitive Marketing Intelligence
Real Time Competitive Marketing Intelligence
 
Data Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and FutureData Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and Future
 
MULTILABEL CLASSIFICATION VIA CO-EVOLUTIONARY MULTILABEL HYPERNETWORK
MULTILABEL CLASSIFICATION VIA CO-EVOLUTIONARY MULTILABEL HYPERNETWORKMULTILABEL CLASSIFICATION VIA CO-EVOLUTIONARY MULTILABEL HYPERNETWORK
MULTILABEL CLASSIFICATION VIA CO-EVOLUTIONARY MULTILABEL HYPERNETWORK
 
Presentation
PresentationPresentation
Presentation
 
Block iterative methods
Block iterative methodsBlock iterative methods
Block iterative methods
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach
 
Rohit 10103543
Rohit 10103543Rohit 10103543
Rohit 10103543
 
A survey of web clustering engines
A survey of web clustering enginesA survey of web clustering engines
A survey of web clustering engines
 
Recsys2018 item recommendation on monotonic behavior chains
Recsys2018 item recommendation on monotonic behavior chainsRecsys2018 item recommendation on monotonic behavior chains
Recsys2018 item recommendation on monotonic behavior chains
 
Web clustring engine
Web clustring engineWeb clustring engine
Web clustring engine
 
Local Influence Diagnostics for Generalized Linear Mixed Models with Overdisp...
Local Influence Diagnostics for Generalized Linear Mixed Models with Overdisp...Local Influence Diagnostics for Generalized Linear Mixed Models with Overdisp...
Local Influence Diagnostics for Generalized Linear Mixed Models with Overdisp...
 
web clustering engines
web clustering enginesweb clustering engines
web clustering engines
 
Normalization of Large-Scale Metabolomic Studies 2014
Normalization of Large-Scale Metabolomic Studies 2014Normalization of Large-Scale Metabolomic Studies 2014
Normalization of Large-Scale Metabolomic Studies 2014
 
11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution
 
Query optimization to improve performance of the code execution
Query optimization to improve performance of the code executionQuery optimization to improve performance of the code execution
Query optimization to improve performance of the code execution
 

Similar to 3.a similarity measure for text classification and

03 cs3024 pankaj_jajoo
03 cs3024 pankaj_jajoo03 cs3024 pankaj_jajoo
03 cs3024 pankaj_jajoo
Meetika Gupta
 
Comparative analysis of algorithms_MADI
Comparative analysis of algorithms_MADIComparative analysis of algorithms_MADI
Comparative analysis of algorithms_MADI
Sayed Rahman
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
Editor IJARCET
 

Similar to 3.a similarity measure for text classification and (20)

2014 IEEE JAVA DATA MINING PROJECT A similarity measure for text classificati...
2014 IEEE JAVA DATA MINING PROJECT A similarity measure for text classificati...2014 IEEE JAVA DATA MINING PROJECT A similarity measure for text classificati...
2014 IEEE JAVA DATA MINING PROJECT A similarity measure for text classificati...
 
2014 IEEE DOTNET DATA MINING PROJECT Similarity preserving snippet based visu...
2014 IEEE DOTNET DATA MINING PROJECT Similarity preserving snippet based visu...2014 IEEE DOTNET DATA MINING PROJECT Similarity preserving snippet based visu...
2014 IEEE DOTNET DATA MINING PROJECT Similarity preserving snippet based visu...
 
Recent Trends in Incremental Clustering: A Review
Recent Trends in Incremental Clustering: A ReviewRecent Trends in Incremental Clustering: A Review
Recent Trends in Incremental Clustering: A Review
 
automatic classification in information retrieval
automatic classification in information retrievalautomatic classification in information retrieval
automatic classification in information retrieval
 
Classification By Clustering Based On Adjusted Cluster
Classification By Clustering Based On Adjusted ClusterClassification By Clustering Based On Adjusted Cluster
Classification By Clustering Based On Adjusted Cluster
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed Clustering
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity Measure
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
F04463437
F04463437F04463437
F04463437
 
03 cs3024 pankaj_jajoo
03 cs3024 pankaj_jajoo03 cs3024 pankaj_jajoo
03 cs3024 pankaj_jajoo
 
Classification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmClassification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithm
 
Comparative analysis of algorithms_MADI
Comparative analysis of algorithms_MADIComparative analysis of algorithms_MADI
Comparative analysis of algorithms_MADI
 
Comparative analysis of algorithms classification and methods the presentatio...
Comparative analysis of algorithms classification and methods the presentatio...Comparative analysis of algorithms classification and methods the presentatio...
Comparative analysis of algorithms classification and methods the presentatio...
 
Ijetr021251
Ijetr021251Ijetr021251
Ijetr021251
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
Ir3116271633
Ir3116271633Ir3116271633
Ir3116271633
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
 

3.a similarity measure for text classification and

  • 1. A Similarity Measure for Text Classification and Clustering Abstract: Measuring the similarity between documents is an important operation in the text processing field. In this paper, a new similarity measure is proposed. To compute the similarity between two documents with respect to a feature, the proposed measure takes the following three cases into account: a) The feature appears in both documents, b) the feature appears in only one document, and c) the feature appears in none of the documents. For the first case, the similarity increases as the difference between the two involved feature values decreases. Furthermore, the contribution of the difference is normally scaled. For the second case, a fixed value is contributed to the similarity. For the last case, the feature has no contribution to the similarity. The proposed measure is extended to gauge the similarity between two sets of documents. The effectiveness of our measure is evaluated on several real-world data sets for text classification and clustering problems. The results show that the performance obtained by the proposed measure is better than that achieved by other measures. Existing System: • Clustering is one of the most interesting and important topics in data mining. The aim of clustering is to find intrinsic structures in data, and organize them into meaningful subgroups for further study and analysis. • Existing Systems greedily picks the next frequent item set which represent the next cluster to minimize the overlapping between the documents that contain both the item set and some remaining item sets. • In other words, the clustering result depends on the order of picking up the item sets, which in turns depends on the greedy heuristic. This method does not follow a sequential order of selecting clusters.
  • 2. DISADVANTAGES: • Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. • It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance. • But has the same problems as k-means, the minimum is a local minimum, and the results depend on the initial choice of weights. • The Expectation-maximization algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes ProposedSystem: • The main work is to develop a novel hierarchal algorithm for document clustering which provides maximum efficiency and performance. Propose a novel way to evaluate similarity between documents, and consequently formulate new criterion functions for document clustering. • Assume that the majority. The purpose of this test is to check how much a similarity measure coincides with the true class labels. • It is particularly focused in studying and making use of cluster overlapping phenomenon to design cluster merging criteria. • Experiments in both public data and document clustering data show that this approachcan improve the efficiency of clustering and save computing time.
  • 3. System Requirements: Software Requirements: • Windows XP/Windows 2000 • Java Runtime Environment with higher version(1.5) • Net Beans • My SQL Server Hardware requirements: • Pentium ProcessorIV with 2.80GHZ or Higher • 512 MB RAM • 2 GB HDD • 15” Monitor