To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
2014 IEEE JAVA DATA MINING PROJECT A similarity measure for text classification and
1. GLOBALSOFT TECHNOLOGIES
IEEE PROJECTS & SOFTWARE DEVELOPMENTS
IEEE FINAL YEAR PROJECTS|IEEE ENGINEERING PROJECTS|IEEE STUDENTS PROJECTS|IEEE
BULK PROJECTS|BE/BTECH/ME/MTECH/MS/MCA PROJECTS|CSE/IT/ECE/EEE PROJECTS
CELL: +91 98495 39085, +91 99662 35788, +91 98495 57908, +91 97014 40401
Visit: www.finalyearprojects.org Mail to:ieeefinalsemprojects@gmail.com
A Similarity Measure for Text Classification and
Clustering
Abstract:
Measuring the similarity between documents is an important operation in the text
processing field. In this paper, a new similarity measure is proposed. To compute
the similarity between two documents with respect to a feature, the proposed
measure takes the following three cases into account: a) The feature appears in
both documents, b) the feature appears in only one document, and c) the feature
appears in none of the documents. For the first case, the similarity increases as the
difference between the two involved feature values decreases. Furthermore, the
contribution of the difference is normally scaled. For the second case, a fixed value
is contributed to the similarity. For the last case, the feature has no contribution to
the similarity. The proposed measure is extended to gauge the similarity between
two sets of documents. The effectiveness of our measure is evaluated on several
real-world data sets for text classification and clustering problems. The results
show that the performance obtained by the proposed measure is better than that
achieved by other measures.
Existing System:
• Clustering is one of the most interesting and important topics in data mining.
The aim of clustering is to find intrinsic structures in data, and organize
them into meaningful subgroups for further study and analysis.
2. • Existing Systems greedily picks the next frequent item set which represent
the next cluster to minimize the overlapping between the documents that
contain both the item set and some remaining item sets.
• In other words, the clustering result depends on the order of picking up the
item sets, which in turns depends on the greedy heuristic. This method does
not follow a sequential order of selecting clusters.
DISADVANTAGES:
• Its disadvantage is that it does not yield the same result with each run, since
the resulting clusters depend on the initial random assignments.
• It minimizes intra-cluster variance, but does not ensure that the result has a
global minimum of variance.
• But has the same problems as k-means, the minimum is a local minimum,
and the results depend on the initial choice of weights.
• The Expectation-maximization algorithm is a more statistically formalized
method which includes some of these ideas: partial membership in classes
Proposed System:
• The main work is to develop a novel hierarchal algorithm for document
clustering which provides maximum efficiency and performance. Propose a
novel way to evaluate similarity between documents, and consequently
formulate new criterion functions for document clustering.
• Assume that the majority. The purpose of this test is to check how much a
similarity measure coincides with the true class labels.
• It is particularly focused in studying and making use of cluster overlapping
phenomenon to design cluster merging criteria.
3. • Experiments in both public data and document clustering data show that this
approach can improve the efficiency of clustering and save computing time.
System Requirements:
Software Requirements:
• Windows XP/Windows 2000
• Java Runtime Environment with higher version(1.5)
• Net Beans
• My SQL Server
Hardware requirements:
• Pentium Processor IV with 2.80GHZ or Higher
• 512 MB RAM
• 2 GB HDD
• 15” Monitor