3.a similarity measure for text classification and

1. A Similarity Measure for Text Classification and Clustering Abstract: Measuring the similarity between documents is an important operation in the text processing field. In this paper, a new similarity measure is proposed. To compute the similarity between two documents with respect to a feature, the proposed measure takes the following three cases into account: a) The feature appears in both documents, b) the feature appears in only one document, and c) the feature appears in none of the documents. For the first case, the similarity increases as the difference between the two involved feature values decreases. Furthermore, the contribution of the difference is normally scaled. For the second case, a fixed value is contributed to the similarity. For the last case, the feature has no contribution to the similarity. The proposed measure is extended to gauge the similarity between two sets of documents. The effectiveness of our measure is evaluated on several real-world data sets for text classification and clustering problems. The results show that the performance obtained by the proposed measure is better than that achieved by other measures. Existing System: • Clustering is one of the most interesting and important topics in data mining. The aim of clustering is to find intrinsic structures in data, and organize them into meaningful subgroups for further study and analysis. • Existing Systems greedily picks the next frequent item set which represent the next cluster to minimize the overlapping between the documents that contain both the item set and some remaining item sets. • In other words, the clustering result depends on the order of picking up the item sets, which in turns depends on the greedy heuristic. This method does not follow a sequential order of selecting clusters.

2. DISADVANTAGES: • Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. • It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance. • But has the same problems as k-means, the minimum is a local minimum, and the results depend on the initial choice of weights. • The Expectation-maximization algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes ProposedSystem: • The main work is to develop a novel hierarchal algorithm for document clustering which provides maximum efficiency and performance. Propose a novel way to evaluate similarity between documents, and consequently formulate new criterion functions for document clustering. • Assume that the majority. The purpose of this test is to check how much a similarity measure coincides with the true class labels. • It is particularly focused in studying and making use of cluster overlapping phenomenon to design cluster merging criteria. • Experiments in both public data and document clustering data show that this approachcan improve the efficiency of clustering and save computing time.

3. System Requirements: Software Requirements: • Windows XP/Windows 2000 • Java Runtime Environment with higher version(1.5) • Net Beans • My SQL Server Hardware requirements: • Pentium ProcessorIV with 2.80GHZ or Higher • 512 MB RAM • 2 GB HDD • 15” Monitor

3.a similarity measure for text classification and

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to 3.a similarity measure for text classification and

Similar to 3.a similarity measure for text classification and (20)

3.a similarity measure for text classification and