An  E mpirical  C omparison   of F ast and  E fficient  T ools   for  M ining  T extual  D ata Volkan TUNALI (Marmara Univ...
Introduction <ul><li>Text Mining  </li></ul><ul><li>Document Clustering </li></ul><ul><li>K-Means Algorithm and Its Varian...
Text Mining <ul><li>Significant increase in the amount of information produced </li></ul><ul><li>Particularly in the form ...
Document Clustering <ul><li>An important Text Mining method  </li></ul><ul><ul><li>provides effective navigation, summariz...
K-means <ul><li>A commonly used partitioning clustering algorithm </li></ul><ul><li>Partitions  n  documents into  k  clus...
K-means Variants <ul><li>Many variants of K-means </li></ul><ul><ul><li>Usually address weaknesses of K-means like </li></...
Spherical K-means <ul><li>Weighting schemes like TFIDF </li></ul><ul><ul><li>long documents are favored over short ones be...
Bisecting K-means <ul><li>Starts with the whole dataset as a single cluster </li></ul><ul><li>Splits one cluster into two ...
Powerful Text Clustering Tools <ul><li>Cluto </li></ul><ul><ul><li>Bisecting K-means </li></ul></ul><ul><li>Gmeans </li></...
Cluto <ul><li>Written in ANSI C by George Karypis  </li></ul><ul><li>Contains partitional, agglomerative, and graph-partit...
Gmeans <ul><li>Written in C++ by Yuqiang Guan </li></ul><ul><li>Offers several distance (similarity) functions like cosine...
Datasets Used in Experiments <ul><li>Classic3 </li></ul><ul><ul><li>3891 documents </li></ul></ul><ul><ul><li>4467 dimensi...
Performance Evaluation Metrics <ul><li>Memory consumption </li></ul><ul><li>CPU Time consumption </li></ul><ul><li>Externa...
Memory Consumption
CPU Time Consumption
Purity
Entropy
F-Measure
Normalized Mutual Information
Comparison: Cluto vs. Gmeans <ul><li>Cluto with Bisecting K-means </li></ul><ul><ul><li>presents better clustering quality...
Conclusions <ul><li>When to use Cluto? </li></ul><ul><ul><li>have textual data </li></ul></ul><ul><ul><li>need high cluste...
Thank you! <ul><li>Any questions? </li></ul>
Upcoming SlideShare
Loading in …5
×

An Empirical Comparison of Fast and Efficient Tools for Mining Textual Data

5,335 views

Published on

In order to effectively manage and retrieve the information comprised in vast amount of text documents, powerful text mining tools and techniques are essential. In this paper we evaluate and compare two state-of-the-art data mining tools for clustering high-dimensional text data, Cluto and Gmeans. Several experiments were conducted on three benchmark datasets, and results are analysed in terms of clustering quality, memory and CPU time consumption. We empirically show that Gmeans offers high scalability by sacrificing clustering quality while Cluto presents better clustering quality at the expense of memory and CPU time.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,335
On SlideShare
0
From Embeds
0
Number of Embeds
3,723
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

An Empirical Comparison of Fast and Efficient Tools for Mining Textual Data

  1. 1. An E mpirical C omparison of F ast and E fficient T ools for M ining T extual D ata Volkan TUNALI (Marmara University) A. Yılmaz ÇAMURCU (Marmara University) T. Tugay BİLGİN (Maltepe University)
  2. 2. Introduction <ul><li>Text Mining </li></ul><ul><li>Document Clustering </li></ul><ul><li>K-Means Algorithm and Its Variants </li></ul><ul><li>Powerful Clustering Tools </li></ul><ul><li>Experimental Results </li></ul><ul><li>Conclusions </li></ul>
  3. 3. Text Mining <ul><li>Significant increase in the amount of information produced </li></ul><ul><li>Particularly in the form of text documents (news articles, research papers, books, digital libraries, e-mail messages, and Web pages etc.) </li></ul><ul><li>Effective and efficient text mining techniques are needed to handle “ Information Explosion ” </li></ul>
  4. 4. Document Clustering <ul><li>An important Text Mining method </li></ul><ul><ul><li>provides effective navigation, summarization, and organization of text documents </li></ul></ul><ul><li>Unsupervised and automatic grouping of a document collection into clusters </li></ul><ul><li>Documents belonging to the same cluster are as similar to each other as possible, whereas documents from two different clusters are dissimilar </li></ul>
  5. 5. K-means <ul><li>A commonly used partitioning clustering algorithm </li></ul><ul><li>Partitions n documents into k clusters </li></ul><ul><li>Until a global criterion function is optimized </li></ul><ul><li>Converges to local optimum! </li></ul><ul><li>Sensitive to initial cluster assignment! </li></ul>
  6. 6. K-means Variants <ul><li>Many variants of K-means </li></ul><ul><ul><li>Usually address weaknesses of K-means like </li></ul></ul><ul><ul><ul><li>Convergence to local optimum </li></ul></ul></ul><ul><ul><ul><li>Sensitivity to initial cluster assignment </li></ul></ul></ul><ul><ul><li>Some other modifications </li></ul></ul><ul><li>We are interested in </li></ul><ul><ul><li>Spherical K-means </li></ul></ul><ul><ul><li>Bisecting K-means </li></ul></ul>
  7. 7. Spherical K-means <ul><li>Weighting schemes like TFIDF </li></ul><ul><ul><li>long documents are favored over short ones because they contain more terms </li></ul></ul><ul><ul><li>a normalization is needed! </li></ul></ul><ul><li>Normalization implies that </li></ul><ul><ul><li>length of document vectors is 1 </li></ul></ul><ul><ul><li>document vectors lie on the surface of the unit sphere in R m </li></ul></ul><ul><ul><li>cosine similarity is simply dot product of two vectors </li></ul></ul>
  8. 8. Bisecting K-means <ul><li>Starts with the whole dataset as a single cluster </li></ul><ul><li>Splits one cluster into two subclusters at each iteration </li></ul><ul><ul><li>using classical k-means </li></ul></ul><ul><ul><li>k = 2 of course </li></ul></ul>
  9. 9. Powerful Text Clustering Tools <ul><li>Cluto </li></ul><ul><ul><li>Bisecting K-means </li></ul></ul><ul><li>Gmeans </li></ul><ul><ul><li>Spherical K-means </li></ul></ul>
  10. 10. Cluto <ul><li>Written in ANSI C by George Karypis </li></ul><ul><li>Contains partitional, agglomerative, and graph-partitioning based clustering algorithms </li></ul><ul><li>Offers several distance (similarity) functions like cosine, euclidean, correlation coefficient, extended Jaccard </li></ul><ul><li>Bisecting k-means with Cosine is the default option </li></ul>
  11. 11. Gmeans <ul><li>Written in C++ by Yuqiang Guan </li></ul><ul><li>Offers several distance (similarity) functions like cosine, euclidean, diametric distance, and Kullback-Leibler divergence </li></ul><ul><li>Cosine similarity is the default option with Spherical K-means algorithm </li></ul>
  12. 12. Datasets Used in Experiments <ul><li>Classic3 </li></ul><ul><ul><li>3891 documents </li></ul></ul><ul><ul><li>4467 dimensions </li></ul></ul><ul><li>20 News Groups (20NG) </li></ul><ul><ul><li>18821 documents </li></ul></ul><ul><ul><li>70241 dimensions </li></ul></ul><ul><li>Synthetic Waveform </li></ul><ul><ul><li>5000 data points </li></ul></ul><ul><ul><li>21 dimensions </li></ul></ul>
  13. 13. Performance Evaluation Metrics <ul><li>Memory consumption </li></ul><ul><li>CPU Time consumption </li></ul><ul><li>External Clustering Validity Indices </li></ul><ul><ul><li>Purity </li></ul></ul><ul><ul><li>Entropy </li></ul></ul><ul><ul><li>F-Measure </li></ul></ul><ul><ul><li>Normalized Mutual Information (NMI) </li></ul></ul>
  14. 14. Memory Consumption
  15. 15. CPU Time Consumption
  16. 16. Purity
  17. 17. Entropy
  18. 18. F-Measure
  19. 19. Normalized Mutual Information
  20. 20. Comparison: Cluto vs. Gmeans <ul><li>Cluto with Bisecting K-means </li></ul><ul><ul><li>presents better clustering quality </li></ul></ul><ul><ul><li>poor at memory and CPU time consumption </li></ul></ul><ul><li>Gmeans with Spherical K-means </li></ul><ul><ul><li>highly efficient in terms of memory and CPU time consumption </li></ul></ul><ul><ul><li>presents good clustering quality </li></ul></ul>
  21. 21. Conclusions <ul><li>When to use Cluto? </li></ul><ul><ul><li>have textual data </li></ul></ul><ul><ul><li>need high clustering quality </li></ul></ul><ul><ul><li>have powerful hardware with large main memory </li></ul></ul><ul><li>When to use Gmeans? </li></ul><ul><ul><li>have very large textual data </li></ul></ul><ul><ul><li>have modest hardware configuration </li></ul></ul><ul><ul><li>need good clustering in reasonable time </li></ul></ul>
  22. 22. Thank you! <ul><li>Any questions? </li></ul>

×