Content-based Clustering for Tag Cloud Visualization

1,137 views

Published on

My presentation at ASONAM 2009 on July 21st, 2009

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,137
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
12
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Content-based Clustering for Tag Cloud Visualization

  1. 1. Content-based Clustering for Tag Cloud Visualization ASONAM 2009 Arkaitz Zubiaga Alberto P. Garc´ıa-Plaza V´ ıctor Fresno Raquel Mart´ ınez NLP & IR Group @ UNED July 21st, 2009
  2. 2. IntroductionIndex1 Introduction2 Dataset Generation3 Our Method4 Results5 Conclusions6 Future Work NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 2 / 25
  3. 3. IntroductionSimple Tagging NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 3 / 25
  4. 4. IntroductionCollaborative Tagging NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 4 / 25
  5. 5. IntroductionTag Cloud No organization. No relations between tags. NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 5 / 25
  6. 6. IntroductionOur Work Find relations between tags to organize them: To ease visualization and search. To ease subscribing to a group of related tags. Previous works rely on tag co-occurrence to find relations. What about considering web documents’ content? NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 6 / 25
  7. 7. Dataset GenerationIndex1 Introduction2 Dataset Generation3 Our Method4 Results5 Conclusions6 Future Work NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 7 / 25
  8. 8. Dataset GenerationDataset Generation Starting point: 140 most popular tags on Delicious (T140, tag cloud). Tag monitoring: ∼6.000 documents/tag (∼840.000 docs., html and pdf). Data retrieval: Tag data for each document. Document content. Filtering: English-written documents with tag data available. Result: 144.574 documents (unbalanced). NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 8 / 25
  9. 9. Our MethodIndex1 Introduction2 Dataset Generation3 Our Method4 Results5 Conclusions6 Future Work NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 9 / 25
  10. 10. Our MethodRepresentation Most relevant tags for each document: at least, 40,7% of the top tag Merge documents pertaining to each T140 tag. Stopwords removal. Stemming. TF-IDF representation (reducing by DF). 1 vector/tag. NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 10 / 25
  11. 11. Our MethodClustering (SOM) NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 11 / 25
  12. 12. Our MethodClustering Settings 12x12 sized map: 144 neurons. vectors with 17.518 dimensions. Learning rate: 0,1. Neighborhood: 12. Iterations: 50.000. NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 12 / 25
  13. 13. Our MethodTerminology Extraction Merge all the documents in each neuron. Terminology extraction for each neuron. Representative for the neuron, but not for the rest. Language models (KLD, Kullback-Leibler Divergence). Result: Representative terms for each neuron. NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 13 / 25
  14. 14. ResultsIndex1 Introduction2 Dataset Generation3 Our Method4 Results5 Conclusions6 Future Work NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 14 / 25
  15. 15. ResultsResultsFull map available at: http://nlp.uned.es/social-tagging/ NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 15 / 25
  16. 16. ResultsResults: Computer Science NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 16 / 25
  17. 17. ResultsResults: Design NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 17 / 25
  18. 18. ResultsResults: Cooking NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 18 / 25
  19. 19. ResultsResults: Coherence NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 19 / 25
  20. 20. ResultsResults: Terminology NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 20 / 25
  21. 21. ConclusionsIndex1 Introduction2 Dataset Generation3 Our Method4 Results5 Conclusions6 Future Work NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 21 / 25
  22. 22. ConclusionsConclusions We analyzed tag clustering and terminology extraction relying on documents’ content. We collected the DeliciousT140 dataset. Unlike previous works, we considered documents’ content. The resulting map shows encouraging results, exhibiting the potential of collaborative tagging systems. It could allow community discovery. It eases tag cloud visualization, as well as improving navigation and subscribing. NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 22 / 25
  23. 23. Future WorkIndex1 Introduction2 Dataset Generation3 Our Method4 Results5 Conclusions6 Future Work NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 23 / 25
  24. 24. Future WorkFuture Work To compare our content-based approach to those based on tag co-occurrence. To make a quantitative evaluation To semantically analyze tags (polysemy, synonimy,...). To extend the work to multilingual tag sets. NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 24 / 25
  25. 25. Future WorkThank You for Your Attention Achiu Arigato Danke Dhannvaad Dua Netjer en ek Efcharisto Gracias Gr`cies Gratia Grazie Guishepeli Hvala Kiitos a K¨sz¨n¨m Merc´ Merci Mila esker Obrigado Shukran o o o eShukriya Tack Tak Takk T¨nan Tapadh leat Tesekk¨r ederim Thank a u you Toda NLP Group (UNED) Content-based Tag Clustering July 21st, 2009 25 / 25

×