Using Tags and Clustering to Identify Topic-specific Blogs


Published on

Presentation at International Conference on Weblogs and Social Media (ICWSM 07) at Boulder Colorado

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Using Tags and Clustering to Identify Topic-specific Blogs

    1. 1. Using Tags and Clustering to Identify Topic-specific Blogs Conor Hayes Digital Enterprise Research Institute, National University of Ireland, Galway, Ireland Paolo Avesani, Bruno Kessler Institute (ITC-IRST) Trento, Italy
    2. 2. Outline <ul><li>Organising Blogs: tags vs content </li></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Tags as a support to clustering </li></ul></ul><ul><li>A-tags and A-blogs, C-tags and C-blogs </li></ul><ul><li>A-blogs as topic-relevant sources </li></ul><ul><li>Experimental Analysis </li></ul><ul><ul><li>Intra-blog similarity/similarity to cluster centroid </li></ul></ul><ul><ul><li>Similarity to Google pages </li></ul></ul><ul><ul><li>Reclustering consistency – blogger entropy </li></ul></ul>
    3. 3. Tag clouds
    4. 4. The Long Tail <ul><li>Few frequently used tags, very many tags used infrequently or just once </li></ul><ul><li>7209 blogs, 3934 tags </li></ul><ul><li>563 (14%) were used 2 or more times </li></ul><ul><li>< less than 50% of blogs could be retrieved using tags </li></ul><ul><li>So recall ability of unprocessed tags is quite poor </li></ul>
    5. 5. Data <ul><li>We collected blog data from Jan16 to Feb 27, 2006 </li></ul><ul><ul><li>Set of blog URLs provided by BlogPulse </li></ul></ul><ul><ul><li>Selected blogs that used tags and that were in English </li></ul></ul><ul><li>Selected blogs at either side of the median posting frequency </li></ul><ul><ul><li>7200 blogs in total </li></ul></ul><ul><li>We created 6 data sets, one for each week </li></ul><ul><ul><li>mean of 4250 blogs per week </li></ul></ul><ul><ul><li>70% overlap between consecutive weeks </li></ul></ul><ul><li>Each instance in each data set contains the posts from a single tag , from a single blogger </li></ul><ul><li>An instance is only included in the data set for a week only if the user has posted in that week </li></ul>
    6. 6. Clustering <ul><li>Goals: Uncover topics in a document corpus and provide a means of summarisation </li></ul><ul><li>Spherical k -means : </li></ul><ul><ul><li>partitions corpus into k disjoint groups of documents </li></ul></ul><ul><ul><li>Produces interpretable concept summary for each cluster </li></ul></ul><ul><li>Clustering quality: </li></ul><ul><ul><li>Blogs in the same cluster should be similar; </li></ul></ul><ul><ul><li>Blogs in different clusters should be dissimilar. </li></ul></ul><ul><li>H r : Ratio of intra- to inter-cluster similarity </li></ul>
    7. 7. Clustering: Tags vs. Content <ul><li>Tags were poor at partitioning the data into similar clusters of documents </li></ul>
    8. 8. Partitioning the tag space
    9. 9. Tag frequency distribution per cluster <ul><li>We find a power law frequency distribution </li></ul><ul><li>Frequency distribution varies with cluster strength </li></ul><ul><li>Example </li></ul><ul><li>Cluster 94 : low H </li></ul><ul><li>Cluster 41 : high H </li></ul><ul><li>Allowed us to identify weak clusters which were not identified by the standard H r score (Hayes, Avesani & Veeramachaneni. IJCAI 07) </li></ul>
    10. 10. A-bloggers <ul><li>A-bloggers: only a portion (<0.4) of blogs in each cluster contributes tags to the local tag cloud description </li></ul>
    11. 11. Intrablog similarity: A- vs. C-blogs <ul><li>A-tag blogs are ‘tighter’ – more similar to each other than C-tag blogs </li></ul>
    12. 12. Similarity to centroid: A- vs. C-blogs <ul><li>A-tag blogs also tend to be more similar to the cluster centroid </li></ul>
    13. 13. A-bloggers are <ul><li>more similar to each other than c-bloggers </li></ul><ul><li>more similar to the cluster centroid (topic definition) than c-bloggers </li></ul>
    14. 14. Relevance? <ul><li>Van Rijsbergen’s cluster hypothesis: </li></ul><ul><ul><li>similar docs are likely to more relevant to an information requirement than less similar documents </li></ul></ul><ul><li>The information requirement = the cluster summary </li></ul><ul><li>In application terms, the goal is to present to the user the most relevant blogs to the cluster summary </li></ul><ul><li>How do we test relevance? The Google Oracle </li></ul>
    15. 15. Verification 2: by Google
    16. 16. Similarity to pages from Google
    17. 17. Consistency over time ?
    18. 18. Blogger Entropy <ul><li>We define Blogger Entropy: a measure of the degree of blogger dispersion between windows </li></ul>win t+n q : number of clusters at win t+n containing users from cluster r n r i : number of users from cluster r contained in cluster i at win t+n n r : number of users from cluster r available at win t+n win t
    19. 19. Entropy: a-blogs vs c-blogs <ul><li>A-blog entropy is lower </li></ul><ul><li>As interval increases a-blogs experience smaller increases in entropy </li></ul><ul><li>Suggests that a-bloggers tend to write consistently about the same things </li></ul>
    20. 20. Example of A-blogs and C-blogs <ul><li>Cluster 28 in Win5; k =50 </li></ul><ul><li>Cluster description: mobile, internet, weblog, web, patent </li></ul>A-blogs C-blogs
    21. 21. Conclusions <ul><li>Our experiments suggest that a-bloggers tend to be the most relevant blogs within a cluster </li></ul><ul><ul><li>Tend to form tight subgroups close to cluster topic definition </li></ul></ul><ul><ul><li>Consistently more similar to pages ranked by Google using the cluster topic definition </li></ul></ul><ul><ul><li>Tend to stay together at differerent clusterings over time. In other words they tend to write regularly about the same topic </li></ul></ul><ul><li>What characteristics does the a-blogger have? </li></ul><ul><ul><li>A blogger that is aware of a wider potential readership and chooses his/her tags so that they can be understood easily by others </li></ul></ul><ul><ul><li>Writes regularly in depth about fairly narrowly defined subjects </li></ul></ul><ul><ul><li>New professional bloggers? </li></ul></ul>
    22. 22. References <ul><li>Hayes, C., Avesani, P., Veeramachaneni, S. (2007) An Analysis of the Use of Tags in a Blog Recommender System . In proceedings of IJCAI-07, the International Joint Conference on Artificial Intelligence </li></ul><ul><li>Hayes, C., Avesani, P., Veeramachaneni, S. (2006) An Analysis of Bloggers and Topics for a Blog Recommender System in proceedings of the Workshop on Web Mining (Webmine 06) , 7th European Conference on Machine Learning and the 10th European Conference on Principles </li></ul>
    23. 23. Appendix <ul><li>13,518 bloggers: January 16 to February 27, 2006 </li></ul><ul><li>Constraints : written in English and tag usage </li></ul><ul><li>Posting frequency follows a power law: </li></ul><ul><ul><li>88% of bloggers posted between 1 and 50 times </li></ul></ul><ul><ul><li>High frequency ‘blogs’ are generally spam/splog </li></ul></ul><ul><ul><li>Data from blogs with posts in range 6-48 : 7549 bloggers (56%) </li></ul></ul><ul><ul><li>On average between 1 and 8 posts per week </li></ul></ul>spam