Using Tags and Clustering to Identify Topic-specific Blogs Conor Hayes Digital Enterprise Research Institute, National University of Ireland, Galway, Ireland Paolo Avesani, Bruno Kessler Institute (ITC-IRST) Trento, Italy
Outline Organising Blogs: tags vs content Clustering Tags as a support to clustering A-tags and A-blogs, C-tags and C-blogs A-blogs as topic-relevant sources Experimental Analysis Intra-blog similarity/similarity to cluster centroid Similarity to Google pages Reclustering consistency – blogger entropy
Tag clouds
The Long Tail Few frequently used tags, very many tags used infrequently or just once 7209 blogs, 3934 tags 563 (14%) were used  2 or more times < less than 50% of blogs could be retrieved using tags So recall ability of unprocessed tags is quite poor
Data We collected blog data from Jan16 to Feb 27, 2006 Set of blog URLs provided by BlogPulse Selected blogs that used tags and that were in English Selected blogs at either side of the median posting frequency 7200 blogs in total  We created 6 data sets, one for each week mean of  4250 blogs per week  70% overlap between consecutive weeks Each instance in each data set contains the posts from a  single tag , from a  single blogger An instance is only included in the data set for a week  only if the user has posted in that week
Clustering Goals:  Uncover topics in a document corpus and provide a means of summarisation Spherical  k -means :  partitions corpus into  k  disjoint groups of documents Produces interpretable concept summary for each cluster Clustering quality:   Blogs in the same cluster should be similar;  Blogs in different clusters should be dissimilar.  H r :  Ratio of intra- to inter-cluster similarity
Clustering: Tags vs. Content Tags were poor at partitioning the data into similar clusters of documents
Partitioning the tag space
Tag frequency distribution per cluster We find a   power law frequency distribution Frequency distribution varies with cluster strength Example Cluster 94 : low H Cluster 41 : high H Allowed us to identify weak clusters which were not identified by the standard H r  score (Hayes, Avesani & Veeramachaneni. IJCAI 07)
A-bloggers A-bloggers: only a portion (<0.4) of blogs in each cluster contributes tags to the local tag cloud description
Intrablog similarity: A- vs. C-blogs  A-tag blogs are ‘tighter’ – more similar to each other than C-tag blogs
Similarity to centroid: A- vs. C-blogs  A-tag blogs also tend to be more similar to the cluster centroid
A-bloggers are more similar to each other than c-bloggers more similar to the cluster centroid (topic definition) than c-bloggers
Relevance? Van Rijsbergen’s cluster hypothesis:  similar docs are  likely to more relevant to an information requirement than less similar documents The information requirement = the cluster summary In application terms, the goal is to present to the user the most relevant blogs to the cluster summary How do we test relevance? The Google Oracle
Verification 2: by Google
Similarity to pages from Google
Consistency over time ?
Blogger Entropy We define  Blogger Entropy:  a measure of the degree of blogger dispersion between windows win t+n q   : number of clusters at win t+n  containing users from cluster  r n r i   :   number of users from cluster  r  contained in cluster  i  at win t+n n r :  number of users from cluster  r  available at win t+n win t
Entropy: a-blogs vs c-blogs A-blog entropy is lower As interval increases a-blogs experience smaller increases in entropy Suggests that a-bloggers tend to write consistently about the same things
Example of A-blogs and C-blogs Cluster 28 in Win5;  k =50 Cluster description:  mobile, internet, weblog, web, patent A-blogs C-blogs
Conclusions Our experiments suggest that a-bloggers tend to be the most relevant blogs within a cluster Tend to form tight subgroups close to cluster topic definition Consistently more similar to pages ranked by Google using the cluster topic definition Tend to stay together at differerent clusterings over time. In other words they tend to write regularly about the same topic What characteristics does the a-blogger have? A blogger that is aware of a wider potential readership and chooses his/her tags so that they can be understood easily by others Writes regularly in depth about fairly narrowly defined subjects New professional bloggers?
References Hayes, C., Avesani, P., Veeramachaneni, S. (2007)  An Analysis of the Use of Tags in a Blog Recommender System . In proceedings of IJCAI-07, the International Joint Conference on Artificial Intelligence  Hayes, C., Avesani, P., Veeramachaneni, S. (2006)  An Analysis of Bloggers and Topics for a Blog Recommender System  in proceedings of the Workshop on Web Mining (Webmine 06) , 7th European Conference on Machine Learning and the 10th European Conference on Principles
Appendix 13,518 bloggers: January 16 to February 27, 2006 Constraints : written in English and tag usage Posting frequency follows a power law:  88% of bloggers posted between 1 and 50 times High frequency ‘blogs’ are generally spam/splog Data from blogs with posts in range 6-48 : 7549 bloggers (56%) On average between 1 and 8 posts per week spam

Using Tags and Clustering to Identify Topic-specific Blogs

  • 1.
    Using Tags andClustering to Identify Topic-specific Blogs Conor Hayes Digital Enterprise Research Institute, National University of Ireland, Galway, Ireland Paolo Avesani, Bruno Kessler Institute (ITC-IRST) Trento, Italy
  • 2.
    Outline Organising Blogs:tags vs content Clustering Tags as a support to clustering A-tags and A-blogs, C-tags and C-blogs A-blogs as topic-relevant sources Experimental Analysis Intra-blog similarity/similarity to cluster centroid Similarity to Google pages Reclustering consistency – blogger entropy
  • 3.
  • 4.
    The Long TailFew frequently used tags, very many tags used infrequently or just once 7209 blogs, 3934 tags 563 (14%) were used 2 or more times < less than 50% of blogs could be retrieved using tags So recall ability of unprocessed tags is quite poor
  • 5.
    Data We collectedblog data from Jan16 to Feb 27, 2006 Set of blog URLs provided by BlogPulse Selected blogs that used tags and that were in English Selected blogs at either side of the median posting frequency 7200 blogs in total We created 6 data sets, one for each week mean of 4250 blogs per week 70% overlap between consecutive weeks Each instance in each data set contains the posts from a single tag , from a single blogger An instance is only included in the data set for a week only if the user has posted in that week
  • 6.
    Clustering Goals: Uncover topics in a document corpus and provide a means of summarisation Spherical k -means : partitions corpus into k disjoint groups of documents Produces interpretable concept summary for each cluster Clustering quality: Blogs in the same cluster should be similar; Blogs in different clusters should be dissimilar. H r : Ratio of intra- to inter-cluster similarity
  • 7.
    Clustering: Tags vs.Content Tags were poor at partitioning the data into similar clusters of documents
  • 8.
  • 9.
    Tag frequency distributionper cluster We find a power law frequency distribution Frequency distribution varies with cluster strength Example Cluster 94 : low H Cluster 41 : high H Allowed us to identify weak clusters which were not identified by the standard H r score (Hayes, Avesani & Veeramachaneni. IJCAI 07)
  • 10.
    A-bloggers A-bloggers: onlya portion (<0.4) of blogs in each cluster contributes tags to the local tag cloud description
  • 11.
    Intrablog similarity: A-vs. C-blogs A-tag blogs are ‘tighter’ – more similar to each other than C-tag blogs
  • 12.
    Similarity to centroid:A- vs. C-blogs A-tag blogs also tend to be more similar to the cluster centroid
  • 13.
    A-bloggers are moresimilar to each other than c-bloggers more similar to the cluster centroid (topic definition) than c-bloggers
  • 14.
    Relevance? Van Rijsbergen’scluster hypothesis: similar docs are likely to more relevant to an information requirement than less similar documents The information requirement = the cluster summary In application terms, the goal is to present to the user the most relevant blogs to the cluster summary How do we test relevance? The Google Oracle
  • 15.
  • 16.
  • 17.
  • 18.
    Blogger Entropy Wedefine Blogger Entropy: a measure of the degree of blogger dispersion between windows win t+n q : number of clusters at win t+n containing users from cluster r n r i : number of users from cluster r contained in cluster i at win t+n n r : number of users from cluster r available at win t+n win t
  • 19.
    Entropy: a-blogs vsc-blogs A-blog entropy is lower As interval increases a-blogs experience smaller increases in entropy Suggests that a-bloggers tend to write consistently about the same things
  • 20.
    Example of A-blogsand C-blogs Cluster 28 in Win5; k =50 Cluster description: mobile, internet, weblog, web, patent A-blogs C-blogs
  • 21.
    Conclusions Our experimentssuggest that a-bloggers tend to be the most relevant blogs within a cluster Tend to form tight subgroups close to cluster topic definition Consistently more similar to pages ranked by Google using the cluster topic definition Tend to stay together at differerent clusterings over time. In other words they tend to write regularly about the same topic What characteristics does the a-blogger have? A blogger that is aware of a wider potential readership and chooses his/her tags so that they can be understood easily by others Writes regularly in depth about fairly narrowly defined subjects New professional bloggers?
  • 22.
    References Hayes, C.,Avesani, P., Veeramachaneni, S. (2007) An Analysis of the Use of Tags in a Blog Recommender System . In proceedings of IJCAI-07, the International Joint Conference on Artificial Intelligence Hayes, C., Avesani, P., Veeramachaneni, S. (2006) An Analysis of Bloggers and Topics for a Blog Recommender System in proceedings of the Workshop on Web Mining (Webmine 06) , 7th European Conference on Machine Learning and the 10th European Conference on Principles
  • 23.
    Appendix 13,518 bloggers:January 16 to February 27, 2006 Constraints : written in English and tag usage Posting frequency follows a power law: 88% of bloggers posted between 1 and 50 times High frequency ‘blogs’ are generally spam/splog Data from blogs with posts in range 6-48 : 7549 bloggers (56%) On average between 1 and 8 posts per week spam