Your SlideShare is downloading. ×
0
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Using Tags and Clustering to Identify Topic-specific Blogs

2,179

Published on

Presentation at International Conference on Weblogs and Social Media (ICWSM 07) at Boulder Colorado

Presentation at International Conference on Weblogs and Social Media (ICWSM 07) at Boulder Colorado

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,179
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Transcript

    • 1. Using Tags and Clustering to Identify Topic-specific Blogs Conor Hayes Digital Enterprise Research Institute, National University of Ireland, Galway, Ireland Paolo Avesani, Bruno Kessler Institute (ITC-IRST) Trento, Italy
    • 2. Outline <ul><li>Organising Blogs: tags vs content </li></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Tags as a support to clustering </li></ul></ul><ul><li>A-tags and A-blogs, C-tags and C-blogs </li></ul><ul><li>A-blogs as topic-relevant sources </li></ul><ul><li>Experimental Analysis </li></ul><ul><ul><li>Intra-blog similarity/similarity to cluster centroid </li></ul></ul><ul><ul><li>Similarity to Google pages </li></ul></ul><ul><ul><li>Reclustering consistency – blogger entropy </li></ul></ul>
    • 3. Tag clouds
    • 4. The Long Tail <ul><li>Few frequently used tags, very many tags used infrequently or just once </li></ul><ul><li>7209 blogs, 3934 tags </li></ul><ul><li>563 (14%) were used 2 or more times </li></ul><ul><li>< less than 50% of blogs could be retrieved using tags </li></ul><ul><li>So recall ability of unprocessed tags is quite poor </li></ul>
    • 5. Data <ul><li>We collected blog data from Jan16 to Feb 27, 2006 </li></ul><ul><ul><li>Set of blog URLs provided by BlogPulse </li></ul></ul><ul><ul><li>Selected blogs that used tags and that were in English </li></ul></ul><ul><li>Selected blogs at either side of the median posting frequency </li></ul><ul><ul><li>7200 blogs in total </li></ul></ul><ul><li>We created 6 data sets, one for each week </li></ul><ul><ul><li>mean of 4250 blogs per week </li></ul></ul><ul><ul><li>70% overlap between consecutive weeks </li></ul></ul><ul><li>Each instance in each data set contains the posts from a single tag , from a single blogger </li></ul><ul><li>An instance is only included in the data set for a week only if the user has posted in that week </li></ul>
    • 6. Clustering <ul><li>Goals: Uncover topics in a document corpus and provide a means of summarisation </li></ul><ul><li>Spherical k -means : </li></ul><ul><ul><li>partitions corpus into k disjoint groups of documents </li></ul></ul><ul><ul><li>Produces interpretable concept summary for each cluster </li></ul></ul><ul><li>Clustering quality: </li></ul><ul><ul><li>Blogs in the same cluster should be similar; </li></ul></ul><ul><ul><li>Blogs in different clusters should be dissimilar. </li></ul></ul><ul><li>H r : Ratio of intra- to inter-cluster similarity </li></ul>
    • 7. Clustering: Tags vs. Content <ul><li>Tags were poor at partitioning the data into similar clusters of documents </li></ul>
    • 8. Partitioning the tag space
    • 9. Tag frequency distribution per cluster <ul><li>We find a power law frequency distribution </li></ul><ul><li>Frequency distribution varies with cluster strength </li></ul><ul><li>Example </li></ul><ul><li>Cluster 94 : low H </li></ul><ul><li>Cluster 41 : high H </li></ul><ul><li>Allowed us to identify weak clusters which were not identified by the standard H r score (Hayes, Avesani & Veeramachaneni. IJCAI 07) </li></ul>
    • 10. A-bloggers <ul><li>A-bloggers: only a portion (<0.4) of blogs in each cluster contributes tags to the local tag cloud description </li></ul>
    • 11. Intrablog similarity: A- vs. C-blogs <ul><li>A-tag blogs are ‘tighter’ – more similar to each other than C-tag blogs </li></ul>
    • 12. Similarity to centroid: A- vs. C-blogs <ul><li>A-tag blogs also tend to be more similar to the cluster centroid </li></ul>
    • 13. A-bloggers are <ul><li>more similar to each other than c-bloggers </li></ul><ul><li>more similar to the cluster centroid (topic definition) than c-bloggers </li></ul>
    • 14. Relevance? <ul><li>Van Rijsbergen’s cluster hypothesis: </li></ul><ul><ul><li>similar docs are likely to more relevant to an information requirement than less similar documents </li></ul></ul><ul><li>The information requirement = the cluster summary </li></ul><ul><li>In application terms, the goal is to present to the user the most relevant blogs to the cluster summary </li></ul><ul><li>How do we test relevance? The Google Oracle </li></ul>
    • 15. Verification 2: by Google
    • 16. Similarity to pages from Google
    • 17. Consistency over time ?
    • 18. Blogger Entropy <ul><li>We define Blogger Entropy: a measure of the degree of blogger dispersion between windows </li></ul>win t+n q : number of clusters at win t+n containing users from cluster r n r i : number of users from cluster r contained in cluster i at win t+n n r : number of users from cluster r available at win t+n win t
    • 19. Entropy: a-blogs vs c-blogs <ul><li>A-blog entropy is lower </li></ul><ul><li>As interval increases a-blogs experience smaller increases in entropy </li></ul><ul><li>Suggests that a-bloggers tend to write consistently about the same things </li></ul>
    • 20. Example of A-blogs and C-blogs <ul><li>Cluster 28 in Win5; k =50 </li></ul><ul><li>Cluster description: mobile, internet, weblog, web, patent </li></ul>A-blogs C-blogs
    • 21. Conclusions <ul><li>Our experiments suggest that a-bloggers tend to be the most relevant blogs within a cluster </li></ul><ul><ul><li>Tend to form tight subgroups close to cluster topic definition </li></ul></ul><ul><ul><li>Consistently more similar to pages ranked by Google using the cluster topic definition </li></ul></ul><ul><ul><li>Tend to stay together at differerent clusterings over time. In other words they tend to write regularly about the same topic </li></ul></ul><ul><li>What characteristics does the a-blogger have? </li></ul><ul><ul><li>A blogger that is aware of a wider potential readership and chooses his/her tags so that they can be understood easily by others </li></ul></ul><ul><ul><li>Writes regularly in depth about fairly narrowly defined subjects </li></ul></ul><ul><ul><li>New professional bloggers? </li></ul></ul>
    • 22. References <ul><li>Hayes, C., Avesani, P., Veeramachaneni, S. (2007) An Analysis of the Use of Tags in a Blog Recommender System . In proceedings of IJCAI-07, the International Joint Conference on Artificial Intelligence </li></ul><ul><li>Hayes, C., Avesani, P., Veeramachaneni, S. (2006) An Analysis of Bloggers and Topics for a Blog Recommender System in proceedings of the Workshop on Web Mining (Webmine 06) , 7th European Conference on Machine Learning and the 10th European Conference on Principles </li></ul>
    • 23. Appendix <ul><li>13,518 bloggers: January 16 to February 27, 2006 </li></ul><ul><li>Constraints : written in English and tag usage </li></ul><ul><li>Posting frequency follows a power law: </li></ul><ul><ul><li>88% of bloggers posted between 1 and 50 times </li></ul></ul><ul><ul><li>High frequency ‘blogs’ are generally spam/splog </li></ul></ul><ul><ul><li>Data from blogs with posts in range 6-48 : 7549 bloggers (56%) </li></ul></ul><ul><ul><li>On average between 1 and 8 posts per week </li></ul></ul>spam

    ×