Conor Hayes - Topics, tags and trends in the blogosphere


Published on

WebCamp talk about topic clustering in blog networks.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Conor Hayes - Topics, tags and trends in the blogosphere

  1. 1. WebCamp 07 Social Networks Topics Tags and Trends in the Blogosphere Conor Hayes DERI, Galway
  2. 2. WebCamp 07 Social Networks Outline  Blogs and the Blogosphere  Linking Blogs - Content vs. Tags  Topics and Bloggers  User entropy  Topic drift  Blog reactivity  Identifying consistent, topic–relevant blogs
  3. 3. WebCamp 07 Social Networks Blogs  Web site with journal style entries  Dated in reverse chronological order  Generally written by a single user  Regularly updated  Distributed publishing  Easy to maintain and update  Exponential growth:  Technorati: 50 millions blogs (July 2006), doubling in size every 6 months  Increasingly important indicator of public opinion on politics, technology, current affairs
  4. 4. WebCamp 07 Social Networks Blogs vs. Usenet  Blogosphere is user-centred  Distributed architecture  Topic organisation is locally defined by tags  Not easy to find relevant posts related to the same topic  Usenet is topic-centred  Logically centralised architecture  Topic organisation is a priori defined by newsgroup heading, and subject headers  Users know where to go to find information on a particular topic
  5. 5. WebCamp 07 Social Networks A topic-centred blogosphere?  Semantic Web  Link blogs using machine readable metadata : SIOC  Tagging  Tags: simple propositional entities, locally defined  Link analysis:  Majority of blogs have little or no inward connections  Blog Roll, Comment List  Relatively static group  ‘Conventional’ Knowledge discovery techniques  Clustering + online recommender systems
  6. 6. WebCamp 07 Social Networks Nearest Neighbour Recommender  Method : 1. Periodically, identify a set of nearest neighbour blogs – one set for each topic the user is interested in. 2. Select matching posts from these neighbours  What are the implications of User drift ?  How quickly does the neighbourhood set change  What is the relationship between Users and Topics?  How consistently are bloggers attached to topics?  Which neighbours consistently provide the most topic- relevant information?
  7. 7. WebCamp 07 Social Networks Experiments  We cluster blog data over different time periods  user entropy: measures whether bloggers remain together over time  topic drift: measure blogger behaviour in relation to Topic growth and drift:  We identify the most relevant blogs in each cluster using tag analysis
  8. 8. WebCamp 07 Social Networks Data  We collected blog data from Jan16 to Feb 27, 2006  7200 blogs in total  We created 6 data sets, one for each week  mean of 4250 blogs per week  70% overlap between consecutive weeks  Each instance in each data set contains the posts from a single tag, from a single blogger  An instance is only included in the data set for a week only if the user has posted in that week
  9. 9. WebCamp 07 Social Networks Clustering  Goals: Uncover latent structures reflecting topics in the collection and provide a means of summarisation  Spherical k-means : partitions document corpus into k disjoint groups of documents  Produces interpretable concept summary for each group  Clustering quality:  Blogs in the same cluster should be similar;  Blogs in different clusters should be dissimilar.  Hr: Ratio of intra- to inter- cluster similarity
  10. 10. WebCamp 07 Social Networks Clustering….
  11. 11. WebCamp 07 Social Networks Partitions the document space
  12. 12. WebCamp 07 Social Networks At window win t+n ?
  13. 13. WebCamp 07 Social Networks Methodology  We cluster each data set in date order at different k  We reuse the cluster centroids in window t to seed the clusters in window t+1
  14. 14. WebCamp 07 Social Networks User Drift  We define User Entropy: a measure of the degree of user dispersion between windows wint+n q : number of clusters at wint+n containing users from cluster r nr i : number of users from cluster r contained in cluster i at wint+n nr: number of users from cluster r available at wint+n wint
  15. 15. WebCamp 07 Social Networks Ur as the Interval Increases
  16. 16. WebCamp 07 Social Networks Proportion of Users = mean fraction of dataset contained in top 20% of clusters = mean fraction of dataset contained in bottom 20% of clusters
  17. 17. WebCamp 07 Social Networks User Drift vs. Cluster Strength (mean) correlation: Hr vs. Ur at k -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0 10 20 30 40 50 60 70 80 90 100 k pearsonR.
  18. 18. WebCamp 07 Social Networks User Drift Conclusions  Even where n is low, user dispersion occurs  As n increases user entropy also increases, suggesting that the ‘relationship’ between users based on shared topics is short lived  User dispersion is related to cluster strength  Strong clusters experience less user drift than weak clusters  However, the fraction of data from strong clusters is smaller than the fraction from weak clusters, by at least a factor of 2  We will return to user entropy later in the talk
  19. 19. WebCamp 07 Social Networks Topic drift  inter window similarity Wr t+1  Wr t+1 for a cluster r at wint is the similarity between the centroid of cluster r and the centroid of the corresponding cluster r at wint+n Wr t+n = cos(Cr,t, Cr,t+n)  Intuitively, Wr t+n is a measure of the drift of the concept centroid, Cr,from wint to wint+n
  20. 20. WebCamp 07 Social Networks Topic Drift vs. User Drift (mean) correlation: Wr vs. Ur at k -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0 10 20 30 40 50 60 70 80 90 100 k pearsonR.
  21. 21. WebCamp 07 Social Networks A Model of Topic Drift  Topic drift is related to user drift, but topics may be more stable than users  By observation: rate of topic change is less than rate of user drift  Our analysis so far would suggest that users are not firmly fixed to topics, rather they drift between topics over time. Type a = {X,Y,Z} Type b= {P,Q,R}
  22. 22. WebCamp 07 Social Networks Muhammad Cartoon Controversy time Jan 16 Jan 23 Jan 30 Feb 06 Feb 13 Feb 20
  23. 23. WebCamp 07 Social Networks Observations  The blogosphere responds quickly to breaking news stories  The relationship between topic and user drift is pronounced where topic drift is extreme  Otherwise, there is steady turnover of users around relatively stable concepts  Users ‘float’ between topics
  24. 24. WebCamp 07 Social Networks Identifying the most relevant blogs  Technorati uses a tag cloud to link posts by different blogs together
  25. 25. WebCamp 07 Social Networks Tag clouds  In previous work we showed that tags perform badly at grouping similar blog posts together
  26. 26. WebCamp 07 Social Networks Tag clouds  Clustering followed by tag analysis allowed us to determine clusters contain strong concepts  It also allowed us to fragment the global tag space to produce local tag clouds
  27. 27. WebCamp 07 Social Networks A-bloggers  A-bloggers: only a portion (<0.4) of blogs in each cluster contributes tags to the local tag cloud description
  28. 28. WebCamp 07 Social Networks A-bloggers are 1. more similar to each other than c- bloggers 2. more similar to the cluster centroid (topic definition) than c-bloggers 3. more similar to pages retrieved from Google using the topic description
  29. 29. WebCamp 07 Social Networks Entropy: a-blogs vs c-blogs  A-blog entropy is lower  As interval increases a-blogs experience smaller increases in entropy  Suggests that a-bloggers tend to write consistently about the same things over time
  30. 30. WebCamp 07 Social Networks A-blogs Example  Cluster 28 in Win5; k =50  Cluster description: mobile, internet, weblog, web, patent A-blogs 1) “Comunications: technology, economic and social issues at the intersection of telecom, mobility and the Internet” 2) “IP Blawg”: technology and Intellectual property blog 3) “Small business IP management blog: Patent, Trademark, Copyright, Internet, and Technology Law” 4) “Open Gardens: Wireless mobility, Digital convergence - Mobile web 2.0” 5) “Mobile Enterprise Weblog: the voice of enterprise mobility management” C-blogs 1) “Digital Music Den: Digital Music, online music marketing” 2) " – blog about nothing”: general computing and technology 3) “Dunkie's Saga” - personal blog: personal, cars, games, quizzes, some technology 4) “Complex Christ – a vision for church that is organic, networked, decentralized, bottom-up, emergent, communal, flexible, always evolving” 5) “Philips Brooks patent infringement updates”: legal blog on general patent issues (pharmaceutical as well as technological)
  31. 31. WebCamp 07 Social Networks Conclusions  We have accumulated empirical evidence to suggest that a-bloggers are topic authorities  Tend to form tight subgroups close to cluster topic definition  Consistently more similar to pages ranked by Google using the cluster topic definition  Tend to stay together at differerent clusterings over time. In other words they tend to write regularly about the same topic  What characteristics does the a-blogger have?  A blogger that is aware of a wider potential readership and chooses his/her tags so that they can be understood easily by others  Writes regularly in depth about fairly narrowly defined subjects  New professional bloggers
  32. 32. WebCamp 07 Social Networks Future Work  Produce tag hierarchies using hierarchical clustering  Combine this with the work of Hak Lae Kim and Dr. Suk-Hyung Hwang:  Formal concept model of a blog social network using tags and content analysis  Tag recommender: Enrich SIOC topic descriptions with tag cloud meta data  Develop a set of style features to classify blogs
  33. 33. WebCamp 07 Social Networks References  Hayes, C., Avesani, P (2007) Using Tags and Clustering to Identify Topic Relevant Blogs in Proceedings of the International Conference on Weblogs and Social Media (ICWSM 07)  Hayes, C., Avesani, P., Veeramachaneni, S. (2007) An Analysis of the Use of Tags in a Blog Recommender System. In proceedings of IJCAI-07, the International Joint Conference on Artificial Intelligence  Hayes, C., Avesani, P., Veeramachaneni, S. (2006) An Analysis of Bloggers and Topics for a Blog Recommender System in proceedings of the Workshop on Web Mining (Webmine 06) , 7th European Conference on Machine Learning and the 10th European Conference on Principles