Slideshow transcript
Slide 1: Topics Tags and Trends in the Blogosphere Conor Hayes DERI, Galway WebCamp 07 Social Networks
Slide 2: Outline Blogs and the Blogosphere Linking Blogs - Content vs. Tags Topics and Bloggers User entropy Topic drift Blog reactivity Identifying consistent, topic–relevant blogs WebCamp 07 Social Networks
Slide 3: Blogs Web site with journal style entries Dated in reverse chronological order Generally written by a single user Regularly updated Distributed publishing Easy to maintain and update Exponential growth: Technorati: 50 millions blogs (July 2006), doubling in size every 6 months Increasingly important indicator of public opinion on politics, technology, current affairs WebCamp 07 Social Networks
Slide 4: Blogs vs. Usenet Blogosphere is user-centred Distributed architecture Topic organisation is locally defined by tags Not easy to find relevant posts related to the same topic Usenet is topic-centred Logically centralised architecture Topic organisation is a priori defined by newsgroup heading, and subject headers Users know where to go to find information on a particular topic WebCamp 07 Social Networks
Slide 5: A topic-centred blogosphere? Semantic Web Link blogs using machine readable metadata : SIOC Tagging Tags: simple propositional entities, locally defined Link analysis: Majority of blogs have little or no inward connections Blog Roll, Comment List Relatively static group ‘Conventional’ Knowledge discovery techniques Clustering + online recommender systems WebCamp 07 Social Networks
Slide 6: Nearest Neighbour Recommender Method : 1. Periodically, identify a set of nearest neighbour blogs – one set for each topic the user is interested in. 2. Select matching posts from these neighbours What are the implications of User drift ? How quickly does the neighbourhood set change What is the relationship between Users and Topics? How consistently are bloggers attached to topics? Which neighbours consistently provide the most topic- relevant information? WebCamp 07 Social Networks
Slide 7: Experiments We cluster blog data over different time periods user entropy: measures whether bloggers remain together over time topic drift: measure blogger behaviour in relation to Topic growth and drift: We identify the most relevant blogs in each cluster using tag analysis WebCamp 07 Social Networks
Slide 8: Data We collected blog data from Jan16 to Feb 27, 2006 7200 blogs in total We created 6 data sets, one for each week mean of 4250 blogs per week 70% overlap between consecutive weeks Each instance in each data set contains the posts from a single tag, from a single blogger An instance is only included in the data set for a week only if the user has posted in that week WebCamp 07 Social Networks
Slide 9: Clustering Goals: Uncover latent structures reflecting topics in the collection and provide a means of summarisation Spherical k-means : partitions document corpus into k disjoint groups of documents Produces interpretable concept summary for each group Clustering quality: Blogs in the same cluster should be similar; Blogs in different clusters should be dissimilar. Hr: Ratio of intra- to inter- cluster similarity WebCamp 07 Social Networks
Slide 10: Clustering…. WebCamp 07 Social Networks
Slide 11: Partitions the document space WebCamp 07 Social Networks
Slide 12: At window win t+n ? WebCamp 07 Social Networks
Slide 13: Methodology We cluster each data set in date order at different k We reuse the cluster centroids in window t to seed the clusters in window t+1 WebCamp 07 Social Networks
Slide 14: User Drift We define User Entropy: a measure of the degree of user dispersion between windows wint+n wint q : number of clusters at wint+n containing users from cluster r nri : number of users from cluster r contained in cluster i at wint+n nr: number of users from cluster r available at wint+n WebCamp 07 Social Networks
Slide 15: Ur as the Interval Increases WebCamp 07 Social Networks
Slide 16: Proportion of Users = mean fraction of dataset contained in top 20% of clusters = mean fraction of dataset contained in bottom 20% of clusters WebCamp 07 Social Networks
Slide 17: User Drift vs. Cluster Strength (mean) correlation: Hr vs. Ur at k 0 -0.1 -0.2 pearson R . -0.3 -0.4 -0.5 -0.6 -0.7 -0.8 -0.9 0 10 20 30 40 50 60 70 80 90 100 k WebCamp 07 Social Networks
Slide 18: User Drift Conclusions Even where n is low, user dispersion occurs As n increases user entropy also increases, suggesting that the ‘relationship’ between users based on shared topics is short lived User dispersion is related to cluster strength Strong clusters experience less user drift than weak clusters However, the fraction of data from strong clusters is smaller than the fraction from weak clusters, by at least a factor of 2 We will return to user entropy later in the talk WebCamp 07 Social Networks
Slide 19: Topic drift inter window similarity Wrt+1 Wrt+1 for a cluster r at wint is the similarity between the centroid of cluster r and the centroid of the corresponding cluster r at wint+n Wrt+n = cos(Cr,t, Cr,t+n) Intuitively, Wrt+n is a measure of the drift of the concept centroid, Cr,WebCamp 07 wint to wint+n from Social Networks
Slide 20: Topic Drift vs. User Drift (mean) correlation: Wr vs. Ur at k 0 -0.1 -0.2 -0.3 pearson R . -0.4 -0.5 -0.6 -0.7 -0.8 -0.9 0 10 20 30 40 50 60 70 80 90 100 k WebCamp 07 Social Networks
Slide 21: A Model of Topic Drift Topic drift is related to user drift, but topics may be more stable than users By observation: rate of topic change is less than rate of user drift Our analysis so far would suggest that users are not firmly fixed to topics, rather Type a = {X,Y,Z} they drift between topics Type b= {P,Q,R} over time. WebCamp 07 Social Networks
Slide 22: Muhammad Cartoon Controversy time Jan 16 Jan 23 Jan 30 Feb 06 Feb 13 Feb 20 WebCamp 07 Social Networks
Slide 23: Observations The blogosphere responds quickly to breaking news stories The relationship between topic and user drift is pronounced where topic drift is extreme Otherwise, there is steady turnover of users around relatively stable concepts Users ‘float’ between topics WebCamp 07 Social Networks
Slide 24: Identifying the most relevant blogs Technorati uses a tag cloud to link posts by different blogs together WebCamp 07 Social Networks
Slide 25: Tag clouds In previous work we showed that tags perform badly at grouping similar blog posts together WebCamp 07 Social Networks
Slide 26: Tag clouds Clustering followed by tag analysis allowed us to determine clusters contain strong concepts It also allowed us to fragment the global tag space to produce local tag clouds WebCamp 07 Social Networks
Slide 27: A-bloggers A-bloggers: only a portion (<0.4) of blogs in each cluster contributes tags to the local tag cloud description WebCamp 07 Social Networks
Slide 28: A-bloggers are more similar to each other than c- bloggers 1. more similar to the cluster centroid (topic definition) 2. than c-bloggers more similar to pages retrieved from Google using the 3. topic description WebCamp 07 Social Networks
Slide 29: Entropy: a-blogs vs c-blogs A-blog entropy is lower As interval increases a-blogs experience smaller increases in entropy Suggests that a-bloggers tend to write consistently about the same things over time WebCamp 07 Social Networks
Slide 30: A-blogs Example Cluster 28 in Win5; k =50 Cluster description: mobile, internet, weblog, web, patent A-blogs “Comunications: technology, economic and social issues at the intersection of 4) telecom, mobility and the Internet” “IP Blawg”: technology and Intellectual property blog 5) “Small business IP management blog: Patent, Trademark, Copyright, Internet, 6) and Technology Law” “Open Gardens: Wireless mobility, Digital convergence - Mobile web 2.0” 7) “Mobile Enterprise Weblog: the voice of enterprise mobility management” 8) C-blogs “Digital Music Den: Digital Music, online music marketing” 11) "icarusindie.com – blog about nothing”: general computing and technology 12) “Dunkie's Saga” - personal blog: personal, cars, games, quizzes, some 13) technology “Complex Christ – a vision for church that is organic, networked, decentralized, 14) bottom-up, emergent, communal, flexible, always evolving” “Philips Brooks patent infringement updates”: legal blog on general patent 15) issues (pharmaceutical as well as technological) WebCamp 07 Social Networks
Slide 31: Conclusions We have accumulated empirical evidence to suggest that a-bloggers are topic authorities Tend to form tight subgroups close to cluster topic definition Consistently more similar to pages ranked by Google using the cluster topic definition Tend to stay together at differerent clusterings over time. In other words they tend to write regularly about the same topic What characteristics does the a-blogger have? A blogger that is aware of a wider potential readership and chooses his/her tags so that they can be understood easily by others Writes regularly in depth about fairly narrowly defined subjects New professional bloggers WebCamp 07 Social Networks
Slide 32: Future Work Produce tag hierarchies using hierarchical clustering Combine this with the work of Hak Lae Kim and Dr. Suk-Hyung Hwang: Formal concept model of a blog social network using tags and content analysis Tag recommender: Enrich SIOC topic descriptions with tag cloud meta data Develop a set of style features to classify blogs WebCamp 07 Social Networks
Slide 33: References Hayes, C., Avesani, P (2007) Using Tags and Clustering to Identify Topic Relevant Blogs in Proceedings of the International Conference on Weblogs and Social Media (ICWSM 07) Hayes, C., Avesani, P., Veeramachaneni, S. (2007) An Analysis of the Use of Tags in a Blog Recommender System. In proceedings of IJCAI-07, the International Joint Conference on Artificial Intelligence Hayes, C., Avesani, P., Veeramachaneni, S. (2006) An Analysis of Bloggers and Topics for a Blog Recommender System in proceedings of the Workshop on Web Mining (Webmine 06) , 7th European Conference on Machine Learning and the 10th European Conference on Principles WebCamp 07 Social Networks



Add a comment on Slide 1
If you have a SlideShare account, login to comment; else you can comment as a guest- Favorites & Groups
Showing 1-50 of 4 (more)