Slideshare.net (beta)

 
Post to TwitterPost to Twitter
Post: 
Myspace Hi5 Friendster Xanga LiveJournal Facebook Blogger Tagged Typepad Freewebs BlackPlanet gigya icons

All comments

Add a comment on Slide 1

If you have a SlideShare account, login to comment; else you can comment as a guest


Showing 1-50 of 4 (more)

Conor Hayes - Topics, tags and trends in the blogosphere

From DERIGalway, 2 years ago

WebCamp talk about topic clustering in blog networks.

2387 views  |  0 comments  |  4 favorites  |  2 embeds (Stats)
Download not available ?
 

Categories

Add Category
 
 

Tags

webcamp social network analysis sna blog blogoshere social research sociology blogosfera treball blogs

more

 
 

Groups / Events

 

 
Embed
options

More Info

This slideshow is Public
Total Views: 2387
on Slideshare: 2378
from embeds: 9

Slideshow transcript

Slide 1: Topics Tags and Trends in the Blogosphere Conor Hayes DERI, Galway WebCamp 07 Social Networks

Slide 2: Outline Blogs and the Blogosphere  Linking Blogs - Content vs. Tags  Topics and Bloggers  User entropy  Topic drift  Blog reactivity  Identifying consistent, topic–relevant blogs  WebCamp 07 Social Networks

Slide 3: Blogs Web site with journal style entries  Dated in reverse chronological order  Generally written by a single user  Regularly updated  Distributed publishing  Easy to maintain and update  Exponential growth:  Technorati: 50 millions blogs (July 2006), doubling in size  every 6 months Increasingly important indicator of public opinion on  politics, technology, current affairs WebCamp 07 Social Networks

Slide 4: Blogs vs. Usenet Blogosphere is user-centred  Distributed architecture  Topic organisation is locally defined by tags  Not easy to find relevant posts related to the same topic  Usenet is topic-centred  Logically centralised architecture  Topic organisation is a priori defined by newsgroup  heading, and subject headers Users know where to go to find information on a particular  topic WebCamp 07 Social Networks

Slide 5: A topic-centred blogosphere? Semantic Web  Link blogs using machine readable metadata : SIOC  Tagging  Tags: simple propositional entities, locally defined  Link analysis:  Majority of blogs have little or no inward connections  Blog Roll, Comment List  Relatively static group  ‘Conventional’ Knowledge discovery techniques  Clustering + online recommender systems  WebCamp 07 Social Networks

Slide 6: Nearest Neighbour Recommender Method :  1. Periodically, identify a set of nearest neighbour blogs – one set for each topic the user is interested in. 2. Select matching posts from these neighbours What are the implications of User drift ?  How quickly does the neighbourhood set change  What is the relationship between Users and  Topics? How consistently are bloggers attached to topics?  Which neighbours consistently provide the most  topic- relevant information? WebCamp 07 Social Networks

Slide 7: Experiments We cluster blog data over different time periods  user entropy: measures whether bloggers remain  together over time topic drift: measure blogger behaviour in relation to  Topic growth and drift: We identify the most relevant blogs in each cluster  using tag analysis WebCamp 07 Social Networks

Slide 8: Data We collected blog data from Jan16 to Feb 27, 2006  7200 blogs in total  We created 6 data sets, one for each week  mean of 4250 blogs per week  70% overlap between consecutive weeks  Each instance in each data set contains the posts  from a single tag, from a single blogger An instance is only included in the data set for a week  only if the user has posted in that week WebCamp 07 Social Networks

Slide 9: Clustering Goals: Uncover latent structures reflecting topics in  the collection and provide a means of summarisation Spherical k-means : partitions document corpus into  k disjoint groups of documents Produces interpretable concept summary for each  group Clustering quality:  Blogs in the same cluster  should be similar; Blogs in different clusters  should be dissimilar. Hr: Ratio of intra- to inter-  cluster similarity WebCamp 07 Social Networks

Slide 10: Clustering…. WebCamp 07 Social Networks

Slide 11: Partitions the document space WebCamp 07 Social Networks

Slide 12: At window win t+n ? WebCamp 07 Social Networks

Slide 13: Methodology We cluster each data set in date order at different k  We reuse the cluster centroids in window t to seed the  clusters in window t+1 WebCamp 07 Social Networks

Slide 14: User Drift We define User Entropy: a measure of the degree of  user dispersion between windows wint+n wint q : number of clusters at wint+n containing users from cluster r nri : number of users from cluster r contained in cluster i at wint+n nr: number of users from cluster r available at wint+n WebCamp 07 Social Networks

Slide 15: Ur as the Interval Increases WebCamp 07 Social Networks

Slide 16: Proportion of Users = mean fraction of dataset contained in top 20% of clusters = mean fraction of dataset contained in bottom 20% of clusters WebCamp 07 Social Networks

Slide 17: User Drift vs. Cluster Strength (mean) correlation: Hr vs. Ur at k 0 -0.1 -0.2 pearson R . -0.3 -0.4 -0.5 -0.6 -0.7 -0.8 -0.9 0 10 20 30 40 50 60 70 80 90 100 k WebCamp 07 Social Networks

Slide 18: User Drift Conclusions Even where n is low, user dispersion occurs  As n increases user entropy also increases,  suggesting that the ‘relationship’ between users based on shared topics is short lived User dispersion is related to cluster strength  Strong clusters experience less user drift than weak  clusters However, the fraction of data from strong clusters is  smaller than the fraction from weak clusters, by at least a factor of 2 We will return to user entropy later in the talk  WebCamp 07 Social Networks

Slide 19: Topic drift inter window similarity Wrt+1  Wrt+1 for a cluster r at wint is the similarity between the  centroid of cluster r and the centroid of the corresponding cluster r at wint+n Wrt+n = cos(Cr,t, Cr,t+n) Intuitively, Wrt+n is a measure of the drift of the  concept centroid, Cr,WebCamp 07 wint to wint+n from Social Networks

Slide 20: Topic Drift vs. User Drift (mean) correlation: Wr vs. Ur at k 0 -0.1 -0.2 -0.3 pearson R . -0.4 -0.5 -0.6 -0.7 -0.8 -0.9 0 10 20 30 40 50 60 70 80 90 100 k WebCamp 07 Social Networks

Slide 21: A Model of Topic Drift Topic drift is related to user  drift, but topics may be more stable than users By observation: rate of  topic change is less than rate of user drift Our analysis so far would  suggest that users are not firmly fixed to topics, rather Type a = {X,Y,Z} they drift between topics Type b= {P,Q,R} over time. WebCamp 07 Social Networks

Slide 22: Muhammad Cartoon Controversy time Jan 16 Jan 23 Jan 30 Feb 06 Feb 13 Feb 20 WebCamp 07 Social Networks

Slide 23: Observations The blogosphere responds quickly to breaking news  stories The relationship between topic and user drift is  pronounced where topic drift is extreme Otherwise, there is steady turnover of users around  relatively stable concepts Users ‘float’ between topics  WebCamp 07 Social Networks

Slide 24: Identifying the most relevant blogs Technorati uses a tag cloud to link posts by different  blogs together WebCamp 07 Social Networks

Slide 25: Tag clouds In previous work we showed that tags perform badly  at grouping similar blog posts together WebCamp 07 Social Networks

Slide 26: Tag clouds Clustering followed by tag analysis allowed us to  determine clusters contain strong concepts It also allowed us to fragment the global tag space to  produce local tag clouds WebCamp 07 Social Networks

Slide 27: A-bloggers A-bloggers: only a portion (<0.4) of blogs in each  cluster contributes tags to the local tag cloud description WebCamp 07 Social Networks

Slide 28: A-bloggers are more similar to each other than c- bloggers 1. more similar to the cluster centroid (topic definition) 2. than c-bloggers more similar to pages retrieved from Google using the 3. topic description WebCamp 07 Social Networks

Slide 29: Entropy: a-blogs vs c-blogs A-blog entropy is lower  As interval increases a-blogs experience smaller  increases in entropy Suggests that a-bloggers tend to write consistently  about the same things over time WebCamp 07 Social Networks

Slide 30: A-blogs Example Cluster 28 in Win5; k =50  Cluster description: mobile, internet, weblog, web, patent  A-blogs “Comunications: technology, economic and social issues at the intersection of 4) telecom, mobility and the Internet” “IP Blawg”: technology and Intellectual property blog 5) “Small business IP management blog: Patent, Trademark, Copyright, Internet, 6) and Technology Law” “Open Gardens: Wireless mobility, Digital convergence - Mobile web 2.0” 7) “Mobile Enterprise Weblog: the voice of enterprise mobility management” 8) C-blogs “Digital Music Den: Digital Music, online music marketing” 11) "icarusindie.com – blog about nothing”: general computing and technology 12) “Dunkie's Saga” - personal blog: personal, cars, games, quizzes, some 13) technology “Complex Christ – a vision for church that is organic, networked, decentralized, 14) bottom-up, emergent, communal, flexible, always evolving” “Philips Brooks patent infringement updates”: legal blog on general patent 15) issues (pharmaceutical as well as technological) WebCamp 07 Social Networks

Slide 31: Conclusions We have accumulated empirical evidence to suggest  that a-bloggers are topic authorities Tend to form tight subgroups close to cluster topic  definition Consistently more similar to pages ranked by Google using  the cluster topic definition Tend to stay together at differerent clusterings over time. In  other words they tend to write regularly about the same topic What characteristics does the a-blogger have?  A blogger that is aware of a wider potential readership and  chooses his/her tags so that they can be understood easily by others Writes regularly in depth about fairly narrowly defined  subjects New professional bloggers  WebCamp 07 Social Networks

Slide 32: Future Work Produce tag hierarchies using hierarchical clustering  Combine this with the work of Hak Lae Kim and Dr.  Suk-Hyung Hwang: Formal concept model of a blog social network using  tags and content analysis Tag recommender: Enrich SIOC topic descriptions  with tag cloud meta data Develop a set of style features to classify blogs  WebCamp 07 Social Networks

Slide 33: References Hayes, C., Avesani, P (2007) Using Tags and Clustering to  Identify Topic Relevant Blogs in Proceedings of the International Conference on Weblogs and Social Media (ICWSM 07) Hayes, C., Avesani, P., Veeramachaneni, S. (2007) An  Analysis of the Use of Tags in a Blog Recommender System. In proceedings of IJCAI-07, the International Joint Conference on Artificial Intelligence Hayes, C., Avesani, P., Veeramachaneni, S. (2006) An  Analysis of Bloggers and Topics for a Blog Recommender System in proceedings of the Workshop on Web Mining (Webmine 06) , 7th European Conference on Machine Learning and the 10th European Conference on Principles WebCamp 07 Social Networks