Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

blogosphere, Information retrieval,

Published in: Technology
  • Be the first to comment

  • Be the first to like this


  1. 1. Blogosphere: Research Issues, Tools, and Applications Nitin Agarwal and Huan Liu Sunil Bandla INF384H – Fall 2011
  2. 2. Agenda <ul><li>Introduction </li></ul><ul><li>Research issues </li></ul><ul><li>Tools and Methods </li></ul><ul><li>Case Study </li></ul><ul><li>Blogosphere and Social Networks </li></ul>
  3. 3. Web 2.0 <ul><li>It is the reason behind surge of interest in online communities </li></ul><ul><li>Former consumers are now producers </li></ul><ul><li>Collaborative environment </li></ul><ul><li>User-generated content </li></ul><ul><li>Collective wisdom </li></ul><ul><li>Web 2.0 services: </li></ul><ul><ul><li>Blogs, wikis, social networking sites, social tagging </li></ul></ul><ul><ul><li>Wordpress, Wikipedia, Facebook, Youtube, Twitter, Yelp </li></ul></ul>
  4. 4. Social Networks <ul><li>“ A social network is a social structure made up of individuals connected by one or more types of interdependency, such as friendship, common interest…” – Wikipedia </li></ul><ul><li>Web 2.0 is enabling virtual social networks </li></ul><ul><li>Size and connectedness varies across networks </li></ul><ul><li>Examples: </li></ul><ul><ul><li>Friendship networks ( Facebook, Myspace ) </li></ul></ul><ul><ul><li>Media sharing ( Flickr, Youtube ) </li></ul></ul>
  5. 5. Source: The New York Times “ The site, chock full of advertising, is a moneymaking machine – so much so that Ms. Armstrong and her husband have both quit their regular jobs.“ The reason? The advertisers are eager to influence her 850,000 readers. Arnold Kim, founder and senior editor of “ The site places MacRumors No. 2 on a list of the ‘25 most valuable blogs,’ …” What is the potential value? “Two of the other tech-oriented blogs on its list, …, were sold earlier this year, reportedly for sums in excess of $25 million.” Slide Credit: Liu & Nitin
  6. 6. Blogosphere <ul><li>Blog sites </li></ul><ul><li>Bloggers </li></ul><ul><li>Blog posts </li></ul><ul><li>Blogroll </li></ul><ul><li>Permalinks </li></ul><ul><li>Low barrier to publication </li></ul><ul><li>Readers can comment instantly which gives blogger a feeling of satisfaction </li></ul><ul><li>Individual vs community blogs </li></ul>
  7. 7. Blogosphere <ul><li>Complex social networks </li></ul><ul><li>Bloggers/blog posts/blog sites become nodes </li></ul><ul><li>Relationships are represented by edges between nodes </li></ul><ul><li>Inlinks & Outlinks </li></ul>
  8. 8. Agenda <ul><li>Introduction </li></ul><ul><li>Research issues </li></ul><ul><li>Tools and Methods </li></ul><ul><li>Case Study </li></ul><ul><li>Blogosphere and Social Networks </li></ul>
  9. 9. Modeling the Blogosphere <ul><li>Helps in generating an artificial dataset to compare algorithms </li></ul><ul><li>Study patterns that could explain community discovery, spam blogs, influence, etc. </li></ul><ul><li>Key differences between Web and Blogosphere </li></ul>Web Blogosphere Web models assume dense graph structure Blogosphere has a very sparse hyperlink structure Not much interaction Interaction in the form of comments and replies Static web pages Dynamic blog posts Conventional web pages do not have tags Blog posts have tags and categories
  10. 10. Modeling the Blogosphere <ul><li>Web models: </li></ul><ul><ul><li>Random graph </li></ul></ul><ul><ul><li>Preferential attachment graph models </li></ul></ul><ul><ul><li>Hybrid graph models </li></ul></ul><ul><li>Blogosphere models: </li></ul><ul><ul><li>To study temporal patterns of blogosphere like how often people create blog posts, how they are linked </li></ul></ul><ul><ul><li>Blogrolls to create a network of connected posts </li></ul></ul>
  11. 11. Blog Clustering <ul><li>Automatic organization of the content </li></ul><ul><li>Helps readers focus on interesting categories </li></ul><ul><li>Keyword based: </li></ul><ul><ul><li>Brooks and Montanez 2006, pick top 3 keywords to cluster blog posts </li></ul></ul><ul><ul><li>Li et al. 2007, assign different weights to title, body and comments of blog posts </li></ul></ul><ul><li>Collective wisdom based: </li></ul><ul><ul><li>Agarwal et al. 2008 use category relation graph to merge categories and cluster blogs </li></ul></ul>
  12. 12. Blog Mining <ul><li>Valuable resources to track: </li></ul><ul><ul><li>Consumers’ beliefs and opinions </li></ul></ul><ul><ul><li>Initial reaction to a launch </li></ul></ul><ul><ul><li>Trends and buzzwords </li></ul></ul><ul><li>Blog conversations provide insights into how information flows and how opinions are shaped and influenced </li></ul><ul><li>Pulse uses a Naïve Bayes classifier trained on annotated sentences to classify unlabeled data </li></ul><ul><li>Attardi and Simi 2006, use opinionated words acquired from WordNet to improve blog retrieval </li></ul>
  13. 13. Community Discovery <ul><li>Content analysis and text analysis of the blog posts to identify communities </li></ul><ul><li>Kleinberg et al, cluster all the expert communities together as authorities using an authority based approach </li></ul><ul><li>Kumar et al. extend it to include co-citations to extract all communities on the web </li></ul><ul><li>Some researchers studied community extraction using newsgroups and discussion boards </li></ul>
  14. 14. Influence in Blogs <ul><li>Influential bloggers: </li></ul><ul><ul><li>Are potential market-movers </li></ul></ul><ul><ul><li>Sway opinions in political campaigns </li></ul></ul><ul><ul><li>Troubleshoot the problems of peer consumers </li></ul></ul><ul><ul><li>Useful for “word-of-mouth” advertising of products </li></ul></ul><ul><li>Finding influential blog sites is different from identifying influential bloggers </li></ul><ul><li>Agarwal et al, studied the influence of a blogger by modeling the blog site as a graph </li></ul>
  15. 15. Trust and Reputation <ul><li>Overwhelming amount of collective wisdom </li></ul><ul><li>Difficult for reader to decide whom to trust </li></ul><ul><li>Assess the reputation of influential members in the community </li></ul><ul><li>Not much work that deals with trust in Blogosphere </li></ul><ul><li>Kale et al. 2007 mined sentiments about the cited blog post using a window of words around the links </li></ul><ul><li>They compute trust in a network of blog sites </li></ul><ul><li>Use comments on the blog post to judge a blogger’s trust </li></ul>
  16. 16. Filtering Spam blogs <ul><li>Splogs == Spam blogs </li></ul><ul><li>Degrade search quality and waste network resources </li></ul><ul><li>Initial researchers used web spam detection techniques </li></ul><ul><li>Kolari et al. 2006, use content and hyperlinks to train a SVM based classifier to classify a blog post as spam </li></ul><ul><li>Content on blog sites is dynamic so content based spam filters are ineffective </li></ul><ul><li>Lin et al. propose a self similarity based splog detection algorithm based on patterns in posting times of splogs, content similarity and similar links in splogs </li></ul>
  17. 17. Agenda <ul><li>Introduction </li></ul><ul><li>Research issues </li></ul><ul><li>Tools and Methods </li></ul><ul><li>Case Study </li></ul><ul><li>Blogosphere and Social Networks </li></ul>
  18. 18. Tools and APIs <ul><li>Tools to simulate social networks to study their properties </li></ul><ul><li>Multi-agent simulation tools </li></ul><ul><li>Analysis of social networks </li></ul><ul><li>Visualization of social networks </li></ul><ul><li>APIs: </li></ul><ul><ul><li>Facebook </li></ul></ul><ul><ul><li>StumbleUpon </li></ul></ul><ul><ul><li> </li></ul></ul>
  19. 19. Methodologies <ul><li>Centrality measures </li></ul><ul><li>Content analysis </li></ul><ul><li>Link analysis </li></ul><ul><li>Decision theoretic approaches </li></ul><ul><li>Agent-based modeling </li></ul>
  20. 20. Centrality measures <ul><li>Degree centrality: The number of ties a node has </li></ul><ul><li>Closeness centrality: Nodes with short geodesic distances to other vertices </li></ul><ul><li>Betweenness centrality: The extent a node is directly connected to nodes that are not directly connected </li></ul><ul><li>Eigenvector centrality: Measure of the importance of a node in a network </li></ul>
  21. 21. Content Analysis <ul><li>People create new content and enrich content with labels and tags </li></ul><ul><li>Human-generated tags are also called folksonomies </li></ul><ul><li>Supervised machine learning using class labels to predict the tags of unlabeled corpus </li></ul><ul><li>Text analysis approaches could be used for indexing blog entries </li></ul>
  22. 22. Link Analysis <ul><li>Text around links gives information about the linked blog posts </li></ul><ul><li>Leads to the identification of expert communities </li></ul><ul><li>Sparse link structure in social networks </li></ul><ul><li>Assumes implicit link information among bloggers </li></ul><ul><li>Topical analysis could be used to construct links </li></ul>
  23. 23. Methodologies (contd.) <ul><li>Decision theoretic approaches: </li></ul><ul><ul><li>To study the effect of decision on an individual and/or a community as a whole </li></ul></ul><ul><ul><li>Given a fully informed decision maker, what is the best possible decision to make </li></ul></ul><ul><ul><li>Find node which can make decisions with least possible side-effects and maximum gain for other nodes </li></ul></ul><ul><li>Agent-based modeling </li></ul><ul><ul><li>Agent could be a blogger in the blogosphere </li></ul></ul><ul><ul><li>Study factors that affect a user’s blogging behavior and how (s)he makes decisions </li></ul></ul>
  24. 24. Data Collection
  25. 25. Datasets <ul><li>Nielsen Buzzmetrics dataset </li></ul><ul><ul><li>About 14M blog posts from 3M blog sites </li></ul></ul><ul><ul><li>Annotated with 1.7M blog-blog links </li></ul></ul><ul><ul><li>Up to a half of the blog outlinks are missing </li></ul></ul><ul><ul><li>Only 51% of the total blog posts are in English </li></ul></ul><ul><li>Enron Email dataset </li></ul><ul><ul><li>Emails from about 150 users at Enron </li></ul></ul><ul><ul><li>0.5M messages </li></ul></ul><ul><ul><li>Social networks between users were studied based on link construction </li></ul></ul><ul><ul><li>Email senders and recipients are used to construct links </li></ul></ul>
  26. 26. Information needed <ul><li>Blogger identification </li></ul><ul><li>Date and time of posting </li></ul><ul><li>Number of comments </li></ul><ul><li>Outlinks </li></ul><ul><li>Inlinks </li></ul><ul><li>Blogroll links </li></ul><ul><li>Blog post length </li></ul><ul><li>Tags </li></ul>
  27. 27. Experiments and Performance Metrics <ul><li>Concepts like influence, trust, etc. in Blogosphere are socio-psychological and subjective </li></ul><ul><li>Evaluating them is non-trivial </li></ul><ul><li>Hard to compare different approaches since there is no ground truth! </li></ul><ul><li>Search engines’ ranking as the baseline for most of the existing works </li></ul><ul><li>Web 2.0 application i.e., Digg, was used to evaluate the influence in blogosphere </li></ul>
  28. 28. Agenda <ul><li>Introduction </li></ul><ul><li>Research issues </li></ul><ul><li>Tools and Methods </li></ul><ul><li>Case Study </li></ul><ul><li>Blogosphere and Social Networks </li></ul>
  29. 29. Finding influential bloggers <ul><li>“ A blogger can be influential if s/he has more than one influential blog post” </li></ul><ul><li>Properties that represent influential blog posts: </li></ul><ul><ul><li>Recognition – An influential blog post is recognized by many </li></ul></ul><ul><ul><li>Activity Generation – Number of comments received and amount of discussion initiated </li></ul></ul><ul><ul><li>Novelty – Number of outlinks </li></ul></ul><ul><ul><li>Eloquence – Length of a post </li></ul></ul><ul><li>Data Collection </li></ul><ul><ul><li>The Unofficial Apple Weblog </li></ul></ul><ul><ul><li>Crawled 10,000 posts </li></ul></ul>
  30. 30. Results <ul><li>Top 5 bloggers according to TUAW and proposed model </li></ul><ul><li>Some bloggers are both active and influential </li></ul><ul><li>Some of them are active but not influential </li></ul><ul><li>Some influential bloggers are not active </li></ul><ul><li>Inactive and non-influential bloggers </li></ul>
  31. 31. Verification <ul><li>Challenges: </li></ul><ul><ul><li>No testing and training data </li></ul></ul><ul><ul><li>Absence of ground truth </li></ul></ul><ul><li>Use another Web2.0 site Digg to provide a reference point </li></ul><ul><li>A more liked post will have higher score on Digg </li></ul><ul><li>Digg returns top 100 voted posts </li></ul><ul><li>Intersection of Digg 100 and top 20 from their model </li></ul>
  32. 32. Verification <ul><li>Importance of each parameter </li></ul><ul><li>Inlinks > comments > outlinks > blog post length in decreasing order of importance to influence estimation </li></ul>
  33. 33. Agenda <ul><li>Introduction </li></ul><ul><li>Research issues </li></ul><ul><li>Tools and Methods </li></ul><ul><li>Case Study </li></ul><ul><li>Blogosphere and Social Networks </li></ul>
  34. 34. Blogosphere and Social Networks Blogosphere Social Networks Influential nodes have “been influencing” Influential nodes “could influence” To share ideas or opinions To stay in touch or make friends Reputation is based on previous responses Reputation is based on the number of connections Person-to-group interaction Person-to-person interaction Community experience Friendship experience Loosely defined graph Strictly defined graph Nodes could be bloggers, blog posts, blog sites Nodes are members Implicit links Predefined links Directed graph Undirected graph
  35. 35. Conclusion <ul><li>Virtual communities and low barrier to publication are helping the growth of blogosphere </li></ul><ul><li>A lot is yet to be done in terms of research specific to blogosphere </li></ul><ul><li>Need accurate ground truth data </li></ul><ul><li>Experiments and evaluation plan should be devised to have objective analysis of different algorithms </li></ul>
  36. 36. <ul><li>Thank you! </li></ul>
  37. 37. References <ul><li>http:// </li></ul><ul><li> / </li></ul>