Published on

blogosphere, Information retrieval,

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • two examples show the new trend of advertising and the values of good blogs
  • Reported improved clustering as compared to that using tags
  • Mining sentiments from free text forms poses several challenges
  • Moreover, spammers can copy the content from some regular blog posts to evade content based spam filtersLink based spam filters can easily be beaten by creating legitimate links
  • various social networking sites provide APIs nowadays. this helps the developers to get limited access to data. APIs are also used to write numerous applications that extend the functioanlities of these sites and create mashups.
  • In experiments we observe outlinks is negatively correlated with the number of comments received on a blog post, which means more outlinks reduces people's interest/attention.In experiments we observe blog post length is positively correlated with the number of comments received on a blog post, which means longer blog posts attracts people's interest/attention.
  • Blogosphere

    1. 1. Blogosphere: ResearchIssues, Tools, and Applications Nitin Agarwal and Huan Liu Sunil Bandla INF384H – Fall 2011
    2. 2. Agenda Introduction Research issues Tools and Methods Case Study Blogosphere and Social Networks
    3. 3. Web 2.0 It is the reason behind surge of interest in online communities Former consumers are now producers Collaborative environment User-generated content Collective wisdom Web 2.0 services:  Blogs, wikis, social networking sites, social tagging  Wordpress, Wikipedia, Facebook, Youtube, Twitter, Yelp
    4. 4. Social Networks “A social network is a social structure made up of individuals connected by one or more types of interdependency, such as friendship, common interest…” – Wikipedia Web 2.0 is enabling virtual social networks Size and connectedness varies across networks Examples:  Friendship networks ( Facebook, Myspace )  Media sharing ( Flickr, Youtube )
    5. 5. “The site, chock full of Arnold Kim, founder and senior editor of, is a moneymakingmachine – so much so that Ms.Armstrong and her husband have “The site places MacRumors No. 2 on a listboth quit their regular jobs.“ of the „25 most valuable blogs,‟ …” What isThe reason? The advertisers are the potential value? “Two of the other tech-eager to influence her 850,000 oriented blogs on its list, …, were soldreaders. earlier this year, reportedly for sums in excess of $25 million.” Source: The New York Times Slide Credit: Liu & Nitin
    6. 6. Blogosphere Blog sites Bloggers Blog posts Blogroll Permalinks Low barrier to publication Readers can comment instantly which gives blogger a feeling of satisfaction Individual vs community blogs
    7. 7. Blogosphere Complex social networks Bloggers/blog posts/blog sites become nodes Relationships are represented by edges between nodes Inlinks & Outlinks
    8. 8. Agenda Introduction Research issues Tools and Methods Case Study Blogosphere and Social Networks
    9. 9. Modeling the Blogosphere Helps in generating an artificial dataset to compare algorithms Study patterns that could explain community discovery, spam blogs, influence, etc. Key differences between Web and Blogosphere Web BlogosphereWeb models assume dense graph Blogosphere has a very sparsestructure hyperlink structureNot much interaction Interaction in the form of comments and repliesStatic web pages Dynamic blog postsConventional web pages do not have Blog posts have tags and categoriestags
    10. 10. Modeling the Blogosphere Web models:  Random graph  Preferential attachment graph models  Hybrid graph models Blogosphere models:  To study temporal patterns of blogosphere like how often people create blog posts, how they are linked  Blogrolls to create a network of connected posts
    11. 11. Blog Clustering Automatic organization of the content Helps readers focus on interesting categories Keyword based:  Brooks and Montanez 2006, pick top 3 keywords to cluster blog posts  Li et al. 2007, assign different weights to title, body and comments of blog posts Collective wisdom based:  Agarwal et al. 2008 use category relation graph to merge categories and cluster blogs
    12. 12. Blog Mining Valuable resources to track:  Consumers’ beliefs and opinions  Initial reaction to a launch  Trends and buzzwords Blog conversations provide insights into how information flows and how opinions are shaped and influenced Pulse uses a Naïve Bayes classifier trained on annotated sentences to classify unlabeled data Attardi and Simi 2006, use opinionated words acquired from WordNet to improve blog retrieval
    13. 13. Community Discovery Content analysis and text analysis of the blog posts to identify communities Kleinberg et al, cluster all the expert communities together as authorities using an authority based approach Kumar et al. extend it to include co-citations to extract all communities on the web Some researchers studied community extraction using newsgroups and discussion boards
    14. 14. Influence in Blogs Influential bloggers:  Are potential market-movers  Sway opinions in political campaigns  Troubleshoot the problems of peer consumers  Useful for “word-of-mouth” advertising of products Finding influential blog sites is different from identifying influential bloggers Agarwal et al, studied the influence of a blogger by modeling the blog site as a graph
    15. 15. Trust and Reputation Overwhelming amount of collective wisdom Difficult for reader to decide whom to trust Assess the reputation of influential members in the community Not much work that deals with trust in Blogosphere Kale et al. 2007 mined sentiments about the cited blog post using a window of words around the links They compute trust in a network of blog sites Use comments on the blog post to judge a blogger’s trust
    16. 16. Filtering Spam blogs Splogs == Spam blogs Degrade search quality and waste network resources Initial researchers used web spam detection techniques Kolari et al. 2006, use content and hyperlinks to train a SVM based classifier to classify a blog post as spam Content on blog sites is dynamic so content based spam filters are ineffective Lin et al. propose a self similarity based splog detection algorithm based on patterns in posting times of splogs, content similarity and similar links in
    17. 17. Agenda Introduction Research issues Tools and Methods Case Study Blogosphere and Social Networks
    18. 18. Tools and APIs Tools to simulate social networks to study their properties Multi-agent simulation tools Analysis of social networks Visualization of social networks APIs:  Facebook  StumbleUpon 
    19. 19. Methodologies Centrality measures Content analysis Link analysis Decision theoretic approaches Agent-based modeling
    20. 20. Datasets Nielsen Buzzmetrics dataset  About 14M blog posts from 3M blog sites  Annotated with 1.7M blog-blog links  Up to a half of the blog outlinks are missing  Only 51% of the total blog posts are in English Enron Email dataset  Emails from about 150 users at Enron  0.5M messages  Social networks between users were studied based on link construction  Email senders and recipients are used to construct links
    21. 21. Experiments and Performance Metrics Concepts like influence, trust, etc. in Blogosphere are socio-psychological and subjective Evaluating them is non-trivial Hard to compare different approaches since there is no ground truth! Search engines’ ranking as the baseline for most of the existing works Web 2.0 application i.e., Digg, was used to evaluate the influence in blogosphere
    22. 22. Agenda Introduction Research issues Tools and Methods Case Study Blogosphere and Social Networks
    23. 23. Finding influential bloggers “A blogger can be influential if s/he has more than one influential blog post” Properties that represent influential blog posts:  Recognition – An influential blog post is recognized by many  Activity Generation – Number of comments received and amount of discussion initiated  Novelty – Number of outlinks  Eloquence – Length of a post Data Collection  The Unofficial Apple Weblog  Crawled 10,000 posts
    24. 24. Results Top 5 bloggers according to TUAW and proposed model Some bloggers are both active and influential Some of them are active but not influential Some influential bloggers are not active Inactive and non-influential bloggers
    25. 25. Verification Challenges:  No testing and training data  Absence of ground truth Use another Web2.0 site Digg to provide a reference point A more liked post will have higher score on Digg Digg returns top 100 voted posts Intersection of Digg 100 and top 20 from their model
    26. 26. Verification Importance of each parameter Inlinks > comments > outlinks > blog post length in decreasing order of importance to influence estimation
    27. 27. Agenda Introduction Research issues Tools and Methods Case Study Blogosphere and Social Networks
    28. 28. Blogosphere and Social Networks Blogosphere Social NetworksInfluential nodes have “been Influential nodes “could influence”influencing”To share ideas or opinions To stay in touch or make friendsReputation is based on previous Reputation is based on the number ofresponses connectionsPerson-to-group interaction Person-to-person interactionCommunity experience Friendship experienceLoosely defined graph Strictly defined graphNodes could be bloggers, blog posts, Nodes are membersblog sitesImplicit links Predefined linksDirected graph Undirected graph
    29. 29. Conclusion Virtual communities and low barrier to publication are helping the growth of blogosphere A lot is yet to be done in terms of research specific to blogosphere Need accurate ground truth data Experiments and evaluation plan should be devised to have objective analysis of different algorithms
    30. 30.  Thank you!
    31. 31. References 07/V10N1-Blogosphere.pdf