2. Agenda
Introduction
Research issues
Tools and Methods
Case Study
Blogosphere and Social Networks
3. Web 2.0
It is the reason behind surge of interest in online
communities
Former consumers are now producers
Collaborative environment
User-generated content
Collective wisdom
Web 2.0 services:
Blogs, wikis, social networking sites, social tagging
Wordpress, Wikipedia, Facebook, Youtube, Twitter, Yelp
4. Social Networks
“A social network is a social structure made up of
individuals connected by one or more types of
interdependency, such as friendship, common
interest…” – Wikipedia
Web 2.0 is enabling virtual social networks
Size and connectedness varies across networks
Examples:
Friendship networks ( Facebook, Myspace )
Media sharing ( Flickr, Youtube )
5. “The site, chock full of Arnold Kim, founder and senior editor of
MacRumors.com.
advertising, is a moneymaking
machine – so much so that Ms.
Armstrong and her husband have “The site places MacRumors No. 2 on a list
both quit their regular jobs.“ of the „25 most valuable blogs,‟ …” What is
The reason? The advertisers are the potential value? “Two of the other tech-
eager to influence her 850,000 oriented blogs on its list, …, were sold
readers. earlier this year, reportedly for sums in
excess of $25 million.”
Source: The New York Times
Slide Credit: Liu & Nitin
6. Blogosphere
Blog sites
Bloggers
Blog posts
Blogroll
Permalinks
Low barrier to publication
Readers can comment instantly which gives blogger
a feeling of satisfaction
Individual vs community blogs
7. Blogosphere
Complex social networks
Bloggers/blog posts/blog sites become nodes
Relationships are represented by edges between
nodes
Inlinks & Outlinks
8. Agenda
Introduction
Research issues
Tools and Methods
Case Study
Blogosphere and Social Networks
9. Modeling the Blogosphere
Helps in generating an artificial dataset to compare
algorithms
Study patterns that could explain community
discovery, spam blogs, influence, etc.
Key differences between Web and Blogosphere
Web Blogosphere
Web models assume dense graph Blogosphere has a very sparse
structure hyperlink structure
Not much interaction Interaction in the form of comments
and replies
Static web pages Dynamic blog posts
Conventional web pages do not have Blog posts have tags and categories
tags
10. Modeling the Blogosphere
Web models:
Random graph
Preferential attachment graph models
Hybrid graph models
Blogosphere models:
To study temporal patterns of blogosphere like how often
people create blog posts, how they are linked
Blogrolls to create a network of connected posts
11. Blog Clustering
Automatic organization of the content
Helps readers focus on interesting categories
Keyword based:
Brooks and Montanez 2006, pick top 3 keywords to
cluster blog posts
Li et al. 2007, assign different weights to title, body and
comments of blog posts
Collective wisdom based:
Agarwal et al. 2008 use category relation graph to merge
categories and cluster blogs
12. Blog Mining
Valuable resources to track:
Consumers’ beliefs and opinions
Initial reaction to a launch
Trends and buzzwords
Blog conversations provide insights into how
information flows and how opinions are shaped and
influenced
Pulse uses a Naïve Bayes classifier trained on
annotated sentences to classify unlabeled data
Attardi and Simi 2006, use opinionated words
acquired from WordNet to improve blog retrieval
13. Community Discovery
Content analysis and text analysis of the blog posts
to identify communities
Kleinberg et al, cluster all the expert communities
together as authorities using an authority based
approach
Kumar et al. extend it to include co-citations to
extract all communities on the web
Some researchers studied community extraction
using newsgroups and discussion boards
14. Influence in Blogs
Influential bloggers:
Are potential market-movers
Sway opinions in political campaigns
Troubleshoot the problems of peer consumers
Useful for “word-of-mouth” advertising of products
Finding influential blog sites is different from
identifying influential bloggers
Agarwal et al, studied the influence of a blogger by
modeling the blog site as a graph
15. Trust and Reputation
Overwhelming amount of collective wisdom
Difficult for reader to decide whom to trust
Assess the reputation of influential members in the
community
Not much work that deals with trust in Blogosphere
Kale et al. 2007 mined sentiments about the cited
blog post using a window of words around the links
They compute trust in a network of blog sites
Use comments on the blog post to judge a blogger’s
trust
16. Filtering Spam blogs
Splogs == Spam blogs
Degrade search quality and waste network
resources
Initial researchers used web spam detection
techniques
Kolari et al. 2006, use content and hyperlinks to train
a SVM based classifier to classify a blog post as
spam
Content on blog sites is dynamic so content based
spam filters are ineffective
Lin et al. propose a self similarity based splog
detection algorithm based on patterns in posting
times of splogs, content similarity and similar links in
17. Agenda
Introduction
Research issues
Tools and Methods
Case Study
Blogosphere and Social Networks
18. Tools and APIs
Tools to simulate social networks to study their
properties
Multi-agent simulation tools
Analysis of social networks
Visualization of social networks
APIs:
Facebook
StumbleUpon
Del.icio.us
20. Datasets
Nielsen Buzzmetrics dataset
About 14M blog posts from 3M blog sites
Annotated with 1.7M blog-blog links
Up to a half of the blog outlinks are missing
Only 51% of the total blog posts are in English
Enron Email dataset
Emails from about 150 users at Enron
0.5M messages
Social networks between users were studied based on link
construction
Email senders and recipients are used to construct links
21. Experiments and Performance Metrics
Concepts like influence, trust, etc. in Blogosphere
are socio-psychological and subjective
Evaluating them is non-trivial
Hard to compare different approaches since there is
no ground truth!
Search engines’ ranking as the baseline for most of
the existing works
Web 2.0 application i.e., Digg, was used to evaluate
the influence in blogosphere
22. Agenda
Introduction
Research issues
Tools and Methods
Case Study
Blogosphere and Social Networks
23. Finding influential bloggers
“A blogger can be influential if s/he has more than
one influential blog post”
Properties that represent influential blog posts:
Recognition – An influential blog post is recognized by
many
Activity Generation – Number of comments received and
amount of discussion initiated
Novelty – Number of outlinks
Eloquence – Length of a post
Data Collection
The Unofficial Apple Weblog
Crawled 10,000 posts
24. Results
Top 5 bloggers according to TUAW and proposed
model
Some bloggers are both active and influential
Some of them are active but not influential
Some influential bloggers are not active
Inactive and non-influential bloggers
25. Verification
Challenges:
No testing and training data
Absence of ground truth
Use another Web2.0 site Digg to provide a reference
point
A more liked post will have higher score on Digg
Digg returns top 100 voted posts
Intersection of Digg 100 and top 20 from their model
26. Verification
Importance of each parameter
Inlinks > comments > outlinks > blog post length in
decreasing order of importance to influence
estimation
27. Agenda
Introduction
Research issues
Tools and Methods
Case Study
Blogosphere and Social Networks
28. Blogosphere and Social Networks
Blogosphere Social Networks
Influential nodes have “been Influential nodes “could influence”
influencing”
To share ideas or opinions To stay in touch or make friends
Reputation is based on previous Reputation is based on the number of
responses connections
Person-to-group interaction Person-to-person interaction
Community experience Friendship experience
Loosely defined graph Strictly defined graph
Nodes could be bloggers, blog posts, Nodes are members
blog sites
Implicit links Predefined links
Directed graph Undirected graph
29. Conclusion
Virtual communities and low barrier to publication are
helping the growth of blogosphere
A lot is yet to be done in terms of research specific to
blogosphere
Need accurate ground truth data
Experiments and evaluation plan should be devised
to have objective analysis of different algorithms
http://www.nytimes.com/2008/08/14/technology/14women.html?pagewanted=allhttp://www.nytimes.com/2008/07/21/technology/21blogger.html?_r=1&oref=sloginThe two examples show the new trend of advertising and the values of good blogs
Reported improved clustering as compared to that using tags
Mining sentiments from free text forms poses several challenges
Moreover, spammers can copy the content from some regular blog posts to evade content based spam filtersLink based spam filters can easily be beaten by creating legitimate links
various social networking sites provide APIs nowadays. this helps the developers to get limited access to data. APIs are also used to write numerous applications that extend the functioanlities of these sites and create mashups.
In experiments we observe outlinks is negatively correlated with the number of comments received on a blog post, which means more outlinks reduces people's interest/attention.In experiments we observe blog post length is positively correlated with the number of comments received on a blog post, which means longer blog posts attracts people's interest/attention.