Blog clustering

School of something
Computing
FACULTY OF ENGINEERING
OTHER

Blog Clustering and Community
Discovery in the Blogosphere
An Overview

Ahmad Ammari
Research Fellow (User / Community Modelling)

OUTLINE

• Significance
• Research Challenges
• Network – Based Blog Clustering Approach
• Content – Based Blog Clustering Approach
• Hybrid – Based Blog Clustering Approach
• Evaluation
• Conclusion

The Blogosphere is Huge
 100% Growth Rate for
every 5 months consistently
for the last 4 years
 Over 120,000 new blogs
created every day
 1.4 new Blog every second
(Technorati, 2009)

Why Clustering Blogs?

• For Bloggers / Readers:
o Can focus on the clusters
they “belong to”
• Improve Recommender
Engines:
o Suggest related content to
other cluster members
o Suggest similar bloggers
to network / follow

• For Search Engines:
o Improve indexing
mechanisms
o Improve the delivery
of the search results
by organizing similar
results together
o Enhance the
• Meta Search Engine: Yippy / Clusty
navigability of search
• Retrieve results from many engines
results
• Cluster them into 'clouds' based on
their contextual contents

• For Sociocultural / Political
Studies:
o Uncovering trending
social, cultural, & political
correlations within
blogging communities
• e.g. Harvard Arab
Blogosphere Study, 2009
o Baseline assessment of
networked public sphere in
Middle East Blogs
o Relationships to politics,
media, religion, culture,
international affairs

Research Challenges
• Existing approaches in webpage clustering & web community
discovery are explored in the blogosphere
• Applicability Challenges due to Key Differences between the
Blogosphere & the Web
Blog Posts Web Pages
Short-lived References Long-lived References
Monitoring Community
Relative Temporal Stability
Temporal Dynamics
Multi-Theme Contents Focused Contents
Emergent Text Analysis Traditional Text Analysis
Missing Citations Available Citations

Blog Clusters Vs. Community Discovery
• Research Trend: Researchers find it is more prevalent to
leverage content information to identify clusters of blog topics
and network information to discover blog communities
• Proposal: Both content and network information can be used
/ combined to identify blog Topic clusters and/or blog
communities

Graph – Based Clustering Approach

k-Means Clustering

• Assign k centroids
Randomly
• Assign points to
closest centroids
• Recalculate and
move centroids
• Repeat until
centroids are stable

Content – Based Estimation of W
• Blog graph could be extremely
sparse due to the casual nature 1) -neighbourhood
of bloggers
• Sparsity Solution:
o Edges between blogs are
derived using content similarity 2) k Nearest Neighbor kNN
• Given:

3) Fully Connected Graph

Content – Based Clustering Approaches
• Blog Contents are used to compute Similarity
• Text - Similarity Measure
o Cosine Measure

• Spherical k-Means
o Version of k-means clustering that uses cosine similarity
instead of Euclidean similarity

Content Pre-Processing
• Urban Dictionary: http://www.urbandictionary.com/
• Edited by People
Acronyms • 5,677,798 definitions since 1999

• Articles (a, an, the ..)
• Demonstratives (this, that, these ..) • Conjunctions (for, and, both …)
Stop Words
Removal • Quantifiers (all, few, many … ) • Prepositions (on ,beneath, over …)

• Affix Stemmers e.g indefinitely definite
• Porter’s stemmer (Suffix Stripping)
Stemming

Weighting

Singular Values as Blog Post Features

Hybrid - based Clustering approach
• Blog Community can be defined as a set of nodes
in a graph that link more frequently within this set
than outside it and the set shares similar tags
(Java et al, 2008)

Evaluation
• Data Set Description

• First Data Set: citation network of academic publications
o Six categories: Agents, Artificial Intelligence (AI), Databases
(DB), Human Computer Interaction (HCI), Information
Retrieval (IR) and Machine Learning (ML)
o Binary document-term matrix (Presence / Absence of Terms)
• Second Data Set: Subgraph of Weblogging Ecosystems (WWE)
workshop
o Tags fetched from del.icio.us, a well-known social
bookmarking site
o Corresponding Homepages downloaded
• Performed Clustering Performance Comparisons between
Hybrid & NCut (Network – based) Approaches

Tag Distribution in Discovered Communities

Top five tags associated with
10 communities found using
the Ncut Approach

Top five tags associated with
10 communities found using
Hybrid Clustering

Confusion Matrix Comparison

NCut Hybrid
Average Cluster Similarity

NCut Hybrid

Cluster Similarity Vs AVG Doc Similarity

NCut Hybrid

Conclusion
• Both content and network information can be used to
identify blog clusters or blog communities
• Accompanying content information (user – defined tags,
unstructured contents, agglomerative terms / features) with
network information lead to better coherent blog clusters
and more distinct blog communities than restricted network
– based information
• Matrix Factorization Techniques (LSA, SVD) reduce
Sparsity and High Dimensionality of Content – based
Clustering Information whereas Threshold – based filtration
techniques are used
• There should be more work to be done to consider the
temporal dynamics in blog clustering for blogging
interaction patterns and community evolutions monitoring

School of something
Computing
FACULTY OF ENGINEERING
OTHER

Thank You
Ahmad Ammari
Research Fellow (User / Community Modelling)

Blog clustering

Recommended

Recommended

More Related Content

Similar to Blog clustering

Similar to Blog clustering (20)

More from Ahmad Ammari

More from Ahmad Ammari (6)

Recently uploaded

Recently uploaded (20)

Blog clustering