Blog clustering


Published on

A Presentation on Data Mining (Clustering) of the Blogosphere

Published in: News & Politics, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Blog clustering

  1. 1. School of something ComputingFACULTY OF ENGINEERING OTHER Blog Clustering and Community Discovery in the Blogosphere An Overview Ahmad Ammari Research Fellow (User / Community Modelling)
  2. 2. OUTLINE• Significance• Research Challenges• Network – Based Blog Clustering Approach• Content – Based Blog Clustering Approach• Hybrid – Based Blog Clustering Approach• Evaluation• Conclusion
  3. 3. The Blogosphere is Huge 100% Growth Rate for every 5 months consistently for the last 4 years Over 120,000 new blogs created every day 1.4 new Blog every second(Technorati, 2009)
  4. 4. Why Clustering Blogs?• For Bloggers / Readers: o Can focus on the clusters they “belong to”• Improve Recommender Engines: o Suggest related content to other cluster members o Suggest similar bloggers to network / follow
  5. 5. Why Clustering Blogs?• For Search Engines: o Improve indexing mechanisms o Improve the delivery of the search results by organizing similar results together o Enhance the • Meta Search Engine: Yippy / Clusty navigability of search • Retrieve results from many engines results • Cluster them into clouds based on their contextual contents
  6. 6. Why Clustering Blogs?• For Sociocultural / Political Studies: o Uncovering trending social, cultural, & political correlations within blogging communities• e.g. Harvard Arab Blogosphere Study, 2009 o Baseline assessment of networked public sphere in Middle East Blogs o Relationships to politics, media, religion, culture, international affairs
  7. 7. Research Challenges• Existing approaches in webpage clustering & web community discovery are explored in the blogosphere• Applicability Challenges due to Key Differences between the Blogosphere & the Web Blog Posts Web Pages Short-lived References Long-lived References Monitoring Community Relative Temporal Stability Temporal Dynamics Multi-Theme Contents Focused Contents Emergent Text Analysis Traditional Text Analysis Missing Citations Available Citations
  8. 8. Blog Clusters Vs. Community Discovery• Research Trend: Researchers find it is more prevalent to leverage content information to identify clusters of blog topics and network information to discover blog communities• Proposal: Both content and network information can be used / combined to identify blog Topic clusters and/or blog communities
  9. 9. Graph – Based Clustering Approach
  10. 10. Spectral Clustering - Example
  11. 11. Spectral Clustering - Example
  12. 12. k-Means Clustering• Assign k centroids Randomly• Assign points to closest centroids• Recalculate and move centroids• Repeat until centroids are stable
  13. 13. Content – Based Estimation of W• Blog graph could be extremely sparse due to the casual nature 1) -neighbourhood of bloggers• Sparsity Solution: o Edges between blogs are derived using content similarity 2) k Nearest Neighbor kNN• Given: 3) Fully Connected Graph
  14. 14. Content – Based Clustering Approaches• Blog Contents are used to compute Similarity• Text - Similarity Measure o Cosine Measure• Spherical k-Means o Version of k-means clustering that uses cosine similarity instead of Euclidean similarity
  15. 15. Content Pre-Processing • Urban Dictionary: • Edited by PeopleAcronyms • 5,677,798 definitions since 1999 • Articles (a, an, the ..) • Demonstratives (this, that, these ..) • Conjunctions (for, and, both …)Stop Words Removal • Quantifiers (all, few, many … ) • Prepositions (on ,beneath, over …) • Affix Stemmers e.g indefinitely definite • Porter’s stemmer (Suffix Stripping)StemmingWeighting
  16. 16. Vector Space Model
  17. 17. Singular Values as Blog Post Features
  18. 18. Hybrid - based Clustering approach• Blog Community can be defined as a set of nodes in a graph that link more frequently within this set than outside it and the set shares similar tags (Java et al, 2008)
  19. 19. Evaluation• Data Set Description• First Data Set: citation network of academic publications o Six categories: Agents, Artificial Intelligence (AI), Databases (DB), Human Computer Interaction (HCI), Information Retrieval (IR) and Machine Learning (ML) o Binary document-term matrix (Presence / Absence of Terms)• Second Data Set: Subgraph of Weblogging Ecosystems (WWE) workshop o Tags fetched from, a well-known social bookmarking site o Corresponding Homepages downloaded• Performed Clustering Performance Comparisons between Hybrid & NCut (Network – based) Approaches
  20. 20. Tag Distribution in Discovered Communities Top five tags associated with 10 communities found using the Ncut Approach Top five tags associated with 10 communities found using Hybrid Clustering
  21. 21. Confusion Matrix ComparisonNCut Hybrid Average Cluster SimilarityNCut Hybrid
  22. 22. Cluster Similarity Vs AVG Doc Similarity NCut Hybrid
  23. 23. Conclusion• Both content and network information can be used to identify blog clusters or blog communities• Accompanying content information (user – defined tags, unstructured contents, agglomerative terms / features) with network information lead to better coherent blog clusters and more distinct blog communities than restricted network – based information• Matrix Factorization Techniques (LSA, SVD) reduce Sparsity and High Dimensionality of Content – based Clustering Information whereas Threshold – based filtration techniques are used• There should be more work to be done to consider the temporal dynamics in blog clustering for blogging interaction patterns and community evolutions monitoring
  24. 24. School of something ComputingFACULTY OF ENGINEERING OTHER Thank You Ahmad Ammari Research Fellow (User / Community Modelling)