Your SlideShare is downloading. ×
Blog clustering
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Blog clustering

825
views

Published on

A Presentation on Data Mining (Clustering) of the Blogosphere

A Presentation on Data Mining (Clustering) of the Blogosphere

Published in: News & Politics, Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
825
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. School of something ComputingFACULTY OF ENGINEERING OTHER Blog Clustering and Community Discovery in the Blogosphere An Overview Ahmad Ammari Research Fellow (User / Community Modelling)
  • 2. OUTLINE• Significance• Research Challenges• Network – Based Blog Clustering Approach• Content – Based Blog Clustering Approach• Hybrid – Based Blog Clustering Approach• Evaluation• Conclusion
  • 3. The Blogosphere is Huge 100% Growth Rate for every 5 months consistently for the last 4 years Over 120,000 new blogs created every day 1.4 new Blog every second(Technorati, 2009)
  • 4. Why Clustering Blogs?• For Bloggers / Readers: o Can focus on the clusters they “belong to”• Improve Recommender Engines: o Suggest related content to other cluster members o Suggest similar bloggers to network / follow
  • 5. Why Clustering Blogs?• For Search Engines: o Improve indexing mechanisms o Improve the delivery of the search results by organizing similar results together o Enhance the • Meta Search Engine: Yippy / Clusty navigability of search • Retrieve results from many engines results • Cluster them into clouds based on their contextual contents
  • 6. Why Clustering Blogs?• For Sociocultural / Political Studies: o Uncovering trending social, cultural, & political correlations within blogging communities• e.g. Harvard Arab Blogosphere Study, 2009 o Baseline assessment of networked public sphere in Middle East Blogs o Relationships to politics, media, religion, culture, international affairs
  • 7. Research Challenges• Existing approaches in webpage clustering & web community discovery are explored in the blogosphere• Applicability Challenges due to Key Differences between the Blogosphere & the Web Blog Posts Web Pages Short-lived References Long-lived References Monitoring Community Relative Temporal Stability Temporal Dynamics Multi-Theme Contents Focused Contents Emergent Text Analysis Traditional Text Analysis Missing Citations Available Citations
  • 8. Blog Clusters Vs. Community Discovery• Research Trend: Researchers find it is more prevalent to leverage content information to identify clusters of blog topics and network information to discover blog communities• Proposal: Both content and network information can be used / combined to identify blog Topic clusters and/or blog communities
  • 9. Graph – Based Clustering Approach
  • 10. Spectral Clustering - Example
  • 11. Spectral Clustering - Example
  • 12. k-Means Clustering• Assign k centroids Randomly• Assign points to closest centroids• Recalculate and move centroids• Repeat until centroids are stable
  • 13. Content – Based Estimation of W• Blog graph could be extremely sparse due to the casual nature 1) -neighbourhood of bloggers• Sparsity Solution: o Edges between blogs are derived using content similarity 2) k Nearest Neighbor kNN• Given: 3) Fully Connected Graph
  • 14. Content – Based Clustering Approaches• Blog Contents are used to compute Similarity• Text - Similarity Measure o Cosine Measure• Spherical k-Means o Version of k-means clustering that uses cosine similarity instead of Euclidean similarity
  • 15. Content Pre-Processing • Urban Dictionary: http://www.urbandictionary.com/ • Edited by PeopleAcronyms • 5,677,798 definitions since 1999 • Articles (a, an, the ..) • Demonstratives (this, that, these ..) • Conjunctions (for, and, both …)Stop Words Removal • Quantifiers (all, few, many … ) • Prepositions (on ,beneath, over …) • Affix Stemmers e.g indefinitely definite • Porter’s stemmer (Suffix Stripping)StemmingWeighting
  • 16. Vector Space Model
  • 17. Singular Values as Blog Post Features
  • 18. Hybrid - based Clustering approach• Blog Community can be defined as a set of nodes in a graph that link more frequently within this set than outside it and the set shares similar tags (Java et al, 2008)
  • 19. Evaluation• Data Set Description• First Data Set: citation network of academic publications o Six categories: Agents, Artificial Intelligence (AI), Databases (DB), Human Computer Interaction (HCI), Information Retrieval (IR) and Machine Learning (ML) o Binary document-term matrix (Presence / Absence of Terms)• Second Data Set: Subgraph of Weblogging Ecosystems (WWE) workshop o Tags fetched from del.icio.us, a well-known social bookmarking site o Corresponding Homepages downloaded• Performed Clustering Performance Comparisons between Hybrid & NCut (Network – based) Approaches
  • 20. Tag Distribution in Discovered Communities Top five tags associated with 10 communities found using the Ncut Approach Top five tags associated with 10 communities found using Hybrid Clustering
  • 21. Confusion Matrix ComparisonNCut Hybrid Average Cluster SimilarityNCut Hybrid
  • 22. Cluster Similarity Vs AVG Doc Similarity NCut Hybrid
  • 23. Conclusion• Both content and network information can be used to identify blog clusters or blog communities• Accompanying content information (user – defined tags, unstructured contents, agglomerative terms / features) with network information lead to better coherent blog clusters and more distinct blog communities than restricted network – based information• Matrix Factorization Techniques (LSA, SVD) reduce Sparsity and High Dimensionality of Content – based Clustering Information whereas Threshold – based filtration techniques are used• There should be more work to be done to consider the temporal dynamics in blog clustering for blogging interaction patterns and community evolutions monitoring
  • 24. School of something ComputingFACULTY OF ENGINEERING OTHER Thank You Ahmad Ammari Research Fellow (User / Community Modelling)