SlideShare a Scribd company logo
1 of 24
School of something
          Computing
FACULTY OF ENGINEERING
           OTHER




        Blog Clustering and Community
         Discovery in the Blogosphere
                           An Overview


                            Ahmad Ammari
              Research Fellow (User / Community Modelling)
OUTLINE

• Significance
• Research Challenges
• Network – Based Blog Clustering Approach
• Content – Based Blog Clustering Approach
• Hybrid – Based Blog Clustering Approach
• Evaluation
• Conclusion
The Blogosphere is Huge
 100% Growth Rate for
  every 5 months consistently
  for the last 4 years
 Over 120,000 new blogs
  created every day
 1.4 new Blog every second
(Technorati, 2009)
Why Clustering Blogs?

• For Bloggers / Readers:
 o Can focus on the clusters
   they “belong to”
• Improve Recommender
  Engines:
 o Suggest related content to
   other cluster members
 o Suggest similar bloggers
   to network / follow
Why Clustering Blogs?
• For Search Engines:
 o Improve indexing
   mechanisms
 o Improve the delivery
   of the search results
   by organizing similar
   results together
 o Enhance the
                            • Meta Search Engine: Yippy / Clusty
   navigability of search
                            • Retrieve results from many engines
   results
                            • Cluster them into 'clouds' based on
                              their contextual contents
Why Clustering Blogs?
• For Sociocultural / Political
  Studies:
    o Uncovering trending
      social, cultural, & political
      correlations within
      blogging communities
•    e.g. Harvard Arab
     Blogosphere Study, 2009
    o Baseline assessment of
      networked public sphere in
      Middle East Blogs
    o Relationships to politics,
      media, religion, culture,
      international affairs
Research Challenges
• Existing approaches in webpage clustering & web community
  discovery are explored in the blogosphere
• Applicability Challenges due to Key Differences between the
  Blogosphere & the Web
          Blog Posts                 Web Pages
    Short-lived References       Long-lived References
    Monitoring Community
                               Relative Temporal Stability
     Temporal Dynamics
    Multi-Theme Contents           Focused Contents
    Emergent Text Analysis      Traditional Text Analysis
       Missing Citations           Available Citations
Blog Clusters Vs. Community Discovery
• Research Trend: Researchers find it is more prevalent to
  leverage content information to identify clusters of blog topics
  and network information to discover blog communities
• Proposal: Both content and network information can be used
  / combined to identify blog Topic clusters and/or blog
  communities
Graph – Based Clustering Approach
Spectral Clustering - Example
Spectral Clustering - Example
k-Means Clustering




• Assign k centroids
  Randomly
• Assign points to
  closest centroids
• Recalculate and
  move centroids
• Repeat until
  centroids are stable
Content – Based Estimation of W
• Blog graph could be extremely
  sparse due to the casual nature      1)     -neighbourhood
  of bloggers
• Sparsity Solution:
  o Edges between blogs are
    derived using content similarity   2)    k Nearest Neighbor kNN
• Given:


                                       3)   Fully Connected Graph
Content – Based Clustering Approaches
• Blog Contents are used to compute Similarity
• Text - Similarity Measure
 o Cosine Measure




• Spherical k-Means
 o Version of k-means clustering that uses cosine similarity
   instead of Euclidean similarity
Content Pre-Processing
         • Urban Dictionary: http://www.urbandictionary.com/
         • Edited by People
Acronyms • 5,677,798 definitions since 1999


              • Articles (a, an, the ..)
              • Demonstratives (this, that, these ..)   •   Conjunctions (for, and, both …)
Stop Words
 Removal      • Quantifiers (all, few, many … )         •   Prepositions (on ,beneath, over …)


             • Affix Stemmers                      e.g indefinitely    definite
             • Porter’s stemmer (Suffix Stripping)
Stemming




Weighting
Vector Space Model
Singular Values as Blog Post Features
Hybrid - based Clustering approach
• Blog Community can be defined as a set of nodes
  in a graph that link more frequently within this set
  than outside it and the set shares similar tags
  (Java et al, 2008)
Evaluation
• Data Set Description




• First Data Set: citation network of academic publications
   o Six categories: Agents, Artificial Intelligence (AI), Databases
      (DB), Human Computer Interaction (HCI), Information
      Retrieval (IR) and Machine Learning (ML)
   o Binary document-term matrix (Presence / Absence of Terms)
• Second Data Set: Subgraph of Weblogging Ecosystems (WWE)
  workshop
   o Tags fetched from del.icio.us, a well-known social
      bookmarking site
   o Corresponding Homepages downloaded
• Performed Clustering Performance Comparisons between
  Hybrid & NCut (Network – based) Approaches
Tag Distribution in Discovered Communities



                          Top five tags associated with
                          10 communities found using
                               the Ncut Approach




                          Top five tags associated with
                          10 communities found using
                                Hybrid Clustering
Confusion Matrix Comparison




NCut                       Hybrid
  Average Cluster Similarity




NCut                       Hybrid
Cluster Similarity Vs AVG Doc Similarity




    NCut                        Hybrid
Conclusion
• Both content and network information can be used to
  identify blog clusters or blog communities
• Accompanying content information (user – defined tags,
  unstructured contents, agglomerative terms / features) with
  network information lead to better coherent blog clusters
  and more distinct blog communities than restricted network
  – based information
• Matrix Factorization Techniques (LSA, SVD) reduce
  Sparsity and High Dimensionality of Content – based
  Clustering Information whereas Threshold – based filtration
  techniques are used
• There should be more work to be done to consider the
  temporal dynamics in blog clustering for blogging
  interaction patterns and community evolutions monitoring
School of something
          Computing
FACULTY OF ENGINEERING
           OTHER




                         Thank You
                            Ahmad Ammari
              Research Fellow (User / Community Modelling)

More Related Content

Similar to Blog clustering

Towards effective research recommender systems for repositories
Towards effective research recommender systems for repositoriesTowards effective research recommender systems for repositories
Towards effective research recommender systems for repositoriespetrknoth
 
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific BlogsUsing Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific BlogsConor Hayes
 
Tag based recommender system
Tag based recommender systemTag based recommender system
Tag based recommender systemKaren Li
 
Federated to library discovery platfoms
Federated to library discovery platfomsFederated to library discovery platfoms
Federated to library discovery platfomsNikesh Narayanan
 
Exploring Generative Models of Tripartite Graphs for Recommendation in Social...
Exploring Generative Models of Tripartite Graphs for Recommendation in Social...Exploring Generative Models of Tripartite Graphs for Recommendation in Social...
Exploring Generative Models of Tripartite Graphs for Recommendation in Social...Charalampos Chelmis
 
Netizen style commenting on fashion photos
Netizen style commenting on fashion photosNetizen style commenting on fashion photos
Netizen style commenting on fashion photosJason Tang
 
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondFrank Kelly
 
Knowledge engineering and the Web
Knowledge engineering and the WebKnowledge engineering and the Web
Knowledge engineering and the WebGuus Schreiber
 
NISO access related projects (presented at the Charleston conference 2016)
NISO access related projects (presented at the Charleston conference 2016)NISO access related projects (presented at the Charleston conference 2016)
NISO access related projects (presented at the Charleston conference 2016)Christine Stohn
 
Cataloging roundtable discussion questions
Cataloging roundtable discussion questionsCataloging roundtable discussion questions
Cataloging roundtable discussion questionsrobin fay
 
Owning the Discovery Experience for Your Patrons
Owning the Discovery Experience for Your PatronsOwning the Discovery Experience for Your Patrons
Owning the Discovery Experience for Your PatronsRobert H. McDonald
 
Conor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphereConor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphereDERIGalway
 
IMT530 Tagging Presentation
IMT530 Tagging PresentationIMT530 Tagging Presentation
IMT530 Tagging PresentationMichael Braly
 
Moving Shared Print to the Network Level
Moving Shared Print to the Network LevelMoving Shared Print to the Network Level
Moving Shared Print to the Network LevelMaine_SharedCollections
 
Pikas using bibliometrics to make sense of research proposals
Pikas using bibliometrics to make sense of research proposalsPikas using bibliometrics to make sense of research proposals
Pikas using bibliometrics to make sense of research proposalsChristina Pikas
 
Research Data Publishing
Research Data PublishingResearch Data Publishing
Research Data PublishingBrian Hole
 

Similar to Blog clustering (20)

Towards effective research recommender systems for repositories
Towards effective research recommender systems for repositoriesTowards effective research recommender systems for repositories
Towards effective research recommender systems for repositories
 
Using Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific BlogsUsing Tags and Clustering to Identify Topic-specific Blogs
Using Tags and Clustering to Identify Topic-specific Blogs
 
Tag based recommender system
Tag based recommender systemTag based recommender system
Tag based recommender system
 
Federated to library discovery platfoms
Federated to library discovery platfomsFederated to library discovery platfoms
Federated to library discovery platfoms
 
Blogosphere
BlogosphereBlogosphere
Blogosphere
 
Exploring Generative Models of Tripartite Graphs for Recommendation in Social...
Exploring Generative Models of Tripartite Graphs for Recommendation in Social...Exploring Generative Models of Tripartite Graphs for Recommendation in Social...
Exploring Generative Models of Tripartite Graphs for Recommendation in Social...
 
Netizen style commenting on fashion photos
Netizen style commenting on fashion photosNetizen style commenting on fashion photos
Netizen style commenting on fashion photos
 
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
 
Knowledge engineering and the Web
Knowledge engineering and the WebKnowledge engineering and the Web
Knowledge engineering and the Web
 
NISO access related projects (presented at the Charleston conference 2016)
NISO access related projects (presented at the Charleston conference 2016)NISO access related projects (presented at the Charleston conference 2016)
NISO access related projects (presented at the Charleston conference 2016)
 
Cataloging roundtable discussion questions
Cataloging roundtable discussion questionsCataloging roundtable discussion questions
Cataloging roundtable discussion questions
 
Owning the Discovery Experience for Your Patrons
Owning the Discovery Experience for Your PatronsOwning the Discovery Experience for Your Patrons
Owning the Discovery Experience for Your Patrons
 
Conor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphereConor Hayes - Topics, tags and trends in the blogosphere
Conor Hayes - Topics, tags and trends in the blogosphere
 
IMT530 Tagging Presentation
IMT530 Tagging PresentationIMT530 Tagging Presentation
IMT530 Tagging Presentation
 
BIBLIOMETRICS.pptx
BIBLIOMETRICS.pptxBIBLIOMETRICS.pptx
BIBLIOMETRICS.pptx
 
Moving Shared Print to the Network Level
Moving Shared Print to the Network LevelMoving Shared Print to the Network Level
Moving Shared Print to the Network Level
 
Introduction to Digital Commons
Introduction to Digital CommonsIntroduction to Digital Commons
Introduction to Digital Commons
 
Pikas using bibliometrics to make sense of research proposals
Pikas using bibliometrics to make sense of research proposalsPikas using bibliometrics to make sense of research proposals
Pikas using bibliometrics to make sense of research proposals
 
Ili2012
Ili2012Ili2012
Ili2012
 
Research Data Publishing
Research Data PublishingResearch Data Publishing
Research Data Publishing
 

More from Ahmad Ammari

Cis 2303 lo1 part 1_weeks_1_2 - student ver
Cis 2303 lo1 part 1_weeks_1_2 - student verCis 2303 lo1 part 1_weeks_1_2 - student ver
Cis 2303 lo1 part 1_weeks_1_2 - student verAhmad Ammari
 
Distributed data mining
Distributed data miningDistributed data mining
Distributed data miningAhmad Ammari
 
You tube Group Profiling Services
You tube Group Profiling ServicesYou tube Group Profiling Services
You tube Group Profiling ServicesAhmad Ammari
 
Aum workshop paper_presentation
Aum workshop paper_presentationAum workshop paper_presentation
Aum workshop paper_presentationAhmad Ammari
 

More from Ahmad Ammari (6)

Itecn453 lec01
Itecn453 lec01Itecn453 lec01
Itecn453 lec01
 
Cis 2303 lo1 part 1_weeks_1_2 - student ver
Cis 2303 lo1 part 1_weeks_1_2 - student verCis 2303 lo1 part 1_weeks_1_2 - student ver
Cis 2303 lo1 part 1_weeks_1_2 - student ver
 
Itec410 lec01
Itec410 lec01Itec410 lec01
Itec410 lec01
 
Distributed data mining
Distributed data miningDistributed data mining
Distributed data mining
 
You tube Group Profiling Services
You tube Group Profiling ServicesYou tube Group Profiling Services
You tube Group Profiling Services
 
Aum workshop paper_presentation
Aum workshop paper_presentationAum workshop paper_presentation
Aum workshop paper_presentation
 

Recently uploaded

Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...narsireddynannuri1
 
2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx
2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx
2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docxkfjstone13
 
Lorenzo D'Emidio_Lavoro sullaNorth Korea .pptx
Lorenzo D'Emidio_Lavoro sullaNorth Korea .pptxLorenzo D'Emidio_Lavoro sullaNorth Korea .pptx
Lorenzo D'Emidio_Lavoro sullaNorth Korea .pptxlorenzodemidio01
 
Kishan Reddy Report To People (2019-24).pdf
Kishan Reddy Report To People (2019-24).pdfKishan Reddy Report To People (2019-24).pdf
Kishan Reddy Report To People (2019-24).pdfKISHAN REDDY OFFICE
 
04052024_First India Newspaper Jaipur.pdf
04052024_First India Newspaper Jaipur.pdf04052024_First India Newspaper Jaipur.pdf
04052024_First India Newspaper Jaipur.pdfFIRST INDIA
 
Transformative Leadership: N Chandrababu Naidu and TDP's Vision for Innovatio...
Transformative Leadership: N Chandrababu Naidu and TDP's Vision for Innovatio...Transformative Leadership: N Chandrababu Naidu and TDP's Vision for Innovatio...
Transformative Leadership: N Chandrababu Naidu and TDP's Vision for Innovatio...srinuseo15
 
Enjoy Night ≽ 8448380779 ≼ Call Girls In Palam Vihar (Gurgaon)
Enjoy Night ≽ 8448380779 ≼ Call Girls In Palam Vihar (Gurgaon)Enjoy Night ≽ 8448380779 ≼ Call Girls In Palam Vihar (Gurgaon)
Enjoy Night ≽ 8448380779 ≼ Call Girls In Palam Vihar (Gurgaon)Delhi Call girls
 
Embed-4.pdf lkdiinlajeklhndklheduhuekjdh
Embed-4.pdf lkdiinlajeklhndklheduhuekjdhEmbed-4.pdf lkdiinlajeklhndklheduhuekjdh
Embed-4.pdf lkdiinlajeklhndklheduhuekjdhbhavenpr
 
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's DevelopmentNara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Developmentnarsireddynannuri1
 
Julius Randle's Injury Status: Surgery Not Off the Table
Julius Randle's Injury Status: Surgery Not Off the TableJulius Randle's Injury Status: Surgery Not Off the Table
Julius Randle's Injury Status: Surgery Not Off the Tableget joys
 
Enjoy Night ≽ 8448380779 ≼ Call Girls In Gurgaon Sector 46 (Gurgaon)
Enjoy Night ≽ 8448380779 ≼ Call Girls In Gurgaon Sector 46 (Gurgaon)Enjoy Night ≽ 8448380779 ≼ Call Girls In Gurgaon Sector 46 (Gurgaon)
Enjoy Night ≽ 8448380779 ≼ Call Girls In Gurgaon Sector 46 (Gurgaon)Delhi Call girls
 
30042024_First India Newspaper Jaipur.pdf
30042024_First India Newspaper Jaipur.pdf30042024_First India Newspaper Jaipur.pdf
30042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
Enjoy Night ≽ 8448380779 ≼ Call Girls In Gurgaon Sector 48 (Gurgaon)
Enjoy Night ≽ 8448380779 ≼ Call Girls In Gurgaon Sector 48 (Gurgaon)Enjoy Night ≽ 8448380779 ≼ Call Girls In Gurgaon Sector 48 (Gurgaon)
Enjoy Night ≽ 8448380779 ≼ Call Girls In Gurgaon Sector 48 (Gurgaon)Delhi Call girls
 
AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...
AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...
AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...Axel Bruns
 
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptxKAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptxjohnandrewcarlos
 
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...AlexisTorres963861
 
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopkoEmbed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopkobhavenpr
 
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreie
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreieGujarat-SEBCs.pdf pfpkoopapriorjfperjreie
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreiebhavenpr
 
WhatsApp 📞 8448380779 ✅Call Girls In Chaura Sector 22 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Chaura Sector 22 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Chaura Sector 22 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Chaura Sector 22 ( Noida)Delhi Call girls
 
05052024_First India Newspaper Jaipur.pdf
05052024_First India Newspaper Jaipur.pdf05052024_First India Newspaper Jaipur.pdf
05052024_First India Newspaper Jaipur.pdfFIRST INDIA
 

Recently uploaded (20)

Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...
Nurturing Families, Empowering Lives: TDP's Vision for Family Welfare in Andh...
 
2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx
2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx
2024 02 15 AZ GOP LD4 Gen Meeting Minutes_FINAL_20240228.docx
 
Lorenzo D'Emidio_Lavoro sullaNorth Korea .pptx
Lorenzo D'Emidio_Lavoro sullaNorth Korea .pptxLorenzo D'Emidio_Lavoro sullaNorth Korea .pptx
Lorenzo D'Emidio_Lavoro sullaNorth Korea .pptx
 
Kishan Reddy Report To People (2019-24).pdf
Kishan Reddy Report To People (2019-24).pdfKishan Reddy Report To People (2019-24).pdf
Kishan Reddy Report To People (2019-24).pdf
 
04052024_First India Newspaper Jaipur.pdf
04052024_First India Newspaper Jaipur.pdf04052024_First India Newspaper Jaipur.pdf
04052024_First India Newspaper Jaipur.pdf
 
Transformative Leadership: N Chandrababu Naidu and TDP's Vision for Innovatio...
Transformative Leadership: N Chandrababu Naidu and TDP's Vision for Innovatio...Transformative Leadership: N Chandrababu Naidu and TDP's Vision for Innovatio...
Transformative Leadership: N Chandrababu Naidu and TDP's Vision for Innovatio...
 
Enjoy Night ≽ 8448380779 ≼ Call Girls In Palam Vihar (Gurgaon)
Enjoy Night ≽ 8448380779 ≼ Call Girls In Palam Vihar (Gurgaon)Enjoy Night ≽ 8448380779 ≼ Call Girls In Palam Vihar (Gurgaon)
Enjoy Night ≽ 8448380779 ≼ Call Girls In Palam Vihar (Gurgaon)
 
Embed-4.pdf lkdiinlajeklhndklheduhuekjdh
Embed-4.pdf lkdiinlajeklhndklheduhuekjdhEmbed-4.pdf lkdiinlajeklhndklheduhuekjdh
Embed-4.pdf lkdiinlajeklhndklheduhuekjdh
 
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's DevelopmentNara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
Nara Chandrababu Naidu's Visionary Policies For Andhra Pradesh's Development
 
Julius Randle's Injury Status: Surgery Not Off the Table
Julius Randle's Injury Status: Surgery Not Off the TableJulius Randle's Injury Status: Surgery Not Off the Table
Julius Randle's Injury Status: Surgery Not Off the Table
 
Enjoy Night ≽ 8448380779 ≼ Call Girls In Gurgaon Sector 46 (Gurgaon)
Enjoy Night ≽ 8448380779 ≼ Call Girls In Gurgaon Sector 46 (Gurgaon)Enjoy Night ≽ 8448380779 ≼ Call Girls In Gurgaon Sector 46 (Gurgaon)
Enjoy Night ≽ 8448380779 ≼ Call Girls In Gurgaon Sector 46 (Gurgaon)
 
30042024_First India Newspaper Jaipur.pdf
30042024_First India Newspaper Jaipur.pdf30042024_First India Newspaper Jaipur.pdf
30042024_First India Newspaper Jaipur.pdf
 
Enjoy Night ≽ 8448380779 ≼ Call Girls In Gurgaon Sector 48 (Gurgaon)
Enjoy Night ≽ 8448380779 ≼ Call Girls In Gurgaon Sector 48 (Gurgaon)Enjoy Night ≽ 8448380779 ≼ Call Girls In Gurgaon Sector 48 (Gurgaon)
Enjoy Night ≽ 8448380779 ≼ Call Girls In Gurgaon Sector 48 (Gurgaon)
 
AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...
AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...
AI as Research Assistant: Upscaling Content Analysis to Identify Patterns of ...
 
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptxKAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
KAHULUGAN AT KAHALAGAHAN NG GAWAING PANSIBIKO.pptx
 
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
Defensa de JOH insiste que testimonio de analista de la DEA es falso y solici...
 
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopkoEmbed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
Embed-2 (1).pdfb[k[k[[k[kkkpkdpokkdpkopko
 
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreie
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreieGujarat-SEBCs.pdf pfpkoopapriorjfperjreie
Gujarat-SEBCs.pdf pfpkoopapriorjfperjreie
 
WhatsApp 📞 8448380779 ✅Call Girls In Chaura Sector 22 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Chaura Sector 22 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Chaura Sector 22 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Chaura Sector 22 ( Noida)
 
05052024_First India Newspaper Jaipur.pdf
05052024_First India Newspaper Jaipur.pdf05052024_First India Newspaper Jaipur.pdf
05052024_First India Newspaper Jaipur.pdf
 

Blog clustering

  • 1. School of something Computing FACULTY OF ENGINEERING OTHER Blog Clustering and Community Discovery in the Blogosphere An Overview Ahmad Ammari Research Fellow (User / Community Modelling)
  • 2. OUTLINE • Significance • Research Challenges • Network – Based Blog Clustering Approach • Content – Based Blog Clustering Approach • Hybrid – Based Blog Clustering Approach • Evaluation • Conclusion
  • 3. The Blogosphere is Huge  100% Growth Rate for every 5 months consistently for the last 4 years  Over 120,000 new blogs created every day  1.4 new Blog every second (Technorati, 2009)
  • 4. Why Clustering Blogs? • For Bloggers / Readers: o Can focus on the clusters they “belong to” • Improve Recommender Engines: o Suggest related content to other cluster members o Suggest similar bloggers to network / follow
  • 5. Why Clustering Blogs? • For Search Engines: o Improve indexing mechanisms o Improve the delivery of the search results by organizing similar results together o Enhance the • Meta Search Engine: Yippy / Clusty navigability of search • Retrieve results from many engines results • Cluster them into 'clouds' based on their contextual contents
  • 6. Why Clustering Blogs? • For Sociocultural / Political Studies: o Uncovering trending social, cultural, & political correlations within blogging communities • e.g. Harvard Arab Blogosphere Study, 2009 o Baseline assessment of networked public sphere in Middle East Blogs o Relationships to politics, media, religion, culture, international affairs
  • 7. Research Challenges • Existing approaches in webpage clustering & web community discovery are explored in the blogosphere • Applicability Challenges due to Key Differences between the Blogosphere & the Web Blog Posts Web Pages Short-lived References Long-lived References Monitoring Community Relative Temporal Stability Temporal Dynamics Multi-Theme Contents Focused Contents Emergent Text Analysis Traditional Text Analysis Missing Citations Available Citations
  • 8. Blog Clusters Vs. Community Discovery • Research Trend: Researchers find it is more prevalent to leverage content information to identify clusters of blog topics and network information to discover blog communities • Proposal: Both content and network information can be used / combined to identify blog Topic clusters and/or blog communities
  • 9. Graph – Based Clustering Approach
  • 12. k-Means Clustering • Assign k centroids Randomly • Assign points to closest centroids • Recalculate and move centroids • Repeat until centroids are stable
  • 13. Content – Based Estimation of W • Blog graph could be extremely sparse due to the casual nature 1) -neighbourhood of bloggers • Sparsity Solution: o Edges between blogs are derived using content similarity 2) k Nearest Neighbor kNN • Given: 3) Fully Connected Graph
  • 14. Content – Based Clustering Approaches • Blog Contents are used to compute Similarity • Text - Similarity Measure o Cosine Measure • Spherical k-Means o Version of k-means clustering that uses cosine similarity instead of Euclidean similarity
  • 15. Content Pre-Processing • Urban Dictionary: http://www.urbandictionary.com/ • Edited by People Acronyms • 5,677,798 definitions since 1999 • Articles (a, an, the ..) • Demonstratives (this, that, these ..) • Conjunctions (for, and, both …) Stop Words Removal • Quantifiers (all, few, many … ) • Prepositions (on ,beneath, over …) • Affix Stemmers e.g indefinitely definite • Porter’s stemmer (Suffix Stripping) Stemming Weighting
  • 17. Singular Values as Blog Post Features
  • 18. Hybrid - based Clustering approach • Blog Community can be defined as a set of nodes in a graph that link more frequently within this set than outside it and the set shares similar tags (Java et al, 2008)
  • 19. Evaluation • Data Set Description • First Data Set: citation network of academic publications o Six categories: Agents, Artificial Intelligence (AI), Databases (DB), Human Computer Interaction (HCI), Information Retrieval (IR) and Machine Learning (ML) o Binary document-term matrix (Presence / Absence of Terms) • Second Data Set: Subgraph of Weblogging Ecosystems (WWE) workshop o Tags fetched from del.icio.us, a well-known social bookmarking site o Corresponding Homepages downloaded • Performed Clustering Performance Comparisons between Hybrid & NCut (Network – based) Approaches
  • 20. Tag Distribution in Discovered Communities Top five tags associated with 10 communities found using the Ncut Approach Top five tags associated with 10 communities found using Hybrid Clustering
  • 21. Confusion Matrix Comparison NCut Hybrid Average Cluster Similarity NCut Hybrid
  • 22. Cluster Similarity Vs AVG Doc Similarity NCut Hybrid
  • 23. Conclusion • Both content and network information can be used to identify blog clusters or blog communities • Accompanying content information (user – defined tags, unstructured contents, agglomerative terms / features) with network information lead to better coherent blog clusters and more distinct blog communities than restricted network – based information • Matrix Factorization Techniques (LSA, SVD) reduce Sparsity and High Dimensionality of Content – based Clustering Information whereas Threshold – based filtration techniques are used • There should be more work to be done to consider the temporal dynamics in blog clustering for blogging interaction patterns and community evolutions monitoring
  • 24. School of something Computing FACULTY OF ENGINEERING OTHER Thank You Ahmad Ammari Research Fellow (User / Community Modelling)