Slides

403 views

Published on

Published in: News & Politics, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
403
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Most political blog studies depend on hand-labeled information. This can often limit the size of the actual dataset and make generalization more difficult. It is extremely difficult to label large amounts of posts or blogs. Many (though not all) studies are limited to mainstream blogs – the most popular blogs in the blogosphere. Studies exploring a larger section of the blogosphere often focus on keyword counts and term co-occurrences. This data may be combined with more complex information: geography, time zones, etc.
  • Sentiment analysis is to determine the attitude or perspective of a speaker or a writer with respect to some topic The challenges: not many labeled data on blog sentiment Can we apply a model trained for movie reviews (or other product reviews) to predict sentiment in Lotus blogs? We examined several labeled sentiment datasets from other domains to predict the sentiment of our Lotus-universe blogs Movielens: 1000 positive and 1000 negative movie reviews from www.movielens.org Epinions: 97 enterprise software reviews from epinions.com Amazon: 13040 software reviews from amazon.com Results are shown in accuracy = # of correctly predicted examples/ total # of examples
  • Graph transduction and assortativity of networks. Relevant posts / blogs tend to link to each other. Sentiment, however, appears to be disassorted. Authority scores based on both network structure and text. Who is driving the discussions, based on linguistic patterns and term frequencies? PageRank, betweenness, etc. are good probabilistic measures for driving discussions, but need to be validated. Early detection of “buzz” and discussions. Through the use of assortativity and (potentially) a Bayesian approach to modeling buzz. Building on models of “post cascades”. Defining new measures of assortativity / homophily, both in terms of new probabilistic distributions and multiple labels amongst nodes.
  • Slides

    1. 1. Mining Political Blog Networks Wojciech Gryc Yan Liu Prem Melville Claudia Perlich Richard D. Lawrence Predictive Modeling Group Mathematical Sciences Department IBM Research June 13, 2008
    2. 2. Overview <ul><li>Blogs and other forms of social media provide us with a snapshot of people’s daily lives, opinions, and ideas – can we use this to learn more about trends within society? </li></ul><ul><ul><li>Presentation Outline </li></ul></ul><ul><ul><li>Overview of the political blogosphere </li></ul></ul><ul><ul><li>Our work: a long-term plan </li></ul></ul><ul><ul><li>Finding communities </li></ul></ul><ul><ul><li>Text mining and information retrieval </li></ul></ul><ul><ul><li>Combining text mining with graph mining </li></ul></ul><ul><ul><li>Future work </li></ul></ul>
    3. 3. Web 2.0 <ul><li>The web as: participatory, customizable, and community-oriented </li></ul><ul><li>Numerous opportunities for corporations – marketing, customer loyalty, and research </li></ul>
    4. 4. Web 2.0 and Politics <ul><li>Web 2.0 is also revolutionizing politics </li></ul>
    5. 5. Web 2.0, Blogs, and Social Networks <ul><li>Web 2.0 is ultimately a social environment </li></ul><ul><ul><li>eBay as an auction system with its own social ecosystem </li></ul></ul><ul><ul><li>Wikipedia as a collaborative environment </li></ul></ul><ul><ul><li>Blogs, forums, and e-mail as social networks </li></ul></ul><ul><li>There are over 77 million blogs, with about 100,000 added every day </li></ul><ul><ul><li>Blog: an online journal that an individual shares a running log of events and personal insights with online audiences in a reverse chronological order. </li></ul></ul><ul><li>Blogs provide two key pieces of information: </li></ul><ul><ul><li>Textual data: </li></ul></ul><ul><ul><li>Information on relationships between bloggers: </li></ul></ul>“ No, I am NOT Voting for McCain ... There has been some murmuring of Hillary Clinton supporters voting for McCain in pure protest of Barack Obama’s candidacy.” Huffington Post Daily Kos Boing Boing Political Wire A C B D Blog Cross-Reference Graph
    6. 6. Literature Review: Political blog networks <ul><li>Most political blog studies depend on hand-labeled information </li></ul><ul><li>Many (though not all) studies are limited to mainstream blogs – the most popular blogs in the blogosphere </li></ul><ul><li>Prior Work </li></ul><ul><li>Adamic & Glance, 2005. </li></ul><ul><ul><li>Analysis of political blogosphere between liberal and conservative bloggers. </li></ul></ul><ul><li>Ackland, 2005. </li></ul><ul><ul><li>Follow-up to Adamic & Glance, 2005. </li></ul></ul><ul><li>Wallsten, 2005. </li></ul><ul><ul><li>Political blogosphere as an echo chamber, and prominence of conservative bloggers. </li></ul></ul><ul><li>Hand-labeled popular Liberal and Conservative blogs. </li></ul><ul><li>Analysis of linking patterns. </li></ul><ul><li>Analysis of less popular blogs (background discussions). </li></ul>
    7. 7. Literature Review: Machine learning and political blogs <ul><li>Studies focusing on machine learning and political blogs often focus on text classification </li></ul><ul><li>Prior Work </li></ul><ul><li>Tremayne, 2006 </li></ul><ul><ul><li>Preferential attachment and link prediction in the war-focused blogosphere </li></ul></ul><ul><li>Turney, 2002 </li></ul><ul><ul><li>Cultural discussions are much more difficult to label than technical ones </li></ul></ul><ul><li>Mullen & Malouf, 2006 </li></ul><ul><ul><li>Sentiment labeling of political discussion boards, accuracies around 60% </li></ul></ul><ul><li>Durant & Smith, 2006 </li></ul><ul><ul><li>Achieve accuracies around 90% in labeling political blogs as left , right , or moderate </li></ul></ul><ul><li>Analysis of linking patterns (topic-specific). </li></ul><ul><li>Post classification based on sentiment labels. </li></ul><ul><li>Notice differences in accuracies. </li></ul>
    8. 8. BANTER ( B log A nalysis of N etwork T opology and E volving R esponses) 77M Blogs Political Blogs Presidential Primary Blogs <ul><li>1. How do we identify the relevant sub-universe of blogs? </li></ul><ul><li>We submit set of relevant keywords to Technorati, include out-linked blogs, and then refine this sub-universe via active learning </li></ul><ul><li>2. How do we determine “authorities” in this sub-universe? </li></ul><ul><li>We use page-rank-like algorithms against cross-reference structure, combined with SNA concepts (e.g. Information Flow) </li></ul><ul><li>3. How do we detect emerging topics and themes in this sub-universe? </li></ul><ul><li>One approach is to predict link (cross-reference) formation using network evolution and content (keywords) at the nodes (blogs) </li></ul><ul><li>4. How do we detect sentiment and topics associated with specific posts? </li></ul><ul><li>One approach is to learn a model using background knowledge and a small set of labeled examples. </li></ul>OpenID Buzz in January
    9. 9. Task 1 <ul><li>First, how do we actually find the relevant blogs and communities? </li></ul>
    10. 10. Task 1: How do we find a relevant sub-community of blogs (e.g. Lotus-related blogs)? <ul><li>Develop text-based classification approach to rank blogs in terms of their relevance to a specific domain (e.g. Lotus software) </li></ul>Politics Democrats Elections Republicans Policy Voting Technorati Blog Search Keywords Keyword Search Subset of Blogs Extended Blog Subset Include out-linked blogs Positive Examples Negative Examples Generate labels for classifier Classify blogs as Relevant or Irrelevant Classifier Build Model 5 4 3 2 1 Use Top-Ranked Blogs As New Subset Repeat this process as many times as necessary to collect a larger universe of blogs.
    11. 11. Task 1: Technorati results <ul><li>31 political tags were submitted to Technorati, based on terms surrounding the Presidential Primaries and policy areas currently making headlines (e.g. “Iraq”, “economy”) </li></ul><ul><li>The following table shows the number of blogs tagged with a specific term </li></ul>
    12. 12. Task 1: Current data sets <ul><li>Initial set of 100 “influential” political blogs </li></ul><ul><ul><li>Crawled from January 10, 2008 until the present </li></ul></ul><ul><ul><li>Includes influential sites taken from previous political blog papers and listings such as Technorati’s Top 100 blogs </li></ul></ul><ul><ul><li>Includes blogs like Huffington Post, Wonkette, Daily Kos, etc </li></ul></ul><ul><li>Larger set of 11788 blogs (317566 posts) being crawled since April 22, 2008 </li></ul><ul><ul><li>This includes the smaller data set above </li></ul></ul><ul><ul><li>Built through the Technorati tag system </li></ul></ul>
    13. 13. Task 2: How do we determine “authorities” in this sub-universe? <ul><li>Influence </li></ul><ul><ul><li>Standard site ranking algorithms (e.g. Page Rank or Flow Betweenness) look at status of each blogger within the social network </li></ul></ul><ul><ul><li>Page Rank looks at linking patterns, giving more weight (i.e. importance) to links originating from important websites </li></ul></ul><ul><ul><li>Flow Betweenness looks at whether specific nodes in a network act as key distribution points for information </li></ul></ul>Most Very Somewhat None Level of Importance Page Rank Flow Betweenness
    14. 14. Task 3: How do we detect emerging topics and themes in this sub-universe? <ul><li>One way to find emerging topics is comparing background discussions to most recent posts </li></ul><ul><li>We can also use text-based information to see if our authorities are actually leading discussions and breaking news </li></ul>General background discussions Wikileaks DNS entry removed by US judge New background level following press coverage
    15. 15. Task 3: How do we detect emerging topics and themes in this sub-universe? <ul><li>Assortativity and homophily play a key role in our analysis </li></ul><ul><li>Our approach to homophily is based on measuring within- and between-group edges </li></ul><ul><li>Homophily is only observed in certain contexts </li></ul><ul><ul><li>Blogs focusing on similar topics are more likely to link to each other </li></ul></ul><ul><ul><li>Blogs are not homophilous when it comes political sentiment </li></ul></ul><ul><ul><li>Node level versus network level </li></ul></ul>Homophilous Heterophilous
    16. 16. Task 3: Analyzing discussions and network structure <ul><li>To use assortativity in analyzing “buzz”, label bloggers by their mentioning a specific term or set of terms </li></ul><ul><li>In this case, correlation between # of bloggers and assortativity: -0.799 </li></ul>
    17. 17. Task 4: Labeling political posts and blogs <ul><li>Depending on the information we want to extract, there are numerous labels we may want to apply to posts or blogs </li></ul><ul><li>An extension of our data sets: </li></ul><ul><ul><li>260 posts labeled as “positive”, “negative” in relation to Hillary Clinton and Barack Obama </li></ul></ul><ul><ul><li>360 posts labeled as “relevant” or “not relevant” in relation to the Democratic Primaries </li></ul></ul><ul><li>Potential Labels </li></ul><ul><li>Relevant or not relevant </li></ul><ul><li>Subjective or objective </li></ul><ul><li>Positive or negative </li></ul><ul><li>More on Our Data </li></ul><ul><li>260 posts on sentiment surrounding Obama, Clinton </li></ul><ul><li>360 posts on relevance to Democratic Primaries </li></ul>Is post relevant? Is post subjective? Posts Positive Negative
    18. 18. Task 4: Labeling political posts and blogs <ul><li>Key question: how can we improve our classifiers with such a limited set of labeled examples? </li></ul><ul><li>Transfer learning: using other data sets </li></ul><ul><li>Using background knowledge </li></ul>
    19. 19. Task 4: Precocious Naïve Bayes <ul><li>Using Naïve Bayes classification, we can use a bag-of-words approach to build text-based classifiers </li></ul><ul><li>Problem: when training a classifier like this, we start with a “blank slate” – equal probabilities for all features </li></ul><ul><li>It may also be useful to include background knowledge in classification systems </li></ul><ul><ul><li>Lexicons containing sentiment-focused information </li></ul></ul><ul><ul><li>Related data sets and labeled information </li></ul></ul>
    20. 20. Task 4: Precocious Naïve Bayes <ul><li>Using a lexicon can improve the classification process </li></ul><ul><ul><li>It increases accuracy and minimizes the number of training examples needed </li></ul></ul><ul><li>Example application: classifying posts based on sentiment towards Obama or Clinton </li></ul><ul><ul><li>Similar studies achieve accuracies of about 60% </li></ul></ul>
    21. 21. Task 4: Precocious Naïve Bayes <ul><li>Using machine learning and information retrieval models can also help clarify linguistic patterns within political posts and blogs </li></ul><ul><li>For example, explore the term “truth” </li></ul><ul><ul><li>In general, “truth” has positive connotations </li></ul></ul><ul><ul><li>Yet our models see it as negative </li></ul></ul><ul><li>Evidence as to why “truth” is negative </li></ul><ul><ul><li>Often used in sarcastic or accusatory messages </li></ul></ul><ul><ul><li>Associated with negative events </li></ul></ul><ul><li>Another down-weighted term: “liberal” </li></ul><ul><ul><li>More evidence of a conservative blogosphere? </li></ul></ul>“ Spinning the truth.” “ Transform a lie into a truth.” “ There is a lot of truth to Wright's sermons.”
    22. 22. Task 4: Transfer learning for sentiment prediction <ul><li>Can we apply a model trained for movie reviews (or other product reviews) to predict sentiment in Lotus blogs? </li></ul><ul><li>We examined several labeled sentiment datasets from other domains to predict the sentiment of our Lotus-universe blogs </li></ul><ul><ul><li>Movielens: 1000 positive and 1000 negative movie reviews from www.movielens.org </li></ul></ul><ul><ul><li>Epinions: 97 enterprise software reviews from epinions.com </li></ul></ul><ul><ul><li>Amazon: 13040 software reviews from amazon.com </li></ul></ul>76.6 73.8 60.9 67.2 Amazon 50.3 34.9 67.5 33.7 Epinions 79.3 60.1 65.9 81.5 Movie Lotus Data 2-class (3-class) Amazon (Software) Epinions (Enterprise software) Movie Training Testing (Accuracy)
    23. 23. Task 4: Labeling political posts and blogs <ul><li>Final question: can we learn more by combining graph-based information with our text-based models? </li></ul>
    24. 24. Task 4: Topic-Link Latent Dirichlet Allocation (LDA) <ul><li>We wish to incorporate both content and network structure in modeling the political blogosphere </li></ul><ul><ul><li>Members of a “community” are more likely to link to each other </li></ul></ul><ul><li>An application: predicting links between new posts in our data set </li></ul><ul><ul><li>Predicting linking patterns can help predict (or observe) major patterns in the blogosphere </li></ul></ul><ul><li>Simple text-based models can even be somewhat effective </li></ul><ul><ul><li>The chart below shows the probability of a link existing between two posts based on those two posts’ content similarities </li></ul></ul>
    25. 25. Task 4: Topic-Link Latent Dirichlet Allocation (LDA) <ul><li>This method helps show clusters of terms (i.e. topical discussions or posts) </li></ul><ul><li>The table below shows the top five clusters using LDA and Topic-Link LDA </li></ul><ul><li>While one can’t formally say which set of clusters is better, both clusters provide different ways of looking at patterns in the data </li></ul><ul><ul><li>Clusters from the Topic-Link LDA approach are built using both textual similarity scores and the linking patterns of the bloggers </li></ul></ul>
    26. 26. Task 4: Link prediction using Topic-Link LDA <ul><li>Using the models generated from Topic-Link LDA, it is possible to build a predictive model for which posts will link to which posts </li></ul><ul><ul><li>Using posts from February 1-14, 2008, we want to predict linking patterns between posts written during February 15-22, 2008 </li></ul></ul><ul><li>Using the following baseline models: </li></ul><ul><ul><li>Preferential attachment: blogs with high-outlinks always cite blogs with high in-links </li></ul></ul><ul><ul><li>Cosine similarity: blog posts with high similarity scores (> 0.5) link to each other </li></ul></ul><ul><li>Below are precision and recall scores for the models </li></ul><ul><ul><li>Accuracies were not used due to the sparsity of the networks </li></ul></ul>
    27. 27. Future Work and Potential Extensions <ul><li>Graph transduction and assortativity of networks </li></ul><ul><ul><li>Multidimensional assortativity </li></ul></ul><ul><ul><li>Graph transduction dependent on assortativity </li></ul></ul><ul><li>Authority scores based on both network structure and text </li></ul><ul><ul><li>Who is driving the discussions, based on linguistic patterns and term frequencies? </li></ul></ul><ul><ul><li>Page rank, betweenness, etc. are good network measures for driving discussions, but need to be validated </li></ul></ul><ul><li>Early detection of “buzz” and discussions </li></ul><ul><ul><li>Building on models of “post cascades” </li></ul></ul><ul><ul><li>Incorporating new definition of authority using both textual and network-based data </li></ul></ul>
    28. 28. Thank You [email_address]

    ×