Trust influence and social media


Published on

  • Be the first to comment

  • Be the first to like this

Trust influence and social media

  1. 1. Trust, Influence and Bias in Social Media Anupam Joshi Joint work with Tim Finin and several students Ebiquity Group, UMBC [email_address]
  2. 2. Knowing & Influencing your Audience <ul><li>Your goal is to campaign for a presidential candidate </li></ul><ul><li>How can you track the buzz about him/her? </li></ul><ul><li>What are the relevant communities and bogs? </li></ul><ul><li>Which communities are supporters, which are skeptical, which are put off by the hype? </li></ul><ul><li>Is your campaign having an effect? The desired effect? </li></ul><ul><li>Which bloggers are influential with political audience? Of these, which are already onboard and which are lost causes? </li></ul><ul><li>To whom should you send details or talk to? </li></ul>
  3. 3. Knowing & Influencing your Market <ul><li>Your goal is to market Zune </li></ul><ul><li>How can you track the buzz about it? </li></ul><ul><li>What are the relevant communities and blogs? </li></ul><ul><li>Which communities are fans, which are suspicious, which are put off by the hype? </li></ul><ul><li>Is your advertising having an effect? The desired effect? </li></ul><ul><li>Which bloggers are influential in this market? Of these, which are already onboard and which are lost causes? </li></ul><ul><li>To whom should you send details or evaluation samples? </li></ul>
  4. 4. What is Influence? <ul><li>“ the act or power of producing an effect without apparent exertion of force or direct exercise of command ’’ </li></ul><ul><li>Measurable Influence </li></ul><ul><li>The ability of a blogger to persuade another blogger to </li></ul><ul><li>Take action by means of creating a new post about the topic and commenting on the original (text and graph mining) . </li></ul><ul><li>Quote the blogger’s views in her post (text mining) . </li></ul><ul><li>Link to the original post via trackbacks, comments (graph mining) . </li></ul><ul><li>Link to the blogger through other means like, digg, citeULike, Connotea, etc. (graph mining) </li></ul><ul><li>Subscribe to the blog feed (graph mining) . </li></ul>
  5. 5. <ul><li>A community in real world is represented in a graph as a set of nodes that have more links within the set than outside it. </li></ul><ul><li>Graph </li></ul><ul><li>Citation Network </li></ul><ul><li>Affiliation Network </li></ul><ul><li>Sentiment Information </li></ul><ul><li>Shared Resource (tags, videos..) </li></ul>Political Blogs Twitter Network Facebook Network What is a Community
  6. 6. Finding Communities (and Feeds) That Matter <ul><li>Top Advertising Feeds </li></ul><ul><li>1. Adrants » Marketing and Advertising News With Attitude </li></ul><ul><li>2. Adverblog: advertising and new media marketing </li></ul><ul><li>3. </li></ul><ul><li>4. adfreak </li></ul><ul><li>5. AdJab </li></ul><ul><li>6. MIT Advertising Lab: future of advertising and advertising technology </li></ul><ul><li>7. AdPulp: Daily Juice from the Ad Biz </li></ul><ul><li>8. Advertising/Design Goodness </li></ul><ul><li>Related Tags: advertising   marketing   media   news   design   </li></ul>Before Merge After Merge Analysis of Bloglines Feeds 83K publicly listed subscribers 2.8M feeds, 500K are unique 26K users (35%) use folders to organize subscriptions Data collected in May 2006
  7. 7. Feeds That Matter <ul><li>Top Feeds for “Politics” </li></ul><ul><li>Merged folders: “political”, “political blogs” </li></ul><ul><li>Talking Points Memo: by Joshua Micah Marshall </li></ul><ul><li>Daily Kos: State of the Nation </li></ul><ul><li>Eschaton </li></ul><ul><li>The Washington Monthly </li></ul><ul><li>Wonkette, Politics for People with Dirty Minds </li></ul><ul><li> </li></ul><ul><li>Informed Comment </li></ul><ul><li>Power Line </li></ul><ul><li>AMERICAblog: Because a great nation deserves the truth </li></ul><ul><li>Crooks and Liars </li></ul><ul><li>Top Feeds for “Knitting” </li></ul><ul><li>Merged folders “knitting blogs” </li></ul><ul><li>Yarn Harlot knitting </li></ul><ul><li>Wendy Knits! </li></ul><ul><li>See Eunny Knit! </li></ul><ul><li>the blue blog </li></ul><ul><li>Grumperina goes to local yarn shops and Home Depot </li></ul><ul><li>You Knit What?? </li></ul><ul><li>Mason-Dixon Knitting </li></ul><ul><li>knit and tonic </li></ul><ul><li>Crazy Aunt Purl </li></ul><ul><li> </li></ul>
  8. 8. <ul><li>Long Tail </li></ul><ul><ul><li>80/20 Rule or Pareto distribution </li></ul></ul><ul><ul><li>Few blogs get most attention/links </li></ul></ul><ul><ul><li>Most are sparsely connected </li></ul></ul><ul><li>Motivation </li></ul><ul><ul><li>Web graphs are large, but sparse </li></ul></ul><ul><ul><li>Expensive to compute community structure over the entire graph </li></ul></ul><ul><li>Goal </li></ul><ul><ul><li>Approximate the membership of the nodes using only a small portion of the entire graph. </li></ul></ul>Special Properties of Social Datasets
  9. 9. Special Properties of Social Datasets <ul><li>Intuition </li></ul><ul><ul><li>Communities defined by the core (A) </li></ul></ul><ul><ul><li>Membership of rest (B) approxi-mated by how they link to the core </li></ul></ul><ul><li>Direct Method </li></ul><ul><ul><li>NCut (Baseline) </li></ul></ul><ul><li>Approximation </li></ul><ul><ul><li>Singular value decomposition (SVD) </li></ul></ul><ul><ul><li>sampling </li></ul></ul><ul><ul><li>Heuristic </li></ul></ul>
  10. 10. <ul><li>SVD (low rank) </li></ul><ul><li>Sampling based Approach </li></ul><ul><ul><li>Communities can be extracted by sampling only columns from the head (Drineas et al.) </li></ul></ul><ul><li>Heuristic Cluster head to find initial communities. Assign cluster that the tail nodes most frequently link to . </li></ul>Approximating Communities Nodes ordered by degree r ICWSM ‘08
  11. 11. Approximating Communities <ul><li>Dataset: A blog dataset of 6000 blogs. </li></ul>ICWSM ‘08 Original Adjacency Heuristic Approximation Modularity = 0.51
  12. 12. Approximating Communities <ul><li>Advantages: faster detection using small portion of graph, less memory </li></ul><ul><li>Complexity: SVD O(n 3 ), Ncut O(nk), Sampling O(r 3 ), Heuristic O(rk) where n = # blogs, k = # clusters, r = # columns </li></ul>Low Modularity More Time Similar Modularity Lower Time ICWSM ‘08
  13. 13. Approximating Communities ICWSM ‘08 Additional evaluations using Variation of Information score
  14. 14. <ul><li>Tags are free meta-data! </li></ul><ul><li>Other semantic features: </li></ul><ul><li>Sentiments </li></ul><ul><li>Named Entities </li></ul><ul><li>Readership information </li></ul><ul><li>Geolocation information </li></ul><ul><li>etc. </li></ul><ul><li>How to combine this for detecting communities? </li></ul>
  15. 15. Social Media Graphs Links Between Nodes Links Between Nodes and Tags Simultaneous Cuts
  16. 16. A community in the real world is identified in a graph as a set of nodes that have more links within the set than outside it and share similar tags. Communities in Social Media
  17. 17. Nodes Nodes Nodes Tags Tags Nodes Tags Tags Fiedler Vector Polarity β = 0 Entirely ignore link information β = 1 Equal importance to blog-blog and blog-tag, β >> 1 NCut WebKDD ‘08 SimCUT: Clustering Tags and Graphs
  18. 18. SimCUT: Clustering Tags and Graphs β = 0 Entirely ignore link information β = 1 Equal importance to blog-blog and blog-tag, β >> 1 NCut Clustering Only Links Clustering Links + Tags WebKDD ‘08
  19. 19. Datasets <ul><li>Citeseer (Getoor et al.) </li></ul><ul><ul><li>Agents, AI, DB, HCI, IR, ML </li></ul></ul><ul><ul><li>Words used in place of tags </li></ul></ul><ul><li>Blog data </li></ul><ul><ul><li>derived from the WWE/Buzzmetrics dataset </li></ul></ul><ul><ul><li>Tags associated with Blogs derived from </li></ul></ul><ul><ul><li>For dimensionality reduction 100 topics derived from blog homepages using LDA (Latent Dirichilet Allocation) </li></ul></ul><ul><li>Pairwise similarity computed </li></ul><ul><ul><li>RBF Kernel for Citeseer </li></ul></ul><ul><ul><li>Cosine for blogs </li></ul></ul>
  20. 20. Clustering Tags and Graphs Clustering Only Links Clustering Links + Tags
  21. 21. Varying Scaling Parameter β Accuracy = 36% Accuracy = 62% Higher accuracy by adding ‘tag’ information Simple Kmeans ~23% Content only, binary Content only ~52% (Getoor et al. 2004) β >> 1 β=1 β=0 Accuracy = 39% Only Graph Only Tags Graphs & Tags
  22. 22. Effect of Number of Tags, Clusters <ul><li>Mutual Information </li></ul><ul><li>Measures the dependence between two random variables. </li></ul><ul><li>Compares results with ground truth </li></ul>Citeseer Link only has lower MI More Semantics helps Similar results for real, blog datasets
  23. 23. Influence in Communities Communities detected using “Fast algorithm for detecting community structure in networks”, M.E. J. Newman
  24. 24. Authority and Popularity <ul><li>Authority </li></ul><ul><li>contributes to influence </li></ul><ul><li>Influence may be subjective. </li></ul><ul><li>A source, authoritative in one community could influence another community negatively. </li></ul><ul><li>Within a community, an authoritative source is influential. </li></ul><ul><li>Popularity </li></ul><ul><li>Authority and popularity often treated equally </li></ul><ul><li>On blog search engines, authority is measured using inlinks, which is at best popularity </li></ul><ul><li>Popularity doesn’t mean influence </li></ul><ul><ul><li>Dilbert is extremely popular but not influential </li></ul></ul>
  25. 25. Link Polarity & Sentiment
  26. 26. Link Polarity and Bias <ul><li>Linking alone is not indicator of influence </li></ul><ul><li>Polarity (+/- sentiment) indicates type of influence </li></ul><ul><li>Consistent negative/positive opinion indicates bias </li></ul><ul><li>Link polarity/citation signal helps determine trust </li></ul>Democrat Blog Republican Blog Strong Negative Opinion Mildly Negative opinion Strongly Positive opinion
  27. 27. Propagating Influence <ul><li>Based on work of Guha et al [1] for modeling propagation of trust and distrust. Framework: </li></ul><ul><ul><li>M ij represents influence/bias from user i to j.(0 <= M ij <= 1) </li></ul></ul><ul><ul><li>M ij is initialized to the polarity from i to j. </li></ul></ul><ul><ul><li>Belief Matrix M (sparse) represents initial set of known beliefs </li></ul></ul><ul><ul><ul><li>Goal is to compute all unknown values in M </li></ul></ul></ul><ul><ul><li>Belief Matrix after i th atomic propagation </li></ul></ul><ul><ul><ul><ul><li>M i+1 = M i * C i </li></ul></ul></ul></ul><ul><ul><li>Combined Operator </li></ul></ul><ul><ul><ul><ul><li>C i = a 1 * M + a 2 * M T *M + a 3 * M T + a 4 * M*M T </li></ul></ul></ul></ul><ul><ul><ul><ul><li>a {0.4, 0.4, 0.1, 0.1} represents weighing factor </li></ul></ul></ul></ul><ul><ul><li>[1] Guha R, Kumar R, Raghavan P, Tomkins A. Propagation of trust and distrust. In: Proceedings of the Thirteenth International World Wide Web Conference, New York, NY, USA, May 2004. ACM Press, 2004. </li></ul></ul>
  28. 28. Recognizing subjectivity & sentiment <ul><li>We’ve developed ΔTFIDF as a simple feature-engineering technique to increase the accuracy of subjectivity detection and sentiment analysis </li></ul><ul><li>Our preliminary analysis shows that ΔTFIDF </li></ul><ul><ul><ul><li>Works well in different subject domains </li></ul></ul></ul><ul><ul><ul><li>Improves accuracy for documents of varying sizes: sentence fragments, sentences, paragraphs and multi-paragraph documents </li></ul></ul></ul><ul><ul><ul><li>Helps on text classification tasks other than sentiment analysis </li></ul></ul></ul>
  29. 29. Feature Engineering for Text Classification <ul><li>Typical features: words and/or phrases along with term frequency or (better) TF-IDF scores </li></ul><ul><li>ΔTFIDF amplifies the training set signals by using the ratio of the IDF for the negative and positive collections </li></ul><ul><li>Results in a significant boost in accuracy </li></ul>Text: The quick brown fox jumped over the lazy white dog. Features: the 2, quick 1, brown 1, fox 1, jumped 1, over 1, lazy 1, white 1, dog 1, the quick 1, quick brown 1, brown fox 1, fox jumped 1, jumped over 1, over the 1, lazy white 1, white dog 1
  30. 30. ΔTFIDF BoW Feature Set <ul><li>Value of feature t in document d is </li></ul><ul><li>Where </li></ul><ul><ul><li>C t,d = count of term t in document d </li></ul></ul><ul><ul><li>N t = number of negative labeled training docs with term t </li></ul></ul><ul><ul><li>P t = number of positive labeled training docs with term t </li></ul></ul><ul><li>Normalize to avoid bias towards longer documents </li></ul><ul><li>Gives greater weight to rare (significant) words </li></ul><ul><li>Downplays very common words </li></ul><ul><li>Similar to Unigram + Bigram BoW in other aspects </li></ul>
  31. 31. Example: ΔTFIDF vs TFIDF vs TF <ul><li>Δtfidf tfidf tf </li></ul><ul><li>, city angels , </li></ul><ul><li>cage is angels is the </li></ul><ul><li>mediocrity , city . </li></ul><ul><li>criticized of angels to </li></ul><ul><li>exhilarating maggie , of </li></ul><ul><li>well worth city of a </li></ul><ul><li>out well maggie and </li></ul><ul><li>should know angel who is </li></ul><ul><li>really enjoyed movie goers that </li></ul><ul><li>maggie , cage is it </li></ul><ul><li>it's nice seth , who </li></ul><ul><li>is beautifully goers in </li></ul><ul><li>wonderfully angels , more </li></ul><ul><li>of angels us with you </li></ul><ul><li>Underneath the city but </li></ul>15 features with highest values for a review of City of Angels
  32. 32. Improvement over TFIDF (Uni- + Bi-grams) <ul><li>Movie Reviews: 88.1% Accuracy vs. 84.65% at 95% Confidence Interval </li></ul><ul><li>Subjectivity Detection (Opinionated or not): 91.26% vs. 89.4% at 99.9% Confidence Interval </li></ul><ul><li>Congressional Support for Bill (Voted for/ Against): 72.47% vs. 66.84% at 99.9% Confidence Interval </li></ul><ul><li>Enron Email Spam Detection : (Spam or not): 98.917% vs. 96.6168 at 99.995% Confidence Interval </li></ul><ul><li>All tests used 10 fold cross validation </li></ul><ul><li>At least as good as mincuts + subjectivity detectors on movie reviews (87.2%) </li></ul>
  33. 33. Link Polarity Experiments <ul><li>Domain </li></ul><ul><ul><li>Political Blogosphere </li></ul></ul><ul><ul><li>Dataset from Buzzmetrics [2] provides post-post link structure over 14 million posts </li></ul></ul><ul><ul><li>Few off-the-topic posts help aggregation </li></ul></ul><ul><ul><li>Potential business value </li></ul></ul><ul><li>Reference Dataset </li></ul><ul><ul><li>Hand-labeled dataset from Lada Adamic et al [3] classifying political blogs into right and left leaning bloggers </li></ul></ul><ul><ul><li>Timeframe : 2004 presidential elections, over 1500 blogs analyzed </li></ul></ul><ul><ul><li>Overlap of 300 blogs between Buzzmetrics and reference dataset </li></ul></ul><ul><li>Goal </li></ul><ul><ul><li>Classify the blogs in Buzzmetrics dataset as democrat and republican and compare with reference dataset </li></ul></ul><ul><ul><li>[2] Lada A. Adamic and Natalie Glance, &quot;The political blogosphere and the 2004 US Election&quot;, in Proceedings of the WWW-2005 Workshop </li></ul></ul><ul><ul><li>Buzzmetrics – </li></ul></ul>
  34. 34. Evaluation of Link Polarity Confusion Matrix <ul><li>Accuracy = 73% </li></ul><ul><li>True positive (Recall) = 78% </li></ul><ul><li>False positive (FP) = 31% </li></ul><ul><li>True negative (Recall) = 69% </li></ul><ul><li>False negative (FN) = 21% </li></ul><ul><li>Precision (R) = 75% </li></ul><ul><li>Precision (D) = 72% </li></ul>Polarity Improves Classification by almost 26%
  35. 35. Trust Propagation Sample Data <ul><li>Compensates for initial incorrect polarity ( DK–AT ) </li></ul><ul><li>Doesn’t change correct polarity ( AT-DK ) </li></ul><ul><li>Assigns correct polarity for non-existent direct links ( AT-IP ) </li></ul><ul><li>Numbers in italics are problematic ( MM-AT ) </li></ul><ul><ul><ul><li>Improve sentiment detection ? </li></ul></ul></ul>
  36. 36. MSM Classification Results
  37. 37. Interesting Observations <ul><li>24 of 27 sources correct-ly classified </li></ul><ul><ul><li>guardian , foxnews , human-eventsonline , mediamatters </li></ul></ul><ul><li>Outliers: “The Nation” & “Boston Globe” </li></ul><ul><li>Left and right leaning blogs talk negatively about “ny times” & “abc news” and positively about “raw story” and “examiner” </li></ul>
  38. 38. Identifying Bias using KL Divergence
  39. 39. Conclusion
  40. 40. Conclusion <ul><li>Using topic, social structure and opinions we can develop a model for influence, bias and trust in social media </li></ul><ul><li>We apply this framework on real-world data and describe techniques for identifying influence </li></ul><ul><li>Splogs are a big issue – we have developed efficient techniques to detect them in near real time </li></ul><ul><li>Does the Game Theoretic Nature of this system raise fundamental new challenges for Data Mining </li></ul>
  41. 41. Assets: Good, Bad and Wanted <ul><li>How the assets (data, APIs) were helpful? </li></ul><ul><li>Where these assets failed to be helpful and why? </li></ul><ul><ul><li>Since we go “beyond search”, search data not that useful  </li></ul></ul><ul><li>Which research questions you would like to address if you had unlimited access to assets? </li></ul><ul><ul><li>Unlimited livespaces link and content data to validate some of our approaches. </li></ul></ul><ul><ul><li>Use to place ads on social media sites </li></ul></ul>
  42. 42.