Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Frontiers of Computational Journalism week 3 - Information Filter Design

34 views

Published on

Taught at Columbia Journalism School, Fall 2018
Full syllabus and lecture videos at http://www.compjournalism.com/?p=218

Published in: Education
  • Be the first to comment

  • Be the first to like this

Frontiers of Computational Journalism week 3 - Information Filter Design

  1. 1. Frontiers of Computational Journalism Columbia Journalism School Week 3: Information Filter Design September 26, 2016
  2. 2. This class • The need for information filtering • Filtering algorithms • Human-machine filters • Filter bubbles and other problems • The filter design problem
  3. 3. The Need for Filtering
  4. 4. More video on YouTube than produced by TV networks during entire 20th century.
  5. 5. 10,000 legally-required reports filed by U.S. public companies every day
  6. 6. Each day, the Associated Press publishes: ~10,000 text stories ~3,000 photographs ~500 videos + radio, interactive…
  7. 7. Comment Ranking
  8. 8. Comment voting Problem: putting comments with most votes at top doesn’t work. Why?
  9. 9. Old reddit comment ranking “Hot” algorithm. Up – down votes plus time decay
  10. 10. Reddit Comment Ranking (new) Hypothetically, suppose all users voted on the comment, and v out of N up-voted. Then we could sort by proportion p = v/N of upvotes. N=16 v = 11 p = 11/16 = 0.6875
  11. 11. Reddit Comment Ranking Actually, only n users out of N vote, giving an observed approximate proportion p’ = v’/n n=3 v’ = 1 p’ = 1/3 = 0.333
  12. 12. Reddit Comment Ranking Limited sampling can rank votes wrong when we don’t have enough data. p’ = 0.333 p = 0.6875 p’ = 0.75 p = 0.1875
  13. 13. Confidence interval 1-𝛼 probability that the true value p will lie within the central region (when sampled assuming p=p’)
  14. 14. Rank comments by lower bound of confidence interval p’ = observed proportion of upvotes n = how many people voted zα= how certain do we want to be before we assume that p’ is “close” to true p Analytic solution for confidence interval, known as “Wilson score” How not to sort by average rating, Evan Miller
  15. 15. User-item Recommendation
  16. 16. User-item matrix Stores “rating” of each user for each item. Could also be binary variable that says whether user clicked, liked, starred, shared, purchased...
  17. 17. User-item matrix • No content analysis. We know nothing about what is “in” each item. • Typically very sparse – a user hasn’t watched even 1% of all movies. • Filtering problem is guessing “unknown” entry in matrix. High guessed values are things user would want to see.
  18. 18. Filtering process Item-Based Collaborative Filtering Recommendation Algorithms, Sarwar et al
  19. 19. How to guess unknown rating? Basic idea: suggest “similar” items. Similar items are rated in a similar way by many different users. Remember, “rating” could be a click, a like, a purchase. o “Users who bought A also bought B...” o “Users who clicked A also clicked B...” o “Users who shared A also shared B...”
  20. 20. Similar items Item-Based Collaborative Filtering Recommendation Algorithms, Sarwar et al
  21. 21. Item similarity Cosine similarity!
  22. 22. Other distance measures “adjusted cosine similarity” Subtracts average rating for each user, to compensate for general enthusiasm (“most movies suck” vs. “most movies are great”)
  23. 23. Generating a recommendation Weighted average of item ratings by their similarity.
  24. 24. Matrix factorization recommender
  25. 25. Matrix factorization recommender Note: only sum over observed ratings rij.
  26. 26. Matrix factorization plate model r v u user rating of item variation in user topics λu λv variation in item topics topics for user topics for item i users j items
  27. 27. New York Times recommender
  28. 28. Different Filtering Systems Content: Newsblaster analyzes the topics in the documents. No concept of users. Social: What I see on Twitter determined by who I follow. Reddit comments filtered by votes as input. Amazon "people who bought X also bought Y” - no content analysis. Hybrid: Recommend based both on content and user behavior.
  29. 29. Combining collaborative filtering and topic modeling Collaborative Topic Modeling for Recommending Scientific Articles, Wang and Blei
  30. 30. K topics topic for word word in doc topics in doc topic concentration parameter word concentration parameter Content modeling - LDA D docs words in topics N words in doc
  31. 31. K topicstopic for word word in doctopics in doc (content) topic concentration weight of user selections variation in per-user topics topics for user user rating of doctopics in doc (collaborative) Collaborative Topic Modeling
  32. 32. content only content + social
  33. 33. Filtering News on Twitter
  34. 34. Reuters News Tracer Filter Cluster into events Searches and Alerts Score veracity & newsworthy
  35. 35. Liu et. al, Reuters Tracer: A Large Scale System of Detecting & Verifying Real-Time News Events from Twitter
  36. 36. Liu et. al, Reuters Tracer: A Large Scale System of Detecting & Verifying Real-Time News Events from Twitter
  37. 37. Liu et. al, Reuters Tracer: A Large Scale System of Detecting & Verifying Real-Time News Events from Twitter
  38. 38. Human-Machine Filters
  39. 39. TechMeme / MediaGazer
  40. 40. Facebook trending (with editors)
  41. 41. Facebook trending (without editors)
  42. 42. Facebook “trending review tool” screenshot from leaked documents
  43. 43. Approve or Reject: Can You Moderate Five New York Times Comments?
  44. 44. Revealed: Facebook's internal rulebook on sex, terrorism and violence, The Guardian
  45. 45. Facebook’s “Community Standards” document
  46. 46. Filter bubbles and other problems
  47. 47. Graph of political book sales during 2008 U.S. election, by orgnet.org From Amazon "users who bought X also bought Y" data.
  48. 48. Retweet network of political tweets. Political Polarization on Twitter, Conover, et. al.,
  49. 49. Instagram co-tag graph, highlighting three distinct topical communities: 1) pro-Israeli (Orange), 2) pro-Palestinian (Yellow), and 3) Religious / muslim (Purple) Gilad Lotan, Betaworks
  50. 50. The Filter Bubble What people care about politically, and what they’re motivated to do something about, is a function of what they know about and what they see in their media. ... People see something about the deficit on the news, and they say, ‘Oh, the deficit is the big problem.’ If they see something about the environment, they say the environment is a big problem. This creates this kind of a feedback loop in which your media influences your preferences and your choices; your choices influence your media; and you really can go down a long and narrow path, rather than actually seeing the whole set of issues in front of us. - Eli Pariser, How do we recreate a front-page ethos for a digital world?
  51. 51. Are filters causing our bubbles? Increasing U.S. polarization predates Internet by decades.
  52. 52. Is the Internet Causing Political Polarization? Evidence from Demographics Boxell, Gentzkow, Shapiro Polarization increasing fastest among those who are online the least
  53. 53. Exposure to Diverse Information on Facebook, Eytan Bakshy, Lada Adamic, Solomon Messing Will you see diverse content vs. will you click it?
  54. 54. Filter Design
  55. 55. Item Content My Data Other Users’ Data Text analysis, topic modeling, clustering... who I follow what I’ve read/liked social network structure, other users’ likes
  56. 56. Filter design problem Formally, given U = user preferences, history, characteristics S = current story {P} = results of function on previous stories {B} = background world knowledge (other users?) Define r(S,U,{P},{B}) in [0...1] relevance of story S to user U
  57. 57. Filter design problem, restated When should a user see a story? Aspects to this question: normative personal: what I want societal: emergent group effects UI how do I tell the computer I want? technical constrained by algorithmic possibility economic cheap enough to deploy widely
  58. 58. “Conversational health” Measuring the health of our public conversations, Cortico.ai
  59. 59. Exposure diversity as a design principle for recommender systems, Natali Helberger
  60. 60. How to evaluate/optimize?
  61. 61. How to evaluate/optimize? • Netflix: try to predict the rating that the user gives a movie after watching it. • Amazon: sell more stuff. • Google, Facebook: human raters A/B test every change (but what do they optimize for?)
  62. 62. • Does the user understand how the filter works? • Can they configure it as desired? • Controls for abuse and harassment • Can it be gamed? Spam, "user-generated censorship," etc. How to evaluate/optimize?
  63. 63. Information diet The holy grail in this model, as far as I’m concerned, would be a Firefox plugin that would passively watch your websurfing behavior and characterize your personal information consumption. Over the course of a week, it might let you know that you hadn’t encountered any news about Latin America, or remind you that a full 40% of the pages you read had to do with Sarah Palin. It wouldn’t necessarily prescribe changes in your behavior, simply help you monitor your own consumption in the hopes that you might make changes. - Ethan Zuckerman, Playing the Internet with PMOG

×