Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Measuring News Similarity Across Ten U.S. News Sites

735 views

Published on

Based on long paper accepted at iPRES 2018. Preprint available at: https://arxiv.org/abs/1806.09082

Published in: News & Politics
  • Be the first to comment

  • Be the first to like this

Measuring News Similarity Across Ten U.S. News Sites

  1. 1. Measuring News Similarity Across Ten U.S. News Sites Old Dominion University Web Science & Digital Libraries Research Group @grantcatkins @WebSciDL Grant C. Atkins, Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson iPRES 2018 Boston, Massachusetts September 25, 2018
  2. 2. The editorial decision 2 ABC News Homepage December 24, 2016 @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  3. 3. The editorial decision 3 ABC News Homepage & USA Today Homepage December 24, 2016 @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  4. 4. Purpose of our experiment • Investigate how synchronized news sites are • Demonstrate a method of mining archived news sites • Detail the difficulties of retrieving top news in news sites and web archives 4@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  5. 5. Homepage formatting tells a better tale • Intuitive for which story is the top story • Subsequent stories are labeled by the news site 5 USA Today Homepage December 24, 2016 @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  6. 6. Internet Archive to the rescue • Oldest and largest Web Archive, more likely to have multiple copies • Memento compliant • Links rewritten to receive stories closest to page’s Memento-Datetime • Not limited to only one news site 6@grantcatkins @WebSciDL https://mementoweb.org/guide/rfc/#rfc.section.2.2.1 iPRES 2018, Boston, MA September 25, 2018
  7. 7. News sites host their web archives • Only two copies of articles • Live version • Archived version (time of publishing) • Homepages archived only once per day • All links point to the live web • Most news sites do not retain their own web archive • Does not conform to the Memento Protocol 7@grantcatkins @WebSciDL https://archive.nytimes.com/ iPRES 2018, Boston, MA September 25, 2018
  8. 8. CNN – JS prohibits playback 8 http://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  9. 9. WP – broken stylesheet 9@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  10. 10. FT – paywall in place 10 http://ws-dl.blogspot.com/2018/03/2018-03-15-paywalls-in-internet-archive.html @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  11. 11. Selecting ten U.S. news sites 11 Memento counts for news site homepages from November 2016 to January 2017 @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  12. 12. Other news sites considered • MSNBC • A majority of top news stories linked to videos not textual content • Wall Street Journal • Partial stories followed by subscription message • CNN • Became unreplayable on November 1, 2016 for the Internet Archive • Financial Times • Almost all stories locked behind a paywall 12 http://ws-dl.blogspot.com/2018/03/2018-03-15-paywalls-in-internet-archive.html @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  13. 13. Measuring synchronicity requires snapshots from the same time 13 Memento creation times from November 2016 to January 2017 @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  14. 14. Temporal distance for mementos retrieved 14@grantcatkins @WebSciDL We can only get homepage Mementos for the times the Internet Archive has collected them iPRES 2018, Boston, MA September 25, 2018
  15. 15. Parsing the homepages https://github.com/oduwsdl/top-news-selectors 15 • Developed custom parsers for the 10 news sites • Collected top stories limited to k = 10 stories per site • Ignored opinion stories not in line with main content @grantcatkins @WebSciDL New York Times Homepage November 1, 2016 iPRES 2018, Boston, MA September 25, 2018
  16. 16. Hero Stories (k = 1) • Prominent top stories emphasized by: • Large font • Central placement • Identified by • Position • Font size • Image size (if one exists) 16@grantcatkins @WebSciDL CBS News Homepage January 1, 2017 NPR Homepage January 1, 2017 iPRES 2018, Boston, MA September 25, 2018
  17. 17. CSS naming conventions can self-identify top stories in HTML 17@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  18. 18. Creating CSS rules 18 NBC News Homepage div.row.js-top-stories-content Hero Story CSS Rule: .js-top-stories-content .panel-txt a Top Stories CSS Rule: .js-top-stories-content div .story-link .media-body > a @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  19. 19. Can’t always get 10 stories 19@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  20. 20. Ordering is often clear 20@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  21. 21. Order is ambiguous 21 New York Times Homepage November 1, 2016 @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  22. 22. Special events can break parsers 22@grantcatkins @WebSciDL USA Today, New York Times, and LA Times Homepages November 8, 2016 (Election Day) iPRES 2018, Boston, MA September 25, 2018
  23. 23. Extracting story text 23 • Request story given an archived story URI • Render textual content and save output • Clean saved text by removing navigational HTML, JavaScript, and text outside story content via Boilerplate removal http://ws-dl.blogspot.com/2017/03/2017-03-20-survey-of-5- boilerplate.html @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  24. 24. Quantifying news similarity • Similarity score: a value between 0 and 1 indicating the degree of similarity of the text content of the news stories (cosine similarity) • 0 – no similarity; documents without any common vocabulary • 1 – maximum similarity; duplicate documents 24@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  25. 25. Quantifying news similarity example (colors = topics, numbers = stories) 25 ID News Titles 1 “Donald Trump Congratulates Roy Moore for Primary Win” 2 “Trump offers congratulations to Roy Moore” 3 “Roy Moore wins Alabama Senate GOP primary runoff” 4 “Harvey Puts Houston Underwater” 5 “Hurricane Harvey intensifies to Category 2 storm” 6 “Harvey Puts Houston Underwater” 7 “Mass Shooting in Las Vegas” 8 “Mass Shooting Outside Las Vegas’ Mandalay Bay” 9 “Las Vegas shooting: What we know” Topic Roy Moore Wins Hurricane Harvey Vegas Shooting @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  26. 26. Quantifying news similarity example (colors = topics, numbers = stories) 26 ID News Titles 1 “Donald Trump Congratulates Roy Moore for Primary Win” 2 “Trump offers congratulations to Roy Moore” 3 “Roy Moore wins Alabama Senate GOP primary runoff” 4 “Harvey Puts Houston Underwater” 5 “Hurricane Harvey intensifies to Category 2 storm” 6 “Harvey Puts Houston Underwater” 7 “Mass Shooting in Las Vegas” 8 “Mass Shooting Outside Las Vegas’ Mandalay Bay” 9 “Las Vegas shooting: What we know” Topic Roy Moore Wins Hurricane Harvey Vegas Shooting Collections similarity scores 1 2 3 4 5 6 7 8 9 = 0.42 = 0.61 = 0.70 @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  27. 27. Quantifying news similarity example (colors = topics, numbers = stories) 27 ID News Titles 1 “Donald Trump Congratulates Roy Moore for Primary Win” 2 “Trump offers congratulations to Roy Moore” 3 “Roy Moore wins Alabama Senate GOP primary runoff” 4 “Harvey Puts Houston Underwater” 5 “Hurricane Harvey intensifies to Category 2 storm” 6 “Harvey Puts Houston Underwater” 7 “Mass Shooting in Las Vegas” 8 “Mass Shooting Outside Las Vegas’ Mandalay Bay” 9 “Las Vegas shooting: What we know” Topic Roy Moore Wins Hurricane Harvey Vegas Shooting Collections similarity scores 1 2 3 4 5 6 7 8 9 = 0.29 @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  28. 28. K maximum stories per news site • Limit stories to a maximum of k stories from each news site • When k = 1, there is a maximum of 10 stories – the Hero Story from each news site • When k = 3, there is a maximum of 30 stories • When k = 10, there is a maximum of 100 stories 28@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  29. 29. Hero Stories (k = 1) • High variability • 10 stories worth of vocabulary • Somewhat difficult to identify significant events Max Similarity: 0.5037 Mean Similarity: 0.2858 Min Similarity : 0.1268 29@grantcatkins @WebSciDL a) Election Day (November 8, 2016) b) Thanksgiving Day (November 24, 2016) c) Christmas Day (December 25, 2016) d) Travel Ban comes into effect (January 27, 2017) iPRES 2018, Boston, MA September 25, 2018
  30. 30. Three stories from each news site (k = 3) • Build up to significant events more transparent Max Similarity: 0.3566 Mean Similarity: 0.2160 Min Similarity : 0.1248 30@grantcatkins @WebSciDL a) Election Day (November 8, 2016) b) Thanksgiving Day (November 24, 2016) c) Christmas Day (December 25, 2016) d) Travel Ban comes into effect (January 27, 2017) iPRES 2018, Boston, MA September 25, 2018
  31. 31. Lowest similarity but clearest synchronicity (k = 10) • Decline and rise of story synchronicity transparent Max Similarity: 0.2786 Mean Similarity: 0.1608 Min Similarity : 0.1150 31@grantcatkins @WebSciDL a) Election Day (November 8, 2016) b) Thanksgiving Day (November 24, 2016) c) Christmas Day (December 25, 2016) d) Travel Ban comes into effect (January 27, 2017) iPRES 2018, Boston, MA September 25, 2018
  32. 32. Similarity goes down as number of stories goes up 32 a) Election Day (November 8, 2016) b) Thanksgiving Day (November 24, 2016) c) Christmas Day (December 25, 2016) d) Travel Ban comes into effect (January 27, 2017) @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  33. 33. Travel Ban - Highest similarity (January 29, 2016) 33 Similarity score is 0.5037 when k = 1. Highest similarity score regardless of k value @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  34. 34. Did not find national holiday synchronicity • Overshadowed by: • Continuing political stories • Sudden tragedies • Interpreting synchronicity requires justification via web archives 34@grantcatkins @WebSciDL CBS Homepage December 25, 2016 (Christmas Day) New York Times Homepage November 11, 2016 (Veterans Day) iPRES 2018, Boston, MA September 25, 2018
  35. 35. What we found • Similarity values peak after a significant event starts • Events not known in advance have a delay in synchronization • Introducing more stories generally means similarity goes down • Political events are more likely to have higher similarity than national holidays based on our dataset 35@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  36. 36. Future work • Extend date range of experiment • Check news similarity multiple times per day – 3AM, 12PM, etc. • Compare aggregated archived news in quality • Analyze how splash titles of homepages differ from actual article titles 36@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  37. 37. Takeaway • Using CSS selectors we can mine top archived news stories • Story position, font size, and image size on a homepage aid researchers in determining ranking of stories • Cosine similarity can be used to evaluate a collection of news stories • USA Today highly values Christmas as a Hero story 37@grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  38. 38. @grantcatkins @WebSciDL 38 Measuring News Similarity Across Ten U.S. News Sites Parser: https://github.com/oduwsdl/top-news-selectors Dataset: https://github.com/grantat/news-similarity Data Collection & Visualization Scripts: https://github.com/grantat/news-similarity-core Preprint: https://arxiv.org/abs/1806.09082 Old Dominion University Web Science & Digital Libraries Research Group @grantcatkins @WebSciDL Grant C. Atkins, Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson iPRES 2018, Boston, MA September 25, 2018
  39. 39. Supplementary Slides @grantcatkins @WebSciDL 39iPRES 2018, Boston, MA September 25, 2018
  40. 40. Problems with finding “top news” • RSS feeds are sorted in order publish date • We can’t go back in time with RSS • No APIs for supplying ranked stories 40 https://abcnews.go.com/abcnews/topstories @grantcatkins @WebSciDL iPRES 2018, Boston, MA September 25, 2018
  41. 41. Coverage beyond targeted timeline 41@grantcatkins @WebSciDL Our parser fails to cover these days iPRES 2018, Boston, MA September 25, 2018

×