Your SlideShare is downloading. ×
Social Media News Mining and Automatic Content Analysis of News
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Social Media News Mining and Automatic Content Analysis of News

1,761
views

Published on

Presentation at the Tow Center for Digital Journalism, Columbia University. November 14th, 2013. …

Presentation at the Tow Center for Digital Journalism, Columbia University. November 14th, 2013.

VIDEO: http://new.livestream.com/accounts/1079539/events/2542929

http://towcenter.org/events/conversation-with-carlos-castillo/


0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,761
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
33
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Social Media News Mining Carlos Castillo Gilad Lotan @ChaToX @gilgul
  • 2. Social Media News Mining & Automatic Content Analysis of News Carlos Castillo – Qatar Computing Research Institute Nov 14th, 2013
  • 3. Outline • Social media around news 1. Predictive analytics using social media 2. Crowds and curators • Automatic content analysis of news 3. TV news via closed captions 4. Online news in international media Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 5
  • 4. Communication scholars vs. Computer scientists • Media and communication scholars – Start from high-level questions • Computer scientists – Start from low-level observations • We need to find a middle ground – To a large extent, we are still not there – I am certainly still not there Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 6
  • 5. Collaborators • • • • • • • • • • • Gianmarco de Francisci Morales – Yahoo! Mohammed El-Haddad – Al Jazeera Sandra González-Bailón – University of Pennsylvania Nasir Khan – Al Jazeera Mounia Lalmas – Yahoo! Janette Lehmann – Pompeu Fabra University & Yahoo! Marcelo Mendoza – Yahoo! Jürgen Pfeffer – CMU Matt Stempeck – MIT Civic Media Diego Sáez-Trumper – Pompeu Fabra University Ethan Zuckerman – MIT Civic Media Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 7
  • 6. Topic 1 of 4 Predictive analytics using social media Carlos Castillo, Mohammed El-Haddad, Jürgen Pfeffer and Matt Stempeck Characterizing the Life Cycle of Online News Stories Using Social Media Reactions To appear in Proc. of Computer Supported Collaborative Work and Social Media. Baltimore, MD, USA. February 2014. See also: demo at http://fast.qcri.org/
  • 7. Pirates abduct ship’s crew off Nigerian coast October 17th, 2012
  • 8. Usage analysis (in news) online • Aikat (1998) – Bursts, short dwell times, weekday != weekend • Crane and Sornette (2008), Yang and Leskovec (2011), Lehmann et al. (2012) – Behavioral classes of attention online • Lotan, Gaffney, and Meyer (SocialFlow, 2011) – Al Jazeera, BBC, CNN, The Economist, Fox News, NY Times • … and many others! Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 10
  • 9. News In-Depth
  • 10. News examples ● ● Dozens killed in India bus-crash blaze (Oct 30th, 2013) Kenyan army admits soldiers looted mall (Oct 30th, 2013) In-Depth examples ● ● Sex selective abortions worry Azerbaijanis (Oct 29th, 2013) Time to put an end to Israel's don't ask-don't tell nuclear policy (Oct 18th, 2013)
  • 11. News: intense first hour In-Depth: longer shelf-life
  • 12. Average visitation/sharing profiles News Carlos Castillo – chato@acm.org http://www.chato.cl/research/ In-Depth 14
  • 13. Types of news visitation profiles (12 h) Decreasing (78%) Steady (9%) Increasing (3%) Rebounding (10%) Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 15
  • 14. Prediction of visits • Short-term traffic is to a large extent correlated with long-term traffic • Social media signals are correlated with traffic and shelf-life More reactions → more traffic More discussion → longer shelf-life • Can we predict 7 days after 30 minutes? Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 16
  • 15. Results (traffic predictions) Improved predictions Using social media variables
  • 16. http://fast.qcri.org/
  • 17. Predictions are updated as new information arrives. Predictive models are re-trained every 24 hours. Traffic to many (but not all) articles is easy to predict. Don't remove over- achievers, promote under- achievers. http://fast.qcri.org/
  • 18. Take-home messages • Decrease, Stay or Increase. Rebound – Roughly 80:10:10 ratio in first 12 hours • News vs In-Depth: different behavior – News pieces die out rapidly on the web – In-Depth pieces live longer • Visit forecasting can help take more informed editorial decisions Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 20
  • 19. Topic 2 of 4 News crowds and news curators in social media Janette Lehmann, Carlos Castillo, Mounia Lalmas and Ethan Zuckerman: Transient News Crowds in Social Media In Proc. of International Conference on Weblogs and Social Media. Cambridge, MA, USA, July 2013. See also: blog post. Janette Lehmann, Carlos Castillo, Mounia Lalmas and Ethan Zuckerman: Finding News Curators in Twitter Social News on the Web (SNOW) workshop. Rio de Janeiro, Brazil, May 2013. See also: blog post.
  • 20. Social media users that are highly engaged with news
  • 21. Transient News Crowds Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 23
  • 22. Empirical results • Experiment with articles in BBC and AJE • People who tweeted an article within 6 hours of publication → news crowd – Follow the crowd for one week – Divide time in 12-hour slices • Most crowds disperse rapidly – They tweeted once about the same thing – Now they tweet about different things • Some crowds re-group later Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 24
  • 23. Syria allows UN to step up food aid 13 Jan 2013 French troops launch ground combat in Mali 13 Jan 2013
  • 24. How do we find the related ones? • Machine-learning approach • Important attributes – Text similarity to original story – Exclusivity of history to this crowd • Finds 14% to 72% of related stories automatically (@ 2/3 precision) Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 26
  • 25. Application to tracking a story Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 27
  • 26. Focus on articles → focus on users Example: which users with a large number of followers tweeted Syria allows UN to step up food aid (16 Jan 2013) Twitter user Followers Tweets about ... @RevolutionSyria 88,122 Syria @KenanFreeSyria 13,388 Syria 703 Food @UP_food @KeriJSmith @BreakingNews 8,838 Breaking news/top stories 5,662,866 Breaking news/top stories Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 28
  • 27. News curators • Think Andy Carvin @acarvin, who was a “distant witness” of the Arab Spring Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 29
  • 28. Do we have curators in Twitter? Human Automatic Topic-unfocused curator Disseminating news articles Topic- about diverse topics, usually unfocus breaking news/top stories ed @KeriJSmith News aggregators Collecting news articles (e.g. from RSS feeds) and automatically post their corresponding headlines and URLs @BreakingNews Topic-focused curator Collecting interesting information Topic- with a specific focus, usually a focused geographic region or a topic @KenanFreeSyria Topic-focused aggregators Disseminating automatically news with topical focus Carlos Castillo – chato@acm.org http://www.chato.cl/research/ @UP_food, @RevolutionSyria 30
  • 29. Which users do we care about? Human Topic-focused curator Collecting interesting information Topic- with a specific focus, usually a focused geographic region or a topic @KenanFreeSyria Carlos Castillo – chato@acm.org http://www.chato.cl/research/ Automatic Topic-focused aggregators Disseminating automatically news with topical focus @UP_food, @RevolutionSyria 31
  • 30. Manual annotation (200 users) Focused - Human Focused - Human Focused - Auto Focused - Auto Unfocused Unfocused 8% 13% 3% 2% 95% 79% Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 32
  • 31. Automatically finding curators • Simple rules – UserFracURL >= 85%: automatic – UserSectionsQ >= 90%: unfocused • Complex model (AUC > 0.90) – Random forest Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 33
  • 32. Take-home messages • Twitter users quickly shift topics – But sometimes return to a topic • There are excellent news curators in Twitter – Although many of them are automatic • Automatic systems can help identify curators and follow-up news Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 34
  • 33. Topic 3 of 4 Analysis of TV news via closed captions Carlos Castillo, Gianmarco De Francisci Morales, Marcelo Mendoza, Nasir Khan: Says Who? Automatic Text-based Content Analysis of Television News Workshop on Mining Unstructured Data Using NLP (UnstructureNLP). Burlington, CA, USA. October 2013.
  • 34. Acquiring closed captions • We used data from Yahoo's IntoNow – 140 TV channels – 2MB/channel/day – Jan-Jun 2012 • Internet Archive: http://archive.org/details/tv Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 36
  • 35. Text pre-processing: input [1339302660] WHAT MORE CAN YOU ASK FOR? [1339302662] >> THIS IS WHAT NBA [1339302663] BASKETBALL IS ABOUT Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 37
  • 36. Text pre-processing: output What/WP more/JJR can/MD you/PRP ask/VB for/IN ?/. This/DT is/VBZ what/WDT NBA/NNP [entity: National_Basketball_ Association] basketball/NN is/VBZ about/IN ./. [sentiment: 0.0] Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 38
  • 37. Clusters by non-entity words General news Sport news General + entertainment Sports Sports General news Sports General + sports Business + sports Business + sports
  • 38. Clusters by linguistic style General + business General + entertainment Sports
  • 39. Sorting by average sentiment Sentiment scores on TV captions go from neutral to positive. Strong positive words are used more than strong negative words? Mixed Sports
  • 40. Automatic TV ↔ online news matching • Same pre-processing is done over articles on the Yahoo! News website • Genre classification (general, sports, business, entertainment) by – Data from TV guide for closed captions – Section in Yahoo! News for web news Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 42
  • 41. Coverage by prominence TV networks with more resources can cover more stories. Some prefer to cover only prominent ones, others want some niche content.
  • 42. US military to probe “marine abuse video” January 12th, 2012
  • 43. Breaking stories vs news matching
  • 44. Average story duration Sports stories tend to have a longer life
  • 45. Newsmakers • By professional activity – Sentiments – Distributions • In relationship to news providers • Everybody is a (potential) entertainer Carlos Castillo – chato@acm.org http://www.chato.cl/research/ Distributions of mentions per person 47
  • 46. By professional activity
  • 47. Athletes or entertainers?
  • 48. Politicians or entertainers?
  • 49. Take-home messages • Closed captions are a goldmine of data for content analysis • Automatic content analysis is feasible up to a certain extent – But we still need to learn to use it • Reduce subjectivity when trying to answer some research questions Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 51
  • 50. Topic 4 of 4 Biases in online news in international news media Diego Sáez-Trumper, Carlos Castillo and Mounia Lalmas: Social Media News Communities: Gatekeeping, Coverage, and Statement Bias In Proc. of Conference on Information and Knowledge Management (short paper). Burlingame, CA, USA, October 2013.
  • 51. Jonathan Stray The Atlantic, Feb 2013 Wei Hao-Lin PhD Thesis, CMU 2008
  • 52. Selection bias Coverage bias Statement bias Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 55
  • 53. Goal: discover bias in news media • 60+ news sources in English – BBC, CNN, Fox, Time, UPI, Herald Sun, Times of India, Euro News, DW English, etc. • Follow news through RSS and Twitter • Collect tweets pointing to news • No a-priori information on conflicts or divisions → unsupervised methods Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 56
  • 54. Method • “Community” of a news source – Users who tweeted at least 3 articles from that source in the last 3 days • Collect all articles posted by each – News source – Community of a news source • Compute distances and project in 2D Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 57
  • 55. Community overlaps (J>0.03)
  • 56. Selection bias
  • 57. Coverage bias Measure the distribution of the number of words given to each news story. Compute the 1-divergence between each pair of sources.
  • 58. Coverage bias In Twitter, coverage bias (as measured by number of tweets) is evident while selection bias is not.
  • 59. Coverage bias and partisan politics
  • 60. Sentiment analysis
  • 61. Future work: find patterns like this? “perusing TIME’s covers reveals countless examples of the publication tempting the world with critical events, ideas or figures, while dangling before Americans the chance to indulge in trite self-absorption” – David Harris Gershon Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 64
  • 62. Take-home messages • Encouraging results on fully unsupervised discovery – But results are quite shallow for now • It is frustratingly difficult to discover bias and framing – We are not happy with only quantifying or analyzing known conflicts Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 65
  • 63. Closing remarks
  • 64. Journalism needs Data availability Carlos Castillo – chato@acm.org http://www.chato.cl/research/ Computing capabilities 67
  • 65. Finding common ground is not easy. AI-complete problems Journalism needs Data availability Poorly planned projects Computing capabilities Overexploited Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 68
  • 66. Data analysis is easy, fun and addictive. Without good research questions, it is often useless. Computer science to support a key function of society = Applied computing at its best! Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 69
  • 67. Thank you! Carlos Castillo · chato@acm.org http://www.chato.cl/research/
  • 68. Shouldn't traditional news outlets resent social media? • We did not take their lunch • I am not pointing fingers but … • … online classified ads are to “blame” Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 72
  • 69. Data sample from Al Jazeera English • October 2012 ≈ 3M visits ≈ 606 articles ≈ 200K social media reactions • Open Source Web Analytics beacon – High-performance process (S4+Cassandra). Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 73
  • 70. News: less shared on Facebook In-Depth: more shared on Facebook
  • 71. Examples (mid-2012) Decreasing (78%): ● ● Almost all breaking news Sometimes delayed due to timezone differences, e.g. Hurricane Sandy Steady or Rebounding Increasing (12%): (10%): ● ● Ongoing news: Obama/Romney, Worker strikes in SA, Syrian unrest Articles updated with supporting content ● ● Articles picked up by external sources or social media (typically single source of traffic) Background articles to new developments
  • 72. Predicting traffic and shelf-life online has a long history • Predicting long-term behavior and half-life from short-term observations – Observations = comments, visits, votes, … – Behavior = total comments, total visits, … – 10+ papers specifically on web traffic • Bit.ly (2011, 2012) – Studies half-life per topic and platform Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 76
  • 73. Results (shelf-life prediction) Larger improvements for In-Depth articles Still, this is a 12 hours error in predicting something with an average of 48-72 hours
  • 74. Social media users engaged with news • To what extent can they contribute to the journalistic process? • What kind of roles do they play? • 47% of journalists from 15 countries (n=478) said Twitter is a source of information for them [source] Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 78
  • 75. Manual annotation • 200 users in 20 articles • Crowdsourcing workers see: – Title of news article – Profile and description of user – Sample of 10 tweets of the user Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 79
  • 76. In relation to news providers Projection in 2D of the second component of a 3-way decomposition with a 3x2x2 core of the tensor of sources x newsmakers x style. The first component separates football from basketball.
  • 77. Text pre-processing: steps • Determine paragraph boundaries – Speech change markers, heuristics based on text and time • Apply a part-of-speech tagger – Stanford NLP tagger • Find named entity mentions • Apply sentiment analysis Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 81
  • 78. News sources • Non-entity words • Linguistic style – Prevalence of different part-of-speech classes • Overall sentiment • Coverage • Timeliness Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 82
  • 79. News matching (model) • Target: {same story, different story} • Example features: – Dot product of aboutness scores of resolved entities in the title, body – Jaccard coefficient of unresolved entities in the title, body • Logistic regression • 4 models in total, one per genre Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 83