Presentation at the Tow Center for Digital Journalism, Columbia University. November 14th, 2013.
VIDEO: http://new.livestream.com/accounts/1079539/events/2542929
http://towcenter.org/events/conversation-with-carlos-castillo/
2. Social Media News Mining &
Automatic Content Analysis
of News
Carlos Castillo – Qatar Computing Research Institute
Nov 14th, 2013
3.
4.
5. Outline
• Social media around news
1. Predictive analytics using social media
2. Crowds and curators
• Automatic content analysis of news
3. TV news via closed captions
4. Online news in international media
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
5
6. Communication scholars
vs. Computer scientists
• Media and communication scholars
– Start from high-level questions
• Computer scientists
– Start from low-level observations
• We need to find a middle ground
– To a large extent, we are still not there
– I am certainly still not there
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
6
7. Collaborators
•
•
•
•
•
•
•
•
•
•
•
Gianmarco de Francisci Morales – Yahoo!
Mohammed El-Haddad – Al Jazeera
Sandra González-Bailón – University of Pennsylvania
Nasir Khan – Al Jazeera
Mounia Lalmas – Yahoo!
Janette Lehmann – Pompeu Fabra University & Yahoo!
Marcelo Mendoza – Yahoo!
Jürgen Pfeffer – CMU
Matt Stempeck – MIT Civic Media
Diego Sáez-Trumper – Pompeu Fabra University
Ethan Zuckerman – MIT Civic Media
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
7
8. Topic 1 of 4
Predictive analytics using
social media
Carlos Castillo, Mohammed El-Haddad, Jürgen Pfeffer and Matt Stempeck
Characterizing the Life Cycle of Online News Stories Using Social Media Reactions
To appear in Proc. of Computer Supported Collaborative Work and Social Media.
Baltimore, MD, USA. February 2014.
See also: demo at http://fast.qcri.org/
10. Usage analysis (in news) online
• Aikat (1998)
– Bursts, short dwell times, weekday != weekend
• Crane and Sornette (2008), Yang and Leskovec
(2011), Lehmann et al. (2012)
– Behavioral classes of attention online
• Lotan, Gaffney, and Meyer (SocialFlow, 2011)
– Al Jazeera, BBC, CNN, The Economist, Fox News, NY
Times
• … and many others!
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
10
12. News examples
●
●
Dozens killed in India bus-crash
blaze (Oct 30th, 2013)
Kenyan army admits soldiers looted
mall (Oct 30th, 2013)
In-Depth examples
●
●
Sex selective abortions worry
Azerbaijanis (Oct 29th, 2013)
Time to put an end to Israel's don't
ask-don't tell nuclear policy (Oct 18th,
2013)
16. Prediction of visits
• Short-term traffic is to a large extent
correlated with long-term traffic
• Social media signals are correlated with
traffic and shelf-life
More reactions → more traffic
More discussion → longer shelf-life
• Can we predict 7 days after 30 minutes?
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
16
19. Predictions are updated as new
information arrives. Predictive
models are re-trained every 24
hours. Traffic to many (but not all)
articles is easy to predict.
Don't remove over- achievers,
promote under- achievers.
http://fast.qcri.org/
20. Take-home messages
• Decrease, Stay or Increase. Rebound
– Roughly 80:10:10 ratio in first 12 hours
• News vs In-Depth: different behavior
– News pieces die out rapidly on the web
– In-Depth pieces live longer
• Visit forecasting can help take more informed
editorial decisions
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
20
21. Topic 2 of 4
News crowds and news
curators in social media
Janette Lehmann, Carlos Castillo, Mounia Lalmas and Ethan Zuckerman:
Transient News Crowds in Social Media
In Proc. of International Conference on Weblogs and Social Media.
Cambridge, MA, USA, July 2013. See also: blog post.
Janette Lehmann, Carlos Castillo, Mounia Lalmas and Ethan Zuckerman:
Finding News Curators in Twitter
Social News on the Web (SNOW) workshop.
Rio de Janeiro, Brazil, May 2013. See also: blog post.
24. Empirical results
• Experiment with articles in BBC and AJE
• People who tweeted an article within 6 hours of
publication → news crowd
– Follow the crowd for one week
– Divide time in 12-hour slices
• Most crowds disperse rapidly
– They tweeted once about the same thing
– Now they tweet about different things
• Some crowds re-group later
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
24
26. How do we find the related ones?
• Machine-learning approach
• Important attributes
– Text similarity to original story
– Exclusivity of history to this crowd
• Finds 14% to 72% of related stories
automatically (@ 2/3 precision)
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
26
27. Application to tracking a story
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
27
28. Focus on articles → focus on users
Example: which users with a large number of followers tweeted
Syria allows UN to step up food aid (16 Jan 2013)
Twitter user
Followers
Tweets about ...
@RevolutionSyria
88,122
Syria
@KenanFreeSyria
13,388
Syria
703
Food
@UP_food
@KeriJSmith
@BreakingNews
8,838
Breaking news/top stories
5,662,866
Breaking news/top stories
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
28
29. News curators
• Think Andy Carvin @acarvin, who was a
“distant witness” of the Arab Spring
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
29
30. Do we have curators in Twitter?
Human
Automatic
Topic-unfocused curator
Disseminating news articles
Topic- about diverse topics, usually
unfocus breaking news/top stories
ed
@KeriJSmith
News aggregators
Collecting news articles (e.g. from
RSS feeds) and automatically post
their corresponding headlines and
URLs
@BreakingNews
Topic-focused curator
Collecting interesting information
Topic- with a specific focus, usually a
focused geographic region or a topic
@KenanFreeSyria
Topic-focused aggregators
Disseminating automatically news
with topical focus
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
@UP_food, @RevolutionSyria
30
31. Which users do we care about?
Human
Topic-focused curator
Collecting interesting information
Topic- with a specific focus, usually a
focused geographic region or a topic
@KenanFreeSyria
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Automatic
Topic-focused aggregators
Disseminating automatically news
with topical focus
@UP_food, @RevolutionSyria
31
32. Manual annotation (200 users)
Focused - Human
Focused - Human
Focused - Auto
Focused - Auto
Unfocused
Unfocused
8%
13%
3%
2%
95%
79%
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
32
34. Take-home messages
• Twitter users quickly shift topics
– But sometimes return to a topic
• There are excellent news curators
in Twitter
– Although many of them are automatic
• Automatic systems can help
identify curators and follow-up news
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
34
35. Topic 3 of 4
Analysis of TV news via
closed captions
Carlos Castillo, Gianmarco De Francisci Morales, Marcelo Mendoza, Nasir Khan:
Says Who? Automatic Text-based Content Analysis of Television News
Workshop on Mining Unstructured Data Using NLP (UnstructureNLP).
Burlington, CA, USA. October 2013.
36. Acquiring closed captions
• We used data from
Yahoo's IntoNow
– 140 TV channels
– 2MB/channel/day
– Jan-Jun 2012
• Internet Archive:
http://archive.org/details/tv
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
36
37. Text pre-processing: input
[1339302660] WHAT MORE CAN YOU ASK FOR?
[1339302662] >> THIS IS WHAT NBA
[1339302663] BASKETBALL IS ABOUT
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
37
39. Clusters by non-entity words
General news
Sport news
General + entertainment
Sports
Sports
General news
Sports
General + sports
Business + sports
Business + sports
41. Sorting by average sentiment
Sentiment scores
on TV captions go
from neutral to
positive.
Strong positive
words are used
more than strong
negative words?
Mixed
Sports
42. Automatic TV ↔ online news matching
• Same pre-processing is done over
articles on the Yahoo! News website
• Genre classification (general, sports,
business, entertainment) by
– Data from TV guide for closed captions
– Section in Yahoo! News for web news
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
42
43. Coverage by prominence
TV networks
with more
resources can
cover more
stories.
Some prefer
to cover only
prominent
ones, others
want some
niche content.
44. US military to probe “marine abuse video”
January 12th, 2012
47. Newsmakers
• By professional
activity
– Sentiments
– Distributions
• In relationship to news
providers
• Everybody is a
(potential) entertainer
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
Distributions of mentions
per person
47
51. Take-home messages
• Closed captions are a goldmine of
data for content analysis
• Automatic content analysis is
feasible up to a certain extent
– But we still need to learn to use it
• Reduce subjectivity when trying to
answer some research questions
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
51
52. Topic 4 of 4
Biases in online news in
international news media
Diego Sáez-Trumper, Carlos Castillo and Mounia Lalmas:
Social Media News Communities: Gatekeeping, Coverage, and Statement Bias
In Proc. of Conference on Information and Knowledge Management (short paper).
Burlingame, CA, USA, October 2013.
56. Goal: discover bias in news media
• 60+ news sources in English
– BBC, CNN, Fox, Time, UPI, Herald Sun, Times
of India, Euro News, DW English, etc.
• Follow news through RSS and Twitter
• Collect tweets pointing to news
• No a-priori information on conflicts or
divisions → unsupervised methods
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
56
57. Method
• “Community” of a news source
– Users who tweeted at least 3 articles from
that source in the last 3 days
• Collect all articles posted by each
– News source
– Community of a news source
• Compute distances and project in 2D
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
57
64. Future work: find patterns like this?
“perusing TIME’s covers reveals countless
examples of the publication tempting the world
with critical events, ideas or figures, while
dangling before Americans the chance to indulge
in trite self-absorption” – David Harris Gershon
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
64
65. Take-home messages
• Encouraging results on fully
unsupervised discovery
– But results are quite shallow for now
• It is frustratingly difficult to
discover bias and framing
– We are not happy with only quantifying or
analyzing known conflicts
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
65
68. Finding common
ground is not easy.
AI-complete
problems
Journalism
needs
Data
availability
Poorly planned
projects
Computing
capabilities
Overexploited
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
68
69. Data analysis is easy, fun and addictive.
Without good research questions,
it is often useless.
Computer science to support a key function
of society = Applied computing at its best!
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
69
72. Shouldn't traditional news outlets
resent social media?
• We did not take their lunch
• I am not pointing fingers but …
• … online classified ads are to “blame”
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
72
73. Data sample from Al Jazeera English
• October 2012
≈ 3M visits
≈ 606 articles
≈ 200K social media reactions
• Open Source Web Analytics beacon
– High-performance process (S4+Cassandra).
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
73
75. Examples (mid-2012)
Decreasing
(78%):
●
●
Almost all
breaking news
Sometimes
delayed due to
timezone
differences, e.g.
Hurricane Sandy
Steady or
Rebounding
Increasing (12%): (10%):
●
●
Ongoing news:
Obama/Romney,
Worker strikes in
SA, Syrian unrest
Articles updated
with supporting
content
●
●
Articles picked up
by external
sources or social
media (typically
single source of
traffic)
Background
articles to new
developments
76. Predicting traffic and shelf-life online
has a long history
• Predicting long-term behavior and
half-life from short-term observations
– Observations = comments, visits, votes, …
– Behavior = total comments, total visits, …
– 10+ papers specifically on web traffic
• Bit.ly (2011, 2012)
– Studies half-life per topic and platform
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
76
78. Social media users engaged with news
• To what extent can they contribute to
the journalistic process?
• What kind of roles do they play?
• 47% of journalists from 15 countries
(n=478) said Twitter is a source of
information for them [source]
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
78
79. Manual annotation
• 200 users in 20 articles
• Crowdsourcing workers see:
– Title of news article
– Profile and description of user
– Sample of 10 tweets of the user
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
79
80. In relation to news providers
Projection in 2D of the second component of a 3-way decomposition
with a 3x2x2 core of the tensor of sources x newsmakers x style.
The first component separates football from basketball.
81. Text pre-processing: steps
• Determine paragraph boundaries
– Speech change markers, heuristics based
on text and time
• Apply a part-of-speech tagger
– Stanford NLP tagger
• Find named entity mentions
• Apply sentiment analysis
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
81
82. News sources
• Non-entity words
• Linguistic style
– Prevalence of different part-of-speech
classes
• Overall sentiment
• Coverage
• Timeliness
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
82
83. News matching (model)
• Target: {same story, different story}
• Example features:
– Dot product of aboutness scores of resolved
entities in the title, body
– Jaccard coefficient of unresolved entities in
the title, body
• Logistic regression
• 4 models in total, one per genre
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/
83