Social Media News Mining and Automatic Content Analysis of News

Social Media News Mining
Carlos Castillo

Gilad Lotan

@ChaToX

@gilgul

Social Media News Mining &
Automatic Content Analysis
of News
Carlos Castillo – Qatar Computing Research Institute

Nov 14th, 2013

Outline
• Social media around news
1. Predictive analytics using social media
2. Crowds and curators

• Automatic content analysis of news
3. TV news via closed captions
4. Online news in international media

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

5

Communication scholars
vs. Computer scientists
• Media and communication scholars
– Start from high-level questions

• Computer scientists
– Start from low-level observations

• We need to find a middle ground
– To a large extent, we are still not there
– I am certainly still not there

6

Collaborators
•
•
•
•
•
•
•
•
•
•
•

Gianmarco de Francisci Morales – Yahoo!
Mohammed El-Haddad – Al Jazeera
Sandra González-Bailón – University of Pennsylvania
Nasir Khan – Al Jazeera
Mounia Lalmas – Yahoo!
Janette Lehmann – Pompeu Fabra University & Yahoo!
Marcelo Mendoza – Yahoo!
Jürgen Pfeffer – CMU
Matt Stempeck – MIT Civic Media
Diego Sáez-Trumper – Pompeu Fabra University
Ethan Zuckerman – MIT Civic Media

7

Topic 1 of 4

Predictive analytics using
social media
Carlos Castillo, Mohammed El-Haddad, Jürgen Pfeffer and Matt Stempeck
Characterizing the Life Cycle of Online News Stories Using Social Media Reactions
To appear in Proc. of Computer Supported Collaborative Work and Social Media.
Baltimore, MD, USA. February 2014.
See also: demo at http://fast.qcri.org/

Pirates abduct ship’s crew off Nigerian coast
October 17th, 2012

Usage analysis (in news) online
• Aikat (1998)
– Bursts, short dwell times, weekday != weekend

• Crane and Sornette (2008), Yang and Leskovec
(2011), Lehmann et al. (2012)
– Behavioral classes of attention online

• Lotan, Gaffney, and Meyer (SocialFlow, 2011)
– Al Jazeera, BBC, CNN, The Economist, Fox News, NY
Times

• … and many others!

10

News examples
●

●

Dozens killed in India bus-crash
blaze (Oct 30th, 2013)
Kenyan army admits soldiers looted
mall (Oct 30th, 2013)

In-Depth examples
●

●

Sex selective abortions worry
Azerbaijanis (Oct 29th, 2013)
Time to put an end to Israel's don't
ask-don't tell nuclear policy (Oct 18th,
2013)

News: intense first hour
In-Depth: longer shelf-life

Average visitation/sharing profiles
News


In-Depth

14

Types of news visitation profiles (12 h)
Decreasing (78%)
Steady (9%)
Increasing (3%)
Rebounding (10%)


15

Prediction of visits
• Short-term traffic is to a large extent
correlated with long-term traffic
• Social media signals are correlated with
traffic and shelf-life
More reactions → more traffic
More discussion → longer shelf-life

• Can we predict 7 days after 30 minutes?

16

Results (traffic predictions)
Improved
predictions
Using social
media variables

Predictions are updated as new
information arrives. Predictive
models are re-trained every 24
hours. Traffic to many (but not all)
articles is easy to predict.
Don't remove over- achievers,
promote under- achievers.

http://fast.qcri.org/

Take-home messages
• Decrease, Stay or Increase. Rebound
– Roughly 80:10:10 ratio in first 12 hours

• News vs In-Depth: different behavior
– News pieces die out rapidly on the web
– In-Depth pieces live longer

• Visit forecasting can help take more informed
editorial decisions


20

Topic 2 of 4

News crowds and news
curators in social media
Janette Lehmann, Carlos Castillo, Mounia Lalmas and Ethan Zuckerman:
Transient News Crowds in Social Media
In Proc. of International Conference on Weblogs and Social Media.
Cambridge, MA, USA, July 2013. See also: blog post.
Janette Lehmann, Carlos Castillo, Mounia Lalmas and Ethan Zuckerman:
Finding News Curators in Twitter
Social News on the Web (SNOW) workshop.
Rio de Janeiro, Brazil, May 2013. See also: blog post.

Social media
users that are
highly engaged
with news

Transient News Crowds


23

Empirical results
• Experiment with articles in BBC and AJE
• People who tweeted an article within 6 hours of
publication → news crowd
– Follow the crowd for one week
– Divide time in 12-hour slices

• Most crowds disperse rapidly
– They tweeted once about the same thing
– Now they tweet about different things

• Some crowds re-group later

24

Syria
allows UN
to step
up food
aid
13 Jan 2013

French
troops
launch
ground
combat
in Mali
13 Jan 2013

How do we find the related ones?
• Machine-learning approach
• Important attributes
– Text similarity to original story
– Exclusivity of history to this crowd

• Finds 14% to 72% of related stories
automatically (@ 2/3 precision)


26

Application to tracking a story


27

Focus on articles → focus on users
Example: which users with a large number of followers tweeted

Syria allows UN to step up food aid (16 Jan 2013)
Twitter user

Followers

Tweets about ...

@RevolutionSyria

88,122

Syria

@KenanFreeSyria

13,388

Syria

703

Food

@UP_food
@KeriJSmith
@BreakingNews

8,838

Breaking news/top stories

5,662,866

Breaking news/top stories


28

News curators
• Think Andy Carvin @acarvin, who was a
“distant witness” of the Arab Spring


29

Do we have curators in Twitter?
Human

Automatic

Topic-unfocused curator
Disseminating news articles
Topic- about diverse topics, usually
unfocus breaking news/top stories
ed
@KeriJSmith

News aggregators
Collecting news articles (e.g. from
RSS feeds) and automatically post
their corresponding headlines and
URLs
@BreakingNews

Topic-focused curator
Collecting interesting information
Topic- with a specific focus, usually a
focused geographic region or a topic
@KenanFreeSyria

Topic-focused aggregators
Disseminating automatically news
with topical focus


@UP_food, @RevolutionSyria

30

Which users do we care about?

Human
Topic-focused curator
Collecting interesting information
Topic- with a specific focus, usually a
focused geographic region or a topic
@KenanFreeSyria


Automatic
Topic-focused aggregators
Disseminating automatically news
with topical focus
@UP_food, @RevolutionSyria

31

Manual annotation (200 users)
Focused - Human

Focused - Human

Focused - Auto

Focused - Auto

Unfocused

Unfocused

8%

13%
3%
2%
95%

79%


32

Automatically finding curators
• Simple rules
– UserFracURL >= 85%: automatic
– UserSectionsQ >= 90%: unfocused

• Complex model (AUC > 0.90)
– Random forest


33

Take-home messages
• Twitter users quickly shift topics
– But sometimes return to a topic

• There are excellent news curators
in Twitter
– Although many of them are automatic

• Automatic systems can help
identify curators and follow-up news

34

Topic 3 of 4

Analysis of TV news via
closed captions
Carlos Castillo, Gianmarco De Francisci Morales, Marcelo Mendoza, Nasir Khan:
Says Who? Automatic Text-based Content Analysis of Television News
Workshop on Mining Unstructured Data Using NLP (UnstructureNLP).
Burlington, CA, USA. October 2013.

Acquiring closed captions
• We used data from
Yahoo's IntoNow
– 140 TV channels
– 2MB/channel/day
– Jan-Jun 2012

• Internet Archive:
http://archive.org/details/tv


36

Text pre-processing: input
[1339302660] WHAT MORE CAN YOU ASK FOR?
[1339302662] >> THIS IS WHAT NBA
[1339302663] BASKETBALL IS ABOUT


37

Text pre-processing: output
What/WP more/JJR can/MD you/PRP ask/VB
for/IN ?/. This/DT is/VBZ what/WDT
NBA/NNP [entity: National_Basketball_
Association] basketball/NN is/VBZ
about/IN ./. [sentiment: 0.0]

38

Clusters by non-entity words
General news
Sport news
General + entertainment
Sports
Sports
General news
Sports
General + sports
Business + sports
Business + sports

Clusters by linguistic style
General + business

General + entertainment

Sports

Sorting by average sentiment
Sentiment scores
on TV captions go
from neutral to
positive.

Strong positive
words are used
more than strong
negative words?

Mixed

Sports

Automatic TV ↔ online news matching
• Same pre-processing is done over
articles on the Yahoo! News website
• Genre classification (general, sports,
business, entertainment) by
– Data from TV guide for closed captions
– Section in Yahoo! News for web news


42

Coverage by prominence
TV networks
with more
resources can
cover more
stories.
Some prefer
to cover only
prominent
ones, others
want some
niche content.

US military to probe “marine abuse video”
January 12th, 2012

Breaking stories vs news matching

Average story duration

Sports stories tend to have a longer life

Newsmakers
• By professional
activity
– Sentiments
– Distributions

• In relationship to news
providers
• Everybody is a
(potential) entertainer

Distributions of mentions
per person

47

Take-home messages
• Closed captions are a goldmine of
data for content analysis
• Automatic content analysis is
feasible up to a certain extent
– But we still need to learn to use it

• Reduce subjectivity when trying to
answer some research questions

51

Topic 4 of 4

Biases in online news in
international news media
Diego Sáez-Trumper, Carlos Castillo and Mounia Lalmas:
Social Media News Communities: Gatekeeping, Coverage, and Statement Bias
In Proc. of Conference on Information and Knowledge Management (short paper).
Burlingame, CA, USA, October 2013.

Jonathan Stray
The Atlantic, Feb 2013

Wei Hao-Lin
PhD Thesis, CMU 2008

Selection bias

Coverage bias

Statement bias


55

Goal: discover bias in news media
• 60+ news sources in English
– BBC, CNN, Fox, Time, UPI, Herald Sun, Times
of India, Euro News, DW English, etc.

• Follow news through RSS and Twitter
• Collect tweets pointing to news
• No a-priori information on conflicts or
divisions → unsupervised methods

56

Method
• “Community” of a news source
– Users who tweeted at least 3 articles from
that source in the last 3 days

• Collect all articles posted by each
– News source
– Community of a news source

• Compute distances and project in 2D

57

Coverage bias
Measure the distribution of
the number of words given
to each news story.
Compute the 1-divergence
between each pair of
sources.

Coverage
bias
In Twitter,
coverage bias (as
measured by
number of tweets)
is evident while
selection bias is
not.

Coverage bias and partisan politics

Future work: find patterns like this?
“perusing TIME’s covers reveals countless
examples of the publication tempting the world
with critical events, ideas or figures, while
dangling before Americans the chance to indulge
in trite self-absorption” – David Harris Gershon


64

Take-home messages
• Encouraging results on fully
unsupervised discovery
– But results are quite shallow for now

• It is frustratingly difficult to
discover bias and framing
– We are not happy with only quantifying or
analyzing known conflicts


65

Journalism
needs
Data
availability


Computing
capabilities

67

Finding common
ground is not easy.

AI-complete
problems

Journalism
needs

Data
availability

Poorly planned
projects

Computing
capabilities

Overexploited

68

Data analysis is easy, fun and addictive.
Without good research questions,
it is often useless.
Computer science to support a key function
of society = Applied computing at its best!


69

Thank you!
Carlos Castillo · chato@acm.org

Shouldn't traditional news outlets
resent social media?
• We did not take their lunch
• I am not pointing fingers but …

• … online classified ads are to “blame”

72

Data sample from Al Jazeera English
• October 2012
≈ 3M visits
≈ 606 articles
≈ 200K social media reactions
• Open Source Web Analytics beacon
– High-performance process (S4+Cassandra).


73

News: less shared on Facebook
In-Depth: more shared on Facebook

Examples (mid-2012)
Decreasing
(78%):
●

●

Almost all
breaking news
Sometimes
delayed due to
timezone
differences, e.g.
Hurricane Sandy

Steady or
Rebounding
Increasing (12%): (10%):
●

●

Ongoing news:
Obama/Romney,
Worker strikes in
SA, Syrian unrest
Articles updated
with supporting
content

●

●

Articles picked up
by external
sources or social
media (typically
single source of
traffic)
Background
articles to new
developments

Predicting traffic and shelf-life online
has a long history
• Predicting long-term behavior and
half-life from short-term observations
– Observations = comments, visits, votes, …
– Behavior = total comments, total visits, …
– 10+ papers specifically on web traffic

• Bit.ly (2011, 2012)
– Studies half-life per topic and platform


76

Results (shelf-life prediction)

Larger
improvements
for In-Depth
articles
Still, this is a 12 hours
error in predicting
something with an
average of 48-72 hours

Social media users engaged with news
• To what extent can they contribute to
the journalistic process?
• What kind of roles do they play?
• 47% of journalists from 15 countries
(n=478) said Twitter is a source of
information for them [source]

78

Manual annotation
• 200 users in 20 articles
• Crowdsourcing workers see:
– Title of news article
– Profile and description of user
– Sample of 10 tweets of the user


79

In relation to news providers

Projection in 2D of the second component of a 3-way decomposition
with a 3x2x2 core of the tensor of sources x newsmakers x style.
The first component separates football from basketball.

Text pre-processing: steps
• Determine paragraph boundaries
– Speech change markers, heuristics based
on text and time

• Apply a part-of-speech tagger
– Stanford NLP tagger

• Find named entity mentions
• Apply sentiment analysis

81

News sources
• Non-entity words
• Linguistic style
– Prevalence of different part-of-speech
classes

• Overall sentiment
• Coverage
• Timeliness

82

News matching (model)
• Target: {same story, different story}
• Example features:
– Dot product of aboutness scores of resolved
entities in the title, body
– Jaccard coefficient of unresolved entities in
the title, body

• Logistic regression
• 4 models in total, one per genre

83

Social Media News Mining and Automatic Content Analysis of News

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Social Media News Mining and Automatic Content Analysis of News

Similar to Social Media News Mining and Automatic Content Analysis of News (20)

More from Carlos Castillo (ChaTo)

More from Carlos Castillo (ChaTo) (20)

Recently uploaded

Recently uploaded (16)

Social Media News Mining and Automatic Content Analysis of News