SlideShare a Scribd company logo
1 of 83
Download to read offline
Social Media News Mining
Carlos Castillo

Gilad Lotan

@ChaToX

@gilgul
Social Media News Mining &
Automatic Content Analysis
of News
Carlos Castillo – Qatar Computing Research Institute

Nov 14th, 2013
Outline
• Social media around news
1. Predictive analytics using social media
2. Crowds and curators

• Automatic content analysis of news
3. TV news via closed captions
4. Online news in international media

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

5
Communication scholars
vs. Computer scientists
• Media and communication scholars
– Start from high-level questions

• Computer scientists
– Start from low-level observations

• We need to find a middle ground
– To a large extent, we are still not there
– I am certainly still not there
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

6
Collaborators
•
•
•
•
•
•
•
•
•
•
•

Gianmarco de Francisci Morales – Yahoo!
Mohammed El-Haddad – Al Jazeera
Sandra González-Bailón – University of Pennsylvania
Nasir Khan – Al Jazeera
Mounia Lalmas – Yahoo!
Janette Lehmann – Pompeu Fabra University & Yahoo!
Marcelo Mendoza – Yahoo!
Jürgen Pfeffer – CMU
Matt Stempeck – MIT Civic Media
Diego Sáez-Trumper – Pompeu Fabra University
Ethan Zuckerman – MIT Civic Media
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

7
Topic 1 of 4

Predictive analytics using
social media
Carlos Castillo, Mohammed El-Haddad, Jürgen Pfeffer and Matt Stempeck
Characterizing the Life Cycle of Online News Stories Using Social Media Reactions
To appear in Proc. of Computer Supported Collaborative Work and Social Media.
Baltimore, MD, USA. February 2014.
See also: demo at http://fast.qcri.org/
Pirates abduct ship’s crew off Nigerian coast
October 17th, 2012
Usage analysis (in news) online
• Aikat (1998)
– Bursts, short dwell times, weekday != weekend

• Crane and Sornette (2008), Yang and Leskovec
(2011), Lehmann et al. (2012)
– Behavioral classes of attention online

• Lotan, Gaffney, and Meyer (SocialFlow, 2011)
– Al Jazeera, BBC, CNN, The Economist, Fox News, NY
Times

• … and many others!
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

10
News

In-Depth
News examples
●

●

Dozens killed in India bus-crash
blaze (Oct 30th, 2013)
Kenyan army admits soldiers looted
mall (Oct 30th, 2013)

In-Depth examples
●

●

Sex selective abortions worry
Azerbaijanis (Oct 29th, 2013)
Time to put an end to Israel's don't
ask-don't tell nuclear policy (Oct 18th,
2013)
News: intense first hour
In-Depth: longer shelf-life
Average visitation/sharing profiles
News

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

In-Depth

14
Types of news visitation profiles (12 h)
Decreasing (78%)
Steady (9%)
Increasing (3%)
Rebounding (10%)

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

15
Prediction of visits
• Short-term traffic is to a large extent
correlated with long-term traffic
• Social media signals are correlated with
traffic and shelf-life
More reactions → more traffic
More discussion → longer shelf-life

• Can we predict 7 days after 30 minutes?
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

16
Results (traffic predictions)
Improved
predictions
Using social
media variables
http://fast.qcri.org/
Predictions are updated as new
information arrives. Predictive
models are re-trained every 24
hours. Traffic to many (but not all)
articles is easy to predict.
Don't remove over- achievers,
promote under- achievers.

http://fast.qcri.org/
Take-home messages
• Decrease, Stay or Increase. Rebound
– Roughly 80:10:10 ratio in first 12 hours

• News vs In-Depth: different behavior
– News pieces die out rapidly on the web
– In-Depth pieces live longer

• Visit forecasting can help take more informed
editorial decisions

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

20
Topic 2 of 4

News crowds and news
curators in social media
Janette Lehmann, Carlos Castillo, Mounia Lalmas and Ethan Zuckerman:
Transient News Crowds in Social Media
In Proc. of International Conference on Weblogs and Social Media.
Cambridge, MA, USA, July 2013. See also: blog post.
Janette Lehmann, Carlos Castillo, Mounia Lalmas and Ethan Zuckerman:
Finding News Curators in Twitter
Social News on the Web (SNOW) workshop.
Rio de Janeiro, Brazil, May 2013. See also: blog post.
Social media
users that are
highly engaged
with news
Transient News Crowds

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

23
Empirical results
• Experiment with articles in BBC and AJE
• People who tweeted an article within 6 hours of
publication → news crowd
– Follow the crowd for one week
– Divide time in 12-hour slices

• Most crowds disperse rapidly
– They tweeted once about the same thing
– Now they tweet about different things

• Some crowds re-group later
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

24
Syria
allows UN
to step
up food
aid
13 Jan 2013

French
troops
launch
ground
combat
in Mali
13 Jan 2013
How do we find the related ones?
• Machine-learning approach
• Important attributes
– Text similarity to original story
– Exclusivity of history to this crowd

• Finds 14% to 72% of related stories
automatically (@ 2/3 precision)

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

26
Application to tracking a story

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

27
Focus on articles → focus on users
Example: which users with a large number of followers tweeted

Syria allows UN to step up food aid (16 Jan 2013)
Twitter user

Followers

Tweets about ...

@RevolutionSyria

88,122

Syria

@KenanFreeSyria

13,388

Syria

703

Food

@UP_food
@KeriJSmith
@BreakingNews

8,838

Breaking news/top stories

5,662,866

Breaking news/top stories

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

28
News curators
• Think Andy Carvin @acarvin, who was a
“distant witness” of the Arab Spring

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

29
Do we have curators in Twitter?
Human

Automatic

Topic-unfocused curator
Disseminating news articles
Topic- about diverse topics, usually
unfocus breaking news/top stories
ed
@KeriJSmith

News aggregators
Collecting news articles (e.g. from
RSS feeds) and automatically post
their corresponding headlines and
URLs
@BreakingNews

Topic-focused curator
Collecting interesting information
Topic- with a specific focus, usually a
focused geographic region or a topic
@KenanFreeSyria

Topic-focused aggregators
Disseminating automatically news
with topical focus

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

@UP_food, @RevolutionSyria

30
Which users do we care about?

Human
Topic-focused curator
Collecting interesting information
Topic- with a specific focus, usually a
focused geographic region or a topic
@KenanFreeSyria

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

Automatic
Topic-focused aggregators
Disseminating automatically news
with topical focus
@UP_food, @RevolutionSyria

31
Manual annotation (200 users)
Focused - Human

Focused - Human

Focused - Auto

Focused - Auto

Unfocused

Unfocused

8%

13%
3%
2%
95%

79%

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

32
Automatically finding curators
• Simple rules
– UserFracURL >= 85%: automatic
– UserSectionsQ >= 90%: unfocused

• Complex model (AUC > 0.90)
– Random forest

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

33
Take-home messages
• Twitter users quickly shift topics
– But sometimes return to a topic

• There are excellent news curators
in Twitter
– Although many of them are automatic

• Automatic systems can help
identify curators and follow-up news
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

34
Topic 3 of 4

Analysis of TV news via
closed captions
Carlos Castillo, Gianmarco De Francisci Morales, Marcelo Mendoza, Nasir Khan:
Says Who? Automatic Text-based Content Analysis of Television News
Workshop on Mining Unstructured Data Using NLP (UnstructureNLP).
Burlington, CA, USA. October 2013.
Acquiring closed captions
• We used data from
Yahoo's IntoNow
– 140 TV channels
– 2MB/channel/day
– Jan-Jun 2012

• Internet Archive:
http://archive.org/details/tv

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

36
Text pre-processing: input
[1339302660] WHAT MORE CAN YOU ASK FOR?
[1339302662] >> THIS IS WHAT NBA
[1339302663] BASKETBALL IS ABOUT

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

37
Text pre-processing: output
What/WP more/JJR can/MD you/PRP ask/VB
for/IN ?/. This/DT is/VBZ what/WDT
NBA/NNP [entity: National_Basketball_
Association] basketball/NN is/VBZ
about/IN ./. [sentiment: 0.0]
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

38
Clusters by non-entity words
General news
Sport news
General + entertainment
Sports
Sports
General news
Sports
General + sports
Business + sports
Business + sports
Clusters by linguistic style
General + business

General + entertainment

Sports
Sorting by average sentiment
Sentiment scores
on TV captions go
from neutral to
positive.

Strong positive
words are used
more than strong
negative words?

Mixed

Sports
Automatic TV ↔ online news matching
• Same pre-processing is done over
articles on the Yahoo! News website
• Genre classification (general, sports,
business, entertainment) by
– Data from TV guide for closed captions
– Section in Yahoo! News for web news

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

42
Coverage by prominence
TV networks
with more
resources can
cover more
stories.
Some prefer
to cover only
prominent
ones, others
want some
niche content.
US military to probe “marine abuse video”
January 12th, 2012
Breaking stories vs news matching
Average story duration

Sports stories tend to have a longer life
Newsmakers
• By professional
activity
– Sentiments
– Distributions

• In relationship to news
providers
• Everybody is a
(potential) entertainer
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

Distributions of mentions
per person

47
By professional activity
Athletes or entertainers?
Politicians or entertainers?
Take-home messages
• Closed captions are a goldmine of
data for content analysis
• Automatic content analysis is
feasible up to a certain extent
– But we still need to learn to use it

• Reduce subjectivity when trying to
answer some research questions
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

51
Topic 4 of 4

Biases in online news in
international news media
Diego Sáez-Trumper, Carlos Castillo and Mounia Lalmas:
Social Media News Communities: Gatekeeping, Coverage, and Statement Bias
In Proc. of Conference on Information and Knowledge Management (short paper).
Burlingame, CA, USA, October 2013.
Jonathan Stray
The Atlantic, Feb 2013

Wei Hao-Lin
PhD Thesis, CMU 2008
Selection bias

Coverage bias

Statement bias

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

55
Goal: discover bias in news media
• 60+ news sources in English
– BBC, CNN, Fox, Time, UPI, Herald Sun, Times
of India, Euro News, DW English, etc.

• Follow news through RSS and Twitter
• Collect tweets pointing to news
• No a-priori information on conflicts or
divisions → unsupervised methods
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

56
Method
• “Community” of a news source
– Users who tweeted at least 3 articles from
that source in the last 3 days

• Collect all articles posted by each
– News source
– Community of a news source

• Compute distances and project in 2D
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

57
Community
overlaps
(J>0.03)
Selection bias
Coverage bias
Measure the distribution of
the number of words given
to each news story.
Compute the 1-divergence
between each pair of
sources.
Coverage
bias
In Twitter,
coverage bias (as
measured by
number of tweets)
is evident while
selection bias is
not.
Coverage bias and partisan politics
Sentiment analysis
Future work: find patterns like this?
“perusing TIME’s covers reveals countless
examples of the publication tempting the world
with critical events, ideas or figures, while
dangling before Americans the chance to indulge
in trite self-absorption” – David Harris Gershon

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

64
Take-home messages
• Encouraging results on fully
unsupervised discovery
– But results are quite shallow for now

• It is frustratingly difficult to
discover bias and framing
– We are not happy with only quantifying or
analyzing known conflicts

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

65
Closing remarks
Journalism
needs
Data
availability

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

Computing
capabilities

67
Finding common
ground is not easy.

AI-complete
problems

Journalism
needs

Data
availability

Poorly planned
projects

Computing
capabilities

Overexploited
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

68
Data analysis is easy, fun and addictive.
Without good research questions,
it is often useless.
Computer science to support a key function
of society = Applied computing at its best!

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

69
Thank you!
Carlos Castillo · chato@acm.org
http://www.chato.cl/research/
Shouldn't traditional news outlets
resent social media?
• We did not take their lunch
• I am not pointing fingers but …

• … online classified ads are to “blame”
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

72
Data sample from Al Jazeera English
• October 2012
≈ 3M visits
≈ 606 articles
≈ 200K social media reactions
• Open Source Web Analytics beacon
– High-performance process (S4+Cassandra).

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

73
News: less shared on Facebook
In-Depth: more shared on Facebook
Examples (mid-2012)
Decreasing
(78%):
●

●

Almost all
breaking news
Sometimes
delayed due to
timezone
differences, e.g.
Hurricane Sandy

Steady or
Rebounding
Increasing (12%): (10%):
●

●

Ongoing news:
Obama/Romney,
Worker strikes in
SA, Syrian unrest
Articles updated
with supporting
content

●

●

Articles picked up
by external
sources or social
media (typically
single source of
traffic)
Background
articles to new
developments
Predicting traffic and shelf-life online
has a long history
• Predicting long-term behavior and
half-life from short-term observations
– Observations = comments, visits, votes, …
– Behavior = total comments, total visits, …
– 10+ papers specifically on web traffic

• Bit.ly (2011, 2012)
– Studies half-life per topic and platform

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

76
Results (shelf-life prediction)

Larger
improvements
for In-Depth
articles
Still, this is a 12 hours
error in predicting
something with an
average of 48-72 hours
Social media users engaged with news
• To what extent can they contribute to
the journalistic process?
• What kind of roles do they play?
• 47% of journalists from 15 countries
(n=478) said Twitter is a source of
information for them [source]
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

78
Manual annotation
• 200 users in 20 articles
• Crowdsourcing workers see:
– Title of news article
– Profile and description of user
– Sample of 10 tweets of the user

Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

79
In relation to news providers

Projection in 2D of the second component of a 3-way decomposition
with a 3x2x2 core of the tensor of sources x newsmakers x style.
The first component separates football from basketball.
Text pre-processing: steps
• Determine paragraph boundaries
– Speech change markers, heuristics based
on text and time

• Apply a part-of-speech tagger
– Stanford NLP tagger

• Find named entity mentions
• Apply sentiment analysis
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

81
News sources
• Non-entity words
• Linguistic style
– Prevalence of different part-of-speech
classes

• Overall sentiment
• Coverage
• Timeliness
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

82
News matching (model)
• Target: {same story, different story}
• Example features:
– Dot product of aboutness scores of resolved
entities in the title, body
– Jaccard coefficient of unresolved entities in
the title, body

• Logistic regression
• 4 models in total, one per genre
Carlos Castillo – chato@acm.org
http://www.chato.cl/research/

83

More Related Content

What's hot

Crisis Mapping, Citizen Sensing and Social Media Analytics: Leveraging Citize...
Crisis Mapping, Citizen Sensing and Social Media Analytics: Leveraging Citize...Crisis Mapping, Citizen Sensing and Social Media Analytics: Leveraging Citize...
Crisis Mapping, Citizen Sensing and Social Media Analytics: Leveraging Citize...Artificial Intelligence Institute at UofSC
 
Public Health Crisis Analytics for Gender Violence
Public Health Crisis Analytics for Gender ViolencePublic Health Crisis Analytics for Gender Violence
Public Health Crisis Analytics for Gender ViolenceHemant Purohit
 
Real-Time Processing of Social Media Content for Social Good
Real-Time Processing of Social Media Content for Social GoodReal-Time Processing of Social Media Content for Social Good
Real-Time Processing of Social Media Content for Social GoodMuhammad Imran
 
Automatically Rank Social Media Requests for Emergency Services using Service...
Automatically Rank Social Media Requests for Emergency Services using Service...Automatically Rank Social Media Requests for Emergency Services using Service...
Automatically Rank Social Media Requests for Emergency Services using Service...Hemant Purohit
 
NCSU invited talk: Leveraging Social Media for Tourism Marketplace Coordination
NCSU invited talk: Leveraging Social Media for Tourism Marketplace CoordinationNCSU invited talk: Leveraging Social Media for Tourism Marketplace Coordination
NCSU invited talk: Leveraging Social Media for Tourism Marketplace CoordinationArtificial Intelligence Institute at UofSC
 
Applying citizen science model to disaster management
Applying citizen science model to disaster managementApplying citizen science model to disaster management
Applying citizen science model to disaster managementW. David Stephenson
 
Processing Social Media Messages in Mass Emergency: A Survey
Processing Social Media Messages in Mass Emergency: A SurveyProcessing Social Media Messages in Mass Emergency: A Survey
Processing Social Media Messages in Mass Emergency: A SurveyMuhammad Imran
 
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...Muhammad Imran
 
Social Media in Sri Lanka: Do Science and Reason Stand a Chance? - Nalaka Gun...
Social Media in Sri Lanka: Do Science and Reason Stand a Chance? - Nalaka Gun...Social Media in Sri Lanka: Do Science and Reason Stand a Chance? - Nalaka Gun...
Social Media in Sri Lanka: Do Science and Reason Stand a Chance? - Nalaka Gun...Nalaka Gunawardene
 
Evolution of the Humanitarian Data Ecosystem
Evolution of the Humanitarian Data EcosystemEvolution of the Humanitarian Data Ecosystem
Evolution of the Humanitarian Data EcosystemSara-Jayne Terp
 
Offline Activism - How successful activism facilitate social media
Offline Activism - How successful activism facilitate social mediaOffline Activism - How successful activism facilitate social media
Offline Activism - How successful activism facilitate social mediaWilson Fung
 
Presentation
PresentationPresentation
Presentationospinaan
 
Presentation
PresentationPresentation
Presentationokunka
 
Presentation
PresentationPresentation
PresentationSprowlt
 
Iscram 2007 humanitarian-foss case disaster management
Iscram 2007 humanitarian-foss case disaster managementIscram 2007 humanitarian-foss case disaster management
Iscram 2007 humanitarian-foss case disaster managementChamindra de Silva
 
Emergency Risk Communication
Emergency Risk CommunicationEmergency Risk Communication
Emergency Risk CommunicationHeather Blanchard
 
Humanitarian Diplomacy in the Digital Age: Analysis and use of digital inform...
Humanitarian Diplomacy in the Digital Age: Analysis and use of digital inform...Humanitarian Diplomacy in the Digital Age: Analysis and use of digital inform...
Humanitarian Diplomacy in the Digital Age: Analysis and use of digital inform...Keith Powell
 

What's hot (20)

Crisis Mapping, Citizen Sensing and Social Media Analytics: Leveraging Citize...
Crisis Mapping, Citizen Sensing and Social Media Analytics: Leveraging Citize...Crisis Mapping, Citizen Sensing and Social Media Analytics: Leveraging Citize...
Crisis Mapping, Citizen Sensing and Social Media Analytics: Leveraging Citize...
 
Public Health Crisis Analytics for Gender Violence
Public Health Crisis Analytics for Gender ViolencePublic Health Crisis Analytics for Gender Violence
Public Health Crisis Analytics for Gender Violence
 
Real-Time Processing of Social Media Content for Social Good
Real-Time Processing of Social Media Content for Social GoodReal-Time Processing of Social Media Content for Social Good
Real-Time Processing of Social Media Content for Social Good
 
Automatically Rank Social Media Requests for Emergency Services using Service...
Automatically Rank Social Media Requests for Emergency Services using Service...Automatically Rank Social Media Requests for Emergency Services using Service...
Automatically Rank Social Media Requests for Emergency Services using Service...
 
NCSU invited talk: Leveraging Social Media for Tourism Marketplace Coordination
NCSU invited talk: Leveraging Social Media for Tourism Marketplace CoordinationNCSU invited talk: Leveraging Social Media for Tourism Marketplace Coordination
NCSU invited talk: Leveraging Social Media for Tourism Marketplace Coordination
 
Applying citizen science model to disaster management
Applying citizen science model to disaster managementApplying citizen science model to disaster management
Applying citizen science model to disaster management
 
Processing Social Media Messages in Mass Emergency: A Survey
Processing Social Media Messages in Mass Emergency: A SurveyProcessing Social Media Messages in Mass Emergency: A Survey
Processing Social Media Messages in Mass Emergency: A Survey
 
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
Coordinating Human and Machine Intelligence to Classify Microblog Communica0o...
 
Social Media in Sri Lanka: Do Science and Reason Stand a Chance? - Nalaka Gun...
Social Media in Sri Lanka: Do Science and Reason Stand a Chance? - Nalaka Gun...Social Media in Sri Lanka: Do Science and Reason Stand a Chance? - Nalaka Gun...
Social Media in Sri Lanka: Do Science and Reason Stand a Chance? - Nalaka Gun...
 
Evolution of the Humanitarian Data Ecosystem
Evolution of the Humanitarian Data EcosystemEvolution of the Humanitarian Data Ecosystem
Evolution of the Humanitarian Data Ecosystem
 
ICCM 2014 -- Ignite Talks -- Session 2
ICCM 2014 -- Ignite Talks -- Session 2ICCM 2014 -- Ignite Talks -- Session 2
ICCM 2014 -- Ignite Talks -- Session 2
 
Presentation
PresentationPresentation
Presentation
 
Presentation
PresentationPresentation
Presentation
 
Offline Activism - How successful activism facilitate social media
Offline Activism - How successful activism facilitate social mediaOffline Activism - How successful activism facilitate social media
Offline Activism - How successful activism facilitate social media
 
Presentation
PresentationPresentation
Presentation
 
Presentation
PresentationPresentation
Presentation
 
Presentation
PresentationPresentation
Presentation
 
Iscram 2007 humanitarian-foss case disaster management
Iscram 2007 humanitarian-foss case disaster managementIscram 2007 humanitarian-foss case disaster management
Iscram 2007 humanitarian-foss case disaster management
 
Emergency Risk Communication
Emergency Risk CommunicationEmergency Risk Communication
Emergency Risk Communication
 
Humanitarian Diplomacy in the Digital Age: Analysis and use of digital inform...
Humanitarian Diplomacy in the Digital Age: Analysis and use of digital inform...Humanitarian Diplomacy in the Digital Age: Analysis and use of digital inform...
Humanitarian Diplomacy in the Digital Age: Analysis and use of digital inform...
 

Similar to Social Media News Mining and Automatic Content Analysis of News

Characterizing the Life Cycle of Online News Stories Using Social Media React...
Characterizing the Life Cycle of Online News Stories Using Social Media React...Characterizing the Life Cycle of Online News Stories Using Social Media React...
Characterizing the Life Cycle of Online News Stories Using Social Media React...Carlos Castillo (ChaTo)
 
Information Verification During Natural Disasters
Information Verification During Natural DisastersInformation Verification During Natural Disasters
Information Verification During Natural DisastersCarlos Castillo (ChaTo)
 
Using twitter for Scholarly Purposes 2013
Using twitter for Scholarly Purposes 2013Using twitter for Scholarly Purposes 2013
Using twitter for Scholarly Purposes 2013serge noiret
 
Are Filter Bubbles Real?
Are Filter Bubbles Real?Are Filter Bubbles Real?
Are Filter Bubbles Real?Axel Bruns
 
Are Filter Bubbles Real?
Are Filter Bubbles Real?Are Filter Bubbles Real?
Are Filter Bubbles Real?Axel Bruns
 
British Library Labs Presentation at UK Medical Heritage Library Live Lab
British Library Labs Presentation at UK Medical Heritage Library Live LabBritish Library Labs Presentation at UK Medical Heritage Library Live Lab
British Library Labs Presentation at UK Medical Heritage Library Live Lablabsbl
 
Filter Bubbles in the Australian Twittersphere?
Filter Bubbles in the Australian Twittersphere?Filter Bubbles in the Australian Twittersphere?
Filter Bubbles in the Australian Twittersphere?Axel Bruns
 
A Multi-Institutional Approach to ‘Big Social Data’: The TrISMA Project
A Multi-Institutional Approach to ‘Big Social Data’: The TrISMA ProjectA Multi-Institutional Approach to ‘Big Social Data’: The TrISMA Project
A Multi-Institutional Approach to ‘Big Social Data’: The TrISMA ProjectAxel Bruns
 
Are Filter Bubbles Real?
Are Filter Bubbles Real?Are Filter Bubbles Real?
Are Filter Bubbles Real?Axel Bruns
 
It's Not the Technology, Stupid: How the ‘Echo Chamber’ and ‘Filter Bubble’ M...
It's Not the Technology, Stupid: How the ‘Echo Chamber’ and ‘Filter Bubble’ M...It's Not the Technology, Stupid: How the ‘Echo Chamber’ and ‘Filter Bubble’ M...
It's Not the Technology, Stupid: How the ‘Echo Chamber’ and ‘Filter Bubble’ M...Axel Bruns
 
Beyond the Bubble: A Critical Review of the Evidence for Echo Chambers and Fi...
Beyond the Bubble: A Critical Review of the Evidence for Echo Chambers and Fi...Beyond the Bubble: A Critical Review of the Evidence for Echo Chambers and Fi...
Beyond the Bubble: A Critical Review of the Evidence for Echo Chambers and Fi...Axel Bruns
 
Are Filter Bubbles Real?
Are Filter Bubbles Real?Are Filter Bubbles Real?
Are Filter Bubbles Real?Axel Bruns
 
Media literacy panel
Media literacy panel Media literacy panel
Media literacy panel Cody Hennesy
 
Using Twitter as a Postgraduate Researcher
Using Twitter as a Postgraduate ResearcherUsing Twitter as a Postgraduate Researcher
Using Twitter as a Postgraduate ResearcherSimon Bishop
 
Research with Social Media Data: Stewardship & Ethical Considerations
Research with Social Media Data: Stewardship & Ethical ConsiderationsResearch with Social Media Data: Stewardship & Ethical Considerations
Research with Social Media Data: Stewardship & Ethical ConsiderationsToronto Metropolitan University
 
Gatewatching 11: Echo Chambers? Filter Bubbles? Reviewing the Evidence
Gatewatching 11: Echo Chambers? Filter Bubbles? Reviewing the EvidenceGatewatching 11: Echo Chambers? Filter Bubbles? Reviewing the Evidence
Gatewatching 11: Echo Chambers? Filter Bubbles? Reviewing the EvidenceAxel Bruns
 

Similar to Social Media News Mining and Automatic Content Analysis of News (20)

Crisis Informatics (November 2013)
Crisis Informatics (November 2013)Crisis Informatics (November 2013)
Crisis Informatics (November 2013)
 
Characterizing the Life Cycle of Online News Stories Using Social Media React...
Characterizing the Life Cycle of Online News Stories Using Social Media React...Characterizing the Life Cycle of Online News Stories Using Social Media React...
Characterizing the Life Cycle of Online News Stories Using Social Media React...
 
Dressler Kristof The Right to be Forgotten and Digital Collections
Dressler Kristof The Right to be Forgotten and Digital CollectionsDressler Kristof The Right to be Forgotten and Digital Collections
Dressler Kristof The Right to be Forgotten and Digital Collections
 
Social Media Mining and Retrieval
Social Media Mining and RetrievalSocial Media Mining and Retrieval
Social Media Mining and Retrieval
 
Information Verification During Natural Disasters
Information Verification During Natural DisastersInformation Verification During Natural Disasters
Information Verification During Natural Disasters
 
Using twitter for Scholarly Purposes 2013
Using twitter for Scholarly Purposes 2013Using twitter for Scholarly Purposes 2013
Using twitter for Scholarly Purposes 2013
 
Are Filter Bubbles Real?
Are Filter Bubbles Real?Are Filter Bubbles Real?
Are Filter Bubbles Real?
 
Are Filter Bubbles Real?
Are Filter Bubbles Real?Are Filter Bubbles Real?
Are Filter Bubbles Real?
 
British Library Labs Presentation at UK Medical Heritage Library Live Lab
British Library Labs Presentation at UK Medical Heritage Library Live LabBritish Library Labs Presentation at UK Medical Heritage Library Live Lab
British Library Labs Presentation at UK Medical Heritage Library Live Lab
 
Filter Bubbles in the Australian Twittersphere?
Filter Bubbles in the Australian Twittersphere?Filter Bubbles in the Australian Twittersphere?
Filter Bubbles in the Australian Twittersphere?
 
A Multi-Institutional Approach to ‘Big Social Data’: The TrISMA Project
A Multi-Institutional Approach to ‘Big Social Data’: The TrISMA ProjectA Multi-Institutional Approach to ‘Big Social Data’: The TrISMA Project
A Multi-Institutional Approach to ‘Big Social Data’: The TrISMA Project
 
Broker Bots: Analyzing automated activity during High Impact Events on Twitter
Broker Bots: Analyzing automated activity during High Impact Events on TwitterBroker Bots: Analyzing automated activity during High Impact Events on Twitter
Broker Bots: Analyzing automated activity during High Impact Events on Twitter
 
Are Filter Bubbles Real?
Are Filter Bubbles Real?Are Filter Bubbles Real?
Are Filter Bubbles Real?
 
It's Not the Technology, Stupid: How the ‘Echo Chamber’ and ‘Filter Bubble’ M...
It's Not the Technology, Stupid: How the ‘Echo Chamber’ and ‘Filter Bubble’ M...It's Not the Technology, Stupid: How the ‘Echo Chamber’ and ‘Filter Bubble’ M...
It's Not the Technology, Stupid: How the ‘Echo Chamber’ and ‘Filter Bubble’ M...
 
Beyond the Bubble: A Critical Review of the Evidence for Echo Chambers and Fi...
Beyond the Bubble: A Critical Review of the Evidence for Echo Chambers and Fi...Beyond the Bubble: A Critical Review of the Evidence for Echo Chambers and Fi...
Beyond the Bubble: A Critical Review of the Evidence for Echo Chambers and Fi...
 
Are Filter Bubbles Real?
Are Filter Bubbles Real?Are Filter Bubbles Real?
Are Filter Bubbles Real?
 
Media literacy panel
Media literacy panel Media literacy panel
Media literacy panel
 
Using Twitter as a Postgraduate Researcher
Using Twitter as a Postgraduate ResearcherUsing Twitter as a Postgraduate Researcher
Using Twitter as a Postgraduate Researcher
 
Research with Social Media Data: Stewardship & Ethical Considerations
Research with Social Media Data: Stewardship & Ethical ConsiderationsResearch with Social Media Data: Stewardship & Ethical Considerations
Research with Social Media Data: Stewardship & Ethical Considerations
 
Gatewatching 11: Echo Chambers? Filter Bubbles? Reviewing the Evidence
Gatewatching 11: Echo Chambers? Filter Bubbles? Reviewing the EvidenceGatewatching 11: Echo Chambers? Filter Bubbles? Reviewing the Evidence
Gatewatching 11: Echo Chambers? Filter Bubbles? Reviewing the Evidence
 

More from Carlos Castillo (ChaTo)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social MediaCarlos Castillo (ChaTo)
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Carlos Castillo (ChaTo)
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Carlos Castillo (ChaTo)
 

More from Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Indexing
IndexingIndexing
Indexing
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 
Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 

Recently uploaded

complaint-ECI-PM-media-1-Chandru.pdfra;;prfk
complaint-ECI-PM-media-1-Chandru.pdfra;;prfkcomplaint-ECI-PM-media-1-Chandru.pdfra;;prfk
complaint-ECI-PM-media-1-Chandru.pdfra;;prfkbhavenpr
 
VIP Girls Available Call or WhatsApp 9711199012
VIP Girls Available Call or WhatsApp 9711199012VIP Girls Available Call or WhatsApp 9711199012
VIP Girls Available Call or WhatsApp 9711199012ankitnayak356677
 
Chandrayaan 3 Successful Moon Landing Mission.pdf
Chandrayaan 3 Successful Moon Landing Mission.pdfChandrayaan 3 Successful Moon Landing Mission.pdf
Chandrayaan 3 Successful Moon Landing Mission.pdfauroraaudrey4826
 
N Chandrababu Naidu Launches 'Praja Galam' As Part of TDP’s Election Campaign
N Chandrababu Naidu Launches 'Praja Galam' As Part of TDP’s Election CampaignN Chandrababu Naidu Launches 'Praja Galam' As Part of TDP’s Election Campaign
N Chandrababu Naidu Launches 'Praja Galam' As Part of TDP’s Election Campaignanjanibaddipudi1
 
Quiz for Heritage Indian including all the rounds
Quiz for Heritage Indian including all the roundsQuiz for Heritage Indian including all the rounds
Quiz for Heritage Indian including all the roundsnaxymaxyy
 
Different Frontiers of Social Media War in Indonesia Elections 2024
Different Frontiers of Social Media War in Indonesia Elections 2024Different Frontiers of Social Media War in Indonesia Elections 2024
Different Frontiers of Social Media War in Indonesia Elections 2024Ismail Fahmi
 
Brief biography of Julius Robert Oppenheimer
Brief biography of Julius Robert OppenheimerBrief biography of Julius Robert Oppenheimer
Brief biography of Julius Robert OppenheimerOmarCabrera39
 
Opportunities, challenges, and power of media and information
Opportunities, challenges, and power of media and informationOpportunities, challenges, and power of media and information
Opportunities, challenges, and power of media and informationReyMonsales
 
Dynamics of Destructive Polarisation in Mainstream and Social Media: The Case...
Dynamics of Destructive Polarisation in Mainstream and Social Media: The Case...Dynamics of Destructive Polarisation in Mainstream and Social Media: The Case...
Dynamics of Destructive Polarisation in Mainstream and Social Media: The Case...Axel Bruns
 
Manipur-Book-Final-2-compressed.pdfsal'rpk
Manipur-Book-Final-2-compressed.pdfsal'rpkManipur-Book-Final-2-compressed.pdfsal'rpk
Manipur-Book-Final-2-compressed.pdfsal'rpkbhavenpr
 
Top 10 Wealthiest People In The World.pdf
Top 10 Wealthiest People In The World.pdfTop 10 Wealthiest People In The World.pdf
Top 10 Wealthiest People In The World.pdfauroraaudrey4826
 
AP Election Survey 2024: TDP-Janasena-BJP Alliance Set To Sweep Victory
AP Election Survey 2024: TDP-Janasena-BJP Alliance Set To Sweep VictoryAP Election Survey 2024: TDP-Janasena-BJP Alliance Set To Sweep Victory
AP Election Survey 2024: TDP-Janasena-BJP Alliance Set To Sweep Victoryanjanibaddipudi1
 
Global Terrorism and its types and prevention ppt.
Global Terrorism and its types and prevention ppt.Global Terrorism and its types and prevention ppt.
Global Terrorism and its types and prevention ppt.NaveedKhaskheli1
 
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...Ismail Fahmi
 
Referendum Party 2024 Election Manifesto
Referendum Party 2024 Election ManifestoReferendum Party 2024 Election Manifesto
Referendum Party 2024 Election ManifestoSABC News
 
57 Bidens Annihilation Nation Policy.pdf
57 Bidens Annihilation Nation Policy.pdf57 Bidens Annihilation Nation Policy.pdf
57 Bidens Annihilation Nation Policy.pdfGerald Furnkranz
 

Recently uploaded (16)

complaint-ECI-PM-media-1-Chandru.pdfra;;prfk
complaint-ECI-PM-media-1-Chandru.pdfra;;prfkcomplaint-ECI-PM-media-1-Chandru.pdfra;;prfk
complaint-ECI-PM-media-1-Chandru.pdfra;;prfk
 
VIP Girls Available Call or WhatsApp 9711199012
VIP Girls Available Call or WhatsApp 9711199012VIP Girls Available Call or WhatsApp 9711199012
VIP Girls Available Call or WhatsApp 9711199012
 
Chandrayaan 3 Successful Moon Landing Mission.pdf
Chandrayaan 3 Successful Moon Landing Mission.pdfChandrayaan 3 Successful Moon Landing Mission.pdf
Chandrayaan 3 Successful Moon Landing Mission.pdf
 
N Chandrababu Naidu Launches 'Praja Galam' As Part of TDP’s Election Campaign
N Chandrababu Naidu Launches 'Praja Galam' As Part of TDP’s Election CampaignN Chandrababu Naidu Launches 'Praja Galam' As Part of TDP’s Election Campaign
N Chandrababu Naidu Launches 'Praja Galam' As Part of TDP’s Election Campaign
 
Quiz for Heritage Indian including all the rounds
Quiz for Heritage Indian including all the roundsQuiz for Heritage Indian including all the rounds
Quiz for Heritage Indian including all the rounds
 
Different Frontiers of Social Media War in Indonesia Elections 2024
Different Frontiers of Social Media War in Indonesia Elections 2024Different Frontiers of Social Media War in Indonesia Elections 2024
Different Frontiers of Social Media War in Indonesia Elections 2024
 
Brief biography of Julius Robert Oppenheimer
Brief biography of Julius Robert OppenheimerBrief biography of Julius Robert Oppenheimer
Brief biography of Julius Robert Oppenheimer
 
Opportunities, challenges, and power of media and information
Opportunities, challenges, and power of media and informationOpportunities, challenges, and power of media and information
Opportunities, challenges, and power of media and information
 
Dynamics of Destructive Polarisation in Mainstream and Social Media: The Case...
Dynamics of Destructive Polarisation in Mainstream and Social Media: The Case...Dynamics of Destructive Polarisation in Mainstream and Social Media: The Case...
Dynamics of Destructive Polarisation in Mainstream and Social Media: The Case...
 
Manipur-Book-Final-2-compressed.pdfsal'rpk
Manipur-Book-Final-2-compressed.pdfsal'rpkManipur-Book-Final-2-compressed.pdfsal'rpk
Manipur-Book-Final-2-compressed.pdfsal'rpk
 
Top 10 Wealthiest People In The World.pdf
Top 10 Wealthiest People In The World.pdfTop 10 Wealthiest People In The World.pdf
Top 10 Wealthiest People In The World.pdf
 
AP Election Survey 2024: TDP-Janasena-BJP Alliance Set To Sweep Victory
AP Election Survey 2024: TDP-Janasena-BJP Alliance Set To Sweep VictoryAP Election Survey 2024: TDP-Janasena-BJP Alliance Set To Sweep Victory
AP Election Survey 2024: TDP-Janasena-BJP Alliance Set To Sweep Victory
 
Global Terrorism and its types and prevention ppt.
Global Terrorism and its types and prevention ppt.Global Terrorism and its types and prevention ppt.
Global Terrorism and its types and prevention ppt.
 
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...
HARNESSING AI FOR ENHANCED MEDIA ANALYSIS A CASE STUDY ON CHATGPT AT DRONE EM...
 
Referendum Party 2024 Election Manifesto
Referendum Party 2024 Election ManifestoReferendum Party 2024 Election Manifesto
Referendum Party 2024 Election Manifesto
 
57 Bidens Annihilation Nation Policy.pdf
57 Bidens Annihilation Nation Policy.pdf57 Bidens Annihilation Nation Policy.pdf
57 Bidens Annihilation Nation Policy.pdf
 

Social Media News Mining and Automatic Content Analysis of News

  • 1. Social Media News Mining Carlos Castillo Gilad Lotan @ChaToX @gilgul
  • 2. Social Media News Mining & Automatic Content Analysis of News Carlos Castillo – Qatar Computing Research Institute Nov 14th, 2013
  • 3.
  • 4.
  • 5. Outline • Social media around news 1. Predictive analytics using social media 2. Crowds and curators • Automatic content analysis of news 3. TV news via closed captions 4. Online news in international media Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 5
  • 6. Communication scholars vs. Computer scientists • Media and communication scholars – Start from high-level questions • Computer scientists – Start from low-level observations • We need to find a middle ground – To a large extent, we are still not there – I am certainly still not there Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 6
  • 7. Collaborators • • • • • • • • • • • Gianmarco de Francisci Morales – Yahoo! Mohammed El-Haddad – Al Jazeera Sandra González-Bailón – University of Pennsylvania Nasir Khan – Al Jazeera Mounia Lalmas – Yahoo! Janette Lehmann – Pompeu Fabra University & Yahoo! Marcelo Mendoza – Yahoo! Jürgen Pfeffer – CMU Matt Stempeck – MIT Civic Media Diego Sáez-Trumper – Pompeu Fabra University Ethan Zuckerman – MIT Civic Media Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 7
  • 8. Topic 1 of 4 Predictive analytics using social media Carlos Castillo, Mohammed El-Haddad, Jürgen Pfeffer and Matt Stempeck Characterizing the Life Cycle of Online News Stories Using Social Media Reactions To appear in Proc. of Computer Supported Collaborative Work and Social Media. Baltimore, MD, USA. February 2014. See also: demo at http://fast.qcri.org/
  • 9. Pirates abduct ship’s crew off Nigerian coast October 17th, 2012
  • 10. Usage analysis (in news) online • Aikat (1998) – Bursts, short dwell times, weekday != weekend • Crane and Sornette (2008), Yang and Leskovec (2011), Lehmann et al. (2012) – Behavioral classes of attention online • Lotan, Gaffney, and Meyer (SocialFlow, 2011) – Al Jazeera, BBC, CNN, The Economist, Fox News, NY Times • … and many others! Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 10
  • 12. News examples ● ● Dozens killed in India bus-crash blaze (Oct 30th, 2013) Kenyan army admits soldiers looted mall (Oct 30th, 2013) In-Depth examples ● ● Sex selective abortions worry Azerbaijanis (Oct 29th, 2013) Time to put an end to Israel's don't ask-don't tell nuclear policy (Oct 18th, 2013)
  • 13. News: intense first hour In-Depth: longer shelf-life
  • 14. Average visitation/sharing profiles News Carlos Castillo – chato@acm.org http://www.chato.cl/research/ In-Depth 14
  • 15. Types of news visitation profiles (12 h) Decreasing (78%) Steady (9%) Increasing (3%) Rebounding (10%) Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 15
  • 16. Prediction of visits • Short-term traffic is to a large extent correlated with long-term traffic • Social media signals are correlated with traffic and shelf-life More reactions → more traffic More discussion → longer shelf-life • Can we predict 7 days after 30 minutes? Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 16
  • 19. Predictions are updated as new information arrives. Predictive models are re-trained every 24 hours. Traffic to many (but not all) articles is easy to predict. Don't remove over- achievers, promote under- achievers. http://fast.qcri.org/
  • 20. Take-home messages • Decrease, Stay or Increase. Rebound – Roughly 80:10:10 ratio in first 12 hours • News vs In-Depth: different behavior – News pieces die out rapidly on the web – In-Depth pieces live longer • Visit forecasting can help take more informed editorial decisions Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 20
  • 21. Topic 2 of 4 News crowds and news curators in social media Janette Lehmann, Carlos Castillo, Mounia Lalmas and Ethan Zuckerman: Transient News Crowds in Social Media In Proc. of International Conference on Weblogs and Social Media. Cambridge, MA, USA, July 2013. See also: blog post. Janette Lehmann, Carlos Castillo, Mounia Lalmas and Ethan Zuckerman: Finding News Curators in Twitter Social News on the Web (SNOW) workshop. Rio de Janeiro, Brazil, May 2013. See also: blog post.
  • 22. Social media users that are highly engaged with news
  • 23. Transient News Crowds Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 23
  • 24. Empirical results • Experiment with articles in BBC and AJE • People who tweeted an article within 6 hours of publication → news crowd – Follow the crowd for one week – Divide time in 12-hour slices • Most crowds disperse rapidly – They tweeted once about the same thing – Now they tweet about different things • Some crowds re-group later Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 24
  • 25. Syria allows UN to step up food aid 13 Jan 2013 French troops launch ground combat in Mali 13 Jan 2013
  • 26. How do we find the related ones? • Machine-learning approach • Important attributes – Text similarity to original story – Exclusivity of history to this crowd • Finds 14% to 72% of related stories automatically (@ 2/3 precision) Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 26
  • 27. Application to tracking a story Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 27
  • 28. Focus on articles → focus on users Example: which users with a large number of followers tweeted Syria allows UN to step up food aid (16 Jan 2013) Twitter user Followers Tweets about ... @RevolutionSyria 88,122 Syria @KenanFreeSyria 13,388 Syria 703 Food @UP_food @KeriJSmith @BreakingNews 8,838 Breaking news/top stories 5,662,866 Breaking news/top stories Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 28
  • 29. News curators • Think Andy Carvin @acarvin, who was a “distant witness” of the Arab Spring Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 29
  • 30. Do we have curators in Twitter? Human Automatic Topic-unfocused curator Disseminating news articles Topic- about diverse topics, usually unfocus breaking news/top stories ed @KeriJSmith News aggregators Collecting news articles (e.g. from RSS feeds) and automatically post their corresponding headlines and URLs @BreakingNews Topic-focused curator Collecting interesting information Topic- with a specific focus, usually a focused geographic region or a topic @KenanFreeSyria Topic-focused aggregators Disseminating automatically news with topical focus Carlos Castillo – chato@acm.org http://www.chato.cl/research/ @UP_food, @RevolutionSyria 30
  • 31. Which users do we care about? Human Topic-focused curator Collecting interesting information Topic- with a specific focus, usually a focused geographic region or a topic @KenanFreeSyria Carlos Castillo – chato@acm.org http://www.chato.cl/research/ Automatic Topic-focused aggregators Disseminating automatically news with topical focus @UP_food, @RevolutionSyria 31
  • 32. Manual annotation (200 users) Focused - Human Focused - Human Focused - Auto Focused - Auto Unfocused Unfocused 8% 13% 3% 2% 95% 79% Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 32
  • 33. Automatically finding curators • Simple rules – UserFracURL >= 85%: automatic – UserSectionsQ >= 90%: unfocused • Complex model (AUC > 0.90) – Random forest Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 33
  • 34. Take-home messages • Twitter users quickly shift topics – But sometimes return to a topic • There are excellent news curators in Twitter – Although many of them are automatic • Automatic systems can help identify curators and follow-up news Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 34
  • 35. Topic 3 of 4 Analysis of TV news via closed captions Carlos Castillo, Gianmarco De Francisci Morales, Marcelo Mendoza, Nasir Khan: Says Who? Automatic Text-based Content Analysis of Television News Workshop on Mining Unstructured Data Using NLP (UnstructureNLP). Burlington, CA, USA. October 2013.
  • 36. Acquiring closed captions • We used data from Yahoo's IntoNow – 140 TV channels – 2MB/channel/day – Jan-Jun 2012 • Internet Archive: http://archive.org/details/tv Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 36
  • 37. Text pre-processing: input [1339302660] WHAT MORE CAN YOU ASK FOR? [1339302662] >> THIS IS WHAT NBA [1339302663] BASKETBALL IS ABOUT Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 37
  • 38. Text pre-processing: output What/WP more/JJR can/MD you/PRP ask/VB for/IN ?/. This/DT is/VBZ what/WDT NBA/NNP [entity: National_Basketball_ Association] basketball/NN is/VBZ about/IN ./. [sentiment: 0.0] Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 38
  • 39. Clusters by non-entity words General news Sport news General + entertainment Sports Sports General news Sports General + sports Business + sports Business + sports
  • 40. Clusters by linguistic style General + business General + entertainment Sports
  • 41. Sorting by average sentiment Sentiment scores on TV captions go from neutral to positive. Strong positive words are used more than strong negative words? Mixed Sports
  • 42. Automatic TV ↔ online news matching • Same pre-processing is done over articles on the Yahoo! News website • Genre classification (general, sports, business, entertainment) by – Data from TV guide for closed captions – Section in Yahoo! News for web news Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 42
  • 43. Coverage by prominence TV networks with more resources can cover more stories. Some prefer to cover only prominent ones, others want some niche content.
  • 44. US military to probe “marine abuse video” January 12th, 2012
  • 45. Breaking stories vs news matching
  • 46. Average story duration Sports stories tend to have a longer life
  • 47. Newsmakers • By professional activity – Sentiments – Distributions • In relationship to news providers • Everybody is a (potential) entertainer Carlos Castillo – chato@acm.org http://www.chato.cl/research/ Distributions of mentions per person 47
  • 51. Take-home messages • Closed captions are a goldmine of data for content analysis • Automatic content analysis is feasible up to a certain extent – But we still need to learn to use it • Reduce subjectivity when trying to answer some research questions Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 51
  • 52. Topic 4 of 4 Biases in online news in international news media Diego Sáez-Trumper, Carlos Castillo and Mounia Lalmas: Social Media News Communities: Gatekeeping, Coverage, and Statement Bias In Proc. of Conference on Information and Knowledge Management (short paper). Burlingame, CA, USA, October 2013.
  • 53.
  • 54. Jonathan Stray The Atlantic, Feb 2013 Wei Hao-Lin PhD Thesis, CMU 2008
  • 55. Selection bias Coverage bias Statement bias Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 55
  • 56. Goal: discover bias in news media • 60+ news sources in English – BBC, CNN, Fox, Time, UPI, Herald Sun, Times of India, Euro News, DW English, etc. • Follow news through RSS and Twitter • Collect tweets pointing to news • No a-priori information on conflicts or divisions → unsupervised methods Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 56
  • 57. Method • “Community” of a news source – Users who tweeted at least 3 articles from that source in the last 3 days • Collect all articles posted by each – News source – Community of a news source • Compute distances and project in 2D Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 57
  • 60. Coverage bias Measure the distribution of the number of words given to each news story. Compute the 1-divergence between each pair of sources.
  • 61. Coverage bias In Twitter, coverage bias (as measured by number of tweets) is evident while selection bias is not.
  • 62. Coverage bias and partisan politics
  • 64. Future work: find patterns like this? “perusing TIME’s covers reveals countless examples of the publication tempting the world with critical events, ideas or figures, while dangling before Americans the chance to indulge in trite self-absorption” – David Harris Gershon Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 64
  • 65. Take-home messages • Encouraging results on fully unsupervised discovery – But results are quite shallow for now • It is frustratingly difficult to discover bias and framing – We are not happy with only quantifying or analyzing known conflicts Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 65
  • 67. Journalism needs Data availability Carlos Castillo – chato@acm.org http://www.chato.cl/research/ Computing capabilities 67
  • 68. Finding common ground is not easy. AI-complete problems Journalism needs Data availability Poorly planned projects Computing capabilities Overexploited Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 68
  • 69. Data analysis is easy, fun and addictive. Without good research questions, it is often useless. Computer science to support a key function of society = Applied computing at its best! Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 69
  • 70. Thank you! Carlos Castillo · chato@acm.org http://www.chato.cl/research/
  • 71.
  • 72. Shouldn't traditional news outlets resent social media? • We did not take their lunch • I am not pointing fingers but … • … online classified ads are to “blame” Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 72
  • 73. Data sample from Al Jazeera English • October 2012 ≈ 3M visits ≈ 606 articles ≈ 200K social media reactions • Open Source Web Analytics beacon – High-performance process (S4+Cassandra). Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 73
  • 74. News: less shared on Facebook In-Depth: more shared on Facebook
  • 75. Examples (mid-2012) Decreasing (78%): ● ● Almost all breaking news Sometimes delayed due to timezone differences, e.g. Hurricane Sandy Steady or Rebounding Increasing (12%): (10%): ● ● Ongoing news: Obama/Romney, Worker strikes in SA, Syrian unrest Articles updated with supporting content ● ● Articles picked up by external sources or social media (typically single source of traffic) Background articles to new developments
  • 76. Predicting traffic and shelf-life online has a long history • Predicting long-term behavior and half-life from short-term observations – Observations = comments, visits, votes, … – Behavior = total comments, total visits, … – 10+ papers specifically on web traffic • Bit.ly (2011, 2012) – Studies half-life per topic and platform Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 76
  • 77. Results (shelf-life prediction) Larger improvements for In-Depth articles Still, this is a 12 hours error in predicting something with an average of 48-72 hours
  • 78. Social media users engaged with news • To what extent can they contribute to the journalistic process? • What kind of roles do they play? • 47% of journalists from 15 countries (n=478) said Twitter is a source of information for them [source] Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 78
  • 79. Manual annotation • 200 users in 20 articles • Crowdsourcing workers see: – Title of news article – Profile and description of user – Sample of 10 tweets of the user Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 79
  • 80. In relation to news providers Projection in 2D of the second component of a 3-way decomposition with a 3x2x2 core of the tensor of sources x newsmakers x style. The first component separates football from basketball.
  • 81. Text pre-processing: steps • Determine paragraph boundaries – Speech change markers, heuristics based on text and time • Apply a part-of-speech tagger – Stanford NLP tagger • Find named entity mentions • Apply sentiment analysis Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 81
  • 82. News sources • Non-entity words • Linguistic style – Prevalence of different part-of-speech classes • Overall sentiment • Coverage • Timeliness Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 82
  • 83. News matching (model) • Target: {same story, different story} • Example features: – Dot product of aboutness scores of resolved entities in the title, body – Jaccard coefficient of unresolved entities in the title, body • Logistic regression • 4 models in total, one per genre Carlos Castillo – chato@acm.org http://www.chato.cl/research/ 83