Detecting Patterns in News Media Content

Detecting Patterns in News Media Content
Ilias Flaounas
University of Bristol
January 19, 2010
I. Flaounas (University of Bristol) January 19, 2010 1 / 57

Overview
1 Introduction
2 Automating News Content Analysis
Visualisation of Mediasphere
Demo: Found In Translation
3 Findings
Detection of Biases
Predicting Popular Articles
The Structure of the EU Mediasphere
4 Conclusions

Introduction
The global media system has an
important role in democracy,
commerce and culture.
News outlets form a vast,
complex and interconnected
information system.
The system operates in a global
scale, with information being
generated and gathered,
processed, and distributed in
various ways before it reaches
the ﬁnal users.

Terminology
News Outlet or News-media: a source that reports news such as a
newspaper, a journal, a TV or Radio station...
News-item or article: a news piece reported in a news outlet that
refers to a speciﬁc event.
Story: a collection of news items that refer to the same event.
Mediaspere: the collective ecology of the world’s media.
Corpus: a collection of news items.
Coding: the manual annotation of news-items.

Traditional approach
Analysis of news-media content is a domain of research of social scientists.
But they have many limitations.
few outlets per study (< 10)
small numbers of news-items (few hundreds in best cases)
small time periods (few days)
news-items from a single country’s media
manual annotation – ‘coding’
they rely on commercial databases such as LexisNexis and their
constrains.
research is fully hypothesis driven

Examples of traditional studies
Papers published in recent issues of the Journal of Communication:
“A total of 529 stories from NBC Nightly News and 322 stories aired
on Special Report about Iraq, and 64 and 47, respectively, about
Afghanistan were analysed by two coders”
S. Aday, “Chasing the Bad News: An Analysis of 2005 Iraq and Afghanistan War
Coverage on NBC and Fox News Channel”, J. of Com. 60, 144-164 (2010).

Examples of traditional studies
Papers published in recent issues of the Journal of Communication:
“A total of 529 stories from NBC Nightly News and 322 stories aired
on Special Report about Iraq, and 64 and 47, respectively, about
Afghanistan were analysed by two coders”
S. Aday, “Chasing the Bad News: An Analysis of 2005 Iraq and Afghanistan War
Coverage on NBC and Fox News Channel”, J. of Com. 60, 144-164 (2010).
“Our corpus of data consisted of Channel 2s broadcasts on the eve of
MDHH between 7:30 p.m. and midnight in the years 1994-2007[...].
All 278 items aired on the 14 examined evenings were coded.”
O. Meyers et al. “Prime Time Commemoration: An Analysis of Television
Broadcasts on Israel’s Memorial Day for the Holocaust and the Heroism”, J. of
Com. 59, 456-480 (2009).

Example of Coding Scheme
These questionnaires have
to be completed
manually.
The same questionnaire
has to be completed by
more than one coder for
the same news items.
This is a fully hypothesis
driven research model.

But nowadays...
Most media oﬀer their content online in a convenient form.

Research Focus
In our research we undertake a large-scale traditional news-media textual
content analysis using automated techniques.

Research Focus
‘Large-scale’ since we analyse hundreds of outlets, typically for
extended periods of time, involving millions of news items.

Research Focus
‘Traditional news-media’ since we do not focus on modern online-only
news spreading means such as blogs or Twitter.

Research Focus
‘Textual’ since we use only the textual information of news rather
than analysing e.g. images, videos, or speech.

Research Focus
‘Textual’ since we use only the textual information of news rather
than analysing e.g. images, videos, or speech.
‘Automated’ in the sense that the analysis is performed by applying
Artiﬁcial Intelligence techniques rather than using human ‘coders’.

Relevant Work & Datasets
Europe Media Monitor (EMM)
‘Lydia’ system
Newsblaster
NewsInEssence
Google News, Yahoo! News
LexisNexis
Public Corpora: Reuters, New York Times

Relevant Work & Datasets
Europe Media Monitor (EMM)
‘Lydia’ system
Newsblaster
NewsInEssence
Google News, Yahoo! News
LexisNexis
Public Corpora: Reuters, New York Times
We are highly interested in studying the media system per se.

Overview
1 Introduction
3 Findings
Detection of Biases
4 Conclusions

Automating News Content Analysis
Some automation is possible using AI approaches:
◮ Machine Learning
◮ Data Mining
◮ Natural Language Processing
Some questions about media system can be answered for the ﬁrst
time.
We apply methods that work eﬃciently and reliably on large-scale
data.
We can attempt a data-driven research model.

Methods Summary
RSS parsing, Web page content scrapping
Text Preprocessing: Stemming, stop-words removal, TF-IDF, ...
Tagging: Support Vector Machines.
Clustering: Best Reciprocal Hit
Ranking: SVM rank
Words selection: Lasso, SVMs
Network reconstruction: χ2-test ...
NLP: Statistical Machine Translation, Sentiment Analysis, Readability
Data Visualisation: Multidimensional Scaling, Spring Embedding, ...
Statistics: correlations, signiﬁcance tests...

Building & Annotating the Corpus
Our corpus in numbers:
> 1300 multilingual news sources
> 3000 news feeds
133 countries
22 languages
> 3 years of continuous monitoring
40K news items per day
> 30M news items in total

NOAM: News Outlets Analysis & Monitoring system

NOAM: News Outlets Analysis & Monitoring system
NOAM enable us to query the corpus at semantic level.

Statistical Machine Translation1
We applied a phrase based Statistical Machine Translation (SMT)
approach for translating the non-English articles to English.
We use Moses, a complete phrase based translation toolkit for
academic purposes.
We translate all non-English articles of 21 EU languages into English.
For each language pair, an instance of Moses is trained using Europarl
data and JRC-Acquis Multilingual Parallel Corpus.
We make the working assumption that SMT does not alter
signiﬁcantly the geometry of the news corpus in the vector-space
representation.
1
Acknowledgements to Marco Turchi for implementing the SMT module.

Articles Per Day
0 50 100 150 200 250
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
x 10
4
#Days starting from June 1st, 2009
#Articles
We observe:
A seven days cycle
Local minima during weekend days.

Clustering Articles
The Best Reciprocal Hit method:

Outlets Per Story
10
0
10
1
10
2
10
1
10
2
10
3
10
4
10
5
10
6
#Outlets per Story.
#Stories.
Few stories are covered by lots of media, and lots of stories are covered by
few media.

The Global Mediasphere

The Global Mediasphere
543 nodes, 4783 edges, colour by country

Support Vector Machines as Topic Taggers
We train on data from two well accepted corpora:
◮ Reuters
◮ NY Times
Typical text preprocessing: Stemming, stop-words removal,
bag-of-words (TF-IDF) representation...
Two-class SVMs
Cosine kernel
Maximize F0.5-Score on unseen data
Train to recognise 14 interesting news topics

SVM Taggers
Topic Corpus F0.5-Score Precision Recall
1 SPORTS Reuters 97.78 98.31 95.75
2 MARKETS Reuters 92.02 94.09 84.63
3 FASHION Reuters 83.88 94.61 71.27
4 DISASTERS Reuters 83.4 87.69 70.34
5 ART NY Times 81.67 84.9 71.38
6 BUSINESS NY Times 81.16 86.23 65.87
7 INFLATION-PRICES Reuters 77.01 81.45 63.38
8 RELIGION NY Times 74.95 83.57 53.59
9 POLITICS NY Times 73.81 76.65 64.81
10 SCIENCE Reuters 73.63 83.72 50.62
11 WEATHER Reuters 71.43 82.91 46.84
12 PETROLEUM Reuters 70.67 75.14 58.73
13 ELECTIONS Reuters 70.32 78.99 49.32
14 ENVIRONMENT NY Times 64.29 73.48 43.7

We implemented a demo to demonstrate the state of the art in various
disciplines of modern Artiﬁcial Intelligence.
We compare the EU countries according to what topics their media
choose to cover.
Everyday we machine-translate 640 EU media content
Annotate them using SVMs
Compare EU countries media content based on their Top-10 media

http://foundintranslation.enm.bris.ac.uk

Overview
1 Introduction
3 Findings
Detection of Biases
4 Conclusions

Detection of Biases
The goal is to measure typical biases among diﬀerent topics as they are
presented in the media:
Readability
Linguistic Subjectivity
Popularity
Gender Bias
Corpus:
500 English-language media
10 months, (Jan. 1st, 2010 – Oct 31st, 2011)
2.5M articles appeared in main feed

Readability
We measure readability based
on the Flesch Reading Ease
Test
The higher the FRET the
easier the text to read.
Scores range from 0–100.
10K items per topic
FRET(article) = 206.835 − (1.015 · ASL) − 84.6 · ASW

We measure the
percentage of sentimental
adjectives over the total
number of adjectives.
Adjectives detection by
Stanford POS tagger.
We check for each
adjective the presence of a
SentiWordnet sentimental
score > 0.25.

Validation of Linguistic Subjectivity?
This is a challenge due to miss of a golden standard.

Validation of Linguistic Subjectivity?
This is a challenge due to miss of a golden standard.
A subset of leading news-outlets, ranked by their linguistic subjectivity.
Rank Outlet
1 BBC
2 Times
3 NY Times
4 The Guardian
5 CBS
6 Daily Telegraph
7 Daily Star(T)
8 Independent
9 Daily Mail (T)
10 Daily Mirror (T)
11 Newsweek
12 The sun (T)

Popularity
We measure the conditional
probability of an article to
become popular given its
topic.
We track 16 English language
outlets that provide a “Most
popular” feed.
◮ In total 108,516 articles
were popular.
◮ 36,788 articles were
popular and appeared in
the main feed.
P(Pop/Topic) =
P(Topic/Pop) · P(Pop)
P(Topic)

Scatter plot of Topics
0 0.5 1 1.5 2 2.5
36
38
40
42
44
46
48
50
52
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER
Popularity
Readability

Scatter plot of Topics
0 0.5 1 1.5 2 2.5
36
38
40
42
44
46
48
50
52
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER
Popularity
Readability
0 0.5 1 1.5 2 2.5
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER
Popularity
LinguisticSubjectivity
0 0.5 1 1.5 2 2.5
1
2
3
4
5
6
7
8
9
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER
Popularity
GenderBias
35 40 45 50 55
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER
Readability
35 40 45 50 55
1
2
3
4
5
6
7
8
9
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER
Readability
GenderBias
0.01 0.02 0.03 0.04 0.05
1
2
3
4
5
6
7
8
9
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER
GenderBias
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER

Scatter plot of Outlets
0.02 0.025 0.03 0.035 0.04 0.045
2.5
3
3.5
4
4.5
5
BBC
CBS
Daily Mail Daily Mirror
Daily Star
Daily Telegraph
Independent
Newsweek
NY Times
The Guardian
The Sun
Times
GenderBias

30 35 40 45 50 55 60
2.5
3
3.5
4
4.5
5
BBC
CBS
Daily Mail Daily Mirror
Daily Star
Daily Telegraph
Independent
Newsweek
NY Times
The Guardian
The Sun
Times
Readability
GenderBias

30 35 40 45 50 55 60
0.02
0.025
0.03
0.035
0.04
0.045
BBC
CBS
Daily Mail
Daily Mirror
Daily Star
Daily Telegraph
Independent
Newsweek
NY Times
The Guardian
The Sun
Times
Readability

Overview
1 Introduction
3 Findings
Detection of Biases
4 Conclusions

Editors want to know what stories their readers would like to read. Can we
predict which stories will become popular?

Modelling this question as
a simple binary classiﬁcation problem leads to very low performance.

Modelling this question as
a simple binary classiﬁcation problem leads to very low performance.
a ranking problem can lead to promising results. This is because
popularity is a relevant concept and not an absolute one.

Predicting the Popular Articles
Month-by-month predictions, using Ranking SVM.
6 months of data.
Accuracy is the correct orientation of positive/negative pairs of data.

Most Popular Articles
Titles of most popular articles per outlet as ranked using Ranking SVMs
for December 2009.
Outlet Titles of Top-3 Articles
CBS Sources: Elin Done with Tiger — Tiger Woods Slapped with
Ticket for Crash — Tiger Woods: I let my Family Down
Florida
Times-
Union
Pizza delivery woman killed on Westside — A family’s search
for justice, 15 years later — Rants & Raves: Napolitano
unqualified
NY
Times
Poor Children Likelier to Get Antipsychotics — Surf s
Up, Way Up, and Competitors Let Out a Big Mahalo —
Grandma s Gifts Need Extra Reindeer
Reuters Dubai says not responsible for Dubai World debt — Boe-
ing Dreamliner touches down after first flight — Iran’s Ah-
madinejad mocks Obama, ”TV series” nuke talks
Seattle
Post
Hospital: Actress Brittany Murphy dies at age 32 — Actor
Charlie Sheen arrested in Colorado — Charlie Sheen accused
of using weapon in Aspen

Overview
1 Introduction
3 Findings
Detection of Biases
4 Conclusions

EU Mediasphere
Top-10 media outlets per
country
over the 27 EU countries
in 22 diﬀerent languages
for a 6 months period
A total of 1.3M news
items.

EU Mediasphere
Top-10 media outlets per
country
over the 27 EU countries
in 22 diﬀerent languages
for a 6 months period
A total of 1.3M news
items.
What patterns can we ﬁnd using modern AI techniques?

The EU Mediasphere
Co-coverage network: We link two outlets if they share more stories than
expected by chance (χ2 − scores).

The EU Mediasphere
Co-coverage network: We link two outlets if they share more stories than
expected by chance (χ2 − scores).
This network has 203 nodes and 6702 edges.

A bit sparser....
This network has 197 nodes, 3386 edges and 3 connected components.
Singleton nodes are omitted.

What kind of connected components are formed?

We go as sparse as possible with stopping criterion the modularity
maximization.

maximization.
The probability of two non-singleton nodes from the same country to
end up in the same connected component is 82.9% (p < 0.001).

maximization.
Nationality is the major underline criterion of what stories media
outlets choose to publish.

maximization.
Nationality is the major underline criterion of what stories media
outlets choose to publish.
We will work on countries level rather than outlets level.

Which are the strongest connections between countries?

Which are the strongest connections between countries?
We go as sparse as possible while keeping the network connected.
This network has 27 nodes and 112 edges.

Can we explain relations of countries?

We found signiﬁcant (p < 0.001) correlation of countries’ media-content
similarity to their:
Geographical proximity — based on sharing of borders 33.86%

Economical proximity — based on trade volume 31.03%

Cultural proximity — based on song contest votting patterns 32.05%

Cultural proximity — based on song contest votting patterns 32.05%
UK Metro, Dec. 8, 2010: Countries that always vote for each other in the
Eurovision song contest, have a shared interest in news content, as well as
terrible music, a study has shown [...]

How ‘close’ are countries, based on common media
interests?
We use χ2-scores as similarities and project countries in a 2D plane using
Multidimensional Scaling.

How ‘close’ are countries, based on common media
interests?
We use χ2-scores as similarities and project countries in a 2D plane using
Multidimensional Scaling.
We colour the Eurozone members
in blue.
These countries are closer to the
centre, that is the average
EU-media content.

Ranking of countries
Based on their deviation from average EU media content (in 26D space).

Ranking of countries
Based on their deviation from average EU media content (in 26D space).
Rank Country Euro A.Year
1 France Y 1957
2 Austria Y 1995
3 Germany Y 1957
4 Greece Y 1981
5 Ireland Y 1973
6 Cyprus Y 2004
7 Slovenia Y 2004
8 Spain Y 1986
9 Slovakia Y 2004
10 Italy Y 1957
11 Belgium Y 1957
12 Luxembourg Y 1957
13 Bulgaria N 2007
14 Netherlands Y 1957
15 U. Kingdom N 1973
16 Finland Y 1995
17 Sweden N 1995
18 Poland N 2004
19 Estonia N 2004
20 Denmark N 1973
21 Portugal Y 1986
22 Malta Y 2004
23 Czech Republic N 2004
24 Romania N 2007
25 Latvia N 2004
26 Hungary N 2004
27 Lithuania N 2004

Any other important factors?
Correlations of countries deviation from average EU media content.
Factor Correlation (%) p-values
In Eurozone 70.65 <0.001

Accession Year -49.32 0.009

GDP 2008 44.75 0.020

GDP 2008 44.75 0.020
Population 23.05 0.247

GDP 2008 44.75 0.020
Area 15.63 0.435

GDP 2008 44.75 0.020
Area 15.63 0.435
Population Density 7.45 0.712
The ﬁrst three factors are signiﬁcant (p < 0.05), while the rest are not.

Discussion
EU media editors made independently a multitude of small editorial
decisions which shaped the contents of the EU mediasphere in a way
that reﬂects its deep geographic, economic and cultural relations.
Detecting these subtle signals in a statistically rigorous way would be
out of the reach of traditional methods of social scientists.
This analysis demonstrates the power of the available methods for
signiﬁcant automation of media content analysis.

Conclusions
Several tasks using modern AI techniques can be automated.
◮ Though the high-precision & sophisticated annotation, that human
‘coders’ can achieve, is not yet reachable.

Conclusions
Research can be conducted across multiple languages/countries.

Conclusions
The study can run for a long period of time period involving millions
of articles.

Conclusions
of articles.
We challenge questions that could not be answered previously.
◮ e.g. Which country’s media have the most negative coverage of
environmental news that also mention Barack Obama?

Conclusions
of articles.
We challenge questions that could not be answered previously.
◮ e.g. Which country’s media have the most negative coverage of
environmental news that also mention Barack Obama?
In the social sciences, the analysis of news media is done largely by
hand in a hypothesis-driven fashion. Is it time for social sciences to
also adopt a data-driven research model?

Future Work
Under development ideas:
Use of features such as images / audio / video.
How does SMT aﬀect the supervised/unsupervised learning?
Compare the US and the EU mediaspheres
...

Work from other members of the group
Suﬃx Tree - Detection of memes
Named Entities detection & disambiguation
Twitter - Events detection
Summarisation of news
Online learning algorithms for news annotation

References 1
I. Flaounas, M. Turchi, O. Ali, N. Fyson, T. De Bie, N. Mosdell, J.
Lewis, N. Cristianini: “The Structure of the EU Mediasphere”, PLoS
ONE, Vol. 5(12), pp. e14243, 2010.
I. Flaounas, M. Turchi, T. De Bie and N. Cristianini: “Inference and
Validation of Networks”, ECML/PKDD, Springer, LNCS, Vol.
5782(1), pp. 344–358, Bled, Slovenia, 2009.
O. Ali, I. Flaounas, T. De Bie, N. Mosdell, J. Lewis and N. Cristianini,
“Automating News Content Analysis: An Application to Gender Bias
and Readability”, JMLR W & CP: Workshop on Applications of
Pattern Analysis (WAPA), Vol.11, pp. 36–43, Windsor, UK, 2010.
M. Turchi, I. Flaounas, O. Ali, T De Bie, T. Snowsill and N.
Cristianini: “Found in Translation”, ECML/PKDD, Springer, LNCS,
Vol. 5782(2), pp. 746–749, Bled, Slovenia, 2009.

References 2
I. Flaounas, N. Fyson and N. Cristianini: “Predicting Relations in
News-Media Content among EU Countries”, 2nd International
Workshop on Cognitive Information Processing (CIP), IEEE, pp.
269–274, Elba, Italy, 2010.
E. Hensinger, I. Flaounas and N. Cristianini: “Learning the
Preferences of News Readers with SVM and Lasso Ranking”,
Artiﬁcial Intelligence Applications and Innovations, Springer, pp.
179–186, Larnaca, Cyprus, 2010.
T. Snowsill, I. Flaounas, T. De Bie and N. Cristianini: “Detecting
events in a million New York Times articles”, ECML/PKDD,
Springer, LNCS, Vol. 6323(3), pp. 615–618, Barcelona, Spain, 2010.
I. Flaounas, M. Turchi and N. Cristianini: “Detecting Macro-Patterns
in the European Mediasphere”, IEEE/WIC/ACM International
Conference on Web Intelligence and Intelligent Agent Technology, pp.
527–530, Milano, Italy, 2009.

More info at: http://mediapatterns.enm.bris.ac.uk
Thank you!

Detecting Patterns in News Media Content

More Related Content

Viewers also liked

Similar to Detecting Patterns in News Media Content

More from Ilias Flaounas

Detecting Patterns in News Media Content