Detecting Patterns in News Media Content
Ilias Flaounas
University of Bristol
January 19, 2010
I. Flaounas (University of Bristol) January 19, 2010 1 / 57
Overview
1 Introduction
2 Automating News Content Analysis
Visualisation of Mediasphere
Demo: Found In Translation
3 Findings
Detection of Biases
Predicting Popular Articles
The Structure of the EU Mediasphere
4 Conclusions
I. Flaounas (University of Bristol) January 19, 2010 2 / 57
Introduction
The global media system has an
important role in democracy,
commerce and culture.
News outlets form a vast,
complex and interconnected
information system.
The system operates in a global
scale, with information being
generated and gathered,
processed, and distributed in
various ways before it reaches
the final users.
I. Flaounas (University of Bristol) January 19, 2010 3 / 57
Terminology
News Outlet or News-media: a source that reports news such as a
newspaper, a journal, a TV or Radio station...
News-item or article: a news piece reported in a news outlet that
refers to a specific event.
Story: a collection of news items that refer to the same event.
Mediaspere: the collective ecology of the world’s media.
Corpus: a collection of news items.
Coding: the manual annotation of news-items.
I. Flaounas (University of Bristol) January 19, 2010 4 / 57
Traditional approach
Analysis of news-media content is a domain of research of social scientists.
But they have many limitations.
few outlets per study (< 10)
small numbers of news-items (few hundreds in best cases)
small time periods (few days)
news-items from a single country’s media
manual annotation – ‘coding’
they rely on commercial databases such as LexisNexis and their
constrains.
research is fully hypothesis driven
I. Flaounas (University of Bristol) January 19, 2010 5 / 57
Examples of traditional studies
Papers published in recent issues of the Journal of Communication:
“A total of 529 stories from NBC Nightly News and 322 stories aired
on Special Report about Iraq, and 64 and 47, respectively, about
Afghanistan were analysed by two coders”
S. Aday, “Chasing the Bad News: An Analysis of 2005 Iraq and Afghanistan War
Coverage on NBC and Fox News Channel”, J. of Com. 60, 144-164 (2010).
I. Flaounas (University of Bristol) January 19, 2010 6 / 57
Examples of traditional studies
Papers published in recent issues of the Journal of Communication:
“A total of 529 stories from NBC Nightly News and 322 stories aired
on Special Report about Iraq, and 64 and 47, respectively, about
Afghanistan were analysed by two coders”
S. Aday, “Chasing the Bad News: An Analysis of 2005 Iraq and Afghanistan War
Coverage on NBC and Fox News Channel”, J. of Com. 60, 144-164 (2010).
“Our corpus of data consisted of Channel 2s broadcasts on the eve of
MDHH between 7:30 p.m. and midnight in the years 1994-2007[...].
All 278 items aired on the 14 examined evenings were coded.”
O. Meyers et al. “Prime Time Commemoration: An Analysis of Television
Broadcasts on Israel’s Memorial Day for the Holocaust and the Heroism”, J. of
Com. 59, 456-480 (2009).
I. Flaounas (University of Bristol) January 19, 2010 6 / 57
Example of Coding Scheme
These questionnaires have
to be completed
manually.
The same questionnaire
has to be completed by
more than one coder for
the same news items.
This is a fully hypothesis
driven research model.
I. Flaounas (University of Bristol) January 19, 2010 7 / 57
But nowadays...
Most media offer their content online in a convenient form.
I. Flaounas (University of Bristol) January 19, 2010 8 / 57
Research Focus
In our research we undertake a large-scale traditional news-media textual
content analysis using automated techniques.
I. Flaounas (University of Bristol) January 19, 2010 9 / 57
Research Focus
In our research we undertake a large-scale traditional news-media textual
content analysis using automated techniques.
‘Large-scale’ since we analyse hundreds of outlets, typically for
extended periods of time, involving millions of news items.
I. Flaounas (University of Bristol) January 19, 2010 9 / 57
Research Focus
In our research we undertake a large-scale traditional news-media textual
content analysis using automated techniques.
‘Large-scale’ since we analyse hundreds of outlets, typically for
extended periods of time, involving millions of news items.
‘Traditional news-media’ since we do not focus on modern online-only
news spreading means such as blogs or Twitter.
I. Flaounas (University of Bristol) January 19, 2010 9 / 57
Research Focus
In our research we undertake a large-scale traditional news-media textual
content analysis using automated techniques.
‘Large-scale’ since we analyse hundreds of outlets, typically for
extended periods of time, involving millions of news items.
‘Traditional news-media’ since we do not focus on modern online-only
news spreading means such as blogs or Twitter.
‘Textual’ since we use only the textual information of news rather
than analysing e.g. images, videos, or speech.
I. Flaounas (University of Bristol) January 19, 2010 9 / 57
Research Focus
In our research we undertake a large-scale traditional news-media textual
content analysis using automated techniques.
‘Large-scale’ since we analyse hundreds of outlets, typically for
extended periods of time, involving millions of news items.
‘Traditional news-media’ since we do not focus on modern online-only
news spreading means such as blogs or Twitter.
‘Textual’ since we use only the textual information of news rather
than analysing e.g. images, videos, or speech.
‘Automated’ in the sense that the analysis is performed by applying
Artificial Intelligence techniques rather than using human ‘coders’.
I. Flaounas (University of Bristol) January 19, 2010 9 / 57
Relevant Work & Datasets
Europe Media Monitor (EMM)
‘Lydia’ system
Newsblaster
NewsInEssence
Google News, Yahoo! News
LexisNexis
Public Corpora: Reuters, New York Times
I. Flaounas (University of Bristol) January 19, 2010 10 / 57
Relevant Work & Datasets
Europe Media Monitor (EMM)
‘Lydia’ system
Newsblaster
NewsInEssence
Google News, Yahoo! News
LexisNexis
Public Corpora: Reuters, New York Times
We are highly interested in studying the media system per se.
I. Flaounas (University of Bristol) January 19, 2010 10 / 57
Overview
1 Introduction
2 Automating News Content Analysis
Visualisation of Mediasphere
Demo: Found In Translation
3 Findings
Detection of Biases
Predicting Popular Articles
The Structure of the EU Mediasphere
4 Conclusions
I. Flaounas (University of Bristol) January 19, 2010 11 / 57
Automating News Content Analysis
Some automation is possible using AI approaches:
◮ Machine Learning
◮ Data Mining
◮ Natural Language Processing
Some questions about media system can be answered for the first
time.
We apply methods that work efficiently and reliably on large-scale
data.
We can attempt a data-driven research model.
I. Flaounas (University of Bristol) January 19, 2010 12 / 57
Methods Summary
RSS parsing, Web page content scrapping
Text Preprocessing: Stemming, stop-words removal, TF-IDF, ...
Tagging: Support Vector Machines.
Clustering: Best Reciprocal Hit
Ranking: SVM rank
Words selection: Lasso, SVMs
Network reconstruction: χ2-test ...
NLP: Statistical Machine Translation, Sentiment Analysis, Readability
Data Visualisation: Multidimensional Scaling, Spring Embedding, ...
Statistics: correlations, significance tests...
I. Flaounas (University of Bristol) January 19, 2010 13 / 57
Building & Annotating the Corpus
Our corpus in numbers:
> 1300 multilingual news sources
> 3000 news feeds
133 countries
22 languages
> 3 years of continuous monitoring
40K news items per day
> 30M news items in total
I. Flaounas (University of Bristol) January 19, 2010 14 / 57
NOAM: News Outlets Analysis & Monitoring system
I. Flaounas (University of Bristol) January 19, 2010 15 / 57
NOAM: News Outlets Analysis & Monitoring system
NOAM enable us to query the corpus at semantic level.
I. Flaounas (University of Bristol) January 19, 2010 15 / 57
Statistical Machine Translation1
We applied a phrase based Statistical Machine Translation (SMT)
approach for translating the non-English articles to English.
We use Moses, a complete phrase based translation toolkit for
academic purposes.
We translate all non-English articles of 21 EU languages into English.
For each language pair, an instance of Moses is trained using Europarl
data and JRC-Acquis Multilingual Parallel Corpus.
We make the working assumption that SMT does not alter
significantly the geometry of the news corpus in the vector-space
representation.
1
Acknowledgements to Marco Turchi for implementing the SMT module.
I. Flaounas (University of Bristol) January 19, 2010 16 / 57
Articles Per Day
0 50 100 150 200 250
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
x 10
4
#Days starting from June 1st, 2009
#Articles
We observe:
A seven days cycle
Local minima during weekend days.
I. Flaounas (University of Bristol) January 19, 2010 17 / 57
Clustering Articles
The Best Reciprocal Hit method:
I. Flaounas (University of Bristol) January 19, 2010 18 / 57
Outlets Per Story
10
0
10
1
10
2
10
1
10
2
10
3
10
4
10
5
10
6
#Outlets per Story.
#Stories.
Few stories are covered by lots of media, and lots of stories are covered by
few media.
I. Flaounas (University of Bristol) January 19, 2010 19 / 57
The Global Mediasphere
I. Flaounas (University of Bristol) January 19, 2010 20 / 57
The Global Mediasphere
543 nodes, 4783 edges, colour by country
I. Flaounas (University of Bristol) January 19, 2010 20 / 57
Support Vector Machines as Topic Taggers
We train on data from two well accepted corpora:
◮ Reuters
◮ NY Times
Typical text preprocessing: Stemming, stop-words removal,
bag-of-words (TF-IDF) representation...
Two-class SVMs
Cosine kernel
Maximize F0.5-Score on unseen data
Train to recognise 14 interesting news topics
I. Flaounas (University of Bristol) January 19, 2010 21 / 57
SVM Taggers
Topic Corpus F0.5-Score Precision Recall
1 SPORTS Reuters 97.78 98.31 95.75
2 MARKETS Reuters 92.02 94.09 84.63
3 FASHION Reuters 83.88 94.61 71.27
4 DISASTERS Reuters 83.4 87.69 70.34
5 ART NY Times 81.67 84.9 71.38
6 BUSINESS NY Times 81.16 86.23 65.87
7 INFLATION-PRICES Reuters 77.01 81.45 63.38
8 RELIGION NY Times 74.95 83.57 53.59
9 POLITICS NY Times 73.81 76.65 64.81
10 SCIENCE Reuters 73.63 83.72 50.62
11 WEATHER Reuters 71.43 82.91 46.84
12 PETROLEUM Reuters 70.67 75.14 58.73
13 ELECTIONS Reuters 70.32 78.99 49.32
14 ENVIRONMENT NY Times 64.29 73.48 43.7
I. Flaounas (University of Bristol) January 19, 2010 22 / 57
Demo: Found In Translation
We implemented a demo to demonstrate the state of the art in various
disciplines of modern Artificial Intelligence.
We compare the EU countries according to what topics their media
choose to cover.
Everyday we machine-translate 640 EU media content
Annotate them using SVMs
Compare EU countries media content based on their Top-10 media
I. Flaounas (University of Bristol) January 19, 2010 23 / 57
Demo: Found In Translation
http://foundintranslation.enm.bris.ac.uk
I. Flaounas (University of Bristol) January 19, 2010 24 / 57
Overview
1 Introduction
2 Automating News Content Analysis
Visualisation of Mediasphere
Demo: Found In Translation
3 Findings
Detection of Biases
Predicting Popular Articles
The Structure of the EU Mediasphere
4 Conclusions
I. Flaounas (University of Bristol) January 19, 2010 25 / 57
Detection of Biases
The goal is to measure typical biases among different topics as they are
presented in the media:
Readability
Linguistic Subjectivity
Popularity
Gender Bias
Corpus:
500 English-language media
10 months, (Jan. 1st, 2010 – Oct 31st, 2011)
2.5M articles appeared in main feed
I. Flaounas (University of Bristol) January 19, 2010 26 / 57
Readability
We measure readability based
on the Flesch Reading Ease
Test
The higher the FRET the
easier the text to read.
Scores range from 0–100.
10K items per topic
FRET(article) = 206.835 − (1.015 · ASL) − 84.6 · ASW
I. Flaounas (University of Bristol) January 19, 2010 27 / 57
Linguistic Subjectivity
We measure the
percentage of sentimental
adjectives over the total
number of adjectives.
Adjectives detection by
Stanford POS tagger.
We check for each
adjective the presence of a
SentiWordnet sentimental
score > 0.25.
I. Flaounas (University of Bristol) January 19, 2010 28 / 57
Validation of Linguistic Subjectivity?
This is a challenge due to miss of a golden standard.
I. Flaounas (University of Bristol) January 19, 2010 29 / 57
Validation of Linguistic Subjectivity?
This is a challenge due to miss of a golden standard.
A subset of leading news-outlets, ranked by their linguistic subjectivity.
Rank Outlet
1 BBC
2 Times
3 NY Times
4 The Guardian
5 CBS
6 Daily Telegraph
7 Daily Star(T)
8 Independent
9 Daily Mail (T)
10 Daily Mirror (T)
11 Newsweek
12 The sun (T)
I. Flaounas (University of Bristol) January 19, 2010 29 / 57
Popularity
We measure the conditional
probability of an article to
become popular given its
topic.
We track 16 English language
outlets that provide a “Most
popular” feed.
◮ In total 108,516 articles
were popular.
◮ 36,788 articles were
popular and appeared in
the main feed.
P(Pop/Topic) =
P(Topic/Pop) · P(Pop)
P(Topic)
I. Flaounas (University of Bristol) January 19, 2010 30 / 57
Scatter plot of Topics
0 0.5 1 1.5 2 2.5
36
38
40
42
44
46
48
50
52
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER
Popularity
Readability
I. Flaounas (University of Bristol) January 19, 2010 31 / 57
Scatter plot of Topics
0 0.5 1 1.5 2 2.5
36
38
40
42
44
46
48
50
52
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER
Popularity
Readability
0 0.5 1 1.5 2 2.5
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER
Popularity
LinguisticSubjectivity
0 0.5 1 1.5 2 2.5
1
2
3
4
5
6
7
8
9
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER
Popularity
GenderBias
35 40 45 50 55
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER
Readability
LinguisticSubjectivity
35 40 45 50 55
1
2
3
4
5
6
7
8
9
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER
Readability
GenderBias
0.01 0.02 0.03 0.04 0.05
1
2
3
4
5
6
7
8
9
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER
Linguistic Subjectivity
GenderBias
ART
BUSINESS
DISASTERS
ELECTIONS
ENVIRONMENT
FASHION
MARKETS
PETROLEUM
POLITICS
PRICES
RELIGION
SCIENCE
SPORTS
WEATHER
I. Flaounas (University of Bristol) January 19, 2010 32 / 57
Scatter plot of Outlets
0.02 0.025 0.03 0.035 0.04 0.045
2.5
3
3.5
4
4.5
5
BBC
CBS
Daily Mail Daily Mirror
Daily Star
Daily Telegraph
Independent
Newsweek
NY Times
The Guardian
The Sun
Times
Linguistic Subjectivity
GenderBias
I. Flaounas (University of Bristol) January 19, 2010 33 / 57
Scatter plot of Outlets
30 35 40 45 50 55 60
2.5
3
3.5
4
4.5
5
BBC
CBS
Daily Mail Daily Mirror
Daily Star
Daily Telegraph
Independent
Newsweek
NY Times
The Guardian
The Sun
Times
Readability
GenderBias
I. Flaounas (University of Bristol) January 19, 2010 34 / 57
Scatter plot of Outlets
30 35 40 45 50 55 60
0.02
0.025
0.03
0.035
0.04
0.045
BBC
CBS
Daily Mail
Daily Mirror
Daily Star
Daily Telegraph
Independent
Newsweek
NY Times
The Guardian
The Sun
Times
Readability
LinguisticSubjectivity
I. Flaounas (University of Bristol) January 19, 2010 35 / 57
Overview
1 Introduction
2 Automating News Content Analysis
Visualisation of Mediasphere
Demo: Found In Translation
3 Findings
Detection of Biases
Predicting Popular Articles
The Structure of the EU Mediasphere
4 Conclusions
I. Flaounas (University of Bristol) January 19, 2010 36 / 57
Predicting Popular Articles
Editors want to know what stories their readers would like to read. Can we
predict which stories will become popular?
I. Flaounas (University of Bristol) January 19, 2010 37 / 57
Predicting Popular Articles
Editors want to know what stories their readers would like to read. Can we
predict which stories will become popular?
Modelling this question as
a simple binary classification problem leads to very low performance.
I. Flaounas (University of Bristol) January 19, 2010 37 / 57
Predicting Popular Articles
Editors want to know what stories their readers would like to read. Can we
predict which stories will become popular?
Modelling this question as
a simple binary classification problem leads to very low performance.
a ranking problem can lead to promising results. This is because
popularity is a relevant concept and not an absolute one.
I. Flaounas (University of Bristol) January 19, 2010 37 / 57
Predicting the Popular Articles
Month-by-month predictions, using Ranking SVM.
6 months of data.
Accuracy is the correct orientation of positive/negative pairs of data.
I. Flaounas (University of Bristol) January 19, 2010 38 / 57
Most Popular Articles
Titles of most popular articles per outlet as ranked using Ranking SVMs
for December 2009.
Outlet Titles of Top-3 Articles
CBS Sources: Elin Done with Tiger — Tiger Woods Slapped with
Ticket for Crash — Tiger Woods: I let my Family Down
Florida
Times-
Union
Pizza delivery woman killed on Westside — A family’s search
for justice, 15 years later — Rants & Raves: Napolitano
unqualified
NY
Times
Poor Children Likelier to Get Antipsychotics — Surf s
Up, Way Up, and Competitors Let Out a Big Mahalo —
Grandma s Gifts Need Extra Reindeer
Reuters Dubai says not responsible for Dubai World debt — Boe-
ing Dreamliner touches down after first flight — Iran’s Ah-
madinejad mocks Obama, ”TV series” nuke talks
Seattle
Post
Hospital: Actress Brittany Murphy dies at age 32 — Actor
Charlie Sheen arrested in Colorado — Charlie Sheen accused
of using weapon in Aspen
I. Flaounas (University of Bristol) January 19, 2010 39 / 57
Overview
1 Introduction
2 Automating News Content Analysis
Visualisation of Mediasphere
Demo: Found In Translation
3 Findings
Detection of Biases
Predicting Popular Articles
The Structure of the EU Mediasphere
4 Conclusions
I. Flaounas (University of Bristol) January 19, 2010 40 / 57
EU Mediasphere
Top-10 media outlets per
country
over the 27 EU countries
in 22 different languages
for a 6 months period
A total of 1.3M news
items.
I. Flaounas (University of Bristol) January 19, 2010 41 / 57
EU Mediasphere
Top-10 media outlets per
country
over the 27 EU countries
in 22 different languages
for a 6 months period
A total of 1.3M news
items.
What patterns can we find using modern AI techniques?
I. Flaounas (University of Bristol) January 19, 2010 41 / 57
The EU Mediasphere
Co-coverage network: We link two outlets if they share more stories than
expected by chance (χ2 − scores).
I. Flaounas (University of Bristol) January 19, 2010 42 / 57
The EU Mediasphere
Co-coverage network: We link two outlets if they share more stories than
expected by chance (χ2 − scores).
This network has 203 nodes and 6702 edges.
I. Flaounas (University of Bristol) January 19, 2010 42 / 57
A bit sparser....
This network has 197 nodes, 3386 edges and 3 connected components.
Singleton nodes are omitted.
I. Flaounas (University of Bristol) January 19, 2010 43 / 57
What kind of connected components are formed?
I. Flaounas (University of Bristol) January 19, 2010 44 / 57
What kind of connected components are formed?
I. Flaounas (University of Bristol) January 19, 2010 44 / 57
What kind of connected components are formed?
We go as sparse as possible with stopping criterion the modularity
maximization.
I. Flaounas (University of Bristol) January 19, 2010 44 / 57
What kind of connected components are formed?
We go as sparse as possible with stopping criterion the modularity
maximization.
The probability of two non-singleton nodes from the same country to
end up in the same connected component is 82.9% (p < 0.001).
I. Flaounas (University of Bristol) January 19, 2010 44 / 57
What kind of connected components are formed?
We go as sparse as possible with stopping criterion the modularity
maximization.
The probability of two non-singleton nodes from the same country to
end up in the same connected component is 82.9% (p < 0.001).
Nationality is the major underline criterion of what stories media
outlets choose to publish.
I. Flaounas (University of Bristol) January 19, 2010 44 / 57
What kind of connected components are formed?
We go as sparse as possible with stopping criterion the modularity
maximization.
The probability of two non-singleton nodes from the same country to
end up in the same connected component is 82.9% (p < 0.001).
Nationality is the major underline criterion of what stories media
outlets choose to publish.
We will work on countries level rather than outlets level.
I. Flaounas (University of Bristol) January 19, 2010 44 / 57
Which are the strongest connections between countries?
I. Flaounas (University of Bristol) January 19, 2010 45 / 57
Which are the strongest connections between countries?
We go as sparse as possible while keeping the network connected.
This network has 27 nodes and 112 edges.
I. Flaounas (University of Bristol) January 19, 2010 45 / 57
Can we explain relations of countries?
I. Flaounas (University of Bristol) January 19, 2010 46 / 57
Can we explain relations of countries?
We found significant (p < 0.001) correlation of countries’ media-content
similarity to their:
Geographical proximity — based on sharing of borders 33.86%
I. Flaounas (University of Bristol) January 19, 2010 46 / 57
Can we explain relations of countries?
We found significant (p < 0.001) correlation of countries’ media-content
similarity to their:
Geographical proximity — based on sharing of borders 33.86%
Economical proximity — based on trade volume 31.03%
I. Flaounas (University of Bristol) January 19, 2010 46 / 57
Can we explain relations of countries?
We found significant (p < 0.001) correlation of countries’ media-content
similarity to their:
Geographical proximity — based on sharing of borders 33.86%
Economical proximity — based on trade volume 31.03%
Cultural proximity — based on song contest votting patterns 32.05%
I. Flaounas (University of Bristol) January 19, 2010 46 / 57
Can we explain relations of countries?
We found significant (p < 0.001) correlation of countries’ media-content
similarity to their:
Geographical proximity — based on sharing of borders 33.86%
Economical proximity — based on trade volume 31.03%
Cultural proximity — based on song contest votting patterns 32.05%
UK Metro, Dec. 8, 2010: Countries that always vote for each other in the
Eurovision song contest, have a shared interest in news content, as well as
terrible music, a study has shown [...]
I. Flaounas (University of Bristol) January 19, 2010 46 / 57
How ‘close’ are countries, based on common media
interests?
We use χ2-scores as similarities and project countries in a 2D plane using
Multidimensional Scaling.
I. Flaounas (University of Bristol) January 19, 2010 47 / 57
How ‘close’ are countries, based on common media
interests?
We use χ2-scores as similarities and project countries in a 2D plane using
Multidimensional Scaling.
I. Flaounas (University of Bristol) January 19, 2010 47 / 57
How ‘close’ are countries, based on common media
interests?
We use χ2-scores as similarities and project countries in a 2D plane using
Multidimensional Scaling.
We colour the Eurozone members
in blue.
These countries are closer to the
centre, that is the average
EU-media content.
I. Flaounas (University of Bristol) January 19, 2010 48 / 57
Ranking of countries
Based on their deviation from average EU media content (in 26D space).
I. Flaounas (University of Bristol) January 19, 2010 49 / 57
Ranking of countries
Based on their deviation from average EU media content (in 26D space).
Rank Country Euro A.Year
1 France Y 1957
2 Austria Y 1995
3 Germany Y 1957
4 Greece Y 1981
5 Ireland Y 1973
6 Cyprus Y 2004
7 Slovenia Y 2004
8 Spain Y 1986
9 Slovakia Y 2004
10 Italy Y 1957
11 Belgium Y 1957
12 Luxembourg Y 1957
13 Bulgaria N 2007
14 Netherlands Y 1957
15 U. Kingdom N 1973
16 Finland Y 1995
17 Sweden N 1995
18 Poland N 2004
19 Estonia N 2004
20 Denmark N 1973
21 Portugal Y 1986
22 Malta Y 2004
23 Czech Republic N 2004
24 Romania N 2007
25 Latvia N 2004
26 Hungary N 2004
27 Lithuania N 2004
I. Flaounas (University of Bristol) January 19, 2010 49 / 57
Any other important factors?
Correlations of countries deviation from average EU media content.
Factor Correlation (%) p-values
In Eurozone 70.65 <0.001
I. Flaounas (University of Bristol) January 19, 2010 50 / 57
Any other important factors?
Correlations of countries deviation from average EU media content.
Factor Correlation (%) p-values
In Eurozone 70.65 <0.001
Accession Year -49.32 0.009
I. Flaounas (University of Bristol) January 19, 2010 50 / 57
Any other important factors?
Correlations of countries deviation from average EU media content.
Factor Correlation (%) p-values
In Eurozone 70.65 <0.001
Accession Year -49.32 0.009
GDP 2008 44.75 0.020
I. Flaounas (University of Bristol) January 19, 2010 50 / 57
Any other important factors?
Correlations of countries deviation from average EU media content.
Factor Correlation (%) p-values
In Eurozone 70.65 <0.001
Accession Year -49.32 0.009
GDP 2008 44.75 0.020
Population 23.05 0.247
I. Flaounas (University of Bristol) January 19, 2010 50 / 57
Any other important factors?
Correlations of countries deviation from average EU media content.
Factor Correlation (%) p-values
In Eurozone 70.65 <0.001
Accession Year -49.32 0.009
GDP 2008 44.75 0.020
Population 23.05 0.247
Area 15.63 0.435
I. Flaounas (University of Bristol) January 19, 2010 50 / 57
Any other important factors?
Correlations of countries deviation from average EU media content.
Factor Correlation (%) p-values
In Eurozone 70.65 <0.001
Accession Year -49.32 0.009
GDP 2008 44.75 0.020
Population 23.05 0.247
Area 15.63 0.435
Population Density 7.45 0.712
The first three factors are significant (p < 0.05), while the rest are not.
I. Flaounas (University of Bristol) January 19, 2010 50 / 57
Discussion
EU media editors made independently a multitude of small editorial
decisions which shaped the contents of the EU mediasphere in a way
that reflects its deep geographic, economic and cultural relations.
Detecting these subtle signals in a statistically rigorous way would be
out of the reach of traditional methods of social scientists.
This analysis demonstrates the power of the available methods for
significant automation of media content analysis.
I. Flaounas (University of Bristol) January 19, 2010 51 / 57
Conclusions
Several tasks using modern AI techniques can be automated.
◮ Though the high-precision & sophisticated annotation, that human
‘coders’ can achieve, is not yet reachable.
I. Flaounas (University of Bristol) January 19, 2010 52 / 57
Conclusions
Several tasks using modern AI techniques can be automated.
◮ Though the high-precision & sophisticated annotation, that human
‘coders’ can achieve, is not yet reachable.
Research can be conducted across multiple languages/countries.
I. Flaounas (University of Bristol) January 19, 2010 52 / 57
Conclusions
Several tasks using modern AI techniques can be automated.
◮ Though the high-precision & sophisticated annotation, that human
‘coders’ can achieve, is not yet reachable.
Research can be conducted across multiple languages/countries.
The study can run for a long period of time period involving millions
of articles.
I. Flaounas (University of Bristol) January 19, 2010 52 / 57
Conclusions
Several tasks using modern AI techniques can be automated.
◮ Though the high-precision & sophisticated annotation, that human
‘coders’ can achieve, is not yet reachable.
Research can be conducted across multiple languages/countries.
The study can run for a long period of time period involving millions
of articles.
We challenge questions that could not be answered previously.
◮ e.g. Which country’s media have the most negative coverage of
environmental news that also mention Barack Obama?
I. Flaounas (University of Bristol) January 19, 2010 52 / 57
Conclusions
Several tasks using modern AI techniques can be automated.
◮ Though the high-precision & sophisticated annotation, that human
‘coders’ can achieve, is not yet reachable.
Research can be conducted across multiple languages/countries.
The study can run for a long period of time period involving millions
of articles.
We challenge questions that could not be answered previously.
◮ e.g. Which country’s media have the most negative coverage of
environmental news that also mention Barack Obama?
In the social sciences, the analysis of news media is done largely by
hand in a hypothesis-driven fashion. Is it time for social sciences to
also adopt a data-driven research model?
I. Flaounas (University of Bristol) January 19, 2010 52 / 57
Future Work
Under development ideas:
Use of features such as images / audio / video.
How does SMT affect the supervised/unsupervised learning?
Compare the US and the EU mediaspheres
...
I. Flaounas (University of Bristol) January 19, 2010 53 / 57
Work from other members of the group
Suffix Tree - Detection of memes
Named Entities detection & disambiguation
Twitter - Events detection
Summarisation of news
Online learning algorithms for news annotation
I. Flaounas (University of Bristol) January 19, 2010 54 / 57
References 1
I. Flaounas, M. Turchi, O. Ali, N. Fyson, T. De Bie, N. Mosdell, J.
Lewis, N. Cristianini: “The Structure of the EU Mediasphere”, PLoS
ONE, Vol. 5(12), pp. e14243, 2010.
I. Flaounas, M. Turchi, T. De Bie and N. Cristianini: “Inference and
Validation of Networks”, ECML/PKDD, Springer, LNCS, Vol.
5782(1), pp. 344–358, Bled, Slovenia, 2009.
O. Ali, I. Flaounas, T. De Bie, N. Mosdell, J. Lewis and N. Cristianini,
“Automating News Content Analysis: An Application to Gender Bias
and Readability”, JMLR W & CP: Workshop on Applications of
Pattern Analysis (WAPA), Vol.11, pp. 36–43, Windsor, UK, 2010.
M. Turchi, I. Flaounas, O. Ali, T De Bie, T. Snowsill and N.
Cristianini: “Found in Translation”, ECML/PKDD, Springer, LNCS,
Vol. 5782(2), pp. 746–749, Bled, Slovenia, 2009.
I. Flaounas (University of Bristol) January 19, 2010 55 / 57
References 2
I. Flaounas, N. Fyson and N. Cristianini: “Predicting Relations in
News-Media Content among EU Countries”, 2nd International
Workshop on Cognitive Information Processing (CIP), IEEE, pp.
269–274, Elba, Italy, 2010.
E. Hensinger, I. Flaounas and N. Cristianini: “Learning the
Preferences of News Readers with SVM and Lasso Ranking”,
Artificial Intelligence Applications and Innovations, Springer, pp.
179–186, Larnaca, Cyprus, 2010.
T. Snowsill, I. Flaounas, T. De Bie and N. Cristianini: “Detecting
events in a million New York Times articles”, ECML/PKDD,
Springer, LNCS, Vol. 6323(3), pp. 615–618, Barcelona, Spain, 2010.
I. Flaounas, M. Turchi and N. Cristianini: “Detecting Macro-Patterns
in the European Mediasphere”, IEEE/WIC/ACM International
Conference on Web Intelligence and Intelligent Agent Technology, pp.
527–530, Milano, Italy, 2009.
I. Flaounas (University of Bristol) January 19, 2010 56 / 57
More info at: http://mediapatterns.enm.bris.ac.uk
Thank you!
I. Flaounas (University of Bristol) January 19, 2010 57 / 57

Detecting Patterns in News Media Content

  • 1.
    Detecting Patterns inNews Media Content Ilias Flaounas University of Bristol January 19, 2010 I. Flaounas (University of Bristol) January 19, 2010 1 / 57
  • 2.
    Overview 1 Introduction 2 AutomatingNews Content Analysis Visualisation of Mediasphere Demo: Found In Translation 3 Findings Detection of Biases Predicting Popular Articles The Structure of the EU Mediasphere 4 Conclusions I. Flaounas (University of Bristol) January 19, 2010 2 / 57
  • 3.
    Introduction The global mediasystem has an important role in democracy, commerce and culture. News outlets form a vast, complex and interconnected information system. The system operates in a global scale, with information being generated and gathered, processed, and distributed in various ways before it reaches the final users. I. Flaounas (University of Bristol) January 19, 2010 3 / 57
  • 4.
    Terminology News Outlet orNews-media: a source that reports news such as a newspaper, a journal, a TV or Radio station... News-item or article: a news piece reported in a news outlet that refers to a specific event. Story: a collection of news items that refer to the same event. Mediaspere: the collective ecology of the world’s media. Corpus: a collection of news items. Coding: the manual annotation of news-items. I. Flaounas (University of Bristol) January 19, 2010 4 / 57
  • 5.
    Traditional approach Analysis ofnews-media content is a domain of research of social scientists. But they have many limitations. few outlets per study (< 10) small numbers of news-items (few hundreds in best cases) small time periods (few days) news-items from a single country’s media manual annotation – ‘coding’ they rely on commercial databases such as LexisNexis and their constrains. research is fully hypothesis driven I. Flaounas (University of Bristol) January 19, 2010 5 / 57
  • 6.
    Examples of traditionalstudies Papers published in recent issues of the Journal of Communication: “A total of 529 stories from NBC Nightly News and 322 stories aired on Special Report about Iraq, and 64 and 47, respectively, about Afghanistan were analysed by two coders” S. Aday, “Chasing the Bad News: An Analysis of 2005 Iraq and Afghanistan War Coverage on NBC and Fox News Channel”, J. of Com. 60, 144-164 (2010). I. Flaounas (University of Bristol) January 19, 2010 6 / 57
  • 7.
    Examples of traditionalstudies Papers published in recent issues of the Journal of Communication: “A total of 529 stories from NBC Nightly News and 322 stories aired on Special Report about Iraq, and 64 and 47, respectively, about Afghanistan were analysed by two coders” S. Aday, “Chasing the Bad News: An Analysis of 2005 Iraq and Afghanistan War Coverage on NBC and Fox News Channel”, J. of Com. 60, 144-164 (2010). “Our corpus of data consisted of Channel 2s broadcasts on the eve of MDHH between 7:30 p.m. and midnight in the years 1994-2007[...]. All 278 items aired on the 14 examined evenings were coded.” O. Meyers et al. “Prime Time Commemoration: An Analysis of Television Broadcasts on Israel’s Memorial Day for the Holocaust and the Heroism”, J. of Com. 59, 456-480 (2009). I. Flaounas (University of Bristol) January 19, 2010 6 / 57
  • 8.
    Example of CodingScheme These questionnaires have to be completed manually. The same questionnaire has to be completed by more than one coder for the same news items. This is a fully hypothesis driven research model. I. Flaounas (University of Bristol) January 19, 2010 7 / 57
  • 9.
    But nowadays... Most mediaoffer their content online in a convenient form. I. Flaounas (University of Bristol) January 19, 2010 8 / 57
  • 10.
    Research Focus In ourresearch we undertake a large-scale traditional news-media textual content analysis using automated techniques. I. Flaounas (University of Bristol) January 19, 2010 9 / 57
  • 11.
    Research Focus In ourresearch we undertake a large-scale traditional news-media textual content analysis using automated techniques. ‘Large-scale’ since we analyse hundreds of outlets, typically for extended periods of time, involving millions of news items. I. Flaounas (University of Bristol) January 19, 2010 9 / 57
  • 12.
    Research Focus In ourresearch we undertake a large-scale traditional news-media textual content analysis using automated techniques. ‘Large-scale’ since we analyse hundreds of outlets, typically for extended periods of time, involving millions of news items. ‘Traditional news-media’ since we do not focus on modern online-only news spreading means such as blogs or Twitter. I. Flaounas (University of Bristol) January 19, 2010 9 / 57
  • 13.
    Research Focus In ourresearch we undertake a large-scale traditional news-media textual content analysis using automated techniques. ‘Large-scale’ since we analyse hundreds of outlets, typically for extended periods of time, involving millions of news items. ‘Traditional news-media’ since we do not focus on modern online-only news spreading means such as blogs or Twitter. ‘Textual’ since we use only the textual information of news rather than analysing e.g. images, videos, or speech. I. Flaounas (University of Bristol) January 19, 2010 9 / 57
  • 14.
    Research Focus In ourresearch we undertake a large-scale traditional news-media textual content analysis using automated techniques. ‘Large-scale’ since we analyse hundreds of outlets, typically for extended periods of time, involving millions of news items. ‘Traditional news-media’ since we do not focus on modern online-only news spreading means such as blogs or Twitter. ‘Textual’ since we use only the textual information of news rather than analysing e.g. images, videos, or speech. ‘Automated’ in the sense that the analysis is performed by applying Artificial Intelligence techniques rather than using human ‘coders’. I. Flaounas (University of Bristol) January 19, 2010 9 / 57
  • 15.
    Relevant Work &Datasets Europe Media Monitor (EMM) ‘Lydia’ system Newsblaster NewsInEssence Google News, Yahoo! News LexisNexis Public Corpora: Reuters, New York Times I. Flaounas (University of Bristol) January 19, 2010 10 / 57
  • 16.
    Relevant Work &Datasets Europe Media Monitor (EMM) ‘Lydia’ system Newsblaster NewsInEssence Google News, Yahoo! News LexisNexis Public Corpora: Reuters, New York Times We are highly interested in studying the media system per se. I. Flaounas (University of Bristol) January 19, 2010 10 / 57
  • 17.
    Overview 1 Introduction 2 AutomatingNews Content Analysis Visualisation of Mediasphere Demo: Found In Translation 3 Findings Detection of Biases Predicting Popular Articles The Structure of the EU Mediasphere 4 Conclusions I. Flaounas (University of Bristol) January 19, 2010 11 / 57
  • 18.
    Automating News ContentAnalysis Some automation is possible using AI approaches: ◮ Machine Learning ◮ Data Mining ◮ Natural Language Processing Some questions about media system can be answered for the first time. We apply methods that work efficiently and reliably on large-scale data. We can attempt a data-driven research model. I. Flaounas (University of Bristol) January 19, 2010 12 / 57
  • 19.
    Methods Summary RSS parsing,Web page content scrapping Text Preprocessing: Stemming, stop-words removal, TF-IDF, ... Tagging: Support Vector Machines. Clustering: Best Reciprocal Hit Ranking: SVM rank Words selection: Lasso, SVMs Network reconstruction: χ2-test ... NLP: Statistical Machine Translation, Sentiment Analysis, Readability Data Visualisation: Multidimensional Scaling, Spring Embedding, ... Statistics: correlations, significance tests... I. Flaounas (University of Bristol) January 19, 2010 13 / 57
  • 20.
    Building & Annotatingthe Corpus Our corpus in numbers: > 1300 multilingual news sources > 3000 news feeds 133 countries 22 languages > 3 years of continuous monitoring 40K news items per day > 30M news items in total I. Flaounas (University of Bristol) January 19, 2010 14 / 57
  • 21.
    NOAM: News OutletsAnalysis & Monitoring system I. Flaounas (University of Bristol) January 19, 2010 15 / 57
  • 22.
    NOAM: News OutletsAnalysis & Monitoring system NOAM enable us to query the corpus at semantic level. I. Flaounas (University of Bristol) January 19, 2010 15 / 57
  • 23.
    Statistical Machine Translation1 Weapplied a phrase based Statistical Machine Translation (SMT) approach for translating the non-English articles to English. We use Moses, a complete phrase based translation toolkit for academic purposes. We translate all non-English articles of 21 EU languages into English. For each language pair, an instance of Moses is trained using Europarl data and JRC-Acquis Multilingual Parallel Corpus. We make the working assumption that SMT does not alter significantly the geometry of the news corpus in the vector-space representation. 1 Acknowledgements to Marco Turchi for implementing the SMT module. I. Flaounas (University of Bristol) January 19, 2010 16 / 57
  • 24.
    Articles Per Day 050 100 150 200 250 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 x 10 4 #Days starting from June 1st, 2009 #Articles We observe: A seven days cycle Local minima during weekend days. I. Flaounas (University of Bristol) January 19, 2010 17 / 57
  • 25.
    Clustering Articles The BestReciprocal Hit method: I. Flaounas (University of Bristol) January 19, 2010 18 / 57
  • 26.
    Outlets Per Story 10 0 10 1 10 2 10 1 10 2 10 3 10 4 10 5 10 6 #Outletsper Story. #Stories. Few stories are covered by lots of media, and lots of stories are covered by few media. I. Flaounas (University of Bristol) January 19, 2010 19 / 57
  • 27.
    The Global Mediasphere I.Flaounas (University of Bristol) January 19, 2010 20 / 57
  • 28.
    The Global Mediasphere 543nodes, 4783 edges, colour by country I. Flaounas (University of Bristol) January 19, 2010 20 / 57
  • 29.
    Support Vector Machinesas Topic Taggers We train on data from two well accepted corpora: ◮ Reuters ◮ NY Times Typical text preprocessing: Stemming, stop-words removal, bag-of-words (TF-IDF) representation... Two-class SVMs Cosine kernel Maximize F0.5-Score on unseen data Train to recognise 14 interesting news topics I. Flaounas (University of Bristol) January 19, 2010 21 / 57
  • 30.
    SVM Taggers Topic CorpusF0.5-Score Precision Recall 1 SPORTS Reuters 97.78 98.31 95.75 2 MARKETS Reuters 92.02 94.09 84.63 3 FASHION Reuters 83.88 94.61 71.27 4 DISASTERS Reuters 83.4 87.69 70.34 5 ART NY Times 81.67 84.9 71.38 6 BUSINESS NY Times 81.16 86.23 65.87 7 INFLATION-PRICES Reuters 77.01 81.45 63.38 8 RELIGION NY Times 74.95 83.57 53.59 9 POLITICS NY Times 73.81 76.65 64.81 10 SCIENCE Reuters 73.63 83.72 50.62 11 WEATHER Reuters 71.43 82.91 46.84 12 PETROLEUM Reuters 70.67 75.14 58.73 13 ELECTIONS Reuters 70.32 78.99 49.32 14 ENVIRONMENT NY Times 64.29 73.48 43.7 I. Flaounas (University of Bristol) January 19, 2010 22 / 57
  • 31.
    Demo: Found InTranslation We implemented a demo to demonstrate the state of the art in various disciplines of modern Artificial Intelligence. We compare the EU countries according to what topics their media choose to cover. Everyday we machine-translate 640 EU media content Annotate them using SVMs Compare EU countries media content based on their Top-10 media I. Flaounas (University of Bristol) January 19, 2010 23 / 57
  • 32.
    Demo: Found InTranslation http://foundintranslation.enm.bris.ac.uk I. Flaounas (University of Bristol) January 19, 2010 24 / 57
  • 33.
    Overview 1 Introduction 2 AutomatingNews Content Analysis Visualisation of Mediasphere Demo: Found In Translation 3 Findings Detection of Biases Predicting Popular Articles The Structure of the EU Mediasphere 4 Conclusions I. Flaounas (University of Bristol) January 19, 2010 25 / 57
  • 34.
    Detection of Biases Thegoal is to measure typical biases among different topics as they are presented in the media: Readability Linguistic Subjectivity Popularity Gender Bias Corpus: 500 English-language media 10 months, (Jan. 1st, 2010 – Oct 31st, 2011) 2.5M articles appeared in main feed I. Flaounas (University of Bristol) January 19, 2010 26 / 57
  • 35.
    Readability We measure readabilitybased on the Flesch Reading Ease Test The higher the FRET the easier the text to read. Scores range from 0–100. 10K items per topic FRET(article) = 206.835 − (1.015 · ASL) − 84.6 · ASW I. Flaounas (University of Bristol) January 19, 2010 27 / 57
  • 36.
    Linguistic Subjectivity We measurethe percentage of sentimental adjectives over the total number of adjectives. Adjectives detection by Stanford POS tagger. We check for each adjective the presence of a SentiWordnet sentimental score > 0.25. I. Flaounas (University of Bristol) January 19, 2010 28 / 57
  • 37.
    Validation of LinguisticSubjectivity? This is a challenge due to miss of a golden standard. I. Flaounas (University of Bristol) January 19, 2010 29 / 57
  • 38.
    Validation of LinguisticSubjectivity? This is a challenge due to miss of a golden standard. A subset of leading news-outlets, ranked by their linguistic subjectivity. Rank Outlet 1 BBC 2 Times 3 NY Times 4 The Guardian 5 CBS 6 Daily Telegraph 7 Daily Star(T) 8 Independent 9 Daily Mail (T) 10 Daily Mirror (T) 11 Newsweek 12 The sun (T) I. Flaounas (University of Bristol) January 19, 2010 29 / 57
  • 39.
    Popularity We measure theconditional probability of an article to become popular given its topic. We track 16 English language outlets that provide a “Most popular” feed. ◮ In total 108,516 articles were popular. ◮ 36,788 articles were popular and appeared in the main feed. P(Pop/Topic) = P(Topic/Pop) · P(Pop) P(Topic) I. Flaounas (University of Bristol) January 19, 2010 30 / 57
  • 40.
    Scatter plot ofTopics 0 0.5 1 1.5 2 2.5 36 38 40 42 44 46 48 50 52 ART BUSINESS DISASTERS ELECTIONS ENVIRONMENT FASHION MARKETS PETROLEUM POLITICS PRICES RELIGION SCIENCE SPORTS WEATHER Popularity Readability I. Flaounas (University of Bristol) January 19, 2010 31 / 57
  • 41.
    Scatter plot ofTopics 0 0.5 1 1.5 2 2.5 36 38 40 42 44 46 48 50 52 ART BUSINESS DISASTERS ELECTIONS ENVIRONMENT FASHION MARKETS PETROLEUM POLITICS PRICES RELIGION SCIENCE SPORTS WEATHER Popularity Readability 0 0.5 1 1.5 2 2.5 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 ART BUSINESS DISASTERS ELECTIONS ENVIRONMENT FASHION MARKETS PETROLEUM POLITICS PRICES RELIGION SCIENCE SPORTS WEATHER Popularity LinguisticSubjectivity 0 0.5 1 1.5 2 2.5 1 2 3 4 5 6 7 8 9 ART BUSINESS DISASTERS ELECTIONS ENVIRONMENT FASHION MARKETS PETROLEUM POLITICS PRICES RELIGION SCIENCE SPORTS WEATHER Popularity GenderBias 35 40 45 50 55 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 ART BUSINESS DISASTERS ELECTIONS ENVIRONMENT FASHION MARKETS PETROLEUM POLITICS PRICES RELIGION SCIENCE SPORTS WEATHER Readability LinguisticSubjectivity 35 40 45 50 55 1 2 3 4 5 6 7 8 9 ART BUSINESS DISASTERS ELECTIONS ENVIRONMENT FASHION MARKETS PETROLEUM POLITICS PRICES RELIGION SCIENCE SPORTS WEATHER Readability GenderBias 0.01 0.02 0.03 0.04 0.05 1 2 3 4 5 6 7 8 9 ART BUSINESS DISASTERS ELECTIONS ENVIRONMENT FASHION MARKETS PETROLEUM POLITICS PRICES RELIGION SCIENCE SPORTS WEATHER Linguistic Subjectivity GenderBias ART BUSINESS DISASTERS ELECTIONS ENVIRONMENT FASHION MARKETS PETROLEUM POLITICS PRICES RELIGION SCIENCE SPORTS WEATHER I. Flaounas (University of Bristol) January 19, 2010 32 / 57
  • 42.
    Scatter plot ofOutlets 0.02 0.025 0.03 0.035 0.04 0.045 2.5 3 3.5 4 4.5 5 BBC CBS Daily Mail Daily Mirror Daily Star Daily Telegraph Independent Newsweek NY Times The Guardian The Sun Times Linguistic Subjectivity GenderBias I. Flaounas (University of Bristol) January 19, 2010 33 / 57
  • 43.
    Scatter plot ofOutlets 30 35 40 45 50 55 60 2.5 3 3.5 4 4.5 5 BBC CBS Daily Mail Daily Mirror Daily Star Daily Telegraph Independent Newsweek NY Times The Guardian The Sun Times Readability GenderBias I. Flaounas (University of Bristol) January 19, 2010 34 / 57
  • 44.
    Scatter plot ofOutlets 30 35 40 45 50 55 60 0.02 0.025 0.03 0.035 0.04 0.045 BBC CBS Daily Mail Daily Mirror Daily Star Daily Telegraph Independent Newsweek NY Times The Guardian The Sun Times Readability LinguisticSubjectivity I. Flaounas (University of Bristol) January 19, 2010 35 / 57
  • 45.
    Overview 1 Introduction 2 AutomatingNews Content Analysis Visualisation of Mediasphere Demo: Found In Translation 3 Findings Detection of Biases Predicting Popular Articles The Structure of the EU Mediasphere 4 Conclusions I. Flaounas (University of Bristol) January 19, 2010 36 / 57
  • 46.
    Predicting Popular Articles Editorswant to know what stories their readers would like to read. Can we predict which stories will become popular? I. Flaounas (University of Bristol) January 19, 2010 37 / 57
  • 47.
    Predicting Popular Articles Editorswant to know what stories their readers would like to read. Can we predict which stories will become popular? Modelling this question as a simple binary classification problem leads to very low performance. I. Flaounas (University of Bristol) January 19, 2010 37 / 57
  • 48.
    Predicting Popular Articles Editorswant to know what stories their readers would like to read. Can we predict which stories will become popular? Modelling this question as a simple binary classification problem leads to very low performance. a ranking problem can lead to promising results. This is because popularity is a relevant concept and not an absolute one. I. Flaounas (University of Bristol) January 19, 2010 37 / 57
  • 49.
    Predicting the PopularArticles Month-by-month predictions, using Ranking SVM. 6 months of data. Accuracy is the correct orientation of positive/negative pairs of data. I. Flaounas (University of Bristol) January 19, 2010 38 / 57
  • 50.
    Most Popular Articles Titlesof most popular articles per outlet as ranked using Ranking SVMs for December 2009. Outlet Titles of Top-3 Articles CBS Sources: Elin Done with Tiger — Tiger Woods Slapped with Ticket for Crash — Tiger Woods: I let my Family Down Florida Times- Union Pizza delivery woman killed on Westside — A family’s search for justice, 15 years later — Rants & Raves: Napolitano unqualified NY Times Poor Children Likelier to Get Antipsychotics — Surf s Up, Way Up, and Competitors Let Out a Big Mahalo — Grandma s Gifts Need Extra Reindeer Reuters Dubai says not responsible for Dubai World debt — Boe- ing Dreamliner touches down after first flight — Iran’s Ah- madinejad mocks Obama, ”TV series” nuke talks Seattle Post Hospital: Actress Brittany Murphy dies at age 32 — Actor Charlie Sheen arrested in Colorado — Charlie Sheen accused of using weapon in Aspen I. Flaounas (University of Bristol) January 19, 2010 39 / 57
  • 51.
    Overview 1 Introduction 2 AutomatingNews Content Analysis Visualisation of Mediasphere Demo: Found In Translation 3 Findings Detection of Biases Predicting Popular Articles The Structure of the EU Mediasphere 4 Conclusions I. Flaounas (University of Bristol) January 19, 2010 40 / 57
  • 52.
    EU Mediasphere Top-10 mediaoutlets per country over the 27 EU countries in 22 different languages for a 6 months period A total of 1.3M news items. I. Flaounas (University of Bristol) January 19, 2010 41 / 57
  • 53.
    EU Mediasphere Top-10 mediaoutlets per country over the 27 EU countries in 22 different languages for a 6 months period A total of 1.3M news items. What patterns can we find using modern AI techniques? I. Flaounas (University of Bristol) January 19, 2010 41 / 57
  • 54.
    The EU Mediasphere Co-coveragenetwork: We link two outlets if they share more stories than expected by chance (χ2 − scores). I. Flaounas (University of Bristol) January 19, 2010 42 / 57
  • 55.
    The EU Mediasphere Co-coveragenetwork: We link two outlets if they share more stories than expected by chance (χ2 − scores). This network has 203 nodes and 6702 edges. I. Flaounas (University of Bristol) January 19, 2010 42 / 57
  • 56.
    A bit sparser.... Thisnetwork has 197 nodes, 3386 edges and 3 connected components. Singleton nodes are omitted. I. Flaounas (University of Bristol) January 19, 2010 43 / 57
  • 57.
    What kind ofconnected components are formed? I. Flaounas (University of Bristol) January 19, 2010 44 / 57
  • 58.
    What kind ofconnected components are formed? I. Flaounas (University of Bristol) January 19, 2010 44 / 57
  • 59.
    What kind ofconnected components are formed? We go as sparse as possible with stopping criterion the modularity maximization. I. Flaounas (University of Bristol) January 19, 2010 44 / 57
  • 60.
    What kind ofconnected components are formed? We go as sparse as possible with stopping criterion the modularity maximization. The probability of two non-singleton nodes from the same country to end up in the same connected component is 82.9% (p < 0.001). I. Flaounas (University of Bristol) January 19, 2010 44 / 57
  • 61.
    What kind ofconnected components are formed? We go as sparse as possible with stopping criterion the modularity maximization. The probability of two non-singleton nodes from the same country to end up in the same connected component is 82.9% (p < 0.001). Nationality is the major underline criterion of what stories media outlets choose to publish. I. Flaounas (University of Bristol) January 19, 2010 44 / 57
  • 62.
    What kind ofconnected components are formed? We go as sparse as possible with stopping criterion the modularity maximization. The probability of two non-singleton nodes from the same country to end up in the same connected component is 82.9% (p < 0.001). Nationality is the major underline criterion of what stories media outlets choose to publish. We will work on countries level rather than outlets level. I. Flaounas (University of Bristol) January 19, 2010 44 / 57
  • 63.
    Which are thestrongest connections between countries? I. Flaounas (University of Bristol) January 19, 2010 45 / 57
  • 64.
    Which are thestrongest connections between countries? We go as sparse as possible while keeping the network connected. This network has 27 nodes and 112 edges. I. Flaounas (University of Bristol) January 19, 2010 45 / 57
  • 65.
    Can we explainrelations of countries? I. Flaounas (University of Bristol) January 19, 2010 46 / 57
  • 66.
    Can we explainrelations of countries? We found significant (p < 0.001) correlation of countries’ media-content similarity to their: Geographical proximity — based on sharing of borders 33.86% I. Flaounas (University of Bristol) January 19, 2010 46 / 57
  • 67.
    Can we explainrelations of countries? We found significant (p < 0.001) correlation of countries’ media-content similarity to their: Geographical proximity — based on sharing of borders 33.86% Economical proximity — based on trade volume 31.03% I. Flaounas (University of Bristol) January 19, 2010 46 / 57
  • 68.
    Can we explainrelations of countries? We found significant (p < 0.001) correlation of countries’ media-content similarity to their: Geographical proximity — based on sharing of borders 33.86% Economical proximity — based on trade volume 31.03% Cultural proximity — based on song contest votting patterns 32.05% I. Flaounas (University of Bristol) January 19, 2010 46 / 57
  • 69.
    Can we explainrelations of countries? We found significant (p < 0.001) correlation of countries’ media-content similarity to their: Geographical proximity — based on sharing of borders 33.86% Economical proximity — based on trade volume 31.03% Cultural proximity — based on song contest votting patterns 32.05% UK Metro, Dec. 8, 2010: Countries that always vote for each other in the Eurovision song contest, have a shared interest in news content, as well as terrible music, a study has shown [...] I. Flaounas (University of Bristol) January 19, 2010 46 / 57
  • 70.
    How ‘close’ arecountries, based on common media interests? We use χ2-scores as similarities and project countries in a 2D plane using Multidimensional Scaling. I. Flaounas (University of Bristol) January 19, 2010 47 / 57
  • 71.
    How ‘close’ arecountries, based on common media interests? We use χ2-scores as similarities and project countries in a 2D plane using Multidimensional Scaling. I. Flaounas (University of Bristol) January 19, 2010 47 / 57
  • 72.
    How ‘close’ arecountries, based on common media interests? We use χ2-scores as similarities and project countries in a 2D plane using Multidimensional Scaling. We colour the Eurozone members in blue. These countries are closer to the centre, that is the average EU-media content. I. Flaounas (University of Bristol) January 19, 2010 48 / 57
  • 73.
    Ranking of countries Basedon their deviation from average EU media content (in 26D space). I. Flaounas (University of Bristol) January 19, 2010 49 / 57
  • 74.
    Ranking of countries Basedon their deviation from average EU media content (in 26D space). Rank Country Euro A.Year 1 France Y 1957 2 Austria Y 1995 3 Germany Y 1957 4 Greece Y 1981 5 Ireland Y 1973 6 Cyprus Y 2004 7 Slovenia Y 2004 8 Spain Y 1986 9 Slovakia Y 2004 10 Italy Y 1957 11 Belgium Y 1957 12 Luxembourg Y 1957 13 Bulgaria N 2007 14 Netherlands Y 1957 15 U. Kingdom N 1973 16 Finland Y 1995 17 Sweden N 1995 18 Poland N 2004 19 Estonia N 2004 20 Denmark N 1973 21 Portugal Y 1986 22 Malta Y 2004 23 Czech Republic N 2004 24 Romania N 2007 25 Latvia N 2004 26 Hungary N 2004 27 Lithuania N 2004 I. Flaounas (University of Bristol) January 19, 2010 49 / 57
  • 75.
    Any other importantfactors? Correlations of countries deviation from average EU media content. Factor Correlation (%) p-values In Eurozone 70.65 <0.001 I. Flaounas (University of Bristol) January 19, 2010 50 / 57
  • 76.
    Any other importantfactors? Correlations of countries deviation from average EU media content. Factor Correlation (%) p-values In Eurozone 70.65 <0.001 Accession Year -49.32 0.009 I. Flaounas (University of Bristol) January 19, 2010 50 / 57
  • 77.
    Any other importantfactors? Correlations of countries deviation from average EU media content. Factor Correlation (%) p-values In Eurozone 70.65 <0.001 Accession Year -49.32 0.009 GDP 2008 44.75 0.020 I. Flaounas (University of Bristol) January 19, 2010 50 / 57
  • 78.
    Any other importantfactors? Correlations of countries deviation from average EU media content. Factor Correlation (%) p-values In Eurozone 70.65 <0.001 Accession Year -49.32 0.009 GDP 2008 44.75 0.020 Population 23.05 0.247 I. Flaounas (University of Bristol) January 19, 2010 50 / 57
  • 79.
    Any other importantfactors? Correlations of countries deviation from average EU media content. Factor Correlation (%) p-values In Eurozone 70.65 <0.001 Accession Year -49.32 0.009 GDP 2008 44.75 0.020 Population 23.05 0.247 Area 15.63 0.435 I. Flaounas (University of Bristol) January 19, 2010 50 / 57
  • 80.
    Any other importantfactors? Correlations of countries deviation from average EU media content. Factor Correlation (%) p-values In Eurozone 70.65 <0.001 Accession Year -49.32 0.009 GDP 2008 44.75 0.020 Population 23.05 0.247 Area 15.63 0.435 Population Density 7.45 0.712 The first three factors are significant (p < 0.05), while the rest are not. I. Flaounas (University of Bristol) January 19, 2010 50 / 57
  • 81.
    Discussion EU media editorsmade independently a multitude of small editorial decisions which shaped the contents of the EU mediasphere in a way that reflects its deep geographic, economic and cultural relations. Detecting these subtle signals in a statistically rigorous way would be out of the reach of traditional methods of social scientists. This analysis demonstrates the power of the available methods for significant automation of media content analysis. I. Flaounas (University of Bristol) January 19, 2010 51 / 57
  • 82.
    Conclusions Several tasks usingmodern AI techniques can be automated. ◮ Though the high-precision & sophisticated annotation, that human ‘coders’ can achieve, is not yet reachable. I. Flaounas (University of Bristol) January 19, 2010 52 / 57
  • 83.
    Conclusions Several tasks usingmodern AI techniques can be automated. ◮ Though the high-precision & sophisticated annotation, that human ‘coders’ can achieve, is not yet reachable. Research can be conducted across multiple languages/countries. I. Flaounas (University of Bristol) January 19, 2010 52 / 57
  • 84.
    Conclusions Several tasks usingmodern AI techniques can be automated. ◮ Though the high-precision & sophisticated annotation, that human ‘coders’ can achieve, is not yet reachable. Research can be conducted across multiple languages/countries. The study can run for a long period of time period involving millions of articles. I. Flaounas (University of Bristol) January 19, 2010 52 / 57
  • 85.
    Conclusions Several tasks usingmodern AI techniques can be automated. ◮ Though the high-precision & sophisticated annotation, that human ‘coders’ can achieve, is not yet reachable. Research can be conducted across multiple languages/countries. The study can run for a long period of time period involving millions of articles. We challenge questions that could not be answered previously. ◮ e.g. Which country’s media have the most negative coverage of environmental news that also mention Barack Obama? I. Flaounas (University of Bristol) January 19, 2010 52 / 57
  • 86.
    Conclusions Several tasks usingmodern AI techniques can be automated. ◮ Though the high-precision & sophisticated annotation, that human ‘coders’ can achieve, is not yet reachable. Research can be conducted across multiple languages/countries. The study can run for a long period of time period involving millions of articles. We challenge questions that could not be answered previously. ◮ e.g. Which country’s media have the most negative coverage of environmental news that also mention Barack Obama? In the social sciences, the analysis of news media is done largely by hand in a hypothesis-driven fashion. Is it time for social sciences to also adopt a data-driven research model? I. Flaounas (University of Bristol) January 19, 2010 52 / 57
  • 87.
    Future Work Under developmentideas: Use of features such as images / audio / video. How does SMT affect the supervised/unsupervised learning? Compare the US and the EU mediaspheres ... I. Flaounas (University of Bristol) January 19, 2010 53 / 57
  • 88.
    Work from othermembers of the group Suffix Tree - Detection of memes Named Entities detection & disambiguation Twitter - Events detection Summarisation of news Online learning algorithms for news annotation I. Flaounas (University of Bristol) January 19, 2010 54 / 57
  • 89.
    References 1 I. Flaounas,M. Turchi, O. Ali, N. Fyson, T. De Bie, N. Mosdell, J. Lewis, N. Cristianini: “The Structure of the EU Mediasphere”, PLoS ONE, Vol. 5(12), pp. e14243, 2010. I. Flaounas, M. Turchi, T. De Bie and N. Cristianini: “Inference and Validation of Networks”, ECML/PKDD, Springer, LNCS, Vol. 5782(1), pp. 344–358, Bled, Slovenia, 2009. O. Ali, I. Flaounas, T. De Bie, N. Mosdell, J. Lewis and N. Cristianini, “Automating News Content Analysis: An Application to Gender Bias and Readability”, JMLR W & CP: Workshop on Applications of Pattern Analysis (WAPA), Vol.11, pp. 36–43, Windsor, UK, 2010. M. Turchi, I. Flaounas, O. Ali, T De Bie, T. Snowsill and N. Cristianini: “Found in Translation”, ECML/PKDD, Springer, LNCS, Vol. 5782(2), pp. 746–749, Bled, Slovenia, 2009. I. Flaounas (University of Bristol) January 19, 2010 55 / 57
  • 90.
    References 2 I. Flaounas,N. Fyson and N. Cristianini: “Predicting Relations in News-Media Content among EU Countries”, 2nd International Workshop on Cognitive Information Processing (CIP), IEEE, pp. 269–274, Elba, Italy, 2010. E. Hensinger, I. Flaounas and N. Cristianini: “Learning the Preferences of News Readers with SVM and Lasso Ranking”, Artificial Intelligence Applications and Innovations, Springer, pp. 179–186, Larnaca, Cyprus, 2010. T. Snowsill, I. Flaounas, T. De Bie and N. Cristianini: “Detecting events in a million New York Times articles”, ECML/PKDD, Springer, LNCS, Vol. 6323(3), pp. 615–618, Barcelona, Spain, 2010. I. Flaounas, M. Turchi and N. Cristianini: “Detecting Macro-Patterns in the European Mediasphere”, IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pp. 527–530, Milano, Italy, 2009. I. Flaounas (University of Bristol) January 19, 2010 56 / 57
  • 91.
    More info at:http://mediapatterns.enm.bris.ac.uk Thank you! I. Flaounas (University of Bristol) January 19, 2010 57 / 57