Presented at symposium 'Social Media: Incubators of a renewed News Media Landscape?', 27 November 2015, Leuven Belgium. Presentation outlines projects PoliMedia & Newstrackers
Analyzing Published and Consumed Digital & Digitized News
1. Analyzing Published and Consumed
Digital & Digitized News
Martijn Kleppe
Vrije Universiteit Amsterdam
m.kleppe@vu.nl
www.martijnkleppe.nl
@martijnkleppe
Slides on Slideshare:
bit.ly/LeuvenKleppe
Social Media: Incubators of a renewed news media landscape
27 November 2015
Leuven
15. Link debates to news items
Intuition 1: The news item contains a topic a/o name of a
politician and is published within a week after a debate
Intuition 2: The more overlap in topics and named entities, the
more probably there is a link.
21. “Give me all fragments of
debates with over 60
related news items”
SELECT ?speech ?no_newsitems {{
SELECT ?speech (COUNT(?news) AS ?no_news_items)
WHERE{
?speech <http://purl.org/linkedpolitics/nl/polivoc#coveredAt>
?news .
}
GROUP BY ?speech }
FILTER (?no_news_items > 60) }
SPARQL Endpoint
22. • Yeah! It works (but no television)
• Not perfect
• But still ok (recall: 62%; precision: 80%)
• It is open for everyone: www.polimedia.nl
• + via a Sparql Endpoint
• People actually use it
Results
23. NRC Handelsblad, Ewoud Sander, Voor al haar mantelzorgen, 14 April 2014
“Another digital source
I often use is PoliMedia.nl”
Yeah! An article in
NRC HANDELSBLAD!
24. • Yeah! It works (but no television)
• Not perfect
• But still ok (recall: 62%; precision: 80%)
• It is open for everyone: www.polimedia.nl
• + via a Sparql Endpoint
• People actually use it
• We want more: social media, television, recent data
Results
35. THAT HOWWHAT?
What genres of news websites
do news users consume 24/7?
For what do news users
consume these websites 24/7?
How does the consumption of news websites
fit in their everyday surfing behavior?
37. The Newstracker
• Collects web activities
• Of specified & authenticated users
• Via a custom built system
• That collects & cleans web activities
• Extracts textual & visual content of news websites
• And stores this as a 1 dataset
38. The Newstracker
Web activities is a lot…
And monitoring everything is quite privacy intrusive…
So selection and structure is needed, via:
• Whitelist of 4.000 websites
• Labels indicating genre of website
• Subgenres of News and Information websites
46. How?
Via Homepage Via Referral
TOTAL
General News
Lifestyle
Remarkable
59% 41%
64% 36%
49% 51%
48% 52%
47. “Lindanieuws.nl is more
entertainment. Sometimes I
really think ‘this makes no
sense’, but it is fun to read. It’s
more entertainment then true
news, the way I consume it”.
Lean-back
Snacking
48. “Fashion is my hobby”.
• Visits same websites everyday
• In same order
• Starts at homepage
Lean-forward
monitoring
50. Date Time URL
26-4-2015 16:53:05 http://www.vi.nl/home.htm
26-4-2015 16:54:02 http://www.vi.nl/nieuws/promes-maakt-opnieuw-het-verschil-voor-spartak.htm
26-4-2015 17:00:20
http://www.soccernews.nl/news/313971/Kramer_wil_PSV-
aanvallers_verslaan:_Ik_sta_er_dichtbij
26-4-2015 17:01:51 http://www.google.nl/
26-4-2015 17:02:01 http://en.wikipedia.org/wiki/Michiel_Kramer
26-4-2015 17:02:15 http://en.wikipedia.org/wiki/Mike_van_Duinen
26-4-2015 17:02:23 http://en.wikipedia.org/wiki/Gervane_Kastaneer
26-4-2015 17:03:00 http://nl.wikipedia.org/wiki/Wilmer_Kousemaker
26-4-2015 17:03:09 http://en.wikipedia.org/wiki/Wilmer_Kousemaker
26-4-2015 17:04:15 http://nl.wikipedia.org/wiki/Benny_Kerstens
26-4-2015 17:04:39 http://nl.wikipedia.org/wiki/Aykut_Demir
“Via VI.nl, I get to the Wikipedia
lemma of e.g. Wesley Sneijder
and then I look to a team player
of Sneijder and think ‘He!’ you’re
playing for years at that club and
then I click further.”
Lean-forward
monitoring
Serendipitous
consumption
52. Conclusion
News consumption 24/7
BUT….
Which website
When
In which order
=
Personal interest plays
essential role in what
they consider to be
news,
and determines the
pattern of everyday
news consumption
55. What’s next-2?
Different usergroups:
• Different age groups
• Regional News
• Tech news
• Other countries
Requires:
• Updated website whitelist
• Updated scraping templates
56. What’s next-3?
What role do form and content play?
26-4-2015 16:52:59 user28 http://www.bbc.co.uk/sport/0/football/32470569
26-4-2015 16:53:02 user28 http://www.bbc.com/sport/0/football/32470569
26-4-2015 16:53:05 user28 http://www.vi.nl/home.htm
26-4-2015 16:54:02 user28 http://www.vi.nl/nieuws/promes-maakt-opnieuw-het-verschil-voor-spartak.htm
26-4-2015 17:00:20 user28 http://www.soccernews.nl/news/313971/Kramer_wil_PSV-aanvallers_verslaan:_Ik_sta_er_dichtbij
News
+Sport
Next step:
• Automated
Content Analysis
of text on
topic + style
• Visual Content
Analysis
57. Acknowledgements
The New News Consumer
www.news-use.com
Marco Otte Hildebrand Bijleveld Leonie Durlinger Stefan Heijdra
Irene Costera Meijer Marcel Broersma Tim Groot KormelinkChris Peters Joelle SwartAnna van
Cauwenberge
59. Questions?
Martijn Kleppe
Vrije Universiteit Amsterdam
m.kleppe@vu.nl
@martijnkleppe
www.martijnkleppe.nl
www.polimedia.nl
www.news-use.com
Slides on Slideshare:
bit.ly/LeuvenKleppe
Social Media: Incubators of a renewed news media landscape
27 November 2015
Leuven
Editor's Notes
I am a media scholar/historian and a typical research question I have is this: how do media over debates in the Dutch Parliament?
Back in the old days (let’s see five year ago) I had to go to several places to find my resources.
For example to the National Library of the Netherlands/KB in The Hague where I could read the analog minutes of the Dutch Parliament
And there I had to find the old newspapers and go over them manually
And the same counts for the radio bulletins which are great sources as you can see since they contain handwriting. But the horrible thing was that everything had to be done manually.
That changed when this great stuff got digitized. So I can now look up recent newspapers in the Lexis Nexis database for example. Great database but for me there is one big downside: they do not contain images (in which I particular am interested in) nor the whole page but only single articles.
Another great database I can now use is the Academia database of Sound and Vision. I can search through the metadata of program from home of office and watch the broadcast. But the downside is that I do not know which programs are in there but even more: this system and search engine is a complete different one from the one of Lexis Nexis. So I do have digitised materials but there still is a lot of work for me since I need to understand how these different databases work.
And this is where PoliMedia comes in. With PoliMedia we have built a portal in which you can search through the digital minutes of the Dutch Parliament on any keyword or person. But PoliMedia is linked to media databases such as the digitised newspaper collection of the KB, Television broadcasts of Sound and Vision and Radio Bulletins at the KB.
What PoliMedia does is basically the following. After you performed a query it searches for topics a/o names in all the newspapers within a week after a debate. And then calculates the most probable link by looking at the overlap in topics and named entities.
And that looks like this. We made an open website which you can all visit via www.polimedia.nl
You can type in your query in the Google Like search box.
And on the Results page you will see all debates in which the query is found. On the left you filter your results (which is something we built as well) and on the right you see the magic of PoliMedia. Here it automatically says how many and which media items are retrieved that contain coverage about this particular debate.
After you clicked on a result you will see the whole debate with your query highlighted and on the right side you will see the links to the relevant media items.
After clicking on that, you will get to the newspaper item as it is stored at the National Library, so in their interface like this one of the newspapers
Behind PoliMedia there is a database in which all our links are available in RDF. You can also search through this data without using PoliMedia.nl via a SPARQL Endpoint. You are then more flexible and can ask more complex questions, such as: Give me all fragments of debates with over 60 related news items”.
Vangst (recall). Vangst is de verhouding tussen het aantal relevante gevonden documenten, en het totaal aantal relevante documenten dat er mogelijk zijn. Dit laatste is een van tevoren opgesteld 'wensenlijstje', vaak 'ground truth' of 'gouden standaard' genoemd.
Precisie is de verhouding tussen het aantal relevante resultaten (documenten, treffers), en het totaal aantal resultaten dat door het systeem is teruggeven.
Now, there are already quite some tools to monitor what people do on your website.
Everyone who owns a website probably knows Google Analytics which gives very good insights into the clicks on your website.
A similarlike tool that a lot of publishers are using is Chartbeat.
And with Chartbeat you can actually make these kind of dashboards.
This is the so called Big Board of NRC Handelsblad, a leading Dutch newspaper that made this dashboard open to the public.
But in the newsroom these dashboards are constantly shown on screens giving the editors realtime information on what their website visitors are currently reading.
And with Chartbeat you can actually make these kind of dashboards.
This is the so called Big Board of NRC Handelsblad, a leading Dutch newspaper that made this dashboard open to the public.
But in the newsroom these dashboards are constantly shown on screens giving the editors realtime information on what their website visitors are currently reading.
We thus see a difference between the type of news website and how people end up at it.
BUT: Tempting to make bold conclusions, but this does not mean everyone does it like that. This is wehere our qualitative analyses come in
Komt via Facebook!
Komt via Homepage!
Kortom, we zien 24/7 patroon maar welke wanneer en hoe wordt niet bepaald door het type website maar door de individuele gebruiker die allemaal verschillend zijn.
What we already have:
The URLs
We know this is a news website and we made subcategories for the News categorie so that is actually already added in the file.
We have scraped the textual and visual content of the websites.
But the difficult part comes now: what do the text and image say?
And that is what we are currently working on by deploying an automated Content Analysis of the text on both the topic (soccer, ADO Den Haag) but also on the style: how is the news item written? In a factual manner, in a loose manner, etc.
Plus we want to analyse the image: what does it tell us? Who are on there? What is the topic?
What we already have:
The URLs
We know this is a news website and we made subcategories for the News categorie so that is actually already added in the file.
We have scraped the textual and visual content of the websites.
But the difficult part comes now: what do the text and image say?
And that is what we are currently working on by deploying an automated Content Analysis of the text on both the topic (soccer, ADO Den Haag) but also on the style: how is the news item written? In a factual manner, in a loose manner, etc.
Plus we want to analyse the image: what does it tell us? Who are on there? What is the topic?