Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Connecting political data to media data
Laura Hollink
VU University Amsterdam
Web & Media group
ASCoR Spring Colloquium ‘B...
Laura Hollink Damir Juric
Geert-Jan Houben
Martijn Kleppe
Max Kemman
Henri Beunders
Johan Oomen
Jaap Blom
Funded by Clarin...
Questions we want to answer
• Which events have attracted
a lot of media attention?
• What are the differences
between dif...
Transcriptions of all 9,294
meetings of the Dutch
parliament between
1945-1995, consisting of
1,208,903 speeches.
Transcriptions of all 9,294
meetings of the Dutch
parliament between
1945-1995, consisting of
1,208,903 speeches.
Archives...
Transcriptions of all 9,294
meetings of the Dutch
parliament between
1945-1995, consisting of
1,208,903 speeches.
Roughly ...
PoliMedia methods
Step 1: Translate the Dutch parliamentary debates
to the standard structured web format RDF
nl.proc.sgd.d.
194519460000002...
Modeling the debates as events
• An event has a date, a
location, actors, and
possibly sub-events.
• We build on the Simpl...
•the part-of structure and
chronological order of the
debates.
nl.proc.sgd.d.
194519460000002
nl.proc.sgd.d.
1945194600000...
•the different roles and parties
that a speaker can have in his/
her career.
nl.proc.sgd.d.
194519460000002.1.2
Speech
rdf...
Step 2: Linking speeches in the debate to the
newspaper articles that cover them
We created a linking method to deal with ...
Step 2: Linking speeches in the debate to the
newspaper articles that cover them
Detect
topics in
speeches
Create
queries
...
Step 2: Linking speeches in the debate to the
newspaper articles that cover them
Detect
topics in
speeches
Create
queries
...
Evaluation: what do we use to rank the candidate
articles?
• Experiment on 150 <newspaper article, speech in debate> pairs...
Results
•An open data set of Dutch parliamentary debates,
•with almost 3 Million links between 450.000 speeches and URL’s ...
Demo
SPARQL endpoint
• A service to query a knowledge
base using the SPARQL query
language.
“All speeches with more
than 60 ass...
Reflection: to what extend can we answer these
questions?
• Which events have attracted
a lot of media attention?
• What ar...
Future work
• More types of links
• From just “coveredIn” to “quotedIn”, “coveredIn”, “backgroundOf”
“talksAbout”
• More t...
Project ‘Talk of Europe / Traveling Clarin Campus’
2014-2015
Funded by CLARIN-ERIC
From left to right: Max Kemman, Marnix ...
Plans of ‘ToE/TTC’
1.Publish proceedings of the EU parliamentary debates in RDF
• hosted by DANS
2.Organize 3 workshops/ha...
Connecting political data to media data
Connecting political data to media data
Connecting political data to media data
Connecting political data to media data
Connecting political data to media data
Connecting political data to media data
Connecting political data to media data
Connecting political data to media data
Connecting political data to media data
Connecting political data to media data
Connecting political data to media data
Connecting political data to media data
Connecting political data to media data
Connecting political data to media data
Upcoming SlideShare
Loading in …5
×

Connecting political data to media data

2,923 views

Published on

Presentation at the ASCoR Spring Colloquium ‘Big Data at the University of Amsterdam’ February 18, 2014.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Connecting political data to media data

  1. 1. Connecting political data to media data Laura Hollink VU University Amsterdam Web & Media group ASCoR Spring Colloquium ‘Big Data at the University of Amsterdam’ February 18, 2014
  2. 2. Laura Hollink Damir Juric Geert-Jan Houben Martijn Kleppe Max Kemman Henri Beunders Johan Oomen Jaap Blom Funded by Clarin-NL
  3. 3. Questions we want to answer • Which events have attracted a lot of media attention? • What are the differences between different media? E.g. in different newspapers, or newspapers vs. radio bulletins? • Has the coverage changed over time? • How are the events visualized (photos, layout of newspaper, etc.).
  4. 4. Transcriptions of all 9,294 meetings of the Dutch parliament between 1945-1995, consisting of 1,208,903 speeches.
  5. 5. Transcriptions of all 9,294 meetings of the Dutch parliament between 1945-1995, consisting of 1,208,903 speeches. Archives of hundreds of newspaper with tons of newspaper issues or 10’s of Millions of articles between 1618-1995. (We only use 1945-1995)
  6. 6. Transcriptions of all 9,294 meetings of the Dutch parliament between 1945-1995, consisting of 1,208,903 speeches. Roughly 1.8 Million news bulletins between 1937-1984 (We only use 1945-1995) Archives of hundreds of newspaper with tons of newspaper issues or 10’s of Millions of articles between 1618-1995. (We only use 1945-1995)
  7. 7. PoliMedia methods
  8. 8. Step 1: Translate the Dutch parliamentary debates to the standard structured web format RDF nl.proc.sgd.d. 194519460000002 nl.proc.sgd.d. 194519460000002.1 PartOfDebateDebate http://resolver.politicalmashup.nl/nl.proc.sgd.d.194519460000002 http://statengeneraaldigitaal.nl/ http://resolver.kb.nl/resolve?urn=sgd:mpeg21:19451946:0000002:pdf nl.proc.sgd.d.19720000002 Handelingen Verenigde Vergadering... Dutch 1945-11-20 rdf:type dc:id dc:source dc:source dc:publisher dc:language dc:date hasPart rdf:type nl.proc.sgd.d. 194519460000002.1.1 hasPart DebateContext rdf:type nl.proc.sgd.d. 194519460000002.1.2 Speech rdf:type hasPart nl.proc.sgd.d. 194519460000002.1.3 hasSubsequentSpeech "Mijnheer de Voorzitter, de Commissie van …" hasSpokenText sem:hasActor Speaker_0006 4 Party_kvp hasParty hasSpeaker member_of _parliament "De voorzitter opent de vergadering…" hasText http://resolver.kb.nl/resolve?urn=ddd:011198136:mpeg21:a0525:ocr coveredIn Party KVP Katholieke Volkspartij rdf:type hasAcronym hasFullName Joannes Antonius James Bargefoaf:firstName foaf:lastName Barge rdfs:label http://resolver.politicalmashup.nl/nl.m.00064 dc:source Politician rdf:type hasRole nl.proc.sgd.d. 194519460000002.2 hasSubsequentPartOfDebate XML by War in Parliament Project
  9. 9. Modeling the debates as events • An event has a date, a location, actors, and possibly sub-events. • We build on the Simple Event Model (SEM). •links to the original sources •reusing existing vocabularies nl.proc.sgd.d. 194519460000002 Debate http://resolver.politicalmashup.nl/nl.proc.sgd.d.194519460000002 http://statengeneraaldigitaal.nl/ http://resolver.kb.nl/resolve?urn=sgd:mpeg21:19451946:0000002:pdf nl.proc.sgd.d.19720000002 Handelingen Verenigde Vergadering... Dutch 1945-11-20 rdf:type dc:id dc:source dc:source dc:publisher dc:language dc:date dc:title
  10. 10. •the part-of structure and chronological order of the debates. nl.proc.sgd.d. 194519460000002 nl.proc.sgd.d. 194519460000002.1 PartOfDebate hasPart rdf:type nl.proc.sgd.d. 194519460000002.1.1 hasPart DebateContext rdf:type nl.proc.sgd.d. 194519460000002.1.2 Speech rdf:type hasPart nl.proc.sgd.d. 194519460000002.1.3 hasSubsequentSpeech "Mijnheer de Voorzitter, de Commissie van …" hasSpokenText "De voorzitter opent de vergadering…" hasText nl.proc.sgd.d. 194519460000002.2 hasSubsequentPartOfDebate Handelingen Verenigde Vergadering... dc:title
  11. 11. •the different roles and parties that a speaker can have in his/ her career. nl.proc.sgd.d. 194519460000002.1.2 Speech rdf:type "Mijnheer de Voorzitter, de Commissie van …" hasSpokenText sem:hasActor Speaker_0006 4 Party_kvp hasParty hasSpeaker member_of _parliament http://resolver.kb.nl/resolve?urn=ddd:011198136:mpeg21:a0525:ocr coveredIn Party KVP Katholieke Volkspartij rdf:type hasAcronym hasFullName Joannes Antonius James Bargefoaf:firstName foaf:lastName Barge rdfs:label Politician rdf:type hasRole
  12. 12. Step 2: Linking speeches in the debate to the newspaper articles that cover them We created a linking method to deal with our two challenges: 1.How to link documents that are so different in nature? 2. Can we use the structure of the debates: people, chronologic order of speeches, introductions to each new topic, etc? Detect topics in speeches Create queries Search newspaper archive Topics Named Entities Name of speaker Detect Named Entities in speeches Candidate articles Queries Rank candidate articles Links between speeches and articles Debates Date of debate
  13. 13. Step 2: Linking speeches in the debate to the newspaper articles that cover them Detect topics in speeches Create queries Search newspaper archive Topics Named Entities Name of speaker Detect Named Entities in speeches Candidate articles Queries Rank candidate articles Links between speeches and articles Debates Date of debate Intuition 1: The name of the speaker should appear in the article and the article should be published within a week of the debate
  14. 14. Step 2: Linking speeches in the debate to the newspaper articles that cover them Detect topics in speeches Create queries Search newspaper archive Topics Named Entities Name of speaker Detect Named Entities in speeches Candidate articles Queries Rank candidate articles Links between speeches and articles Debates Date of debate Intuition 1: The name of the speaker should appear in the article and the article should be published within a week of the debate Intuition 2: the more the article and the speech overlap in terms of topics and named entities, the more they are related.
  15. 15. Evaluation: what do we use to rank the candidate articles? • Experiment on 150 <newspaper article, speech in debate> pairs, 2 raters, K = 0.5 • Compare text of candidate articles to: • Setting 1: Named Entities in speech • Setting 2: Named Entities + Topics in speech • Setting 3: Named Entities + Topics in speech and larger part-of-debate Score Setting 1 Setting 2 Setting 3 I don’t know 0.14 0.15 0.08 0 - unrelated 0.38 0.23 0.12 1- related 0.29 0.36 0.36 2- explicit mention of the debate 0.19 0.26 0.44 1+2 0.48 0.62 0.80
  16. 16. Results •An open data set of Dutch parliamentary debates, •with almost 3 Million links between 450.000 speeches and URL’s of 1.5 Million news paper articles and radio bulletins at the National Library. •accessible though a Web demonstrator and through a SPARQL endpoint.
  17. 17. Demo
  18. 18. SPARQL endpoint • A service to query a knowledge base using the SPARQL query language. “All speeches with more than 60 associated news items.” SELECT ?speech ?no_newsitems {{ SELECT ?speech (COUNT(?news) AS ?no_news_items) WHERE{ ?speech <http://purl.org/linkedpolitics/nl/polivoc#coveredAt> ?news . } GROUP BY ?speech } FILTER (?no_news_items > 60) }
  19. 19. Reflection: to what extend can we answer these questions? • Which events have attracted a lot of media attention? • What are the differences between different media? E.g. in different newspapers, or newspapers vs. radio bulletins? • Has the coverage changed over time? • How are the events visualized (photos, layout of newspaper, etc.).
  20. 20. Future work • More types of links • From just “coveredIn” to “quotedIn”, “coveredIn”, “backgroundOf” “talksAbout” • More types of media • More types of (political) events.
  21. 21. Project ‘Talk of Europe / Traveling Clarin Campus’ 2014-2015 Funded by CLARIN-ERIC From left to right: Max Kemman, Marnix van Berchum, Laura Hollink, Astrid van Aggelen, Steven Krauwer, Henri Beunders. (Unfortunately, Martijn Kleppe and Johan Oomen were not present to join the group pic.)
  22. 22. Plans of ‘ToE/TTC’ 1.Publish proceedings of the EU parliamentary debates in RDF • hosted by DANS 2.Organize 3 workshops/hackathons/‘Traveling Clarin Campuses’ in which we invite international partners to work with the data. 3.In collaboration with international partners: • enrich with annotations, e.g. topics, structured data about people, parties, etc. • link to national datasets, e.g. media or national parliaments

×