DISCOVERING LINKS
BETWEEN POLITICAL
DEBATES AND MEDIA
Damir Juric, Delft University of Technology
Laura Hollink, VU Univer...
The PoliMedia project: linking politics
to media
PoliMedia research questions
• How is a person, subject or process covered & visualised by the
media?
• How do debates and...
Issues with current approach
• Go to different archives, look up original data!
Goal: explicit links to different media
types in one system
Data Sets – Debates
Handelingen der Staten-General or Dutch Hansard
from 1945-1995
Some provenance:
1. Transcripts are mad...
Structure of the debate data
Debate
Metadata
Topic 1
Topic 2
Speaker 1 / Content
Speaker 2 / Content
Speaker 3 / Content
S...
Data sets – Media
• Newspaper articles
• at the National Library of the Netherlands
• Many newspapers 1950- 1995
• Text + ...
All data and links expressed as RDF
• We have created a semantic model to capture the
datasets and link between them.
• Re...
All data and links expressed as RDF
nl.proc.sgd.d.
194519460000002
nl.proc.sgd.d.
194519460000002.1
PartOfDebateDebate
htt...
PoliMedia linking method
• Debate speeches and newspaper articles are different
types of documents, so default document si...
PoliMedia linking method
• Our PoliMedia linking method consists of four steps:
1. topics: enriching the existing debate m...
Topics
The MALLET topic model package
• Unsupervised analysis of text
• “a Topic consists of a cluster of words that frequ...
Kombrink
rente
inkomstenbelasting
bronheffing
vereenvoudiging
tarief
contourennota
Nederland
word
tussen
wetgeving
sociale...
Polimedia pipeline
RDF
semantic model
RDF files
NERs Speech
TopicSet Speech
NERs Topic
TopicSet Topic
contextual vectors
P...
Evaluation
• We tried three different approaches:
• Experiment 1: NEs in speech
• Experiment 2: NEs + topics in speech
• E...
Evaluation
Results:
• best approach: named entities (speech + debate descriptions) and topics
(speech + debate)
(2: releva...
Evaluation
• Relative recall:
• different evaluation: annotator reads a speech, manually creates a
suitable query for it, ...
Conclusion
• Creation of links between two very different datasets: a dataset of political
debates and a media archive
• L...
Upcoming SlideShare
Loading in …5
×

ICWE2013 - Discovering links between political debates and media

340 views

Published on

Discovering links between political debates and media
by Damir Juric, Laura Hollink, Geert-Jan Houben
TU Delft - WIS
at ICWE 2013, Aalborg, Denmark, July 2013

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
340
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Politics and media are heavily intertwined and both play a role in thediscussion on policy proposals and current affairs. However, a dataset thatallows a joint analysis of the two does not yet exist.
  • The PoliMedia project is driven by research questions from historians with respectto media coverage of politics across several types of media outlets. Cross-mediacomparisons will be conducted over a longer period of time, on different topics. Theproject concentrates on media coverage of the debates in the Dutch parliament andgives insight into the different choices that different media make while reporting onthose debates and how this changes over time.
  • Go to archives, look up original data, decide whether there is a link to a debate.Cumbersome
  • Final goal of our project was to connect all the media sources we can find with one particular speech from the parliament we are interested in. In this paper we present our method of linking between debates and newsapaper dataset.
  • The Dutch government publishes the proceedings of its parliamentary debates ontwo websites. Debates from 1995 until now can be found on the OfficiëleBekendmakingen portal3 and can be downloaded as PDF or in an XML format, usingXML schemaandpermanentidentifiers. TheStaten-GeneraalDigitaal portal4contains the debates from before the year 1995, which can be accessed using theSearchandRetrievalvia URL (SRU) protocol5.A third source for Dutch parliamentary debate data is the Political Mashup portal6,created in the ongoing project War in Parliament (WIP).
  • All debates conform to the same structure where speakers give speeches in somechronological order. The debates are split up into segments according to the differentthemes or agenda points of the meeting. The first speaker of each segment is alwaysthe president of the House of Representatives (‘voorzitter’ or chairperson in Dutch).She gives usually an introduction to the subject and after her speech she gives thefloor to a member of the parliament. Every word by every speaker is transcribedincluding the names of the speakers and their party affiliation. The transcripts alsocontain metadata such as the date and title of the debate.
  • For newspaper data we use the historic newspaper archive of the National Libraryof the Netherlands, which contains the text as well as images of newspaper articlesfrom 1618 to 1995. Metadata of the articles is available as DIDL7 or ‘Digital ItemDeclarationLanguage’ – an XML dialect.
  • The semantic model for the PoliMedia project is built to satisfy the requirements ofthe project, i.e. the research questions from the historians.To represent parliamentary debates as events,we have created a domain specific semantic modelthatenables us to express information associated with the debates such as topics, actors, debate structure, and links to media. We created this model according to the rules ofthe Dutch parliament, although the model can be easily adapted to include parliamentsin other countries, because core elements like speakers, speeches and topics arepresent in all parliamentary debates
  • Topviewofoursemantic model in RDF. Debatesanditsstructure is broken on entitesandrelationshipsbetweenthem. Entitesinblue are importantstructuralpartsofthe debate (like topicdescriptionsandspeeches) andthey all havetheiruniqueidentifiers.
  • The biggest challenge in ourmethod was the task of creating the query representation of the speech that willcontain enough amounts of meaning and context, so it can be used to retrieve anddistinguish between large number of media articles that covers topics from theparliament and politics. We should stress that debate speeches and newspaper articlesare generally completely different types of documents (so computing documentsimilarity doesn’t work) in the style and scope. While speeches can contain largenumber of NEs and digressions, which makes it hard to distinguish the right contextfor each speech, newspaper articles (especially the ones that report on topics from theparliament) are very strict and concise (words are used sparingly)
  • For each speech inside a debate segment (called PartOfDebate in our method) weextract ten words that represent one topic discussed inside the speech. Also allspeeches contained inside one debate segment are concatenated into one text and theset of ten words that represent one topic of the debate segment as a whole is thenextractedfromthattext.
  • Query ismadeautomaticalybyanalyzing debate document. We are creatingfourdifferent groups (vectors) thatwillbejoinedinto a singlequery.Data for thequery is comingfromdifferentstructuralelementsofthe debate as canbeseen on picture.
  • Transformingdebates to rdf (conformingtosemantic model wemadejust for thiscase)topics: enriching the existing debate metadata with topicspreselection of articles: when the candidate articles were published and who spoke in the debate (timeframe and speakers)?automatic query creation: candidate articles are ranked based on similarity to the query (automatically created from speech text) by comparing vectors of topics and named entities link creation: links are created between a speech and an article if the similarity score is above a threshold t (similaritymeasureused: cosinesimilarityandoverlap)Automatically made links are written back into the rdf files representing debatesAddthateverythingg is publishedinan RDF storenow
  • To gain insight into the quality and added value of the varioussteps of the linking method described in the previous section, we have performedexperiments with three versions of the method. Specifically, we have varied whichinformation is used to rank the candidate articles (named entities (NEs), topics) andwhether the partOf relations between speeches and larger parts of debates are used toalso include information associated to these larger parts (debate segments). Experiment 1: NEs in speech - In the most simple form of our method, werank articles only based on the NEs found in the speech. Experiment 2: NEs + topics in speech - Here, we include not only NEs butalso topics detected for the speech. Experiment 3: NEs + topics in speech and debate - WeincludenotjustNEs and topics extracted from the speech itself but also NEs extracted fromthe debate context and topics extracted from all speeches in this context.
  • Table 1 shows the average number of relevant, partially relevant andunrelated links found in the experiments. Using just NEs from the speech (experiment1) gives a lot of unrelated links, and thus a low precision score of 48%. In [17], theauthors stated that NEs play an important role in news documents. They wanted toexploit that characteristic by considering them as the only distinguishing features ofthe documents. In our experiments we found out that using just NEs is not enough todistinguish between newspaper articles. When we include topics extracted from thespeech (experiment 2), precision increases to 62%. Finally, in experiment 3, weleverage the debate structure. We used NEs and topics from debate descriptions tocreate a query that is more specific than both previous queries. We can see that theresulting precision is highest with values around 80%.
  • To calculate recall we had to conduct a different kind of evaluation.It is infeasible to manually assess the relevance of all the close to 1 million articles inthe archive. Therefore, we chose an approach where an annotator reads a speech,manually creates a suitable query for it, and assess the relevance of the articlesreturned for that query. As with our automatic approach, we limit ourselves to articlespublished within 7 days of the debate day. For this experiment, we arbitrarily chosefive speeches for which we retrieved a total of 115 newspaper articles. We repeatedexperiment 3 on this smaller set of 5 speeches/115 articles. Precision on this set was75%, which is in line with the results of experiment 3. Recall was 62%. Another goodindication of the recall of our approach is the number of links returned. Our approachresulted in 5887 links to articles when using the settings of experiment 1, 4449 whenusing the settings of experiment 2 and 3804 for experiment 3.
  • ICWE2013 - Discovering links between political debates and media

    1. 1. DISCOVERING LINKS BETWEEN POLITICAL DEBATES AND MEDIA Damir Juric, Delft University of Technology Laura Hollink, VU University Amsterdam Geert-Jan Houben, Delft University of Technology ICWE2013
    2. 2. The PoliMedia project: linking politics to media
    3. 3. PoliMedia research questions • How is a person, subject or process covered & visualised by the media? • How do debates and arguments develop over a longer period of time? • Analysing the changing ideas, arguments and presentation in different media
    4. 4. Issues with current approach • Go to different archives, look up original data!
    5. 5. Goal: explicit links to different media types in one system
    6. 6. Data Sets – Debates Handelingen der Staten-General or Dutch Hansard from 1945-1995 Some provenance: 1. Transcripts are made of the complete debates of the Dutch parliament. 2. Published online by the government on http://www.statengeneraaldigitaal.nl/ (1818 - 1995) and http://officielebekendmakingen.nl/ (from 1995) 3. PoliticalMashup project has translated government pdf and txt files into XML, incl URI’s as identifiers, see http://politicalmashup.nl/ 4. We build on that.
    7. 7. Structure of the debate data Debate Metadata Topic 1 Topic 2 Speaker 1 / Content Speaker 2 / Content Speaker 3 / Content Speaker 1 / Content Aan de orde is de behandeling van: - de brief van de minister van Economische Zaken inzake Borssele (16226, nr. 26). De beraadslaging wordt geopend. NEs={Economische Zaken, Borssele} NEs={Borssele, Partij van de Arbeid, D66} Metadata Speaker 1 Speaker 2 Speaker 3 Mijnheer de Voorzitter! Met de verdragen tot uitbreiding van de EEG met Denemarken, Engeland, Ierland en Noorwegen wordt een van de doelstellingen van ons buitenlands beleid verwezenlijkt. • who, when, what • identifiers for subparts of the debate • chronological order of speakers
    8. 8. Data sets – Media • Newspaper articles • at the National Library of the Netherlands • Many newspapers 1950- 1995 • Text + images of newspaper layout
    9. 9. All data and links expressed as RDF • We have created a semantic model to capture the datasets and link between them. • Reusing other vocabularies • Simple Event Model (SEM) • Dublic Core • FOAF • ISOCAT
    10. 10. All data and links expressed as RDF nl.proc.sgd.d. 194519460000002 nl.proc.sgd.d. 194519460000002.1 PartOfDebateDebate http://resolver.politicalmashup.nl/nl.proc.sgd.d.194519460000002 http://statengeneraaldigitaal.nl/ http://resolver.kb.nl/resolve?urn=sgd:mpeg21:19451946:0000002:pdf nl.proc.sgd.d.19720000002 Handelingen Verenigde Vergadering... Dutch 1945-11-20 rdf:type dc:id dc:source dc:source dc:publisher dc:language dc:date hasPart rdf:type nl.proc.sgd.d. 194519460000002.1.1 hasPart DebateContext rdf:type nl.proc.sgd.d. 194519460000002.1.2 Speech rdf:type hasPart nl.proc.sgd.d. 194519460000002.1.3 hasSubsequentSpeech "Mijnheer de Voorzitter, de Commissie van …" hasSpokenText sem:hasActor "De voorzitter opent de vergadering…" hasText http://resolver.kb.nl/resolve?urn=ddd:011198136:mpeg21:a0525:ocr coveredIn nl.proc.sgd.d. 194519460000002.2 hasSubsequentPartOfDebate
    11. 11. PoliMedia linking method • Debate speeches and newspaper articles are different types of documents, so default document similarity metrics are insufficient • Speeches contain many named entities, digressions. • Newspapers are formal and concise, words are used sparingly. • The challenge: how to create a representation of the speeches that contains enough information to be used as a query to retrieve the right media articles from the archive?
    12. 12. PoliMedia linking method • Our PoliMedia linking method consists of four steps: 1. topics: enriching the existing debate metadata with topics 2. preselection of articles: when the candidate articles were published and who spoke in the debate (timeframe and speakers)? 3. automatic query creation: candidate articles are ranked based on similarity to the query (automatically created from speech text) by comparing vectors of topics and named entities 4. link creation: links are created between a speech and an article if the similarity score is above a threshold t
    13. 13. Topics The MALLET topic model package • Unsupervised analysis of text • “a Topic consists of a cluster of words that frequently occur together” • [see http://mallet.cs.umass.edu/topics.php] • Input: Text, Number of iterations, Number of topics/clusters • Output: Words that cluster around one topic. • Example: • Text: a speech in a debate from 1975 • number of iterations: 2000 • number of topics: 1
    14. 14. Kombrink rente inkomstenbelasting bronheffing vereenvoudiging tarief contourennota Nederland word tussen wetgeving sociale moeten fraude fraudebestrijding vraag misbruik ten gebruik kamer misbruik fraudebestrijding ismo-rapport Contourennota Kombrink EEG Netherlandse OESO-verband Nederland Contou Engwirda Couprie Midden-Oosten Euro-kapitaalmarkt Tariefnota Staatssecretaris Regering Financiën Zwitserland Brussel Grave TopicSet Speech NE Speech TopicSet Topic NE Topic Automatic query creation Debate Metadata Topic 1 Topic 2 Speaker 1 / Content Speaker 2 / Content Speaker 3 / Content Speaker 1 / Content Actor Query Debate came from came from
    15. 15. Polimedia pipeline RDF semantic model RDF files NERs Speech TopicSet Speech NERs Topic TopicSet Topic contextual vectors PoliticalMas hup (xml) Query NE Stopword removal Topic modeling Query content Expanded query creation SRU Query (actor, date range) automatic query creation KB (preselect data) similarity calculation ranking filtering article metadat a
    16. 16. Evaluation • We tried three different approaches: • Experiment 1: NEs in speech • Experiment 2: NEs + topics in speech • Experiment 3: NEs + topics in speech and debate • Two independent evaluators: reading the speeches and articles linked to them and manually assessing their relatedness • Randomly selected 20 debates from our dataset of 10,924 debates (different subjects: from fraud in the social system to the European elections) • Each experiment: random 50 speeches • In total: 150 speech-article pairs, namely 3 sets of 50 each
    17. 17. Evaluation Results: • best approach: named entities (speech + debate descriptions) and topics (speech + debate) (2: relevant, 1: partially relevant, 0: unrelated)
    18. 18. Evaluation • Relative recall: • different evaluation: annotator reads a speech, manually creates a suitable query for it, and assess the relevance of the articles returned for that query Precision: 75%, recall 62% experiment 3 on 5 speeches/115 articles gave a recall of 3804 links
    19. 19. Conclusion • Creation of links between two very different datasets: a dataset of political debates and a media archive • Linking method takes advantage of: • Debate content and metadata • Named Entities and Topics from the debates • semantic partOf structure of the debates • In experiments we have shown the added value of topics and debate structure • Produced links • different in nature than those produced by e.g. ontology alignment tools • Now: coarsely typed links • Future: nature and strength of the link

    ×