Your SlideShare is downloading. ×
Polimedia Syposium - Linking the data sets
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Polimedia Syposium - Linking the data sets

251
views

Published on

Presentation describes the method for link discovery that aims to connect debate content (textual documents from the parliament) on a speech level with relevant articles that contain not just the …

Presentation describes the method for link discovery that aims to connect debate content (textual documents from the parliament) on a speech level with relevant articles that contain not just the mentions of speakers but also mentions of speakers in a context of topics or events that politicians tackled in their speech in parliament. Method uses semantic and information retrieval techniques to generate automatic queries that contain the context of the parliamentary speeches and to search newspaper, radio and video data sets for the connections between speeches and newspaper articles that are covering them.

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
251
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. PolimediaSymposium Linking the data sets Damir Juric (TU Delft) Amsterdam, 23.01.2013.
  • 2. Background: the PoliMedia project• The PoliMedia project: – driven by research questions from historians – interested in media coverage across several types of media outlets – Cross-media comparisons • conducted over a longer period of time, on different topics • focus on the coverage of the debates in the Dutch parliament • insight on the different choices that different media make while reporting on those debates – three phases : • modeling phase: creating a semantic model • data production phase: creating links between debates and associated media sources • application phase: searching and navigating linked datasets
  • 3. Introduction• Polimediasemantic model needs to represent: – people – topics – time – media types• Model has to be expressive enough: – describing events from the Dutch parliament
  • 4. Data Sets• Primary data set: – The Dutch parliamentary debates (Handelingender Staten-General or Dutch Hansard) – transcripts of speeches that politicians had in the parliament – this project uses data from the Political Mashup – all debates until the year 1995: • published as XML documents (OCR with satisfactory quality is being used). • data shows a fine-grained structure.
  • 5. Data Sets• Secondary data set: – different media types: • newspaper articles and radio bulletins – National Library of the Netherlands • newscasts – evening news and current affairs programs
  • 6. Semantic model: Goals• Goal of the project: – to publish the links on the Web – to use open Web formats and standards – Web query language – unique identifiers (URI’s)• Model has to be expressive: – important information regarding parliamentary debates should be easily accessed
  • 7. Debate: The structure Metadata Debate Metadata NEs={EconomischeZaken, Borssele} Aan de orde is de behandeling van: - de brief van de minister van Topic 1 EconomischeZakeninzakeBorssele(16226, nr. 26). De beraadslagingwordtgeopend.Speaker 1 Speaker 1 / Content NEs={Borssele, Partij van de Arbeid, D66}Speaker 2 Speaker 2 / Content Mijnheer de Voorzitter! Met de verdragen tot uitbreiding van de EEG met Denemarken,Engeland, Ierland en Noorwegenwordteen van de doelstellingenSpeaker 3 Speaker 3 / Content van onsbuitenlandsbeleidverwezenlijkt. Topic 2 Speaker 1 / Content
  • 8. Semantic model: DescriptionPart of semantic model representation of thedebates dataset
  • 9. Semantic model: DescriptionSemantic model representation of the debatesdataset
  • 10. Polimedia linking method• The challenge: how to create a representation of the speech that contains enough information, so it can be used as a query to retrieve relevant media articles from the archive?• Debate speeches and newspaper articles are generally different types of documents (so computing document similarity doesn’t work) in the style and scope• Speeches can contain large number of NEs and digressions: – Problem: hard to distinguish the right context for each speech• Newspaper articles: – very strict and concise – words are used sparingly
  • 11. Polimedia linking method• Our PoliMedia linking method consists of four steps: 1. topics: enriching the existing debate metadata with topics 2. preselection of articles: when the candidate articles were published and who spoke in the debate (timeframe and speakers)? 3. automatic query creation: candidate articles are ranked based on similarity to the query (automatically created from speech text) by comparing vectors of topics and named entities 4. link creation: links are created between a speech and an article if the similarity score is above a threshold t
  • 12. Topics• Topic modeling: – popular tool for the unsupervised analysis of text, – used to check models, summarize the corpus, and guide exploration of its contents – topic models lead to semantically meaningful decompositions of text because they tend to place high probability on words that represent concepts• Extracting topics from speech: – ten words that represent one topic discussed inside the speech are extracted – all speeches contained inside one debate segment are concatenated into one text – set of ten words that represent one topic of the debate segment as a whole is extracted from that text• Input: text /number of iterations/number of topics• Output: generic names for topics/words that cluster around one topic• Example: – Test case: debate nr. 1975/number of iterations: 2000/numbner of topics: 1
  • 13. Automatic query creation Metadata NE Speech Staatssecretaris Regering Euro-kapitaalmarkt Debate Tariefnota Metadata TopicSet Topic Financiën moeten Zwitserland Topic 1 fraude Grave TopicSet Speech wetgeving Brussel Speaker 1 / Contentinkomstenbelasting sociale EEGbronheffing Speaker 2 / Content misbruik NetherlandseKombrink ten Contourennota Speaker 3 / Contentrente fraudebestrijding Kombrinkcontourennota vraag NederlandNederland gebruik Contou Topic 2vereenvoudiging kamer OESO-verband Speaker 1 / Contenttarief Midden-Oostenword misbruikfraudebest Engwirda rijdingismo-rapporttussen Couprie NE Topic Actor
  • 14. Automatic query creation Scholten+(text:wetsontwerptext:latertext:septembertext:prijzentext:lonentext:ontwikkelingtext:zeggentext:staatssecretaristext:gebrachttext:ertoe)(title:wetsontwerptitle:latertitle:septembertitle:prijzentitle:lonentitle:ontwikkelingtitle:zeggentitle:staatssecretaristitle:gebrachttitle:ertoe) +(text:staatssecretaristext:huurverhogingtext:jaartext:moetentext:apriltext:uitsteltext:percentagetext:nieuwe) (title:staatssecretaristitle:huurverhogingtitle:jaartitle:moetentitle:apriltitle:uitsteltitle:percentagetitle:nieuwe)+(text:regelentext:wet)(title:regelentitle:wet)+text:staatssecretaristitle:staatssecretarisMijnheerde Voorzitter ! In de memorie van toelichtingbij het voorliggendewetsontwerpzegt de Staatssecretaris , dathij over hettrendmatigehuurstijgingspercentagevoor 1977 nognietskanzeggenomdat de gegevens over de teverwachtenontwikkeling van lonen en prijzenvoor 1977nognietbekendzijn . Dit is gedateerd 14 september . Impliceertdit , wanneerergeensprakezouzijn van eenwetsontwerp tot verschuiving van de ingangsdatum, danook ten aanzien van de 8 procent per 1 aprilzougeldendatnogafgewachtmoetworden , of het dat percentage zalworden , omdat men pas laterietsmeerweet over de ontwikkeling van lonen en prijzen ? De Staatssecretarisvoeltzich door ditwetsontwerpeigenlijkgedwongen opeenvrijvroegtijdstiptochdaaroverietstezeggen . Immers , een week later namelijkbij brief van 21 septemberkomthijwel met eenbepaaldconcreetvoorstel .Daarinstelthij : Het overleg met de vastecommissieheeftmijertoegebracht ... ExpandedQuery = NERsSpeech TopicSet Speech NER Topic TopicSet Topic + Speaker X = ActorFromSpeech TimeFrame
  • 15. Example of the relevant articlevvd: van dam baseertbeleidteveel op rossige prognoses van planbureaukamermeerderheidtegenuitstel van huurverhogingden haag — eenmeerderheid van de tweedekamervoelternietsvoor de huurverhoging van volgendjaaruitte, stellen van 1 april tot 1 juli. de fractiesvan kvp, arp, chu, vvd, ds7o en de kleinechristelijkepairtijenwillen de huurverhoging op 1 aprillatendoorgaan. staatssecretaris van dam vanvolkshuisvestingwiluitstelom op 1 julivolgendjaareennieuwhuurbeleidtekunneninvoeren. daarvoorzalhij op kortetetmijndriewetsontwerpenindienen:de huurprijzenwet, de wet op de huurcommissie en eenwijziging van het burgerlijkwetboek.debewindsmanzeidat met het afwijzen van uitstel infeiteinvoering van het nieuwehuurbeleid op 1 julivolgendjaaronmogelijkwordtgemaakt. het nieuwestelselzaldan pas in 1978ingevoerdkunnenworden. „met eenuitstel van driemaandenkomen we preciesuit", aldus de heer van dam. de arperscholten, die medenamenskvp enchusprak, zegde de regeringallemedewerking toe om de nieuwehuurwetnog in dezekabinetsperiodetebehandelen, maarhijtwijfeldeeraan of op 1 juli1977 het nieuwehuurbeleid al ingevoerdkanworden. de confessionelen en de vvdhouden vast aaneenhuurverhoging van 8 procent op 1 april.staatssecretaris van dam wil pas op 1 julizonverhoging. zou de verhogingtoch op 1 aprilmoeteningaan, danwilhijeenverhoging van 7 procent. debewindsmankomtvolgens de confessionelentevroeg met eenverlaging van de jaarlijksehuurverhoging.het d66-kamerlid nypelsdiendeeenmotie inwaarinhij de regeringverzoektbijverwerping van het uitsteltekomen met eenwetsontwerpvoor 7 procent op 1 april. ook depvdaetkombrinksuggereerdedezeoplossing. de heerkombrink deed eendringendberoep op de confessionelenom het uitstelteaanvaarden. de vvder debeer vonddatelrnietvoldoenderedenenzijnvooruitstel van de huurverhoging. de staatssecretarisbaseertzijnbeleidteveel op „de rossige prognoses vanhet centraal plan bureau", vindt de vvd. ook de christendemocratenvindendat van dam teveel van prognoses uitgaat die vaaktelaagzijn.depvda is hetmet de regeringeensdat de huren op 1 juli met 8 procentomhoogmoeten. wijst de kamerdataf, danmoeten de huren op 1 april met 7procentwordenverhoogd. men moetnietalleenkijkennaar de ontwikkeling van lonen en prijzen, men moetookkijkennaar het vrijbesteedbareinkomen.de stijgingdaarvanzal in de komendejarenuiterstgeringzijn", zeikombrink. cpn-woordvoerderdraagstrazeidat de hurenbevrorenmoetenworden op hethuidigepeil.
  • 16. Polimedia pipeline semantic model RDF filesPoliticalMash up RDF (xml) Query NE Query content KB (preselect data) Expanded query creation NERs Speech NERs Topic Stopword removal similarity calculation SRU Query (actor, date range) TopicSet Speech Topic modeling ranking TopicSet Topic filtering automatic query creation contextual vectors article metadata
  • 17. Evaluation•We tried three different approaches: • Experiment 1: NEs in speech • Experiment 2: NEs + topics in speech • Experiment 3: NEs + topics in speech and debate• Conclusion: • best approach: • named entities (speech + debate descriptions) and topics (speech + debate)
  • 18. Results discussion• structural elements of transcript: • used to create complex and rich query from the speech• treating particular speech as a part of the bigger context (conversation) and creating a query that is a mixture of those elements: • higher number or related articles retrieved• What we learned? • definition of link can be vague • simple document similarity methods doesnt work • journalist use their own “compression” methods when writing about debates • long speeches with dozens of NEs and topics are sometimes represented with few concise sentences
  • 19. End• Thank you for listening• more information on polimedia.nl
  • 20. similarity measures• similarity measures: metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison and in fuzzy string searching• Given two segments, the expanded query Q and the document from media archive D, the term frequency (TF) is associated to a term t from the query Q and the document D, the similarity between Q and D is computed according to the cosine similarity formula, where the generated value varies between 0 and 1:• CosineSimilarity(Q,D) =• BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document (e.g., their relative proximity). Given a query Q, containing keywords t1, ..., tn, the BM25 score of a document D is:• BM25Score(Q,D) =• where function represents term frequency of the term qtfrom the document D, is the length of the document D in words, and avgdl is the average document length in the text collection from which documents are drawn. Parameters k1 and bare free parameters. Function is the inverse document frequency weight of the query term qt.
  • 21. similarity measures• The overlap coefficient is a similarity measure related to the Jaccard index that computes the overlap between two sets which is defined as follows: overlap(Q,D) =• If set X is a subset of Y or the converse then the overlap coefficient is equal to one.