The document summarizes the PoliticalMashup project, which aims to connect promises and actions of politicians with societal reactions by integrating large datasets. It discusses using text analytics and XML techniques on datasets like Dutch parliamentary proceedings and election manifestos to enable automated analysis. Example applications include search, entity linking, and detecting promises by ministers. It also outlines several areas for natural language processing research using the datasets, such as topic detection and modeling populist language.
Semantic Archive Integration for Holocaust Research: the EHRI Research Infras...Vladimir Alexiev, PhD, PMP
The European Holocaust Research Infrastructure (EHRI) is a large-scale EU project that involves 23 institutions and archives working on Holocaust studies, from Europe, Israel and the US. In its first phase (2011-2015) it aggregated archival descriptions and materials on a large scale and built a Virtual Research Environment (portal) for Holocaust researchers based on a graph database.
In its second phase (2015-2019), EHRI2 seeks to enhance the gathered materials using semantic approaches: enrichment, coreferencing, interlinking. Semantic integration involves four of the 14 EHRI2 work packages and helps integrate databases, free text, and metadata to interconnect historical entities (people, organizations, places, historic events) and create networks. We will present some of the EHRI2 technical work, including critical issues we have encountered.
WP10 (EAD) converts archival descriptions from various formats to standard EAD XML; transports EADs using OAI PMH or ResourceSync; ingests EADs to the EHRI database; enables use cases such as synchronization; coreferencing of textual Access Points to proper thesaurus references
WP11 (Authorities and Standards) consolidates and enlarges the EHRI authorities to render the indexing and retrieval of information more effective. It addresses Access Points in ingested EADs (normalization of Unicode, spelling, punctuation; deduplication; clustering; coreferencing to authority control), Subjects (deployment of a Thesaurus Management System in support of the EHRI Thesaurus Editorial Board), Places (coreferencing to Geonames); Camps and Ghettos (integrating data with Wikidata); Persons, Corporate Bodies (using USHMM HSV and VIAF); semantic (conceptual) search including hierarchical query expansion; interconnectivity of archival descriptions; permanent URLs; metadata quality; EAD RelaxNG and Schematron schemas and validation, etc.
WP13 (Data Infrastructures) builds up domain knowledge bases from institutional databases by using deduplication, semantic data integration, semantic text analysis. It provides the foundation for research use cases on Jewish Social Networks and their impact on the chance of survival.
WP14 (Digital Historiography Research) works on semantic text analysis (semantic enrichment), text similarity (e.g. clustering based on Neural Networks, LDA, etc), geo-mapping. It develops Digital Historiography researcher tools, including Prosopographical approaches.
Reusing historical newspapers of KB in e-humanities - Case studies and exampl...Olaf Janssen
This slidedeck gives an overview of Dutch e-humanties projects that build upon the datasets of the Koninklijke Bibliotheek, the national library of the Netherlands.
It focuses on 8 projects that reuse the digitized historical newspapers (1618-1995) of the KB.
It was presented on 7-1-2014 at the Huygens Institute for the History of the Netherlands (Huygens ING for short). This is an institute of the Royal Netherlands Academy of Arts and Sciences (KNAW) where around 100 scholars work in the largest humanities institute of the Netherlands.
Keywords: biland,delpher,e-humanities,elite network shifts,hirods,historical newspapers,isher,koninklijke bibliotheek,national library of the netherlands,open data,polimedia,political mashup,reuse,sealincmedia,translantis,washp
Semantic Archive Integration for Holocaust Research: the EHRI Research Infras...Vladimir Alexiev, PhD, PMP
The European Holocaust Research Infrastructure (EHRI) is a large-scale EU project that involves 23 institutions and archives working on Holocaust studies, from Europe, Israel and the US. In its first phase (2011-2015) it aggregated archival descriptions and materials on a large scale and built a Virtual Research Environment (portal) for Holocaust researchers based on a graph database.
In its second phase (2015-2019), EHRI2 seeks to enhance the gathered materials using semantic approaches: enrichment, coreferencing, interlinking. Semantic integration involves four of the 14 EHRI2 work packages and helps integrate databases, free text, and metadata to interconnect historical entities (people, organizations, places, historic events) and create networks. We will present some of the EHRI2 technical work, including critical issues we have encountered.
WP10 (EAD) converts archival descriptions from various formats to standard EAD XML; transports EADs using OAI PMH or ResourceSync; ingests EADs to the EHRI database; enables use cases such as synchronization; coreferencing of textual Access Points to proper thesaurus references
WP11 (Authorities and Standards) consolidates and enlarges the EHRI authorities to render the indexing and retrieval of information more effective. It addresses Access Points in ingested EADs (normalization of Unicode, spelling, punctuation; deduplication; clustering; coreferencing to authority control), Subjects (deployment of a Thesaurus Management System in support of the EHRI Thesaurus Editorial Board), Places (coreferencing to Geonames); Camps and Ghettos (integrating data with Wikidata); Persons, Corporate Bodies (using USHMM HSV and VIAF); semantic (conceptual) search including hierarchical query expansion; interconnectivity of archival descriptions; permanent URLs; metadata quality; EAD RelaxNG and Schematron schemas and validation, etc.
WP13 (Data Infrastructures) builds up domain knowledge bases from institutional databases by using deduplication, semantic data integration, semantic text analysis. It provides the foundation for research use cases on Jewish Social Networks and their impact on the chance of survival.
WP14 (Digital Historiography Research) works on semantic text analysis (semantic enrichment), text similarity (e.g. clustering based on Neural Networks, LDA, etc), geo-mapping. It develops Digital Historiography researcher tools, including Prosopographical approaches.
Reusing historical newspapers of KB in e-humanities - Case studies and exampl...Olaf Janssen
This slidedeck gives an overview of Dutch e-humanties projects that build upon the datasets of the Koninklijke Bibliotheek, the national library of the Netherlands.
It focuses on 8 projects that reuse the digitized historical newspapers (1618-1995) of the KB.
It was presented on 7-1-2014 at the Huygens Institute for the History of the Netherlands (Huygens ING for short). This is an institute of the Royal Netherlands Academy of Arts and Sciences (KNAW) where around 100 scholars work in the largest humanities institute of the Netherlands.
Keywords: biland,delpher,e-humanities,elite network shifts,hirods,historical newspapers,isher,koninklijke bibliotheek,national library of the netherlands,open data,polimedia,political mashup,reuse,sealincmedia,translantis,washp
Building the PoliMedia search system; data- and user-drivenMaxKemman
Presentation at eHumanities group at Meerten's Institute (Amsterdam) on Thursday 18 April 2013.
Analysing media coverage across several types of media-outlets is a challenging task for (media) historians. A specific example of media coverage research investigates the coverage of political debates and how the representation of topics and people change over time. The PoliMedia project (http://www.polimedia.nl) aims to showcase the potential of cross-media analysis for research in the humanities, by 1) curating automatically detected semantic links between four data sets of different media types, and 2) developing a demonstrator application that allows researchers to deploy such an interlinked collection for quantitative and qualitative analysis of media coverage of debates in the Dutch parliament.
These two goals reflect the two perspectives on the development of a search system such as PoliMedia; data- and user-driven. In this presentation, Laura Hollink (VU) will present the data-driven perspective of linking between different datasets and the research questions that arise in achieving this linkage: how to combine different types of datasets and what kind of research questions are made possible by the data? Max Kemman (EUR) will present the user-driven perspective: which benefits can scholars have from linking of these datasets? What are the user requirements for the PoliMedia search system and how was the system evaluated with scholars in an eye tracking study?
Presentation given at the Erasmus Studio Lunchseminar at the Erasmus University Rotterdam, the Netherlands, Tuesday 20 January 2015. The presentation gives an overview of the project 'PoliMedia' and 'Talk of Europe' and ends with a reflection on the use and creation of open datasets for academic research purposes. Presented by Martijn Kleppe, Astrid van Aggelen & Laura Hollink.
Bringing parliamentary debates to the Semantic WebLaura Hollink
Presentation of the paper 'Bringing parliamentary debates to the Semantic Web' by Damir Juric, Laura Hollink and Geert-Jan Houben at the workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE2012) in conjunction with the 11th International Semantic Web Conference 2012 in Boston, USA.
See also the homepage of the PoliMedia project: http://polimedia.nl/
Presentation of the Sense4us project at the 2nd European TA Conference - Berlin, 26 February 2015
"Policy Making in a Complex World:
The Opportunities and Risks Presented
by New Technologies"
Using Topic Modeling to Study Everyday "Civic Talk" and Proto-political Engag...Tuukka Ylä-Anttila
We present a two-step topic modeling method of analysing political articulations in everyday proto-political "civic talk" on online social media and interpreting them in terms of cultural and political sociology.
Words and More Words: Challenges of Big Data by Prof. Edie Rasmussenwkwsci-research
Presented during the WKWSCI Symposium 2014
21 March 2014
Marina Bay Sands Expo and Convention Centre
Organized by the Wee Kim Wee School of Communication and Information at Nanyang Technological University
Introduction to Research project PoliMediaMartijn Kleppe
Presentation about our research project 'PoliMedia - Interlinking multimedia for the analysis of media coverage of political debates'. Presented at the PoliMedia symposium, 23 January 2013, Amsterdam, the Netherlands
Beyond document retrieval using semantic annotations Roi Blanco
Traditional information retrieval approaches deal with retrieving full-text document as a response to a user's query. However, applications that go beyond the "ten blue links" and make use of additional information to display and interact with search results are becoming increasingly popular and adopted by all major search engines. In addition, recent advances in text extraction allow for inferring semantic information over particular items present in textual documents. This talks presents how enhancing a document with structures derived from shallow parsing is able to convey a different user experience in search and browsing scenarios, and what challenges we face as a consequence.
Building the PoliMedia search system; data- and user-drivenMaxKemman
Presentation at eHumanities group at Meerten's Institute (Amsterdam) on Thursday 18 April 2013.
Analysing media coverage across several types of media-outlets is a challenging task for (media) historians. A specific example of media coverage research investigates the coverage of political debates and how the representation of topics and people change over time. The PoliMedia project (http://www.polimedia.nl) aims to showcase the potential of cross-media analysis for research in the humanities, by 1) curating automatically detected semantic links between four data sets of different media types, and 2) developing a demonstrator application that allows researchers to deploy such an interlinked collection for quantitative and qualitative analysis of media coverage of debates in the Dutch parliament.
These two goals reflect the two perspectives on the development of a search system such as PoliMedia; data- and user-driven. In this presentation, Laura Hollink (VU) will present the data-driven perspective of linking between different datasets and the research questions that arise in achieving this linkage: how to combine different types of datasets and what kind of research questions are made possible by the data? Max Kemman (EUR) will present the user-driven perspective: which benefits can scholars have from linking of these datasets? What are the user requirements for the PoliMedia search system and how was the system evaluated with scholars in an eye tracking study?
Presentation given at the Erasmus Studio Lunchseminar at the Erasmus University Rotterdam, the Netherlands, Tuesday 20 January 2015. The presentation gives an overview of the project 'PoliMedia' and 'Talk of Europe' and ends with a reflection on the use and creation of open datasets for academic research purposes. Presented by Martijn Kleppe, Astrid van Aggelen & Laura Hollink.
Bringing parliamentary debates to the Semantic WebLaura Hollink
Presentation of the paper 'Bringing parliamentary debates to the Semantic Web' by Damir Juric, Laura Hollink and Geert-Jan Houben at the workshop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE2012) in conjunction with the 11th International Semantic Web Conference 2012 in Boston, USA.
See also the homepage of the PoliMedia project: http://polimedia.nl/
Presentation of the Sense4us project at the 2nd European TA Conference - Berlin, 26 February 2015
"Policy Making in a Complex World:
The Opportunities and Risks Presented
by New Technologies"
Using Topic Modeling to Study Everyday "Civic Talk" and Proto-political Engag...Tuukka Ylä-Anttila
We present a two-step topic modeling method of analysing political articulations in everyday proto-political "civic talk" on online social media and interpreting them in terms of cultural and political sociology.
Words and More Words: Challenges of Big Data by Prof. Edie Rasmussenwkwsci-research
Presented during the WKWSCI Symposium 2014
21 March 2014
Marina Bay Sands Expo and Convention Centre
Organized by the Wee Kim Wee School of Communication and Information at Nanyang Technological University
Introduction to Research project PoliMediaMartijn Kleppe
Presentation about our research project 'PoliMedia - Interlinking multimedia for the analysis of media coverage of political debates'. Presented at the PoliMedia symposium, 23 January 2013, Amsterdam, the Netherlands
Beyond document retrieval using semantic annotations Roi Blanco
Traditional information retrieval approaches deal with retrieving full-text document as a response to a user's query. However, applications that go beyond the "ten blue links" and make use of additional information to display and interact with search results are becoming increasingly popular and adopted by all major search engines. In addition, recent advances in text extraction allow for inferring semantic information over particular items present in textual documents. This talks presents how enhancing a document with structures derived from shallow parsing is able to convey a different user experience in search and browsing scenarios, and what challenges we face as a consequence.
1. PoliticalMashup 1
PoliticalMashup
Connecting promises and actions of politicians and how
the society reacts on them
Maarten Marx
Universiteit van Amsterdam
Groningen, α-informatica, 2011-03-11
2. PoliticalMashup 2
Content
• Overview PoliticalMashup project
• Zooming in on one cultural heritage dataset
• A few example applications
• Research ideas for NLP-scientists.
3. PoliticalMashup 3
Who am I?
• Political scientist turned computer scientist
• My field:
• Theory of XML Database Systems
• Semi Structured Information Retrieval
• Cooperation with
• Tweede Kamer
• Koninklijke Bibliotheek,
• historians at NIOD, DNPP
4. PoliticalMashup 4
PoliticalMashup project
• Large scale data integration project
• 2 years NWO funded infrastructure project 2010-2012
• Partners: U. Amsterdam, Groningen and Tilburg
• Ongoing with irregular funding since 2008
5. PoliticalMashup 5
Goal of PoliticalMashup
• Making huge amounts of textual data available for
• large scale automatic quantitative data and content analysis
• done by scientists from the humanities and social sciences.
6. PoliticalMashup 6
Mashup of what and how?
• 4 data sources
Promises and actions of politicians
Reactions on those in media and general public
• Connect data on
Political entities
Time
Topics
7. PoliticalMashup 7
Data sources
Promises
• Election manifestos, mostly scans, DNPP
• Party websites and blogs, Archipol
• Twitter of politicians
Actions Parliamentary proceedings, mostly scans, KB
Reactions
• News media
• User generated content Fora, Blogs, Comments on news,
Twitter
8. PoliticalMashup 8
Used techniques
• Text analytics and XML DB and IR technology
• Named entity recognition and normalization
• Data mining, Machine Learning, hand-crafted rules
• Natural Language Processing, Language Models
Make implicit structure and information explicit.
16. PoliticalMashup 16
De Handelingen der Staten Generaal (Dutch
Hansards)
17. PoliticalMashup 17
About this collection
• very sparse available metadata
• very rich “metadata” sits hidden inside the raw data
• Rich data model
• Meeting (1 Day)
• Topic
• Stage direction
• Scene
• Stage direction
• Speech
• Paragraph
18. PoliticalMashup 18
Same data: different views
• Raw data in PDF
• XML styled with stylesheet
• Machine readable XML format
20. PoliticalMashup 20
Content and structure search
• Combine IR style keyword search with restrictions on structure.
• E.g., return speeches by Wilders about Islam
21. PoliticalMashup 21
Exhaustive data collection
• Example query for NIOD historians
• Search for paragraphs about fascisme OR nazisme OR dictatuur
OR (nazi AND dictatuur) OR . . .
• Return a tsv file with for each hit date speakername speakerid
speaker-party . . .
• NIOD query
22. PoliticalMashup 22
Link the proceedings to entities
• Who is speaking?
• Who says what to whom?
Applications
• Summary of one speaker
• On old OCRed data: Linking and resolving entities
23. PoliticalMashup 23
Application: Interruption graph (Attackogram)
• MP A interrupts B ⇐⇒ A speaks during the block of B.
25. PoliticalMashup 25
0) Topics
• Common European thesaurus http://eurovoc.europa.eu
• detection
• classification (sentence, paragraph, speech level)
26. PoliticalMashup 26
1) Populist language in parliament
• PhD Thesis Jan Jagers (2006).
27. PoliticalMashup 27
2) Automatically detecting promises (’toezegging’)
by ministers in Parliament
• https:
//zoek.officielebekendmakingen.nl/kst-103196.pdf
(pagina 56)
• Eerste Kamer has a nice database online
http://www.eerstekamer.nl/toezeggingen_2
28. PoliticalMashup 28
Example
De voorzitter: Ik constateer dat wij bijna aan het einde van deze
vergadering zijn gekomen. Wij hebben nog tijd om even de
toezeggingen langs te lopen. Ik vraag iedereen om op te letten of er
niets over het hoofd is gezien. Ik zal dit snel doen en daarna spreken
wij nog even over het vervolg. De toezeggingen.
Na de zomer ligt het wetsvoorstel bij de Kamer.
Er komt een brief om de Kamer erover te informeren op welke wijze
er voorkomen wordt dat er expertise verloren gaat.
Minister Van Bijsterveldt-Vliegenthart: Dat heb ik niet
toegezegd. Beslist niet. Nee, dat doe ik niet, want ik heb dat niet
toegezegd.
29. PoliticalMashup 29
3) Opinion detection
• Detect opinions expressed about entities and topics. (Speaker is
known)
• Detect reported speech.
30. PoliticalMashup 30
4) Detect type of speech
• Interruption, attack, answer, speech (“betoog”), ’stage-direction’,
...
• http://data.politicalmashup.nl/debates/nl/
h-ek-19961997-37-58.1-tijdslijn.html
31. PoliticalMashup 31
5) Detect “bullshit”
• Tautologi¨en . . .
e
• Regels zijn regels, Op is op
• p→p
• het is wat het is
32. PoliticalMashup 32
6) Spelling normalization
• Dutch had many spelling reforms.
• Leads to lower recall.
• Search in new spelling, return results in old spellings.
33. PoliticalMashup 33
Lots of data available: happy to share
• Now: 15 years of Dutch Parliamentary Proceedings in rich XML
• Now: 200 years more in poorer XML, slowly getting richer.
• Parliamentary proceedings from EU (15y), UK (75y), Spain (40y),
Scandinavian countries, . . .
• Election manifestos (provincial elections 2007 and 2011)
• All tweets, blogs, Flickr and Youtube of all Dutch national
politicians since 1.5 year.