Published on

Presentation SURF Research and Innovation Event 2013. February 28, The Hague University of Applied Sciences.

Maarten Marx is Assistant Professor of Computer Science at the University of Amsterdam.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. PoliticalMashup 1 PoliticalMashup Open Official Documents: Requirements and Opportunities Maarten Marx Universiteit van Amsterdam Istanbul, EEOP (@LREC), 2012-05-27
  2. 2. PoliticalMashup 2 Content• Official Documents Zoom in on a specific official publications dataset• Opportunities What makes official publications data valuable?• Requirements What is needed to make official publications data reusable and interoperable?
  3. 3. PoliticalMashup 3 Our Leading Research QuestionWhat is the best data format for publishing both legacy and currentparliamentary proceedings in a digital sustainable manner? [Marx et al 2010]
  4. 4. PoliticalMashup 4 W3C recommendations on Open Government Data• make data both machine and human readable;• link data, make data linkable, provide permanent identifiers for each government object and data item;• provide metadata using common standards (e.g. Dublin Core);• make the data as easy to reuse (e.g. in mashups) as possible. Goal of this talk: make this concrete.
  5. 5. PoliticalMashup 5 Value of a large data corpus• Consider a 200 year corpus of temperature and humidity readings in one location.• Value is not in the individual “documents”• Value is not in the corpus as a whole.• Value is in the relation between the “documents”.
  6. 6. PoliticalMashup 6 Documents related by publication date Google books Ngram viewer
  7. 7. PoliticalMashup 7 Properties of our Parliamentary Proceedings Dataset
  8. 8. PoliticalMashup 8 Longitudinal data• weakly measurement for over 150 years• very stable measurement procedure and data model
  9. 9. PoliticalMashup 9 Data about human behaviour
  10. 10. PoliticalMashup 10 Often rather boring
  11. 11. PoliticalMashup 11 But sometimes full of drama and excitement
  12. 12. PoliticalMashup 12 Loads of measurement points 24.000 days, 450.000 topics, 7.5 miljoen speeches
  13. 13. PoliticalMashup 13 Digitally available
  14. 14. PoliticalMashup 14 About this collection• very sparse available metadata• very rich “metadata” sits hidden inside the raw data• Rich data model• Meeting (1 Day) • Topic • Stage direction • Scene • Stage direction • Speech • Paragraph
  15. 15. PoliticalMashup 15 Very rich metadata for each wordFor every word spoken in parliament, the following facts are knownat the time of the speech act, and can often be extracted from thewritten proceedings:1) when it was said,2) who said it,3) in what function,4) speaking on behalf of which party,5) in which context, and6) who was actively present during the speech act.
  16. 16. PoliticalMashup 16 How to exploit the extra metadata and structure?• Let’s consider a simple killer app . . .
  17. 17. PoliticalMashup 17 Political n-gram viewer• From every word we know both the date and the speaker.• Every speaker belongs to a political party.• 3D n-gram viewer: political spectrum vs time vs word-count• Use: topic ownership, agenda setting, framing
  18. 18. PoliticalMashup 18 Political n-gram viewer: requirementsdocuments 1. metadata: date of the meeting 2. document structure: for every spoken word: who said it.Linked Data Speakers names are disambiguated, normalized and mapped to a database with temporal party information.Completeness and correctness Few missing or wrong data, also for long time ago.
  19. 19. PoliticalMashup 19 Is Linked (Open) Data the solution?• Link speakers name to Wikipedia/DBpedia page. (named entity disambiguation and resolution). See also Google Knowledge Graph, and [Spitkovsky, Chang, LREC 2012].• DBpedia extracts link between person and party affiliation from Wikipedia infobox• Timestamped triple: Geert Wilders is partymember of VVD from 1998-08-25 until 2004-09-02
  20. 20. PoliticalMashup 20 DBpedia not yet reliable• Data extraction is difficult, even from the infobox, even from complete data: Wikipedia page of Geert Wilders DBpedia information about Geert Wilders Notice the values of the party and the office attributes Timestamped facts are difficult to extract and difficult to represent in RDF triples.
  21. 21. PoliticalMashup 21 Lesson learned: requirement on metadata and relations• One cannot rely on Linked Open Data for good quality metadata.• Official documents should be self-describing, also for facts which are obvious at publication time.• Compare speaker’s data in original (OCRed) data and XMLified and enriched version: • Original • Part of it in XML • And now for human consumption
  22. 22. PoliticalMashup 22 A few more applications
  23. 23. PoliticalMashup 23 Entity Profiling and Entity Search• Users search for entities, not for documents. [TREC Entity Track] [Balog et al 2009].• Main research questions How to collect information on entities, how to model an entity, how to rank entities.• (Parsimonious) language models work well as models. [Balog et al, 2009][Hiemstra et al, 2004]• Entity profiling: http://www.politiekinzicht.com• Entity search: http://ikkieswijzer.nl
  24. 24. PoliticalMashup 24 Content and structure search• Usual advanced search combines keyword search with metadata search.• Extra fields are just extra filters on the returned documents.• With structured documents we can do search on content and structure.• Most useful task: rank best entry points in large documents.• Compare two search systems on the same data: on flat text on an XML representation
  25. 25. PoliticalMashup 25 Lesson learned: requirement on structure• Make semantically important structure of documents explicit in XML markup.• Publish for machine readability• Publish generic data, not data prepared for one use-case.
  26. 26. PoliticalMashup 26 Application of structure: Interruption graph (Attackogram)• MP A interrupts B ⇐⇒ A speaks during the block of B.combined with entity profiling:http://debat.politiekinzicht.com/
  27. 27. PoliticalMashup 27 Exploring and exploiting official documents• We saw what can be done with one well-curated collection.• What are the key infrastructural and research questions? In what direction and how to scale this up? 1. in time 2. in breadth 3. in links
  28. 28. PoliticalMashup 28 Scale diachronically• Stable data model and measurement procedure make this data very valuable for diachronic comparisons.• towards the past • OCR • consistency in structure • more missing data to link to• towards the future • remain up to date • legacy decisions
  29. 29. PoliticalMashup 29 Scale in breadth, e.g., parlproceedings of all European countries• All describe the same “script”, so all fit in one schema.• Main question: how to connect the data from different countries? Common structure and annotation use the same Relax NG schema Common values on certain attributes • Entities Normalize to Wikipedia concepts • Controlled vocabulary keywords Normalize to Eurovoc • Language Machine translate to English • Events Normalize to EMM Newsexplorer query/ Wikinews query
  30. 30. PoliticalMashup 30 Scale in breadth: link to related datasets• Link on time, entities, events, topics• Other official publications• News• User generated content• (In our case), promisses of political actors: election manifestos
  31. 31. PoliticalMashup 31 Conclusions• There are ample opportunities for exploiting Official Publications.• Preprocessing and interlinking with other datasets is difficult and does not scale well: • High precision and recall is needed for many applications • Many text analysis and data-mapping tasks [MUC, TAC] • Every format needs an own transformer • Linked Open Data knowledge bases are not (yet) good enough: create special purpose knowledge extractors• High investment, but if done in a general way, high return and impact.
  32. 32. PoliticalMashup 32 Back to our research questionWhat is the best data format for publishing both legacy and current parliamentary proceedings in a digital sustainable manner?Lessons learned• Common, open, standardized, self-describing, machine readable,• not tied to a single application• linked, linked, linked • Not only shared attributes • but more importantly, shared data values• also store utterly obvious facts (10 years later they aren’t)
  33. 33. PoliticalMashup 33 How we can help (ourselves) Help improve input data at the source• Push at the source (in UK: open government data; in Holland: all parliamentary data is now in XML . . . )• Help reduce dumb cut-and-paste annotation work, so annotators can concentrate on tasks which are hard for machines (e.g. text-classification).• Emphasize importance of using shared standards. Future researchers will love you.
  34. 34. PoliticalMashup 34 Last Question Official Publications: are they or ?