PoliticalMashup                                         1                      PoliticalMashup              Open Official Do...
PoliticalMashup                                                  2                           Content• Official Documents Zoo...
PoliticalMashup                                                       3                  Our Leading Research QuestionWhat...
PoliticalMashup                                                     4 W3C recommendations on Open Government Data• make da...
PoliticalMashup                                                     5                  Value of a large data corpus• Consi...
PoliticalMashup                                           6                  Documents related by publication date        ...
PoliticalMashup                                         7          Properties of our Parliamentary Proceedings            ...
PoliticalMashup                                      8                     Longitudinal data• weakly measurement for over ...
PoliticalMashup                                9                  Data about human behaviour
PoliticalMashup                         10                  Often rather boring
PoliticalMashup                                       11         But sometimes full of drama and excitement
PoliticalMashup                                                       12                       Loads of measurement points...
PoliticalMashup                         13                  Digitally available
PoliticalMashup                                          14                    About this collection• very sparse availabl...
PoliticalMashup                                                      15                  Very rich metadata for each wordF...
PoliticalMashup                                      16  How to exploit the extra metadata and structure?• Let’s consider ...
PoliticalMashup                                                17                   Political n-gram viewer• From every wo...
PoliticalMashup                                                    18                  Political n-gram viewer: requiremen...
PoliticalMashup                                                   19                  Is Linked (Open) Data the solution?•...
PoliticalMashup                                                   20                     DBpedia not yet reliable• Data ex...
PoliticalMashup                                                      21       Lesson learned: requirement on metadata and ...
PoliticalMashup                             22                  A few more applications
PoliticalMashup                                                       23                  Entity Profiling and Entity Searc...
PoliticalMashup                                                  24                  Content and structure search• Usual a...
PoliticalMashup                                                    25              Lesson learned: requirement on structur...
PoliticalMashup                                           26           Application of structure: Interruption graph       ...
PoliticalMashup                                               27            Exploring and exploiting official documents• We ...
PoliticalMashup                                                28                    Scale diachronically• Stable data mod...
PoliticalMashup                                                     29          Scale in breadth, e.g., parlproceedings of...
PoliticalMashup                                                       30              Scale in breadth: link to related da...
PoliticalMashup                                                       31                          Conclusions• There are a...
PoliticalMashup                                                       32                  Back to our research questionWha...
PoliticalMashup                                                      33                  How we can help (ourselves)      ...
PoliticalMashup                                       34                       Last Question                  Official Publi...
Upcoming SlideShare
Loading in...5
×

PoliticalMashup

135

Published on

Presentation SURF Research and Innovation Event 2013. February 28, The Hague University of Applied Sciences.

Maarten Marx is Assistant Professor of Computer Science at the University of Amsterdam.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
135
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

PoliticalMashup

  1. 1. PoliticalMashup 1 PoliticalMashup Open Official Documents: Requirements and Opportunities Maarten Marx Universiteit van Amsterdam Istanbul, EEOP (@LREC), 2012-05-27
  2. 2. PoliticalMashup 2 Content• Official Documents Zoom in on a specific official publications dataset• Opportunities What makes official publications data valuable?• Requirements What is needed to make official publications data reusable and interoperable?
  3. 3. PoliticalMashup 3 Our Leading Research QuestionWhat is the best data format for publishing both legacy and currentparliamentary proceedings in a digital sustainable manner? [Marx et al 2010]
  4. 4. PoliticalMashup 4 W3C recommendations on Open Government Data• make data both machine and human readable;• link data, make data linkable, provide permanent identifiers for each government object and data item;• provide metadata using common standards (e.g. Dublin Core);• make the data as easy to reuse (e.g. in mashups) as possible. Goal of this talk: make this concrete.
  5. 5. PoliticalMashup 5 Value of a large data corpus• Consider a 200 year corpus of temperature and humidity readings in one location.• Value is not in the individual “documents”• Value is not in the corpus as a whole.• Value is in the relation between the “documents”.
  6. 6. PoliticalMashup 6 Documents related by publication date Google books Ngram viewer
  7. 7. PoliticalMashup 7 Properties of our Parliamentary Proceedings Dataset
  8. 8. PoliticalMashup 8 Longitudinal data• weakly measurement for over 150 years• very stable measurement procedure and data model
  9. 9. PoliticalMashup 9 Data about human behaviour
  10. 10. PoliticalMashup 10 Often rather boring
  11. 11. PoliticalMashup 11 But sometimes full of drama and excitement
  12. 12. PoliticalMashup 12 Loads of measurement points 24.000 days, 450.000 topics, 7.5 miljoen speeches
  13. 13. PoliticalMashup 13 Digitally available
  14. 14. PoliticalMashup 14 About this collection• very sparse available metadata• very rich “metadata” sits hidden inside the raw data• Rich data model• Meeting (1 Day) • Topic • Stage direction • Scene • Stage direction • Speech • Paragraph
  15. 15. PoliticalMashup 15 Very rich metadata for each wordFor every word spoken in parliament, the following facts are knownat the time of the speech act, and can often be extracted from thewritten proceedings:1) when it was said,2) who said it,3) in what function,4) speaking on behalf of which party,5) in which context, and6) who was actively present during the speech act.
  16. 16. PoliticalMashup 16 How to exploit the extra metadata and structure?• Let’s consider a simple killer app . . .
  17. 17. PoliticalMashup 17 Political n-gram viewer• From every word we know both the date and the speaker.• Every speaker belongs to a political party.• 3D n-gram viewer: political spectrum vs time vs word-count• Use: topic ownership, agenda setting, framing
  18. 18. PoliticalMashup 18 Political n-gram viewer: requirementsdocuments 1. metadata: date of the meeting 2. document structure: for every spoken word: who said it.Linked Data Speakers names are disambiguated, normalized and mapped to a database with temporal party information.Completeness and correctness Few missing or wrong data, also for long time ago.
  19. 19. PoliticalMashup 19 Is Linked (Open) Data the solution?• Link speakers name to Wikipedia/DBpedia page. (named entity disambiguation and resolution). See also Google Knowledge Graph, and [Spitkovsky, Chang, LREC 2012].• DBpedia extracts link between person and party affiliation from Wikipedia infobox• Timestamped triple: Geert Wilders is partymember of VVD from 1998-08-25 until 2004-09-02
  20. 20. PoliticalMashup 20 DBpedia not yet reliable• Data extraction is difficult, even from the infobox, even from complete data: Wikipedia page of Geert Wilders DBpedia information about Geert Wilders Notice the values of the party and the office attributes Timestamped facts are difficult to extract and difficult to represent in RDF triples.
  21. 21. PoliticalMashup 21 Lesson learned: requirement on metadata and relations• One cannot rely on Linked Open Data for good quality metadata.• Official documents should be self-describing, also for facts which are obvious at publication time.• Compare speaker’s data in original (OCRed) data and XMLified and enriched version: • Original • Part of it in XML • And now for human consumption
  22. 22. PoliticalMashup 22 A few more applications
  23. 23. PoliticalMashup 23 Entity Profiling and Entity Search• Users search for entities, not for documents. [TREC Entity Track] [Balog et al 2009].• Main research questions How to collect information on entities, how to model an entity, how to rank entities.• (Parsimonious) language models work well as models. [Balog et al, 2009][Hiemstra et al, 2004]• Entity profiling: http://www.politiekinzicht.com• Entity search: http://ikkieswijzer.nl
  24. 24. PoliticalMashup 24 Content and structure search• Usual advanced search combines keyword search with metadata search.• Extra fields are just extra filters on the returned documents.• With structured documents we can do search on content and structure.• Most useful task: rank best entry points in large documents.• Compare two search systems on the same data: on flat text on an XML representation
  25. 25. PoliticalMashup 25 Lesson learned: requirement on structure• Make semantically important structure of documents explicit in XML markup.• Publish for machine readability• Publish generic data, not data prepared for one use-case.
  26. 26. PoliticalMashup 26 Application of structure: Interruption graph (Attackogram)• MP A interrupts B ⇐⇒ A speaks during the block of B.combined with entity profiling:http://debat.politiekinzicht.com/
  27. 27. PoliticalMashup 27 Exploring and exploiting official documents• We saw what can be done with one well-curated collection.• What are the key infrastructural and research questions? In what direction and how to scale this up? 1. in time 2. in breadth 3. in links
  28. 28. PoliticalMashup 28 Scale diachronically• Stable data model and measurement procedure make this data very valuable for diachronic comparisons.• towards the past • OCR • consistency in structure • more missing data to link to• towards the future • remain up to date • legacy decisions
  29. 29. PoliticalMashup 29 Scale in breadth, e.g., parlproceedings of all European countries• All describe the same “script”, so all fit in one schema.• Main question: how to connect the data from different countries? Common structure and annotation use the same Relax NG schema Common values on certain attributes • Entities Normalize to Wikipedia concepts • Controlled vocabulary keywords Normalize to Eurovoc • Language Machine translate to English • Events Normalize to EMM Newsexplorer query/ Wikinews query
  30. 30. PoliticalMashup 30 Scale in breadth: link to related datasets• Link on time, entities, events, topics• Other official publications• News• User generated content• (In our case), promisses of political actors: election manifestos
  31. 31. PoliticalMashup 31 Conclusions• There are ample opportunities for exploiting Official Publications.• Preprocessing and interlinking with other datasets is difficult and does not scale well: • High precision and recall is needed for many applications • Many text analysis and data-mapping tasks [MUC, TAC] • Every format needs an own transformer • Linked Open Data knowledge bases are not (yet) good enough: create special purpose knowledge extractors• High investment, but if done in a general way, high return and impact.
  32. 32. PoliticalMashup 32 Back to our research questionWhat is the best data format for publishing both legacy and current parliamentary proceedings in a digital sustainable manner?Lessons learned• Common, open, standardized, self-describing, machine readable,• not tied to a single application• linked, linked, linked • Not only shared attributes • but more importantly, shared data values• also store utterly obvious facts (10 years later they aren’t)
  33. 33. PoliticalMashup 33 How we can help (ourselves) Help improve input data at the source• Push at the source (in UK: open government data; in Holland: all parliamentary data is now in XML . . . )• Help reduce dumb cut-and-paste annotation work, so annotators can concentrate on tasks which are hard for machines (e.g. text-classification).• Emphasize importance of using shared standards. Future researchers will love you.
  34. 34. PoliticalMashup 34 Last Question Official Publications: are they or ?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×