Keynote Exploring and Exploiting Official Publications
PoliticalMashup 1 PoliticalMashup Open Oﬃcial Documents: Requirements and Opportunities Maarten Marx Universiteit van Amsterdam Istanbul, EEOP (@LREC), 2012-05-27
PoliticalMashup 2 Content• Oﬃcial Documents Zoom in on a speciﬁc oﬃcial publications dataset• Opportunities What makes oﬃcial publications data valuable?• Requirements What is needed to make oﬃcial publications data reusable and interoperable?
PoliticalMashup 3 Our Leading Research QuestionWhat is the best data format for publishing both legacy and currentparliamentary proceedings in a digital sustainable manner? [Marx et al 2010]
PoliticalMashup 4 W3C recommendations on Open Government Data• make data both machine and human readable;• link data, make data linkable, provide permanent identiﬁers for each government object and data item;• provide metadata using common standards (e.g. Dublin Core);• make the data as easy to reuse (e.g. in mashups) as possible. Goal of this talk: make this concrete.
PoliticalMashup 5 Value of a large data corpus• Consider a 200 year corpus of temperature and humidity readings in one location.• Value is not in the individual “documents”• Value is not in the corpus as a whole.• Value is in the relation between the “documents”.
PoliticalMashup 6 Documents related by publication date Google books Ngram viewer
PoliticalMashup 7 Properties of our Parliamentary Proceedings Dataset
PoliticalMashup 8 Longitudinal data• weakly measurement for over 150 years• very stable measurement procedure and data model
PoliticalMashup 14 About this collection• very sparse available metadata• very rich “metadata” sits hidden inside the raw data• Rich data model• Meeting (1 Day) • Topic • Stage direction • Scene • Stage direction • Speech • Paragraph
PoliticalMashup 15 Very rich metadata for each wordFor every word spoken in parliament, the following facts are knownat the time of the speech act, and can often be extracted from thewritten proceedings:1) when it was said,2) who said it,3) in what function,4) speaking on behalf of which party,5) in which context, and6) who was actively present during the speech act.
PoliticalMashup 16 How to exploit the extra metadata and structure?• Let’s consider a simple killer app . . .
PoliticalMashup 17 Political n-gram viewer• From every word we know both the date and the speaker.• Every speaker belongs to a political party.• 3D n-gram viewer: political spectrum vs time vs word-count• Use: topic ownership, agenda setting, framing
PoliticalMashup 18 Political n-gram viewer: requirementsdocuments 1. metadata: date of the meeting 2. document structure: for every spoken word: who said it.Linked Data Speakers names are disambiguated, normalized and mapped to a database with temporal party information.Completeness and correctness Few missing or wrong data, also for long time ago.
PoliticalMashup 19 Is Linked (Open) Data the solution?• Link speakers name to Wikipedia/DBpedia page. (named entity disambiguation and resolution). See also Google Knowledge Graph, and [Spitkovsky, Chang, LREC 2012].• DBpedia extracts link between person and party aﬃliation from Wikipedia infobox• Timestamped triple: Geert Wilders is partymember of VVD from 1998-08-25 until 2004-09-02
PoliticalMashup 20 DBpedia not yet reliable• Data extraction is diﬃcult, even from the infobox, even from complete data: Wikipedia page of Geert Wilders DBpedia information about Geert Wilders Notice the values of the party and the oﬃce attributes Timestamped facts are diﬃcult to extract and diﬃcult to represent in RDF triples.
PoliticalMashup 21 Lesson learned: requirement on metadata and relations• One cannot rely on Linked Open Data for good quality metadata.• Oﬃcial documents should be self-describing, also for facts which are obvious at publication time.• Compare speaker’s data in original (OCRed) data and XMLiﬁed and enriched version: • Original • Part of it in XML • And now for human consumption
PoliticalMashup 23 Entity Proﬁling and Entity Search• Users search for entities, not for documents. [TREC Entity Track] [Balog et al 2009].• Main research questions How to collect information on entities, how to model an entity, how to rank entities.• (Parsimonious) language models work well as models. [Balog et al, 2009][Hiemstra et al, 2004]• Entity proﬁling: http://www.politiekinzicht.com• Entity search: http://ikkieswijzer.nl
PoliticalMashup 24 Content and structure search• Usual advanced search combines keyword search with metadata search.• Extra ﬁelds are just extra ﬁlters on the returned documents.• With structured documents we can do search on content and structure.• Most useful task: rank best entry points in large documents.• Compare two search systems on the same data: on ﬂat text on an XML representation
PoliticalMashup 25 Lesson learned: requirement on structure• Make semantically important structure of documents explicit in XML markup.• Publish for machine readability• Publish generic data, not data prepared for one use-case.
PoliticalMashup 26 Application of structure: Interruption graph (Attackogram)• MP A interrupts B ⇐⇒ A speaks during the block of B.combined with entity proﬁling:http://debat.politiekinzicht.com/
PoliticalMashup 27 Exploring and exploiting oﬃcial documents• We saw what can be done with one well-curated collection.• What are the key infrastructural and research questions? In what direction and how to scale this up? 1. in time 2. in breadth 3. in links
PoliticalMashup 28 Scale diachronically• Stable data model and measurement procedure make this data very valuable for diachronic comparisons.• towards the past • OCR • consistency in structure • more missing data to link to• towards the future • remain up to date • legacy decisions
PoliticalMashup 29 Scale in breadth, e.g., parlproceedings of all European countries• All describe the same “script”, so all ﬁt in one schema.• Main question: how to connect the data from diﬀerent countries? Common structure and annotation use the same Relax NG schema Common values on certain attributes • Entities Normalize to Wikipedia concepts • Controlled vocabulary keywords Normalize to Eurovoc • Language Machine translate to English • Events Normalize to EMM Newsexplorer query/ Wikinews query
PoliticalMashup 30 Scale in breadth: link to related datasets• Link on time, entities, events, topics• Other oﬃcial publications• News• User generated content• (In our case), promisses of political actors: election manifestos
PoliticalMashup 31 Conclusions• There are ample opportunities for exploiting Oﬃcial Publications.• Preprocessing and interlinking with other datasets is diﬃcult and does not scale well: • High precision and recall is needed for many applications • Many text analysis and data-mapping tasks [MUC, TAC] • Every format needs an own transformer • Linked Open Data knowledge bases are not (yet) good enough: create special purpose knowledge extractors• High investment, but if done in a general way, high return and impact.
PoliticalMashup 32 Back to our research questionWhat is the best data format for publishing both legacy and current parliamentary proceedings in a digital sustainable manner?Lessons learned• Common, open, standardized, self-describing, machine readable,• not tied to a single application• linked, linked, linked • Not only shared attributes • but more importantly, shared data values• also store utterly obvious facts (10 years later they aren’t)
PoliticalMashup 33 How we can help (ourselves) Help improve input data at the source• Push at the source (in UK: open government data; in Holland: all parliamentary data is now in XML . . . )• Help reduce dumb cut-and-paste annotation work, so annotators can concentrate on tasks which are hard for machines (e.g. text-classiﬁcation).• Emphasize importance of using shared standards. Future researchers will love you.
PoliticalMashup 34 Last Question Oﬃcial Publications: are they or ?