SlideShare a Scribd company logo
1 of 34
Download to read offline
PoliticalMashup                                         1




                      PoliticalMashup
              Open Official Documents: Requirements and
                           Opportunities

                             Maarten Marx

                       Universiteit van Amsterdam

                   Istanbul, EEOP (@LREC), 2012-05-27
PoliticalMashup                                                  2



                           Content

• Official Documents Zoom in on a specific official publications
  dataset

• Opportunities What makes official publications data valuable?

• Requirements What is needed to make official publications data
  reusable and interoperable?
PoliticalMashup                                                       3



                  Our Leading Research Question




What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner? [Marx et
                              al 2010]
PoliticalMashup                                                     4



 W3C recommendations on Open Government Data

• make data both machine and human readable;

• link data, make data linkable, provide permanent identifiers for
  each government object and data item;

• provide metadata using common standards (e.g. Dublin Core);

• make the data as easy to reuse (e.g. in mashups) as possible.

                  Goal of this talk: make this concrete.
PoliticalMashup                                                     5



                  Value of a large data corpus

• Consider a 200 year corpus of temperature and humidity readings
  in one location.

• Value is not in the individual “documents”

• Value is not in the corpus as a whole.

• Value is in the relation between the “documents”.
PoliticalMashup                                           6



                  Documents related by publication date




                          Google books Ngram viewer
PoliticalMashup                                         7



          Properties of our Parliamentary Proceedings
                            Dataset
PoliticalMashup                                      8



                     Longitudinal data

• weakly measurement for over 150 years

• very stable measurement procedure and data model
PoliticalMashup                                9



                  Data about human behaviour
PoliticalMashup                         10



                  Often rather boring
PoliticalMashup                                       11



         But sometimes full of drama and excitement
PoliticalMashup                                                       12



                       Loads of measurement points

                  24.000 days, 450.000 topics, 7.5 miljoen speeches
PoliticalMashup                         13



                  Digitally available
PoliticalMashup                                          14



                    About this collection

• very sparse available metadata

• very rich “metadata” sits hidden inside the raw data

• Rich data model
• Meeting (1 Day)
  • Topic
    • Stage direction
    • Scene
     • Stage direction
     • Speech
      • Paragraph
PoliticalMashup                                                      15



                  Very rich metadata for each word

For every word spoken in parliament, the following facts are known
at the time of the speech act, and can often be extracted from the
written proceedings:
1)   when it was said,
2)   who said it,
3)   in what function,
4)   speaking on behalf of which party,
5)   in which context, and
6)   who was actively present during the speech act.
PoliticalMashup                                      16



  How to exploit the extra metadata and structure?

• Let’s consider a simple killer app . . .
PoliticalMashup                                                17



                   Political n-gram viewer
• From every word we know both the date and the speaker.

• Every speaker belongs to a political party.

• 3D n-gram viewer: political spectrum vs time vs word-count

• Use: topic ownership, agenda setting, framing
PoliticalMashup                                                    18



                  Political n-gram viewer: requirements

documents
   1. metadata: date of the meeting
   2. document structure: for every spoken word: who said it.

Linked Data Speakers names are disambiguated, normalized and
   mapped to a database with temporal party information.

Completeness and correctness Few missing or wrong data, also for
  long time ago.
PoliticalMashup                                                   19



                  Is Linked (Open) Data the solution?

• Link speakers name to Wikipedia/DBpedia page. (named entity
  disambiguation and resolution). See also Google Knowledge
  Graph, and [Spitkovsky, Chang, LREC 2012].

• DBpedia extracts link between person and party affiliation from
  Wikipedia infobox

• Timestamped triple:

                     Geert Wilders is partymember of VVD
                       from 1998-08-25 until 2004-09-02
PoliticalMashup                                                   20



                     DBpedia not yet reliable

• Data extraction is difficult, even from the infobox, even from
  complete data:
        Wikipedia page of Geert Wilders
        DBpedia information about Geert Wilders
        Notice the values of the party and the office attributes
        Timestamped facts are difficult to extract and difficult to
        represent in RDF triples.
PoliticalMashup                                                      21



       Lesson learned: requirement on metadata and
                          relations

• One cannot rely on Linked Open Data for good quality metadata.

• Official documents should be self-describing, also for facts which
  are obvious at publication time.

• Compare speaker’s data in original (OCRed) data and XMLified
  and enriched version:
  • Original
  • Part of it in XML
  • And now for human consumption
PoliticalMashup                             22



                  A few more applications
PoliticalMashup                                                       23



                  Entity Profiling and Entity Search

• Users search for entities, not for documents. [TREC Entity Track]
  [Balog et al 2009].

• Main research questions
        How to collect information on entities,
        how to model an entity,
        how to rank entities.

• (Parsimonious) language models work well as models. [Balog et
  al, 2009][Hiemstra et al, 2004]

• Entity profiling: http://www.politiekinzicht.com

• Entity search: http://ikkieswijzer.nl
PoliticalMashup                                                  24



                  Content and structure search

• Usual advanced search combines keyword search with metadata
  search.

• Extra fields are just extra filters on the returned documents.

• With structured documents we can do search on content and
  structure.

• Most useful task: rank best entry points in large documents.

• Compare two search systems on the same data:
        on flat text
        on an XML representation
PoliticalMashup                                                    25



              Lesson learned: requirement on structure

• Make semantically important structure of documents explicit in
  XML markup.

• Publish for machine readability

• Publish generic data, not data prepared for one use-case.
PoliticalMashup                                           26



           Application of structure: Interruption graph
                          (Attackogram)

• MP A interrupts B ⇐⇒ A speaks during the block of B.




combined with entity profiling:
http://debat.politiekinzicht.com/
PoliticalMashup                                               27



            Exploring and exploiting official documents

• We saw what can be done with one well-curated collection.

• What are the key infrastructural and research questions?

        In what direction and how to scale this up?
   1. in time
   2. in breadth
   3. in links
PoliticalMashup                                                28



                    Scale diachronically

• Stable data model and measurement procedure make this data
  very valuable for diachronic comparisons.

• towards the past
  • OCR
  • consistency in structure
  • more missing data to link to

• towards the future
  • remain up to date
  • legacy decisions
PoliticalMashup                                                     29



          Scale in breadth, e.g., parlproceedings of all
                       European countries
• All describe the same “script”, so all fit in one schema.

• Main question: how to connect the data from different countries?
  Common structure and annotation use the same Relax NG
    schema
  Common values on certain attributes
    • Entities Normalize to Wikipedia concepts
    • Controlled vocabulary keywords Normalize to Eurovoc
    • Language Machine translate to English
    • Events Normalize to EMM Newsexplorer query/ Wikinews
      query
PoliticalMashup                                                       30



              Scale in breadth: link to related datasets

• Link on time, entities, events, topics

• Other official publications

• News

• User generated content

• (In our case), promisses of political actors: election manifestos
PoliticalMashup                                                       31



                          Conclusions

• There are ample opportunities for exploiting Official Publications.

• Preprocessing and interlinking with other datasets is difficult and
  does not scale well:
  • High precision and recall is needed for many applications
  • Many text analysis and data-mapping tasks [MUC, TAC]
  • Every format needs an own transformer
  • Linked Open Data knowledge bases are not (yet) good enough:
    create special purpose knowledge extractors

• High investment, but if done in a general way, high return and
  impact.
PoliticalMashup                                                       32



                  Back to our research question
What is the best data format for publishing both legacy and current
   parliamentary proceedings in a digital sustainable manner?

Lessons learned

• Common, open, standardized, self-describing, machine readable,

• not tied to a single application

• linked, linked, linked
  • Not only shared attributes
  • but more importantly, shared data values

• also store utterly obvious facts (10 years later they aren’t)
PoliticalMashup                                                      33



                  How we can help (ourselves)

                  Help improve input data at the source

• Push at the source (in UK: open government data; in Holland: all
  parliamentary data is now in XML . . . )

• Help reduce dumb cut-and-paste annotation work, so annotators
  can concentrate on tasks which are hard for machines (e.g.
  text-classification).

• Emphasize importance of using shared standards.

                     Future researchers will love you.
PoliticalMashup                                       34



                       Last Question

                  Official Publications: are they




                                   or             ?

More Related Content

What's hot

Enterprise knowledge graphs
Enterprise knowledge graphsEnterprise knowledge graphs
Enterprise knowledge graphsSören Auer
 
Design for Findability: metadata, metrics and collaboration on LOC.gov
Design for Findability: metadata, metrics and collaboration on LOC.govDesign for Findability: metadata, metrics and collaboration on LOC.gov
Design for Findability: metadata, metrics and collaboration on LOC.govUXPA International
 
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesReasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesOntotext
 
Design for Findability at the Library of Congress
Design for Findability at the Library of CongressDesign for Findability at the Library of Congress
Design for Findability at the Library of CongressJill MacNeice
 
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data LinkingAnalytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data LinkingOntotext
 
Industry@RuleML2015 DataGraft
Industry@RuleML2015 DataGraftIndustry@RuleML2015 DataGraft
Industry@RuleML2015 DataGraftRuleML
 
Web Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseGiorgio Orsi
 
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sfSparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sfHarsh Thakkar
 
The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise Ontotext
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archiveLewis Crawford
 
Farirhair.ai: AI platform to mine competitive intelligence from billions of u...
Farirhair.ai: AI platform to mine competitive intelligence from billions of u...Farirhair.ai: AI platform to mine competitive intelligence from billions of u...
Farirhair.ai: AI platform to mine competitive intelligence from billions of u...Aditya Jami
 
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018Ontotext
 
Fairhair.ai – alan turing institute june '17 (public)
Fairhair.ai – alan turing institute june '17 (public)Fairhair.ai – alan turing institute june '17 (public)
Fairhair.ai – alan turing institute june '17 (public)Giorgio Orsi
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW
 
Discovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsDiscovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsPeter Haase
 
Building Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 stepsBuilding Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 stepsOntotext
 
Sören Auer | Enterprise Knowledge Graphs
Sören Auer | Enterprise Knowledge GraphsSören Auer | Enterprise Knowledge Graphs
Sören Auer | Enterprise Knowledge Graphssemanticsconference
 

What's hot (20)

Enterprise knowledge graphs
Enterprise knowledge graphsEnterprise knowledge graphs
Enterprise knowledge graphs
 
Design for Findability: metadata, metrics and collaboration on LOC.gov
Design for Findability: metadata, metrics and collaboration on LOC.govDesign for Findability: metadata, metrics and collaboration on LOC.gov
Design for Findability: metadata, metrics and collaboration on LOC.gov
 
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesReasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
 
Design for Findability at the Library of Congress
Design for Findability at the Library of CongressDesign for Findability at the Library of Congress
Design for Findability at the Library of Congress
 
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data LinkingAnalytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
Analytics on Big Knowledge Graphs Deliver Entity Awareness and Help Data Linking
 
Industry@RuleML2015 DataGraft
Industry@RuleML2015 DataGraftIndustry@RuleML2015 DataGraft
Industry@RuleML2015 DataGraft
 
Web Data Extraction: A Crash Course
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash Course
 
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sfSparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
 
The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise
 
Analytics and Access to the UK web archive
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
 
Farirhair.ai: AI platform to mine competitive intelligence from billions of u...
Farirhair.ai: AI platform to mine competitive intelligence from billions of u...Farirhair.ai: AI platform to mine competitive intelligence from billions of u...
Farirhair.ai: AI platform to mine competitive intelligence from billions of u...
 
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
 
Fairhair.ai – alan turing institute june '17 (public)
Fairhair.ai – alan turing institute june '17 (public)Fairhair.ai – alan turing institute june '17 (public)
Fairhair.ai – alan turing institute june '17 (public)
 
Big Data - Gerami
Big Data - GeramiBig Data - Gerami
Big Data - Gerami
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow Tutorial
 
Discovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsDiscovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data Portals
 
Semantic Web in the Digital Humanities
Semantic Web in the Digital HumanitiesSemantic Web in the Digital Humanities
Semantic Web in the Digital Humanities
 
Building Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 stepsBuilding Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 steps
 
Big data
Big dataBig data
Big data
 
Sören Auer | Enterprise Knowledge Graphs
Sören Auer | Enterprise Knowledge GraphsSören Auer | Enterprise Knowledge Graphs
Sören Auer | Enterprise Knowledge Graphs
 

Similar to Keynote Exploring and Exploiting Official Publications

Groningen nl pgroep
Groningen nl pgroepGroningen nl pgroep
Groningen nl pgroepmaartenmarx
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesLaura Po
 
Linked Open Data in Romania
Linked Open Data in RomaniaLinked Open Data in Romania
Linked Open Data in RomaniaVlad Posea
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Roi Blanco
 
Applied Artificial Intelligence Unit 5 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 5 Semester 3 MSc IT Part 2 Mumbai Univer...Applied Artificial Intelligence Unit 5 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 5 Semester 3 MSc IT Part 2 Mumbai Univer...Madhav Mishra
 
Building the PoliMedia search system; data- and user-driven
Building the PoliMedia search system; data- and user-drivenBuilding the PoliMedia search system; data- and user-driven
Building the PoliMedia search system; data- and user-drivenMaxKemman
 
From Open Access to Open Data
From Open Access to Open DataFrom Open Access to Open Data
From Open Access to Open DataBrian Hole
 
“The HE Landscape – Making It Happen (Painlessly)” - Andy Youell, Director of...
“The HE Landscape – Making It Happen (Painlessly)” - Andy Youell, Director of...“The HE Landscape – Making It Happen (Painlessly)” - Andy Youell, Director of...
“The HE Landscape – Making It Happen (Painlessly)” - Andy Youell, Director of...Academic Registrars Council
 
Using DBpedia for Thesaurus Management and Linked Open Data Integration
Using DBpedia for Thesaurus Management and Linked Open Data IntegrationUsing DBpedia for Thesaurus Management and Linked Open Data Integration
Using DBpedia for Thesaurus Management and Linked Open Data IntegrationMartin Kaltenböck
 
Lecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationLecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationMarieke van Erp
 
Linked Open Government Data: What’s Next?
Linked Open Government Data:  What’s Next?Linked Open Government Data:  What’s Next?
Linked Open Government Data: What’s Next?Li Ding
 
ESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge GraphsESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge GraphsPeter Haase
 
Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...
Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...
Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...BigData_Europe
 
An Automated Snowball Census of the Political Web - JITP 2011
An Automated Snowball Census of the Political Web - JITP 2011An Automated Snowball Census of the Political Web - JITP 2011
An Automated Snowball Census of the Political Web - JITP 2011Abe Gong
 
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...CILIP MDG
 
Data.gov Overview, August 2012
Data.gov Overview, August 2012Data.gov Overview, August 2012
Data.gov Overview, August 2012Jeanne Holm
 

Similar to Keynote Exploring and Exploiting Official Publications (20)

Groningen nl pgroep
Groningen nl pgroepGroningen nl pgroep
Groningen nl pgroep
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sources
 
Text Mining : Experience
Text Mining : ExperienceText Mining : Experience
Text Mining : Experience
 
Linked Open Data in Romania
Linked Open Data in RomaniaLinked Open Data in Romania
Linked Open Data in Romania
 
ONS Local presents: Explore Subnational Statistics
ONS Local presents: Explore Subnational StatisticsONS Local presents: Explore Subnational Statistics
ONS Local presents: Explore Subnational Statistics
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search
 
Applied Artificial Intelligence Unit 5 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 5 Semester 3 MSc IT Part 2 Mumbai Univer...Applied Artificial Intelligence Unit 5 Semester 3 MSc IT Part 2 Mumbai Univer...
Applied Artificial Intelligence Unit 5 Semester 3 MSc IT Part 2 Mumbai Univer...
 
Building the PoliMedia search system; data- and user-driven
Building the PoliMedia search system; data- and user-drivenBuilding the PoliMedia search system; data- and user-driven
Building the PoliMedia search system; data- and user-driven
 
From Open Access to Open Data
From Open Access to Open DataFrom Open Access to Open Data
From Open Access to Open Data
 
“The HE Landscape – Making It Happen (Painlessly)” - Andy Youell, Director of...
“The HE Landscape – Making It Happen (Painlessly)” - Andy Youell, Director of...“The HE Landscape – Making It Happen (Painlessly)” - Andy Youell, Director of...
“The HE Landscape – Making It Happen (Painlessly)” - Andy Youell, Director of...
 
Lecture4 Social Web
Lecture4 Social Web Lecture4 Social Web
Lecture4 Social Web
 
Using DBpedia for Thesaurus Management and Linked Open Data Integration
Using DBpedia for Thesaurus Management and Linked Open Data IntegrationUsing DBpedia for Thesaurus Management and Linked Open Data Integration
Using DBpedia for Thesaurus Management and Linked Open Data Integration
 
Lecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationLecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and Visualisation
 
Linked Open Government Data: What’s Next?
Linked Open Government Data:  What’s Next?Linked Open Government Data:  What’s Next?
Linked Open Government Data: What’s Next?
 
ESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge GraphsESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge Graphs
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARL
 
Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...
Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...
Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...
 
An Automated Snowball Census of the Political Web - JITP 2011
An Automated Snowball Census of the Political Web - JITP 2011An Automated Snowball Census of the Political Web - JITP 2011
An Automated Snowball Census of the Political Web - JITP 2011
 
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
 
Data.gov Overview, August 2012
Data.gov Overview, August 2012Data.gov Overview, August 2012
Data.gov Overview, August 2012
 

More from maartenmarx

Ilja state2014expressivity
Ilja state2014expressivityIlja state2014expressivity
Ilja state2014expressivitymaartenmarx
 
Haagse Hogeschool 2012-09-13
Haagse Hogeschool 2012-09-13Haagse Hogeschool 2012-09-13
Haagse Hogeschool 2012-09-13maartenmarx
 
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13maartenmarx
 
Economie van de aandacht
  Economie van de aandacht  Economie van de aandacht
Economie van de aandachtmaartenmarx
 
Dans dataprijs2012
Dans dataprijs2012Dans dataprijs2012
Dans dataprijs2012maartenmarx
 
College sicco van-sas-2012_10_08
College sicco van-sas-2012_10_08College sicco van-sas-2012_10_08
College sicco van-sas-2012_10_08maartenmarx
 
Presentation at NLDB 2012
Presentation at NLDB 2012Presentation at NLDB 2012
Presentation at NLDB 2012maartenmarx
 
Women in Dutch parliament: what they did
Women in Dutch parliament: what they didWomen in Dutch parliament: what they did
Women in Dutch parliament: what they didmaartenmarx
 
Namescape 2012 03 06
Namescape 2012 03 06Namescape 2012 03 06
Namescape 2012 03 06maartenmarx
 
voting advice slides
 voting advice slides voting advice slides
voting advice slidesmaartenmarx
 
TV-slant presentatie_politicologen_etmaal
TV-slant presentatie_politicologen_etmaalTV-slant presentatie_politicologen_etmaal
TV-slant presentatie_politicologen_etmaalmaartenmarx
 
networks inparliament-ccct
 networks inparliament-ccct networks inparliament-ccct
networks inparliament-ccctmaartenmarx
 
Screen biographischportaal2010 12-10
Screen biographischportaal2010 12-10Screen biographischportaal2010 12-10
Screen biographischportaal2010 12-10maartenmarx
 

More from maartenmarx (13)

Ilja state2014expressivity
Ilja state2014expressivityIlja state2014expressivity
Ilja state2014expressivity
 
Haagse Hogeschool 2012-09-13
Haagse Hogeschool 2012-09-13Haagse Hogeschool 2012-09-13
Haagse Hogeschool 2012-09-13
 
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13
 
Economie van de aandacht
  Economie van de aandacht  Economie van de aandacht
Economie van de aandacht
 
Dans dataprijs2012
Dans dataprijs2012Dans dataprijs2012
Dans dataprijs2012
 
College sicco van-sas-2012_10_08
College sicco van-sas-2012_10_08College sicco van-sas-2012_10_08
College sicco van-sas-2012_10_08
 
Presentation at NLDB 2012
Presentation at NLDB 2012Presentation at NLDB 2012
Presentation at NLDB 2012
 
Women in Dutch parliament: what they did
Women in Dutch parliament: what they didWomen in Dutch parliament: what they did
Women in Dutch parliament: what they did
 
Namescape 2012 03 06
Namescape 2012 03 06Namescape 2012 03 06
Namescape 2012 03 06
 
voting advice slides
 voting advice slides voting advice slides
voting advice slides
 
TV-slant presentatie_politicologen_etmaal
TV-slant presentatie_politicologen_etmaalTV-slant presentatie_politicologen_etmaal
TV-slant presentatie_politicologen_etmaal
 
networks inparliament-ccct
 networks inparliament-ccct networks inparliament-ccct
networks inparliament-ccct
 
Screen biographischportaal2010 12-10
Screen biographischportaal2010 12-10Screen biographischportaal2010 12-10
Screen biographischportaal2010 12-10
 

Recently uploaded

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesShubhangi Sonawane
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 

Recently uploaded (20)

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 

Keynote Exploring and Exploiting Official Publications

  • 1. PoliticalMashup 1 PoliticalMashup Open Official Documents: Requirements and Opportunities Maarten Marx Universiteit van Amsterdam Istanbul, EEOP (@LREC), 2012-05-27
  • 2. PoliticalMashup 2 Content • Official Documents Zoom in on a specific official publications dataset • Opportunities What makes official publications data valuable? • Requirements What is needed to make official publications data reusable and interoperable?
  • 3. PoliticalMashup 3 Our Leading Research Question What is the best data format for publishing both legacy and current parliamentary proceedings in a digital sustainable manner? [Marx et al 2010]
  • 4. PoliticalMashup 4 W3C recommendations on Open Government Data • make data both machine and human readable; • link data, make data linkable, provide permanent identifiers for each government object and data item; • provide metadata using common standards (e.g. Dublin Core); • make the data as easy to reuse (e.g. in mashups) as possible. Goal of this talk: make this concrete.
  • 5. PoliticalMashup 5 Value of a large data corpus • Consider a 200 year corpus of temperature and humidity readings in one location. • Value is not in the individual “documents” • Value is not in the corpus as a whole. • Value is in the relation between the “documents”.
  • 6. PoliticalMashup 6 Documents related by publication date Google books Ngram viewer
  • 7. PoliticalMashup 7 Properties of our Parliamentary Proceedings Dataset
  • 8. PoliticalMashup 8 Longitudinal data • weakly measurement for over 150 years • very stable measurement procedure and data model
  • 9. PoliticalMashup 9 Data about human behaviour
  • 10. PoliticalMashup 10 Often rather boring
  • 11. PoliticalMashup 11 But sometimes full of drama and excitement
  • 12. PoliticalMashup 12 Loads of measurement points 24.000 days, 450.000 topics, 7.5 miljoen speeches
  • 13. PoliticalMashup 13 Digitally available
  • 14. PoliticalMashup 14 About this collection • very sparse available metadata • very rich “metadata” sits hidden inside the raw data • Rich data model • Meeting (1 Day) • Topic • Stage direction • Scene • Stage direction • Speech • Paragraph
  • 15. PoliticalMashup 15 Very rich metadata for each word For every word spoken in parliament, the following facts are known at the time of the speech act, and can often be extracted from the written proceedings: 1) when it was said, 2) who said it, 3) in what function, 4) speaking on behalf of which party, 5) in which context, and 6) who was actively present during the speech act.
  • 16. PoliticalMashup 16 How to exploit the extra metadata and structure? • Let’s consider a simple killer app . . .
  • 17. PoliticalMashup 17 Political n-gram viewer • From every word we know both the date and the speaker. • Every speaker belongs to a political party. • 3D n-gram viewer: political spectrum vs time vs word-count • Use: topic ownership, agenda setting, framing
  • 18. PoliticalMashup 18 Political n-gram viewer: requirements documents 1. metadata: date of the meeting 2. document structure: for every spoken word: who said it. Linked Data Speakers names are disambiguated, normalized and mapped to a database with temporal party information. Completeness and correctness Few missing or wrong data, also for long time ago.
  • 19. PoliticalMashup 19 Is Linked (Open) Data the solution? • Link speakers name to Wikipedia/DBpedia page. (named entity disambiguation and resolution). See also Google Knowledge Graph, and [Spitkovsky, Chang, LREC 2012]. • DBpedia extracts link between person and party affiliation from Wikipedia infobox • Timestamped triple: Geert Wilders is partymember of VVD from 1998-08-25 until 2004-09-02
  • 20. PoliticalMashup 20 DBpedia not yet reliable • Data extraction is difficult, even from the infobox, even from complete data: Wikipedia page of Geert Wilders DBpedia information about Geert Wilders Notice the values of the party and the office attributes Timestamped facts are difficult to extract and difficult to represent in RDF triples.
  • 21. PoliticalMashup 21 Lesson learned: requirement on metadata and relations • One cannot rely on Linked Open Data for good quality metadata. • Official documents should be self-describing, also for facts which are obvious at publication time. • Compare speaker’s data in original (OCRed) data and XMLified and enriched version: • Original • Part of it in XML • And now for human consumption
  • 22. PoliticalMashup 22 A few more applications
  • 23. PoliticalMashup 23 Entity Profiling and Entity Search • Users search for entities, not for documents. [TREC Entity Track] [Balog et al 2009]. • Main research questions How to collect information on entities, how to model an entity, how to rank entities. • (Parsimonious) language models work well as models. [Balog et al, 2009][Hiemstra et al, 2004] • Entity profiling: http://www.politiekinzicht.com • Entity search: http://ikkieswijzer.nl
  • 24. PoliticalMashup 24 Content and structure search • Usual advanced search combines keyword search with metadata search. • Extra fields are just extra filters on the returned documents. • With structured documents we can do search on content and structure. • Most useful task: rank best entry points in large documents. • Compare two search systems on the same data: on flat text on an XML representation
  • 25. PoliticalMashup 25 Lesson learned: requirement on structure • Make semantically important structure of documents explicit in XML markup. • Publish for machine readability • Publish generic data, not data prepared for one use-case.
  • 26. PoliticalMashup 26 Application of structure: Interruption graph (Attackogram) • MP A interrupts B ⇐⇒ A speaks during the block of B. combined with entity profiling: http://debat.politiekinzicht.com/
  • 27. PoliticalMashup 27 Exploring and exploiting official documents • We saw what can be done with one well-curated collection. • What are the key infrastructural and research questions? In what direction and how to scale this up? 1. in time 2. in breadth 3. in links
  • 28. PoliticalMashup 28 Scale diachronically • Stable data model and measurement procedure make this data very valuable for diachronic comparisons. • towards the past • OCR • consistency in structure • more missing data to link to • towards the future • remain up to date • legacy decisions
  • 29. PoliticalMashup 29 Scale in breadth, e.g., parlproceedings of all European countries • All describe the same “script”, so all fit in one schema. • Main question: how to connect the data from different countries? Common structure and annotation use the same Relax NG schema Common values on certain attributes • Entities Normalize to Wikipedia concepts • Controlled vocabulary keywords Normalize to Eurovoc • Language Machine translate to English • Events Normalize to EMM Newsexplorer query/ Wikinews query
  • 30. PoliticalMashup 30 Scale in breadth: link to related datasets • Link on time, entities, events, topics • Other official publications • News • User generated content • (In our case), promisses of political actors: election manifestos
  • 31. PoliticalMashup 31 Conclusions • There are ample opportunities for exploiting Official Publications. • Preprocessing and interlinking with other datasets is difficult and does not scale well: • High precision and recall is needed for many applications • Many text analysis and data-mapping tasks [MUC, TAC] • Every format needs an own transformer • Linked Open Data knowledge bases are not (yet) good enough: create special purpose knowledge extractors • High investment, but if done in a general way, high return and impact.
  • 32. PoliticalMashup 32 Back to our research question What is the best data format for publishing both legacy and current parliamentary proceedings in a digital sustainable manner? Lessons learned • Common, open, standardized, self-describing, machine readable, • not tied to a single application • linked, linked, linked • Not only shared attributes • but more importantly, shared data values • also store utterly obvious facts (10 years later they aren’t)
  • 33. PoliticalMashup 33 How we can help (ourselves) Help improve input data at the source • Push at the source (in UK: open government data; in Holland: all parliamentary data is now in XML . . . ) • Help reduce dumb cut-and-paste annotation work, so annotators can concentrate on tasks which are hard for machines (e.g. text-classification). • Emphasize importance of using shared standards. Future researchers will love you.
  • 34. PoliticalMashup 34 Last Question Official Publications: are they or ?