Guest Lecture: Linked Open Data for the Humanities and Social Sciences
1. Linked Open Data
for the Humanities and Social Sciences
Use cases: linking government data to news data
in the PoliMedia and Talk of Europe projects
Laura Hollink
Centrum Wiskunde & Informatica (CWI)
KU Leuven
Guest lecture
November 10, 2016
2. Linked Open Data in the SSH?
Example question:
How did the debate about
the financial crisis in
Greece develop?
3. Searching the proceedings of the European
Parliament
"Greece" in the plenary meetings of the European Parliament
Year
Nr.ofmentions
050100150200
1999 2000 2001 2001 2002 2003 2004 2005 2006 2006 2007 2008 2009 2010 2010 2011 2012 2013
5. Search volumes of a search engine
Frequency of the query “Greece” on Google
http://www.google.com/trends
6. Search volumes of a search engine
Frequency of the query “Greece” on Google
http://www.google.com/trends
We need:
✦open access to data
✦to combine sources
✦more complex queries
7. Linked Open Data in the SSH?
Example question:
Which political debate in the
post-war period has attracted
most media attention?
9. “De Indonesische Quaestie"
To answer this question we need to
go through all newspaper articles
about all political debates…
10. “De Indonesische Quaestie"
To answer this question we need to
go through all newspaper articles
about all political debates…
We need:
✦open access to data
✦to combine sources
✦more complex queries
11. Linked Open Data in the SSH?
Example question:
What are the differences
between different media?
Example question:
Has the coverage changed
over time?
12. A method of publishing structured data on the Web
in such a way that it can be linked and queried
by computers as well as people.
A very brief introduction…
✦open access to data
✦to combine sources
✦more complex queries
Linked Open Data
13. A method of publishing structured data on the Web
in such a way that it can be linked and queried
by computers as well as people.
A very brief introduction…
✦open access to data
✦to combine sources
✦more complex queries
Linked Open Data
14. Thing Type Population Airport
Amsterdam City 1364422 Schiphol
…. … …. …
Structured data
ex:Amsterdam a ex:City .
ex:Amsterdam dbo:populationUrban "1330235"^^xsd:integer .
ex:Amsterdam dbp:cityServed ex:Schiphol .
Comparable to the data one may find in a database table
Represented as RDF triples
15. On the Web
Everything is identified by URIs (documents, concepts, instances, links)
http://example.org/cities#Amsterdam
http://example.org/City
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://dbpedia.org/ontology/population
16. On the Web
Triples can be distributed over the Web
Everything is identified by URIs (documents, concepts, instances, links)
http://example.org/cities#Amsterdam
http://example.org/City
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://dbpedia.org/ontology/population
http://example.org/cities#Amsterdam a ex:City.
http://example.org/cities#Amsterdam dbo:populationUrban "1364422"
http://example.org/cities#Amsterdam dbp:cityServed ex:Schiphol
17. On the Web
Amsterdam
has population
“1364422” City Schiphol
is a
has airport
Triples can be distributed over the Web
Everything is identified by URIs (documents, concepts, instances, links)
http://example.org/cities#Amsterdam
http://example.org/City
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://dbpedia.org/ontology/population
http://example.org/cities#Amsterdam a ex:City.
http://example.org/cities#Amsterdam dbo:populationUrban "1364422"
http://example.org/cities#Amsterdam dbp:cityServed ex:Schiphol
Forming a graph
20. The Web of Data vs. the Web of Documents
Note the differences Web of Data <-> database:
• Non-unique naming assumption
• Open World assumption
• Everyone can say anything about anything
21. Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
22. Querying Linked Open Data
• A W3C recommendation for querying RDF graphs called “SPARQL Protocol
And RDF Query Language”
• See http://www.w3.org/TR/rdf-sparql-query/ or http://www.w3.org/TR/
sparql11-query/
:JamesDean ?what :Giant.
?who :playedIn :Giant.
:JamesDean :playedIn ?what .
:JamesDean :playedIn :Giant .
:Giant
:JamesDean
:playedIn
Data
Query Result
23. Two example projects of Linked Open Data in SSH:
data modelling and linking in the PoliMedia and
Talk of Europe projects
26. Transcriptions of all 9,294
meetings of the Dutch
parliament between
1945-1995, consisting of
1,208,903 speeches.
Roughly 1.8 Million news
bulletins between
1937-1984
(We only use 1945-1995)
Archives of hundreds of
newspaper with tons of
newspaper issues or 10’s
of Millions of articles
between 1618-1995.
(We only use 1945-1995)
29. Step 1: Translate the Dutch parliamentary debates
to the standard structured web format RDF
nl.proc.sgd.d.
194519460000002
nl.proc.sgd.d.
194519460000002.1
PartOfDebateDebate
http://resolver.politicalmashup.nl/nl.proc.sgd.d.194519460000002
http://statengeneraaldigitaal.nl/
http://resolver.kb.nl/resolve?urn=sgd:mpeg21:19451946:0000002:pdf
nl.proc.sgd.d.19720000002
Handelingen Verenigde
Vergadering...
Dutch
1945-11-20
rdf:type
dc:id
dc:source
dc:source
dc:publisher
dc:language
dc:date
hasPart
rdf:type
nl.proc.sgd.d.
194519460000002.1.1
hasPart
DebateContext
rdf:type
nl.proc.sgd.d.
194519460000002.1.2
Speech
rdf:type
hasPart
nl.proc.sgd.d.
194519460000002.1.3
hasSubsequentSpeech
"Mijnheer de
Voorzitter, de
Commissie
van …"
hasSpokenText
sem:hasActor
Speaker_0006
4
Party_kvp
hasParty
hasSpeaker
member_of
_parliament
"De voorzitter
opent de
vergadering…"
hasText
http://resolver.kb.nl/resolve?urn=ddd:011198136:mpeg21:a0525:ocr
coveredIn
Party
KVP
Katholieke Volkspartij
rdf:type
hasAcronym
hasFullName
Joannes Antonius James
Bargefoaf:firstName
foaf:lastName
Barge
rdfs:label
http://resolver.politicalmashup.nl/nl.m.00064
dc:source
Politician
rdf:type
hasRole
nl.proc.sgd.d.
194519460000002.2
hasSubsequentPartOfDebate
30. Step 2: Discovering links between politics and
news
Detect
topics in
speeches
Create
queries
Search
newspaper
archive
Topics
Named
Entities
Name of
speaker
Detect
Named
Entities in
speeches
Candidate
articles
Queries
Rank
candidate
articles
Links
between
speeches
and articles
Debates
Date of
debate
31. Step 2: Discovering links between politics and
news
Detect
topics in
speeches
Create
queries
Search
newspaper
archive
Topics
Named
Entities
Name of
speaker
Detect
Named
Entities in
speeches
Candidate
articles
Queries
Rank
candidate
articles
Links
between
speeches
and articles
Debates
Date of
debate
Intuition 1: The name of the speaker should
appear in the article and the article should
be published within a week of the debate
32. Step 2: Discovering links between politics and
news
Detect
topics in
speeches
Create
queries
Search
newspaper
archive
Topics
Named
Entities
Name of
speaker
Detect
Named
Entities in
speeches
Candidate
articles
Queries
Rank
candidate
articles
Links
between
speeches
and articles
Debates
Date of
debate
Intuition 1: The name of the speaker should
appear in the article and the article should
be published within a week of the debate
Intuition 2: the more the article and the
speech overlap in terms of topics and
named entities, the more they are related.
35. Representation of links
architecten
architects
Link 001
skos:exactMatch
handmatigL. Hollink
concept1
concept2
link type
link methode
auteur
architecten architectsskos:exactMatch
• This is an example of the“design
pattern” referred to as n-ary
relations or relations as classes.
• It allows us to save provenance
information about the statements
we create.
36. Evaluation of Links
Recall that we aim to use the links to answer a research
question.
Can we still do that if there are errors in the links?
How many errors are acceptable?
We need to know the quality!
37. Evaluation of Links
How would you determine the quality of the links?
Recall that we aim to use the links to answer a research
question.
Can we still do that if there are errors in the links?
How many errors are acceptable?
We need to know the quality!
38. Evaluation of Links
How would you determine the quality of the links?
1. Manually rating (a sample of) mappings
• relatively cheap and easy to interpret
• only precision, no recall
Recall that we aim to use the links to answer a research
question.
Can we still do that if there are errors in the links?
How many errors are acceptable?
We need to know the quality!
39. Evaluation of Links
How would you determine the quality of the links?
1. Manually rating (a sample of) mappings
• relatively cheap and easy to interpret
• only precision, no recall
2. Comparison to manually found links
• precision and recall
• more expensive! (but: crowd sourcing?)
Recall that we aim to use the links to answer a research
question.
Can we still do that if there are errors in the links?
How many errors are acceptable?
We need to know the quality!
40. Evaluation of links in PoliMedia
How good are the links?
• We ask 2 raters to manually score pairs of
newspaper articles and speeches.
• a pilot study showed that we needed more
than a 2 point scale.
• inter-rater agreement: 0.5 -> acceptable,
but not high.
• Score: 80%
41. Evaluation of links in PoliMedia
Score Setting 1 Setting 2 Setting 3
I don’t know 0,14 0,15 0,08
0 - unrelated 0,38 0,23 0,12
1- related 0,29 0,36 0,36
2- explicit mention of the debate 0,19 0,26 0,44
1+2 0,48 0,62 0,8
How good are the links?
• We ask 2 raters to manually score pairs of
newspaper articles and speeches.
• a pilot study showed that we needed more
than a 2 point scale.
• inter-rater agreement: 0.5 -> acceptable,
but not high.
• Score: 80%
42. Evaluation of links in PoliMedia
Score Setting 1 Setting 2 Setting 3
I don’t know 0,14 0,15 0,08
0 - unrelated 0,38 0,23 0,12
1- related 0,29 0,36 0,36
2- explicit mention of the debate 0,19 0,26 0,44
1+2 0,48 0,62 0,8
How many links did we miss?
• We ask the raters to
manually search the
archives of the National
Library for related articles.
• Score: 62%
How good are the links?
• We ask 2 raters to manually score pairs of
newspaper articles and speeches.
• a pilot study showed that we needed more
than a 2 point scale.
• inter-rater agreement: 0.5 -> acceptable,
but not high.
• Score: 80%
43. Results
• An open data set of Dutch parliamentary debates,
• with almost 3 Million links between 450.000 speeches and 1.5 Million news
paper articles and radio bulletins at the National Library.
• accessible though a Web demonstrator and through a Sparql Enpoint
51. Online database:
SPARQL endpoint
• A service to query a knowledge
base using the SPARQL query
language.
“All speeches with more
than 60 associated news
items.”
56. The European Parliament as Linked Open Data
Laura Hollink Centrum Wiskunde & Informatica, Amsterdam
Astrid van Aggelen VU University Amsterdam
Martijn Kleppe Erasmus University Rotterdam
Henri Beunders Erasmus University Rotterdam
Jill Briggeman Erasmus University Rotterdam
Max Kemman University of Luxembourg
57. Talk of Europe goals
• To publish the entire plenary debates of the European
Parliament as Linked Open Data
• To improve access to the data
• To enable large scale analysis across time spans.
‣To residents of the European Union access to the proceedings
of the European parliament is a formal right.
58. Step 1: Translate the
European parliamentary
debates to Linked
Open Data
59. Step 1: Translate the
European parliamentary
debates to Linked
Open Data
60. 14M RDF statements about the 30K
speeches in 23 languages by 3K
speakers in 1K session days that
were held in the EU parliament
between 1999 and 2014
Step 1: Translate the
European parliamentary
debates to Linked
Open Data
62. How to relate a speech the party of the speaker?
lp:EUmember_1023lp:eu/plenary/2009-10-21/Speech_140>
lpv:speaker
lp:EUmember_1023lp:eu/plenary/2009-10-21/Speech_140>
lpv:speaker
lp:EUParty/SomeParty
lpv:hasParty
63. How to relate a speech the party of the speaker?
Why is this not a good solution?
lp:EUmember_1023lp:eu/plenary/2009-10-21/Speech_140>
lpv:speaker
lp:EUmember_1023lp:eu/plenary/2009-10-21/Speech_140>
lpv:speaker
lp:EUParty/SomeParty
lpv:hasParty
64. How to relate a speech the party of the speaker?
Why is this not a good solution?
1. A person might be a member of more than one party (at different times)
lp:EUmember_1023lp:eu/plenary/2009-10-21/Speech_140>
lpv:speaker
lp:EUmember_1023lp:eu/plenary/2009-10-21/Speech_140>
lpv:speaker
lp:EUParty/SomeParty
lpv:hasParty
65. How to relate a speech the party of the speaker?
Why is this not a good solution?
1. A person might be a member of more than one party (at different times)
2. Since there is no link between a speech and a party, queries for all speeches
spoken by the members of a certain party become very complicated.
lp:EUmember_1023lp:eu/plenary/2009-10-21/Speech_140>
lpv:speaker
lp:EUmember_1023lp:eu/plenary/2009-10-21/Speech_140>
lpv:speaker
lp:EUParty/SomeParty
lpv:hasParty
66. How to relate a speech to the party of the
speaker?
"20111126"^ xsd:date
"20090716"^ xsd:date
lp:political-
Function102
lpv:beginning
lpv:end
lp:EUmember_1023lp:eu/plenary/2009-10-21/Speech_140>
lpv:role
lp:EUCommittee/
Committee_on_Legal_Affairs
lp:Role/substitute
lpv:political
Function
lpv:institution
lpv:speaker
67. How to relate a speech to the party of the
speaker?
"20111126"^ xsd:date
"20090716"^ xsd:date
lp:political-
Function102
lpv:beginning
lpv:end
lp:EUmember_1023lp:eu/plenary/2009-10-21/Speech_140>
lpv:role
lp:EUCommittee/
Committee_on_Legal_Affairs
lp:Role/substitute
lpv:political
Function
lpv:institution
lpv:speaker
"20111126"^ xsd:date
lp:political-
Function101
lpv:end
"20111126"^
xsd:date
lpv:beginning
"20071114"
^xsd:date
lpv:PoliticalFunction
"20090716"^ xsd:date
lp:political-
Function102
lpv:beginning
lpv:end
lp:EUmember_1023
lp:political
Function
lp:eu/plenary/2009-10-21/Speech_140>
lpv:role
lp:EUCommittee/
Committee_on_Legal_Affairs
lp:Role/substitutelp:Role/member
lp:EUParty/NI
lpv:role
lpv:political
Function
lpv:institutionlpv:institution rdf:type
lpv:speaker
rdf:type
68. How to relate a speech to the party of the
speaker?
"20111126"^ xsd:date
"20090716"^ xsd:date
lp:political-
Function102
lpv:beginning
lpv:end
lp:EUmember_1023lp:eu/plenary/2009-10-21/Speech_140>
lpv:role
lp:EUCommittee/
Committee_on_Legal_Affairs
lp:Role/substitute
lpv:political
Function
lpv:institution
lpv:speaker
"20111126"^ xsd:date
lp:political-
Function101
lpv:end
"20111126"^
xsd:date
lpv:beginning
"20071114"
^xsd:date
lpv:PoliticalFunction
"20090716"^ xsd:date
lp:political-
Function102
lpv:beginning
lpv:end
lp:EUmember_1023
lp:political
Function
lp:eu/plenary/2009-10-21/Speech_140>
lpv:role
lp:EUCommittee/
Committee_on_Legal_Affairs
lp:Role/substitutelp:Role/member
lp:EUParty/NI
lpv:role
lpv:political
Function
lpv:institutionlpv:institution rdf:type
lpv:speaker
rdf:type
"20111126"^ xsd:date
lp:political-
Function101
lpv:end
"20111126"^
xsd:date
lpv:beginning
"20071114"
^xsd:date
lpv:PoliticalFunction
"20090716"^ xsd:date
lp:political-
Function102
lpv:beginning
lpv:end
lp:EUmember_1023
lp:political
Function
lp:eu/plenary/2009-10-21/Speech_140>
lpv:role
lp:EUCommittee/
Committee_on_Legal_Affairs
lp:Role/substitutelp:Role/member
lp:EUParty/NI
lpv:role
lpv:political
Function
lpv:institutionlpv:institution rdf:type
lpv:spokenAs
lpv:speaker
lpv:spokenAs
rdf:type
69. How to relate a speech to the party of the
speaker?
"20111126"^ xsd:date
"20090716"^ xsd:date
lp:political-
Function102
lpv:beginning
lpv:end
lp:EUmember_1023lp:eu/plenary/2009-10-21/Speech_140>
lpv:role
lp:EUCommittee/
Committee_on_Legal_Affairs
lp:Role/substitute
lpv:political
Function
lpv:institution
lpv:speaker
"20111126"^ xsd:date
lp:political-
Function101
lpv:end
"20111126"^
xsd:date
lpv:beginning
"20071114"
^xsd:date
lpv:PoliticalFunction
"20090716"^ xsd:date
lp:political-
Function102
lpv:beginning
lpv:end
lp:EUmember_1023
lp:political
Function
lp:eu/plenary/2009-10-21/Speech_140>
lpv:role
lp:EUCommittee/
Committee_on_Legal_Affairs
lp:Role/substitutelp:Role/member
lp:EUParty/NI
lpv:role
lpv:political
Function
lpv:institutionlpv:institution rdf:type
lpv:speaker
rdf:type
"20111126"^ xsd:date
lp:political-
Function101
lpv:end
"20111126"^
xsd:date
lpv:beginning
"20071114"
^xsd:date
lpv:PoliticalFunction
"20090716"^ xsd:date
lp:political-
Function102
lpv:beginning
lpv:end
lp:EUmember_1023
lp:political
Function
lp:eu/plenary/2009-10-21/Speech_140>
lpv:role
lp:EUCommittee/
Committee_on_Legal_Affairs
lp:Role/substitutelp:Role/member
lp:EUParty/NI
lpv:role
lpv:political
Function
lpv:institutionlpv:institution rdf:type
lpv:spokenAs
lpv:speaker
lpv:spokenAs
rdf:type
Note: this is another example of the
design pattern called n-ary relations or
relations as classes.
75. Linking Members of Parliament to Wikipedia /
DBpedia
• String matching is the most important feature in the linking process.
• “nearly all [alignment systems] use a string similarity metric” [12]
• stopping and stemming is not helpful! Nor is using WordNet synonyms. [12]
[12] Cheatham, M., & Hitzler, P. String
similarity metrics for ontology alignment.
ISWC 2013.
http://www.dbpedia.org/page/Judith_Sargentini
76. Example query 1: speeches that contain a certain
keyword
Query: all speeches that contain the phrase “open data”
…. So let us go for open data, let us
go for utilisation of all the instruments
available to that end! …..
…. but there too governments are
encouraging the use of open data to
increase transparency, accountability
and citizen participation ….
…. We already have many open data
projects in the Member States and
local authorities…..
77. Example 2: speeches that contain a certain
keyword by date
"Slovenia" in the plenary meetings of the European Parliament
Year
Nr.ofmentions
020406080100
1999 2000 2001 2003 2004 2005 2006 2007 2008 2010 2011 2012 2013
78. Example 2: speeches that contain a certain
keyword by date
"Slovenia" in the plenary meetings of the European Parliament
Year
Nr.ofmentions
020406080100
1999 2000 2001 2003 2004 2005 2006 2007 2008 2010 2011 2012 2013
79. Example 2: speeches that contain a certain keyword
by date
Mentions of 'human rights'
dates
Frequency
0200400600800
1999 2000 2001 2003 2004 2005 2006 2007 2009 2010 2011 2012 2013
80. Example 3: speeches that contain a certain keyword
by country
AT BE BG CY CZ DE DK EE ES FI FR GB GR HR HU IE IT LT LU LV MT NL PL PT RO SE SI SK
Mentions of 'human rights' by country
01000200030004000500060007000
81. Example 4: the number of speeches per EU
country
SELECT ?c (COUNT(?c) as ?count)
WHERE {
?x rdf:type <http://purl.org/linkedpolitics/vocabulary/eu/plenary/Speech>.
?x <http://purl.org/linkedpolitics/vocabulary#speaker> ?p.
?p <http://purl.org/linkedpolitics/vocabulary#countryOfRepresentation> ?c
} GROUP BY ?c LIMIT 50
82. Example 5: include data external source
Query: MEPs that were born outside Europe.
Members of Parliament
(DBpedia contains info on
birthplace, birth date, schools,
careers, residence, family, etc. )
83. Example 5: include data external source
Query: MEPs that were born outside Europe.
Members of Parliament
(DBpedia contains info on
birthplace, birth date, schools,
careers, residence, family, etc. )
84. Intermezzo: one-question Quiz
Reasoning on the Web of Data
Question: What can we conclude from this graph?
A. Stihler is a member of exactly 3 parties
B. Stihler is a member of at least 3 parties
C. Stihler is a member of at most 3 parties
D. None of the above
E. All of the above
F. Other, namely ….
http://purl.org/linkedpolitics/EUmember_4545 "Catherine Stihler"foaf:name
http://purl.org/linkedpolitics/EUParty/PES
http://dbpedia.org/resource/
Party_of_European_Socialists
http://dbpedia.org/resource/
Progressive_Alliance_of_Socialists_and_Democrats
:memberOf
:memberOf
:memberOf
85. Results
• An open data set of EU parliamentary debates,
• with links to other sources on the Web of Data
• accessible though a through a Sparql Enpoint
86. Reflection: to what extent can we now answer
these questions?
How did the debate about the
financial crisis in Greece
develop?
Which political event has
attracted most media
attention?
What are the differences
between different media?
Has the coverage changed
over time?
87. Reflection: to what extent can we now answer
these questions?
How did the debate about the
financial crisis in Greece
develop?
Which political event has
attracted most media
attention?
What are the differences
between different media?
Has the coverage changed
over time?
We can, but:
• what is the influence of the selection of newspapers
available at the National Library?
• what was the quality of the digitisation process (OCR)?
• How good is our linking approach (based on
automatically detected entities and topics)?
• How much can we trust the quality of external sources?
➡ How to handle these uncertainties is one of our research
questions. We call this Tool Criticism
88. Research directions at CWI
Transparent, reproducible analysis of large volumes of connected,
heterogenous, multimodal data.
1. How do we automatically link heterogeneous datasets?
2. How do we interpret links between datasets of different quality and certainty?
3. How do we handle the fact that knowledge evolves?
4. How do we design interfaces that allow scholars to study the datasets
• including the links between them?
• while assessing the reliability of the findings?
89. Research directions at CWI
Transparent, reproducible analysis of large volumes of connected,
heterogenous, multimodal data.
1. How do we automatically link heterogeneous datasets?
2. How do we interpret links between datasets of different quality and certainty?
3. How do we handle the fact that knowledge evolves?
4. How do we design interfaces that allow scholars to study the datasets
• including the links between them?
• while assessing the reliability of the findings?
Data Science - Big Data - Web of Data
90. PoliMedia demo: http://polimedia.nl/
PoliMedia project video: https://youtu.be/u24oRCj7xrQ
Talk of Europe project: http://talkofeurope.eu/
Talk of Europe data: purl.org/linkedpolitics
Talk of Europe project video: https://youtu.be/GxA53gkCe0o
My website: http://homepages.cwi.nl/~hollink/
A. van Aggelen, L. Hollink, M. Kemman, M. Kleppe & H. Beunders. The
debates of the European Parliament as Linked Open Data. Semantic Web
Journal. In press, 2016.
M. Kleppe, L. Hollink, J. Oomen, M. Kenman, D. Juric, J. Blom, H.
Beunders. PoliMedia - Improving the Analyses of Radio & Newspaper
coverage of Political Debates. First prize winner of the LinkedUp Veni
Competition, presented at the Open Knowledge Conference (OKCon),
Geneva, September 2013..
I’d be happy to answer any questions!