Data Interlinking 
together with Crowd Workers 
Cristina Sarasua 
2nd DBpedia Community Meeting, Leipzig 
Cristina Sarasua Data Interlinking together with Institute for Web Science and Technologies · Un Civroewrds i Wtyo orkfe Krsoblenz-Landau, Germany 1
Image: 
http://www.w3.org/DesignIssues/diagrams/lod/597992118v2_350x350_Back.jpg 
Cristina Sarasua Data Interlinking together with Crowd Workers 2
Scenario for data interlinking 
Music data integration 
Cristina Sarasua Data Interlinking together with Crowd Workers 3
What for? 
• A: Extending the description of resources 
 enabling richer queries 
dbpedia 
song1 
d1 
song1 
owl:sameAs 
dbpedia 
Leipzig 
d1 
song1 
o:wasPlayedIn 
Cristina Sarasua Data Interlinking together with Crowd Workers 4
The Problem 
Cristina Sarasua Data Interlinking together with Crowd Workers 5
D1 
d1:song1 a ma:AudioTrack; 
owl:sameAs ? 
ma:title ``UFO´´; 
ma:locator musicexample:s1896.mp3^^xsd:anyURI; 
ma:hasKeyword d1:colplay; 
DBpedia 
dbpedia:U.F.O._(song a dbpedia-owl:Work; 
a dbpedia-owl:Song; 
dc:title ``U.F.O.´´; 
prop:artist dbpedia:Coldplay; 
dbpedia:UFO_(band) a dbpedia-owl:Band; 
a dbpedia-owl:Song; 
dc:title ``U.F.O.´´; 
Cristina Sarasua Data Interlinking together with Crowd Workers 6
• Goal: typed link to 
create (e.g. 
owl:sameAs) 
• Information to analyse 
(i.e. attribute-values) 
• Decision criterion (e.g. 
levenshtein  2) 
automatic 
Cristina Sarasua Data Interlinking together with Crowd Workers 7
D1 
d1:song1 a ma:AudioTrack; 
owl:sameAs ? 
ma:title ``UFO´´; 
ma:locator musicexample:s1896.mp3^^xsd:anyURI; 
ma:hasKeyword d1:colplay; 
DBpedia 
dbpedia:U.F.O._(song a dbpedia-owl:Work; 
a dbpedia-owl:Song; 
dc:title ``U.F.O.´´; 
prop:artist dbpedia:Coldplay; 
dbpedia:UFO_(band) a dbpedia-owl:Band; 
prop:name ``U.F.O.´´; 
Human to 
guide the 
process 
Cristina Sarasua Data Interlinking together with Crowd Workers 8
D1 
d1:song1 a ma:AudioTrack; 
owl:sameAs ? 
ma:title ``Soon´´; 
ma:locator musicexample:s1896.mp3^^xsd:anyURI; 
DBpedia 
dbpedia:Transatlantic_KK a dbpedia-owl:Work; 
a dbpedia-owl:Album; 
dc:title ``Soon´´; 
dbprop:artist dbpedia:Delorean_(band); 
Human to 
correct 
dbpedia:Soon_(Tanya_Tucker_song) a dbpedia-owl:Work; 
a dbpedia-owl:MusicalWork; 
dc:title ``Soon´´; 
dbprop:artist dbpedia:Tanya_Tucker; 
Cristina Sarasua Data Interlinking together with Crowd Workers 9
D1 
d1:song1 a ma:AudioTrack; 
o:wasPlayedIn? 
ma:title ``UFO´´; 
ma:locator musicexample:s1896.mp3^^xsd:anyURI; 
ma:hasKeyword d1:colplay; 
DBpedia 
dbpedia:Leipzig a dbpedia-owl:Place; 
rdfs:label ``Leipzig´´; 
Human to 
crate 
new links 
Cristina Sarasua Data Interlinking together with Crowd Workers 10
• Creative and proactive 
• Listen / watch / search 
• Process / associate / more 
complicated conclusions 
human 
Cristina Sarasua Data Interlinking together with Crowd Workers 11
The Approach 
Cristina Sarasua Data Interlinking together with Crowd Workers 12
Crowd-powered data interlinking 
• Building a system that 
– Combines algorithmic and human 
computation 
– Systematically involves humans 
via microtasks 
– Considers the aforementioned 
typs of links 
– Schema- and instance-level links 
Automatic 
interlinking 
Cristina Sarasua Data Interlinking together with Crowd Workers 13
Overview 
It worked! quick, unexpensive 
See CrowdMAP [Sarasua et al., 2012] 
Cristina Sarasua Data Interlinking together with Crowd Workers 14
A microtask 
Cristina Sarasua Data Interlinking together with Crowd Workers 15
A microtask 
Challenge #1: It has to work with ANYONE 
Challenge #2: We still want a data-independent solution 
Cristina Sarasua Data Interlinking together with Crowd Workers 16
Ongoing work 
Picture: Icon made by Freepik from http://www.flaticon.com 
How to 
improve? 
Cristina Sarasua Data Interlinking together with Crowd Workers 17
How to optimize the process? 
Crowdsourcing approaches 
• Additional incentives to make them process 
more links, faster (e.g. display #links left) 
• Let them explain others: write the argument 
for the decision 
• Show similar link: decide by comparison 
Cristina Sarasua Data Interlinking together with Crowd Workers 18
How to optimize the process? 
Crowdsourcing approaches 
• Challenge #Additional 3: How to decide incentives to what is an make them analogous link 
here? (danger of bias?) 
process 
more links, faster (e.g. display #links left) 
• Let them predicate rdf:explain type False others: positive / others: negative 
write the 
argument for the decision 
• Show similar link: decide by comparison 
Cristina Sarasua Data Interlinking together with Crowd Workers 19
How to optimize the process? 
Data-oriented approaches 
• Test and instructing links: targeted selection 
• Scheduled sequences of links to process: to 
make more sense 
Cristina Sarasua Data Interlinking together with Crowd Workers 20
How to optimize the process? 
Data-oriented approaches 
• Test and instructing links: targeted selection 
Challege #4: How to build that programmatically? 
data analysis data + crowd data + expert 
• Scheduled sequences of links to process: 
• Validate vs identify microtasks: 
Difficult case, rare 
Easy case, common 
Cristina Sarasua Data Interlinking together with Crowd Workers 21
How to optimize the process? 
Data-oriented approaches 
• Test and instructing links: targeted selection 
• Scheduled sequences of links to process: to 
make more sense 
Cristina Sarasua Data Interlinking together with Crowd Workers 22
How to optimize the process?(II) 
Data-oriented approaches 
• Test and instructing links: targeted selection 
• Scheduled sequences of links to process: to 
make more sense 
Challege #5: How to predict how suitable a worker will be for 
• Validate vs processing a identify particular link? 
microtasks 
Which features of links have influence in the prediction? 
Previous cross-platform 
experience (CrowdWorkCV) 
See also [Sarasua et al., 2013] 
Ranking a list of suitable 
links based on training links 
Cristina Sarasua Data Interlinking together with Crowd Workers 23
How to optimize the process?(II) 
Data-oriented approaches 
• Test and instructing links: targeted selection 
• Scheduled sequences of links to process: to 
make more sense 
Challege #6: How should we assess a priori if (and to what 
extent approx.) we need crowdsourcing for a particular pair 
of data sets? 
Cristina Sarasua Data Interlinking together with Crowd Workers 24
Closing 
Cristina Sarasua Data Interlinking together with Crowd Workers 25
Take-away messages 
• Yes, microtask crowdsourcing allows you to involve 
humans for processing lots of data, it is cost-effective and fast 
• Research shows it is a feasible complement to data 
interlinking algorithms 
• BUT do not underestimate the microtasks management 
Coming soon … 
http://github.com/criscod 
Cristina Sarasua Data Interlinking together with Crowd Workers 26
Open question: wouldn´t crowd-powered 
data interlinking enrich this table? 
[Schmachtenberg et al., 2014] 
Cristina Sarasua Data Interlinking together with Crowd Workers 27
Thank you for your attention! 
Contact: 
Cristina Sarasua 
Institute for Web Science and Technologies 
Universität Koblenz-Landau 
csarasua@uni-koblenz.de 
Cristina Sarasua Data Interlinking together with Institute for Web Science and Technologies · Un Civroewrds i Wtyo orkfe Krsoblenz-Landau, Germany 28
References 
• Sarasua, C. Crowdsourced Interlinking on the Web of Data. In: 18th 
International Conference on Knowledge Engineering and Knowledge 
Management(EKAW). Doctoral Symposium. (2012) 
• Sarasua, C., Simperl, E., Noy, N.F.: CrowdMAP: Crowdsourcing ontology 
alignment with microtasks. In: Proceedings of the 11th International 
Semantic Web Conference (ISWC). (2012) 
• Sarasua, C. Thimm, M.: Microtask available, send us your CV! In: 
Proceedings of the International Workshop on Crowd Work and Human 
Computation(CrowdWork 2013). (2013) 
• Max Schmachtenberg, Christian Bizer, Heiko Paulheim: Adoption of the 
Linked Data Best Practices in Different Topical Domains. 13th International 
Semantic Web Conference (ISWC2014) - RDB Track, Riva del Garda, 
Italy, October 2014 
Cristina Sarasua Data Interlinking together with Crowd Workers 29

Dbpedia leipzig2014 csarasua_open

  • 1.
    Data Interlinking togetherwith Crowd Workers Cristina Sarasua 2nd DBpedia Community Meeting, Leipzig Cristina Sarasua Data Interlinking together with Institute for Web Science and Technologies · Un Civroewrds i Wtyo orkfe Krsoblenz-Landau, Germany 1
  • 2.
  • 3.
    Scenario for datainterlinking Music data integration Cristina Sarasua Data Interlinking together with Crowd Workers 3
  • 4.
    What for? •A: Extending the description of resources enabling richer queries dbpedia song1 d1 song1 owl:sameAs dbpedia Leipzig d1 song1 o:wasPlayedIn Cristina Sarasua Data Interlinking together with Crowd Workers 4
  • 5.
    The Problem CristinaSarasua Data Interlinking together with Crowd Workers 5
  • 6.
    D1 d1:song1 ama:AudioTrack; owl:sameAs ? ma:title ``UFO´´; ma:locator musicexample:s1896.mp3^^xsd:anyURI; ma:hasKeyword d1:colplay; DBpedia dbpedia:U.F.O._(song a dbpedia-owl:Work; a dbpedia-owl:Song; dc:title ``U.F.O.´´; prop:artist dbpedia:Coldplay; dbpedia:UFO_(band) a dbpedia-owl:Band; a dbpedia-owl:Song; dc:title ``U.F.O.´´; Cristina Sarasua Data Interlinking together with Crowd Workers 6
  • 7.
    • Goal: typedlink to create (e.g. owl:sameAs) • Information to analyse (i.e. attribute-values) • Decision criterion (e.g. levenshtein 2) automatic Cristina Sarasua Data Interlinking together with Crowd Workers 7
  • 8.
    D1 d1:song1 ama:AudioTrack; owl:sameAs ? ma:title ``UFO´´; ma:locator musicexample:s1896.mp3^^xsd:anyURI; ma:hasKeyword d1:colplay; DBpedia dbpedia:U.F.O._(song a dbpedia-owl:Work; a dbpedia-owl:Song; dc:title ``U.F.O.´´; prop:artist dbpedia:Coldplay; dbpedia:UFO_(band) a dbpedia-owl:Band; prop:name ``U.F.O.´´; Human to guide the process Cristina Sarasua Data Interlinking together with Crowd Workers 8
  • 9.
    D1 d1:song1 ama:AudioTrack; owl:sameAs ? ma:title ``Soon´´; ma:locator musicexample:s1896.mp3^^xsd:anyURI; DBpedia dbpedia:Transatlantic_KK a dbpedia-owl:Work; a dbpedia-owl:Album; dc:title ``Soon´´; dbprop:artist dbpedia:Delorean_(band); Human to correct dbpedia:Soon_(Tanya_Tucker_song) a dbpedia-owl:Work; a dbpedia-owl:MusicalWork; dc:title ``Soon´´; dbprop:artist dbpedia:Tanya_Tucker; Cristina Sarasua Data Interlinking together with Crowd Workers 9
  • 10.
    D1 d1:song1 ama:AudioTrack; o:wasPlayedIn? ma:title ``UFO´´; ma:locator musicexample:s1896.mp3^^xsd:anyURI; ma:hasKeyword d1:colplay; DBpedia dbpedia:Leipzig a dbpedia-owl:Place; rdfs:label ``Leipzig´´; Human to crate new links Cristina Sarasua Data Interlinking together with Crowd Workers 10
  • 11.
    • Creative andproactive • Listen / watch / search • Process / associate / more complicated conclusions human Cristina Sarasua Data Interlinking together with Crowd Workers 11
  • 12.
    The Approach CristinaSarasua Data Interlinking together with Crowd Workers 12
  • 13.
    Crowd-powered data interlinking • Building a system that – Combines algorithmic and human computation – Systematically involves humans via microtasks – Considers the aforementioned typs of links – Schema- and instance-level links Automatic interlinking Cristina Sarasua Data Interlinking together with Crowd Workers 13
  • 14.
    Overview It worked!quick, unexpensive See CrowdMAP [Sarasua et al., 2012] Cristina Sarasua Data Interlinking together with Crowd Workers 14
  • 15.
    A microtask CristinaSarasua Data Interlinking together with Crowd Workers 15
  • 16.
    A microtask Challenge#1: It has to work with ANYONE Challenge #2: We still want a data-independent solution Cristina Sarasua Data Interlinking together with Crowd Workers 16
  • 17.
    Ongoing work Picture:Icon made by Freepik from http://www.flaticon.com How to improve? Cristina Sarasua Data Interlinking together with Crowd Workers 17
  • 18.
    How to optimizethe process? Crowdsourcing approaches • Additional incentives to make them process more links, faster (e.g. display #links left) • Let them explain others: write the argument for the decision • Show similar link: decide by comparison Cristina Sarasua Data Interlinking together with Crowd Workers 18
  • 19.
    How to optimizethe process? Crowdsourcing approaches • Challenge #Additional 3: How to decide incentives to what is an make them analogous link here? (danger of bias?) process more links, faster (e.g. display #links left) • Let them predicate rdf:explain type False others: positive / others: negative write the argument for the decision • Show similar link: decide by comparison Cristina Sarasua Data Interlinking together with Crowd Workers 19
  • 20.
    How to optimizethe process? Data-oriented approaches • Test and instructing links: targeted selection • Scheduled sequences of links to process: to make more sense Cristina Sarasua Data Interlinking together with Crowd Workers 20
  • 21.
    How to optimizethe process? Data-oriented approaches • Test and instructing links: targeted selection Challege #4: How to build that programmatically? data analysis data + crowd data + expert • Scheduled sequences of links to process: • Validate vs identify microtasks: Difficult case, rare Easy case, common Cristina Sarasua Data Interlinking together with Crowd Workers 21
  • 22.
    How to optimizethe process? Data-oriented approaches • Test and instructing links: targeted selection • Scheduled sequences of links to process: to make more sense Cristina Sarasua Data Interlinking together with Crowd Workers 22
  • 23.
    How to optimizethe process?(II) Data-oriented approaches • Test and instructing links: targeted selection • Scheduled sequences of links to process: to make more sense Challege #5: How to predict how suitable a worker will be for • Validate vs processing a identify particular link? microtasks Which features of links have influence in the prediction? Previous cross-platform experience (CrowdWorkCV) See also [Sarasua et al., 2013] Ranking a list of suitable links based on training links Cristina Sarasua Data Interlinking together with Crowd Workers 23
  • 24.
    How to optimizethe process?(II) Data-oriented approaches • Test and instructing links: targeted selection • Scheduled sequences of links to process: to make more sense Challege #6: How should we assess a priori if (and to what extent approx.) we need crowdsourcing for a particular pair of data sets? Cristina Sarasua Data Interlinking together with Crowd Workers 24
  • 25.
    Closing Cristina SarasuaData Interlinking together with Crowd Workers 25
  • 26.
    Take-away messages •Yes, microtask crowdsourcing allows you to involve humans for processing lots of data, it is cost-effective and fast • Research shows it is a feasible complement to data interlinking algorithms • BUT do not underestimate the microtasks management Coming soon … http://github.com/criscod Cristina Sarasua Data Interlinking together with Crowd Workers 26
  • 27.
    Open question: wouldn´tcrowd-powered data interlinking enrich this table? [Schmachtenberg et al., 2014] Cristina Sarasua Data Interlinking together with Crowd Workers 27
  • 28.
    Thank you foryour attention! Contact: Cristina Sarasua Institute for Web Science and Technologies Universität Koblenz-Landau csarasua@uni-koblenz.de Cristina Sarasua Data Interlinking together with Institute for Web Science and Technologies · Un Civroewrds i Wtyo orkfe Krsoblenz-Landau, Germany 28
  • 29.
    References • Sarasua,C. Crowdsourced Interlinking on the Web of Data. In: 18th International Conference on Knowledge Engineering and Knowledge Management(EKAW). Doctoral Symposium. (2012) • Sarasua, C., Simperl, E., Noy, N.F.: CrowdMAP: Crowdsourcing ontology alignment with microtasks. In: Proceedings of the 11th International Semantic Web Conference (ISWC). (2012) • Sarasua, C. Thimm, M.: Microtask available, send us your CV! In: Proceedings of the International Workshop on Crowd Work and Human Computation(CrowdWork 2013). (2013) • Max Schmachtenberg, Christian Bizer, Heiko Paulheim: Adoption of the Linked Data Best Practices in Different Topical Domains. 13th International Semantic Web Conference (ISWC2014) - RDB Track, Riva del Garda, Italy, October 2014 Cristina Sarasua Data Interlinking together with Crowd Workers 29