Microtask Crowdsourcing Applications for Linked Data

Microtask Crowdsourcing
Applications for Linked Data

Architecture of
Linked Data Applications
Presentation Tier
Logic Tier
Data Tier

Integrated
Dataset

Data Access
Component

Republication

Republication
Component

Data Integration Component
Vocabulary
Mapping

Interlinking

SPARQL Wr.

Physical Wrapper

R2R Transf.

Cleansing

LD Wrapper

RDF/
XML
Web Data accessed via APIs

SPARQL
Endpoints

EUCLID – Microtask crowdsourcing
applications for Linked Data

Relational Data

Linked Data
2

Data Tier
Data Access
Component

Vocabulary
Mapping

Interlinking

Cleansing

• Consolidates the data retrieved from heterogeneous sources.
• This component may operate at:
– Schema level: Performs vocabulary mappings in order to translate
data into a single unified schema. Links correspond to RDFS properties
CH 2
or OWL property and class axioms.
– Instance level: Performs entity linking, e.g., entity resolution via
owl:sameAs links
CH 3

3

Data Tier (2)
Data Access
Component

Vocabulary
Mapping

Interlinking

Cleansing

The data integration component can be enhanced by including
microtask crowdsourcing apporaches:
• Cleansing or data assessments: Assessment of DBpedia triples
• Vocabulary mapping: CrowdMAP
• Interlinking: ZenCrowd

4

Other Crowdsourcing-based
Solutions for Linked Data Tasks
• Query understanding: CrowdDQ

• Ontology population: OntoGame
• Linked Data curation: Urbanopoly
• …


5

DBPEDIA QUALITY ASSESSMENT


Assessing DBpedia Triples
Correct

{s p o .}
Dataset

{s p o .}
Incorrect +
Quality issue

1. Selecting LD quality issues generated by erroneous extraction
mechanisms and that can be detected by the crowd
2. Selecting the appropriate crowdsourcing approaches
3. Designing and generating the interfaces to present the data to the
crowd

Selecting LD Quality
Issues to Crowdsource
Three categories of quality problems occur
pervasively in DBpedia [Zaveri2013]
and can be crowdsourced:
• Incorrect object
 Example: dbpedia:Dave_Dobbyn dbprop:dateOfBirth “3”.

• Incorrect data type
 Example: dbpedia:Torishima_Izu_Islands foaf:name “鳥島”@en.

• Incorrect link to “external Web pages”
 Example: dbpedia:John-Two-Hawks dbpediaowl:wikiPageExternalLink
<http://cedarlakedvd.com/>

Selecting Appropriate
Crowdsourcing Approaches
Verify

Find

Contest

Microtasks

LD Experts
Difficult task
Final prize

Workers
Easy task
Micropayments

TripleCheckMate

MTurk

[Kontoskostas2013]

Adapted from [Bernstein2010]

Presenting the Data
to the Crowd
Microtask interfaces: MTurk tasks

Incorrect object

• Selection of foaf:name or
rdfs:label to extract humanreadable descriptions
• Real object values extracted
automatically from Wikipedia
infoboxes

Incorrect data type

• Link to the Wikipedia article via
foaf:isPrimaryTopicOf

Incorrect outlink

• Preview of external pages by
implementing HTML iframe

Results
Object values

Data types

Interlinks

Linked Data
experts

0.7151

0.8270

0.1525

MTurk

0.8977

0.4752

0.9412

(majority voting)

• Both forms of crowdsourcing can be applied to detect
certain LD quality issues
• The effort of LD experts must be applied on those tasks
demanding specific-domain skills
• MTurk crowd are exceptionally good at performing
comparison of data entries

11

ZENCROWD


ZenCrowd: Entity Linking by
the Crowd

• Combine both algorithmic and manual linking
• Automate manual linking via crowdsourcing
• Dynamically assess human workers with a
probabilistic reasoning framework
Crowd

Machines

Algorithms
13

http://dbpedia.org/resource/Facebook

HTML:
Facebook is not waiting for its initial
public offering to make its first big
purchase.In its largest
acquisition to date, the social network
has purchased Instagram, the popular
photo-sharing application, for about $1
billion in cash and stock, the company
said Monday.

http://dbpedia.org/resource/Instagram
owl:sameAs

fbase:Instagram

Google

RDFa
enrichment

Android

<cit
e property=”rdfs:label">Facebook</cite> is not
waiting for its initial public offering to make its first
big purchase.In
its largest acquisition to date, the social network has
purchased <cite
property=”rdfs:label">Instagram</cite> , the popular
photo-sharing application, for about $1 billion in cash
and stock, the company said Monday.


14

ZenCrowd Architecture
HTML
Pages

Input

Z enCrowd

Micro
Matching
Tasks

MicroTask Manager

Entity
Extractors

Crowdsourcing
Platform

HTML+ RDFa
Pages
Output

Algorithmic
Matchers

Decision Engine
Probabilistic
Network

LOD Index Get Entity

Workers Decisions

LOD Open Data Cloud

Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. ZenCrowd: Leveraging Probabilistic
Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking. In: 21st International Conference on
World Wide Web (WWW 2012).

15

Entity Factor Graphs
• Graph components

pw1( )

w1

– Workers, links, clicks
Observed
variables
– Prior probabilities
c11
c21
– Link Factors
Link
– Constraints
factors

w2

c12

lf1( )

• Probabilistic
Inference

SameAs
l1
constraints

c22

c13

lf2( )

sa1-2( )

pl1( )

– Select all links with
posterior prob >τ

Worker
priors

pw2( )

l2
pl2( )

c23
lf3( )

u2-3( )

l3

Dataset
Unicity
constraints

pl3( )

Link priors
2 workers, 6 clicks, 3 candidate links


16

Lessons Learnt
• Crowdsourcing + Prob reasoning works!
• But
– Different worker communities perform differently
– Many low quality workers
– Completion time may vary (based on reward)

• Need to find the right workers for your task
(see WWW13 paper)


17

ZenCrowd Summary
• ZenCrowd: Probabilistic reasoning over automatic and
crowdsourcing methods for entity linking
• Standard crowdsourcing improves 6% over automatic
• 4% - 35% improvement over standard crowdsourcing
• 14% average improvement over automatic approaches

http://exascale.info/zencrowd/
• Follow up-work (VLDBJ):
– Also used for instance matching across datasets
– 3-way blocking with the crowd

18

CROWDQ – CROWD-POWERED
QUERY UNDERSTANDING

Motivation
• Web Search Engines can answer simple factual
queries directly on the result page
• Users with complex information needs are
often unsatisfied
• Purely automatic techniques are not enough
• We want to solve it with Crowdsourcing!


20

CrowdQ
• CrowdQ is the first system that uses
crowdsourcing to
– Understand the intended meaning
– Build a structured query template
– Answer the query over Linked Open Data

Gianluca Demartini, Beth Trushkowsky, Tim Kraska, and Michael Franklin. CrowdQ:
Crowdsourced Query Understanding. In: 6th Biennial Conference on Innovative Data Systems
Research (CIDR 2013).

21

CrowdQ Architecture
Off-line: query template generation with the help of the crowd
On-line: query template matching using NLP and search over open data
Keyword Query

On#
line'Complex'Query
Processing
Complex
query
classiﬁer

User

Y

Off#
line'Complex'Query
Decomposition
query

POS + NER tagging
N

N

Structured Query

Vetrical
selection,
Unstructured
Search, ...

Crowd
Manager

Match with existing Queries Templ +
Answer Types
query templates

t1
t2

t3

Template Generation

Answer
Composition

Query Template Index

SERP

Query
Log

Structured
LOD Search

Crowdsourcing
Platform

Result Joiner

23
LOD Open Data Cloud

Hybrid Human-Machine
Pipeline
Q= birthdate of actors of forrest gump
Query annotation

Noun

Noun

Named entity

Verification

Is forrest gump this entity in the query?

Entity Relations

Which is the relation between: actors and forrest gump

Schema element

Starring

Verification

Is the relation between:
Indiana Jones – Harrison Ford
Back to the Future – Michael J. Fox
of the same type as
Forrest Gump – actors

starring

<dbpedia-owl:starring>


24

Structured query generation
Q= birthdate of actors of forrest gump
SELECT ?y ?x
WHERE { ?y <dbpedia-owl:birthdate> ?x .
?z <dbpedia-owl:starring> ?y .
?z <rdfs:label> ‘Forrest Gump’ }

Results from BTC09:


25

CROWDMAP & OTHERS


CrowdMAP
• Experiments using MTurk, CrowdFlower and established benchmarks
• Enhancing the results of automatic techniques
• Fast, accurate, cost-effective
[Sarasua, Simperl, Noy, ISWC2012]

CartP
301-304

100R50P
Edas-Iasted

100R50P
Ekaw-Iasted

100R50P
Cmt-Ekaw

100R50P
ConfOf-Ekaw

Imp
301-304

PRECISION

0.53

0.8

1.0

1.0

0.93

0.73

RECALL

1.0

0.42

0.7

0.75

0.65

1.0

27

Taste IT! Try IT!
•
•
•
•

Restaurant review Android app developed in the Insemtives project
Uses Dbpedia concepts to generate structured reviews
Uses mechanism design/gamification to configure incentives
User study
–

2274 reviews by 180 reviewers referring to 900 restaurants, using 5667 DPpedia concepts

2500
2000
1500
1000
500
0
CAFE

FASTFOOD

PUB

RESTAURANT

Numer of reviews

Number of semantic annotations (type of cuisine)
Number of semantic annotations (dishes)

https://play.google.com/store/apps/details?id=insemtives.android&hl=en
11/11/2013


28

LODrefine

http://research.zemanta.com/crowds-to-the-rescue/
11/11/2013


29

Ontology Population

11/11/2013


30

Linked Data Curation


31

Problems and Challenges
•

What is feasible and how can tasks be optimally translated into microtasks?
– Examples: data quality assessment for technical and contextual features; subjective vs
objective tasks (also in modeling); open-ended questions

•

What to show to users
– Natural language descriptions of Linked Data/SPARQL
– How much context
– What form of rendering
– How about links?

•

How to combine with automatic tools
–

Which results to validate
•
•

•

Low precision (no fun for gamers...)
Low recall (vs all possible questions)

How to embed it into an existing application
– Tasks are fine granular, perceived as additional burden to the actual functionality

•

What to do with the resulting data?
– Integration into existing practices
– Vocabularies!

11/11/2013


32

Web site:
https://sites.google.com/site/microtasktutorial/
SLIDES and EXERCISES:
https://github.com/maribelacosta/crowdsourcingtutorial

Full-day tutorial ISWC2013
Sydney Australia
11/11/2013


33

For exercises, quiz and further material visit our website:

http://www.euclid-project.eu

Course

eBook

Other channels:

@euclid_project

euclidproject

euclidproject
34

Microtask Crowdsourcing Applications for Linked Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Microtask Crowdsourcing Applications for Linked Data

Similar to Microtask Crowdsourcing Applications for Linked Data (20)

Recently uploaded

Recently uploaded (20)

Microtask Crowdsourcing Applications for Linked Data

Editor's Notes