HARE: An Engine for Enhancing Answer Completeness of SPARQL Queries via Crowdsourcing

HARE:
An Engine for Enhancing Answer Completeness
of SPARQL Queries via Crowdsourcing
Maribel Acosta, Elena Simperl, Fabian Flöck, Maria-Esther Vidal

Motivation (1)
3
Due to the semi-structured nature of RDF,
incomplete values cannot be easily detected.

Motivation (2)
4
SELECT DISTINCT ?drug WHERE {
?drug rdf:type dbo:Drug .
?drug dbo:atcPrefix “C01” .
?drug dbp:routesOfAdministration ?route .
}
Retrieve drugs that are annotated with the prefix “C01” (Cardiac Therapy) in the Anatomical
Therapeutic Chemical (ATC) classification system and which have known routes of administration.
47 drugs
(v. 2016)

Motivation (2)
5
}
Retrieve drugs that are annotated with the prefix “C01” (Cardiac Therapy) in the Anatomical
Therapeutic Chemical (ATC) classification system and which have known routes of administration.
98 drugs
(v. 2016)
(There are 48 drugs without
routes of administration)

Motivation (3)
6
Examples of drugs (with ATC prefix “C01”) with no routes of administration in
All images licensed under Fair use via Wikipedia.
dbr:Acadesine dbr:Acetyldigitoxin dbr:Dimetofrine dbr:Flecainide
(v. 2016)
Intravenous administration,
for treating leukemia.
Source: PubChem
Also used in doping (sports).
Source: PubMed
Oral administration,
Source: DrugBank
No route found. No route found.

Problem Definition
7
Given an RDF dataset D and a SPARQL query Q against D. Consider D* the
virtual dataset that contains all the data that should be in D.
P1) Identifying portions P of Q that yield missing values.
P2) Resolving missing values.
[[P]]D [[P]]D*⊂
µ [[P]]D µ [[P]]D*∉ ∈
Do not belong to
the solution of P
Should belong to
the solution of P

Our Approach:
HARE (Hybrid SPARQL Query Engine)
8

}
HARE Overview
9
{?drug à dbr:Ibuprofen}
{?drug à dbr:Flecainide}
Query Engine
RDF Completeness
Model
Microtask Manager
{?drug à dbr:Acadesine}
{?drug à dbr:Ibuprofen}
{?drug à dbr:Flecainide}
{?drug à dbr:Acadesine}
Crowd Knowledge
CKB+ CKB- CKB~
D
τ

HARE
• A hybrid machine/human SPARQL query engine that is able to enhance
the size of query answers.
• Based on a novel RDF completeness model, HARE implements query
optimization and execution techniques:
P1) Identifying portions of queries that yield missing values.
• HARE resorts to microtask crowdsourcing:
P2) Resolving missing values.
10

RDF Completeness Model (1)
• Relies on the Local Closed World Assumption (LCWA).
• Estimates the local completeness of resources with respect to other
resources in an RDF graph that belong to the same classes.
11
rdf:type
dbp:routesOf
Administration
rdf:type
rdf:type
dbo:Drug
dbr:Procainamide
dbr:Flecainide
dbr:Bretylium
Local Completeness
dbp:routesOf
Administration
dbp:routesOf
Administration
dbp:routesOf
Administration
dbp:routesOf
Administration

① Multiplicity of an RDF Resource
Number of objects that a resource has for a certain predicate.
12
MOD(dbr:Procainamide, dbp:routesOfAdministration) = 3
dbr:Procainamide
dbp:routesOfAdministration
dbr:Intravenous
dbr:Intramuscular_injection
dbr:Oral_administration

② Aggregated Multiplicity of a Class
Given a predicate, median number of distinct objects that have all the
resources that belong to a class.
13
AMOD(dbo:Drug, dbp:routesOfAdministration) = 3
MOD(dbr:Procainamide, dbp:routesOfAdministration) = 3
MOD(dbr:Bretylium, dbp:routesOfAdministration) = 2

③ Local Completeness of an RDF Resource
Given a predicate, the completeness of an RDF resource is determined by
the aggregated predicate multiplicity of the classes that it belongs to.
14
CompD(dbr:Procainamide | dbp:routesOfAdministration) =
CompD(dbr:Bretylium | dbp:routesOfAdministration) =
CompD(dbr:Flecainide | dbp:routesOfAdministration) =
3
3
2
3
①Computed in
Computed in ②
0
3

Crowd Knowledge Bases (1)
• The knowledge collected from the crowd is captured in three KBs:
• CKB+, CKB–, CKB~ are fuzzy RDF datasets composed of 4-tuples:
15
CKB~
CKB+
CKB–
(subject, predicate, object, membership_degree)
RDF triple

Crowd Knowledge Bases (2)
16
Types of Crowd Knowledge Bases
(dbr: Acadesine, dbp:routesOfAdministration, _:o2, 0.78)
“Flecainide is administered orally.”
(dbr:Flecainide, dbp:routesOfAdministration, dbr:Oral_administration, 0.9)
“Flecainide does not have a (known) route of administration.”
(dbr:Flecainide, dbp:routesOfAdministration, _:o1, 0.05)
“I am not sure if Acadesine has a route of administration.”
CKB+
CKB-
CKB~
Contradiction (C)
Unknownness (U)

Query Engine (1)
• The engine computes the probability of crowdsourcing a triple pattern t in
query Q, denoted PCROWD(t).
• If PCROWD(t) is greater than a user threshold τ, then the query engine
crowdsources the triple pattern t.
• α is a score weight between 0.0 and 1.0.
17
PCROWD (t) =
α (1 – Comp(t)) + (1 – α) max {max{m+, m–}, min{C(t), 1 – U(t)}}
Estimated
incompleteness
Crowd
unreliability
Crowd
confidence

Query Engine (2)
• The engine combines mappings obtained from the dataset D and fuzzy
mappings from the crowd stored in CKB+.
• We define a fuzzy set semantics for SPARQL.
18
({?drug à dbr:Isoprenaline, ?route à dbr:Intravenous}, 0.94)
{?drug à dbr:Isoprenaline, ?route à dbr:Inhalation}
CKB+
D
The complexity of computing the mapping set of a SPARQL query under fuzzy set semantics is
the same as under set semantics.
The HARE query engine does not increase the time complexity of computing the mapping set of
a SPARQL query.
Corollary
Theorem

Microtask Manager (1)
19
• Receives triple patterns to crowdsource.
• Creates human tasks.
• Submits tasks to the crowdsourcing platform.
(dbr:Flecainide, dbp:routesOfAdministration, ?route)

dbr:Flecainide
Microtask Manager (2)
20
rdfs:label
Flecainide acetate (/flɛˈkeɪnaɪd/
US dict: fle·kā′·nīd) is a classic Ic
antyarrhythmic agent (...)
rdfs:comment
wiki-commons:Special:FilePath/
Flecainide_structure.svg
foaf:depiction
http://en.wikipedia.org/
wiki/Flecainide
foaf:isPrimaryTopicOf
“Flecainide“@en
“routes of administration“@en
RDF Graphs:

Experimental Settings
• Benchmark: 50 queries against (English version, 2014).
• Ten queries in five different knowledge domains:
History, Life Sciences, Movies, Music, and Sports.
• Implementation details:
• Dataset (queries executed directly against the dataset).
• HARE (our proposed approach).
• HARE BL (generates microtask interfaces replacing URIs by labels).
• Crowdsourcing configuration:
• The crowd is reached via CrowdFlower.
• Four different triple patterns per task, 0.07 US$ per task (Sep. 2015).
• At least 3 answers were collected per task.
22

Overview of the Results
• Benchmark: 50 queries against (English version, 2014).
• Ten queries in five different knowledge domains:
History, Life Sciences, Movies, Music, and Sports.
• Implementation details:
• Dataset (queries executed directly against the dataset).
• HARE (our proposed approach).
• HARE BL (generates microtask interfaces replacing URIs by labels).
• Crowdsourcing configuration:
• The crowd is reached via CrowdFlower.
• Four different triple patterns per task, 0.07 US$ per task (Sep. 2015).
• At least 3 answers were collected per task.
23
Total triple patterns crowdsourced: 1,004
Total answers collected from the crowd: 3,163
75%-98% of the crowd answers
were produced in 12 minutes

0
500
1000
1500
0.00 0.25 0.50 0.75 1.00
τ
Crowdsourcedtriplepatterns
Sports
History
LifeSciences
Music
Movies
24
# Crowdsourced Triple Patterns per Domain
The RDF completeness model considerably reduces the
number of triple patterns to crowdsource (τ >= 0.5).
Effectiveness of the RDF Completeness Model

Completeness of Query Answers
Sports Music Life Sciences Movies History
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10
0.00
0.25
0.50
0.75
1.00
Query
Recall
Dataset HARE−BL HARE
25
Recall of tested approaches w.r.t. D* per SPARQL query
Recall varies across queries and knowledge domains.
Completing answers in certain domains is more challenging.

Sports Music Life Sciences Movies History
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9Q10
0.00
0.25
0.50
0.75
1.00
Query
Recall
Dataset HARE−BL HARE
Completeness of Query Answers
26
Recall of tested approaches w.r.t. D* per SPARQL query
HARE outperforms the other approaches across all knowledge domains.
Our RDF completeness model captures the skewed distributions of values.
Recall varies across queries and knowledge domains.
Completing answers in certain domains is more challenging.
✓ ✓✓✓✓ ✓✓✓✓✓ ✓ ✓✓✓✓ ✓✓✓✓✓ ✓ ✓✓✓✓ ✓✓✓✓✓ ✓✓✓✓ ✓✓✓✓✓ ✓ ✓✓✓✓ ✓✓✓✓✓

Quality of Crowd Answers: Precision
27
The crowd exhibits heterogeneous performance within domains.
This supports the importance of HARE triple-based approach.

Quality of Crowd Answers: Precision
28
The precision of the crowd answers is in general higher when
crowdsourcing semantically enriched tasks.

Conclusions
• HARE: Hybrid query engine against RDF data sets.
• Supports microtasks to enhance query answers on-the-fly.
• Experimental results confirmed that:
Future work
• Study further approaches to capture crowd reliability.
• Consider other quality dimensions on the knowledge collected from the
crowd.
30
3.13 – 12 times
Size of query answer
Precision
0.62 – 0.97
Crowd quality
Semantically enriched tasks

Maribel Acosta, Elena Simperl, Fabian Flöck, Maria-Esther Vidal
31
HARE:
An Engine for Enhancing Answer Completeness of
SPARQL Queries via Crowdsourcing
Size of query answer
Precision
Crowd quality
}
Crowd Knowledge
CKB+ CKB- CKB~
D

HARE: An Engine for Enhancing Answer Completeness of SPARQL Queries via Crowdsourcing

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to HARE: An Engine for Enhancing Answer Completeness of SPARQL Queries via Crowdsourcing

Similar to HARE: An Engine for Enhancing Answer Completeness of SPARQL Queries via Crowdsourcing (20)

More from Maribel Acosta Deibe

More from Maribel Acosta Deibe (6)

Recently uploaded

Recently uploaded (20)

HARE: An Engine for Enhancing Answer Completeness of SPARQL Queries via Crowdsourcing