Predicting query performance and explaining results to assist Linked Data consumption.

Predicting Query Performance and
Explaining Results to Assist Linked Data
Consumption
Candidate: Rakebul Hasan
Jury:
President: Johan Montagnat, CNRS (I3S)
Director: Fabien Gandon, INRIA
Co-director: Pierre-Antoine Champin, LIRIS, UCBL, Lyon
Reviewers:
Pascal Molli, University of Nantes
Philippe Cudré-Mauroux, University of Fribourg, Switzerland

2
Accessing Linked Data
Dereferencing URIs: default
SPARQL Endpoints
68% data sets, 2011
98% data (triples), 2014
Consuming Linked Data
Query federation
On-the-fly dereferencing
Crawling
Integrating disparate data to support intelligent applications
Attribution: “Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”

3
Results
Workload management tasks:
configuration, organization, inspection, and optimization
“How long will the query take to execute?”
History of data lifecycle:
make trust judgments, validate or invalidate results
“Why this result?”

4
Linked Data Access/HTTP
“Show me the flow of information in the result derivation”
“Show me the summary of what happened in the result derivation process”

5
Linked Data Access
Assistance in Querying
Query Performance
Assistance in Result Understanding
Query Results
Results Produced by Applications

Query Performance Prediction
Explaining SPARQL Query Results
Linked Explanations: Explanations for
Linked Data Applications
Summarizing Explanations
Outline

7
Predictions
Statistics about data?
-Published statistics
-Few datasets
How to predict query performance without using data statistics? -Not detailed

8
Previously executed queries
Query Q1 took 100 ms
...
Query Qm took 300 ms
Predictions
“How long will the query take to execute?”

Unseen query q
9
y is the performance metric
...
Query Qm took 300 ms
Learn
Regression
f(q) = ?
Predictions
How to model SPARQL query characteristics for machine learning algorithms?
Feature extraction

10
Algebra features:
extracted from SPARQL
algebraic expression

11
Graph pattern features:
Landmarks in training queries
Similarities between landmark queries
and the query in examination
Inverting
approximated graph edit distance
Clustering:
k-medoids with
Approximated graph edit distance
[Riesen et al 2009]

12
Queries:
1260 training, 420 validation and 420 test queries generated from DBPSB
benchmark [Morsey et al. 2011] query templates
RDF dataset:
DBpedia
Learning models:
k-NN regression with k-D tree
SVM with nu-SVR for regression
Triple store:
Jena TDB 1.0.0
16 GB allocated memory
Commodity hardware:
Intel Xeon 2.53 GHz
48 GB RAM
Linux 2.6.32
Experiments

14
Algebra and Graph Pattern Features

Model Training Time Avg. Prediction Time per Query
k-NN + algebra 7.14 sec 3.42 ms
SVM + algebra 26.26 sec 3.53 ms
k-NN + algebra + graph pattern 3300.33 sec
(55.01 min)
47.25 ms
SVM + algebra + graph pattern 3390.71 sec
(56.5 min)
98.1 ms
15

17
Results
History of data lifecycle:
Modifying
make trust judgments, validate or invalidate results
Provenance-based query result explanation

• Explanations
– Provenance models ( e.g. PML, W3C PROV-O)
– Presentation/UI
– Justifications
• Provenance for query results
– Relational databases
• Why, where, how provenance
– Annotation approach
– Non-annotation approach
– RDF and SPARQL
• Transform RDF and SPARQL to relational models [Theoharis et al. 2011,
Damásio et al. 2012]
• Annotation approaches: Corese/KGRAM [Corby 2012], TripleProv [Wylot 2014]
18
Previous Work

Query Result Provenance
Triple
:person1 rdf:type foaf:Person. t1
:person1 foaf:based_near "Paris". t2
:person4 foaf:based_near "London". t8
19
select ?location
where
{
?person rdf:type foaf:Person.
?person foaf:based_near ?location.
}
location
London
Paris How? Why?
Provenance for the result tuple (location=“Paris”):
How-provenance: (t1 t2) (t3 t4) (t5 t6)
Why-provenance: {{t1, t2}, {t3, t4}, {t5, t6}}
Lineage: {t1, t2, t3, t4, t5, t6}
Geerts et al. Algebraic structures for capturing the provenance of SPARQL queries. ICDT '13
Green et al. Tannen. Provenance semirings. SIGMOD/PODS, 2007.
Buneman, Khanna, Tan, “Why and Where: A Characterization of Data Provenance”, ICDT’01

Non-Annotation-based Algorithm
to Compute Why-Provenance
Triple
:person4 foaf:based_near "London". t8
20
select ?location
where
{
}
location
London
Paris Why?
Bind the values from the result tuple to
the original query and project all
SELECT *
{
VALUES ?location {"Paris"}
}
variables
person location
:person1 Paris
:person2 Paris
:person3 Paris
results
:person1 rdf:type foaf:Person.
:person1 foaf:based_near “Paris”.
Provenance for the result tuple (location=“Paris”):
Why-provenance: {{t1, t2}, {t3, t4}, {t5, t6}}

21
Query execution time
Number of result tuples
Provenance generation time for all result tuples
Provenance generation overhead for all result tuples
Provenance generation time
per result tuple

Source selection: SPARQL ASK [Schwarte et al. 2011]
Nested loop join: evaluate iteratively
Exclusive grouping and bound join [Schwarte et al. 2011]
Virtual integration model (graph) [Gaignard 2013]
22
FedQ-LD: Federated Query
Processor
Data
Source
Data
Source
Data
Source
Sub-query
Results of sub-query
Sub-query
Sub-query
Query
Results
Explanation
Facility
Explain tuple
Why-
Provenance
-based
Explanation
UI Basic query federation features
Explanation-Aware Federated Query Processor Prototype

IMPACT OF QUERY RESULT
EXPLANATIONS
24

• Explanations for the Semantic Web
– Assumptions:
• improve users’ understanding
• improved understanding leads to improved trust
• Evaluating Explanations
– Recommender systems [Tintarev et al. 2012]
– Context-aware applications [Lim et al. 2009]
25
Not evaluated yet

• H1. Query result explanations improve user experience over having no
explanations
– User experience: understanding and trust
• User study to test our hypothesis
– Scenario: explanation-aware federated query processing
– Participants: with explanation and without explanation
– Learning: how the system works, example query with or without
explanation
– Reasoning: solve a federated query
– Survey: feeling about the system
• Setup
– Date sources/Data sets: DBpedia and LinkedMDB
– Query: British movies with American actors
– Participants: 11 total, 6 w and 5 w/o; 8 m and 3 f; age:22-66; RDF and
SPARQL
26

90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
With Explanation Without Explanation
Fully Correct Partially Correct Incorrect
Data source selection Source triple selection
27
Response about data source selection and source triple selection
Participants with explanation understood the system better
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
Fully Correct Partiallly Correct Incorrect

28
Confidence level of the participants about their answers
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
Very High High Medium Low Very Low
Participants with explanation were more confident on their answers

60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
Understanding Making trust judgments
29
How users feel about the system
helpful (“yes”) or unhelpful (“no”)
Participants felt that explanations are helpful
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
Yes No
0.00%
Yes No

31
Application
Produces Consumes
Results
Explaining distributed reasoning
Existing approach: centralized metadata registry [McGuiness et al 2003]
Our proposal: Linked Explanations
- Decentralized
- Proof tree-based explanations

Publish explanation metadata as Linked Data
Named graphs for reification and bundling metadata
Dereferenceable named graph URIs
- Statements inside the named graph
- Related statements
32
Linked Explanations

Previous approaches
Linked Data incompatibility: blank nodes
No support for data interchanging standards (W3C PROV)
Ratio4TA ontology http://ns.inria.fr/ratio4ta
An extension of W3C PROV-O
33
Representing Explanation Metadata

35
# Derived data
lodapp:data1 {
dbpedia:Philadelphia gn:parentFeature geonames:5205788.
}
# Explanation metadata
lodapp:explanation1 {
lodapp:data1 r4ta:hasExplanation lodapp:explanation1.
# Type declarations
lodapp:explanation1 rdf:type r4ta:ExplanationBundle.
lodapp:corese rdf:type r4ta:SoftwareApplication.
....
....
# Reasoning process
lodapp:reasoningProcess1 r4ta:performedBy lodapp:corese;
r4ta:usedData lodapp:inputData1;
r4ta:usedData lodapp:inputData2;
r4ta:computed lodapp:result1;
r4ta:produced lodapp:data1.
# Computed result
lodapp:result1 r4ta:resultReasoner lodapp:corese .
# Output data
lodapp:data1 r4ta:derivedFrom lodapp:inputData1;
r4ta:derivedFrom lodapp:inputData2;
r4ta:belongsTo lodapp:result1;
r4ta:derivedBy lodapp:derivation1.
# Data derivation
lodapp:derivation1 r4ta:usedRule lodapp:geoFeatureRule;
r4ta:wasInvolvedInComputing lodapp:result1;
r4ta:derivationReasoner lodapp:corese;
r4ta:performedAsPartOf lodapp:reasoningProcess1.
}
# Dbpedia data
lodapp:inputData1 {
dbpedia:Philadelphia owl:sameAs geonames:4560349 .
}
# GeoNames data
lodapp:inputData2 {
geonames:4560349 gn:parentFeature geonames:5205788.
}
Application
Consumes
Produces
dbp-meta:explanation2 {
lodapp:inputData1 r4ta:hasExplanation dbp-meta:explanation2.
....
....
}
geo-meta:explanation3 {
lodapp:inputData2 r4ta:hasExplanation geo-meta:explanation3.
....
....
}

36
Explanation User Interfaces

39
Entry point to the full explanation
Salient, abstract, and coherent information
Provide a means to filter information in large explanations
Filtering: a set of classes used in the reasoning
Inspired by text summarization [Eduard 2005] and
ontology summarization [Zhang et al 2007]

40
Ranking Measures
Salient RDF Statements
Degree centrality of subject and object
Abstract Statements Abstract

41
Ranking Measures
Similarity of RDF Statements
Similarity between the filtering classes and the statement
subject, predicate, object
[Corby et al 2006]

42
Re-Ranking Measures
Subtree Weight in Proof Tree
Salience of a statement w.r.t. its position in the proof tree
considering the weights of all the statements in the current branch
1
1
1
1
1
3
3
0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.57
0.6
0.6
0.6
0.4
0.6
0.47
Coherence
Coherent
Iteratively selecting the statement with best potential
contribution to the total coherence of the summary
Not coherent

• A query
– Scientists born in United Kingdom
• Query result with explanation
– Bob because
– Bob is a Computer Scientist
– Computer Scientists are Scientists
– Bob was born in London
– London is part of England
– England is part of United Kingdom
• Rating the necessity of each explanation statements from scale of 1
to 5
43
Evaluation
Inferences: RDFS type propagation,
owl:sameAs, transitivity of
gn:parentFeature
For “with filtering”: query + filtering classes (e.g. Computer Scientist,
Place) + result + explanation

44
Gender
Female
Male
Knowledge of RDF
Yes
No
journalism
psychology
computer science
business administration
biology
chemist
mathematician
social scientist
10
8
6
4
2
0
20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59
People
Age
Analysis of ground truths (Cosine similarity)
Avg. Std. dev.
Without Filtering 0.836 0.048
With Filtering 0.835 0.065

45
Evaluating Rankings
salience: SSL ; abstractness: SAB ; similarity: SSM
subtree weight: SST ; coherence: SCO ; sentence graph: SSG

46
Evaluating Summaries
salience: SSL ; abstractness: SAB ; similarity: SSM
subtree weight: SST ; coherence: SCO ; sentence graph: SSG

Summary of Contributions
Linked Explanations
Non-annotation approach
to why-provenance
Evaluating the impact of
explanations
48

Perspectives
49
Query optimization
Training query generation
Explaining performance
How-provenance
More participants
Re-evaluating
Linked Explanations
Named graphs
Large amount of
metadata
Effectively using the
rankings for presentation
Personalized explanations
classifying users
based on their usage logs

• Rakebul Hasan and Fabien Gandon. A Machine Learning Approach to SPARQL
Query Performance Prediction. WI 2014
• Rakebul Hasan. Generating and Summarizing Explanations for Linked Data. ESWC
2014
• Rakebul Hasan. Predicting SPARQL Query Performance and Explaining Linked
Data. PhD Symposium, ESWC 2014
• Rakebul Hasan and Fabien Gandon. Predicting SPARQL Query Performance.
Poster, ESWC 2014
• Rakebul Hasan, Kemele M. Endris and Fabien Gandon. SPARQL Query Result
Explanation for Linked Data. SWCS 2014, ISWC 2014
• Rakebul Hasan and Fabien Gandon. A Brief Review of Explanation in the Semantic
Web. ExaCt 2012, ECAI 2012
• Rakebul Hasan and Fabien Gandon. Linking Justifications in the Collaborative
Semantic Web Applications. SWCS 2012, WWW 2012
50
Thank You

Predicting query performance and explaining results to assist Linked Data consumption.

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Predicting query performance and explaining results to assist Linked Data consumption.

Similar to Predicting query performance and explaining results to assist Linked Data consumption. (20)

Recently uploaded

Recently uploaded (20)

Predicting query performance and explaining results to assist Linked Data consumption.

Editor's Notes