SlideShare a Scribd company logo
1 of 50
Predicting Query Performance and 
Explaining Results to Assist Linked Data 
Consumption 
Candidate: Rakebul Hasan 
Jury: 
President: Johan Montagnat, CNRS (I3S) 
Director: Fabien Gandon, INRIA 
Co-director: Pierre-Antoine Champin, LIRIS, UCBL, Lyon 
Reviewers: 
Pascal Molli, University of Nantes 
Philippe Cudré-Mauroux, University of Fribourg, Switzerland
2 
Accessing Linked Data 
Dereferencing URIs: default 
SPARQL Endpoints 
68% data sets, 2011 
98% data (triples), 2014 
Consuming Linked Data 
Query federation 
On-the-fly dereferencing 
Crawling 
Integrating disparate data to support intelligent applications 
Attribution: “Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
3 
Results 
Workload management tasks: 
configuration, organization, inspection, and optimization 
“How long will the query take to execute?” 
History of data lifecycle: 
make trust judgments, validate or invalidate results 
“Why this result?”
4 
Linked Data Access/HTTP 
“Why this result?” 
“Show me the flow of information in the result derivation” 
“Show me the summary of what happened in the result derivation process”
5 
Linked Data Access 
Assistance in Querying 
Query Performance 
Assistance in Result Understanding 
Query Results 
Results Produced by Applications
Query Performance Prediction 
Explaining SPARQL Query Results 
Linked Explanations: Explanations for 
Linked Data Applications 
Summarizing Explanations 
Outline
7 
Predictions 
Statistics about data? 
-Published statistics 
-Few datasets 
How to predict query performance without using data statistics? -Not detailed
8 
Previously executed queries 
Query Q1 took 100 ms 
Query Q2 took 120 ms 
Query Q3 took 200 ms 
Query Q4 took 190 ms 
... 
Query Qm took 300 ms 
Predictions 
“How long will the query take to execute?”
Unseen query q 
9 
y is the performance metric 
Query Q1 took 100 ms 
Query Q2 took 120 ms 
Query Q3 took 200 ms 
Query Q4 took 190 ms 
... 
Query Qm took 300 ms 
Learn 
Regression 
f(q) = ? 
Predictions 
How to model SPARQL query characteristics for machine learning algorithms? 
Feature extraction
10 
Algebra features: 
extracted from SPARQL 
algebraic expression
11 
Graph pattern features: 
Landmarks in training queries 
Similarities between landmark queries 
and the query in examination 
Inverting 
approximated graph edit distance 
Clustering: 
k-medoids with 
Approximated graph edit distance 
[Riesen et al 2009]
12 
Queries: 
1260 training, 420 validation and 420 test queries generated from DBPSB 
benchmark [Morsey et al. 2011] query templates 
RDF dataset: 
DBpedia 
Learning models: 
k-NN regression with k-D tree 
SVM with nu-SVR for regression 
Triple store: 
Jena TDB 1.0.0 
16 GB allocated memory 
Commodity hardware: 
Intel Xeon 2.53 GHz 
48 GB RAM 
Linux 2.6.32 
Experiments
13 
Algebra Features
14 
Algebra and Graph Pattern Features
Model Training Time Avg. Prediction Time per Query 
k-NN + algebra 7.14 sec 3.42 ms 
SVM + algebra 26.26 sec 3.53 ms 
k-NN + algebra + graph pattern 3300.33 sec 
(55.01 min) 
47.25 ms 
SVM + algebra + graph pattern 3390.71 sec 
(56.5 min) 
98.1 ms 
15
Query Performance Prediction 
Explaining SPARQL Query Results 
Linked Explanations: Explanations for 
Linked Data Applications 
Summarizing Explanations 
Outline
17 
Results 
History of data lifecycle: 
Modifying 
make trust judgments, validate or invalidate results 
“Why this result?” 
Provenance-based query result explanation
• Explanations 
– Provenance models ( e.g. PML, W3C PROV-O) 
– Presentation/UI 
– Justifications 
• Provenance for query results 
– Relational databases 
• Why, where, how provenance 
– Annotation approach 
– Non-annotation approach 
– RDF and SPARQL 
• Transform RDF and SPARQL to relational models [Theoharis et al. 2011, 
Damásio et al. 2012] 
• Annotation approaches: Corese/KGRAM [Corby 2012], TripleProv [Wylot 2014] 
18 
Previous Work
Query Result Provenance 
Triple 
:person1 rdf:type foaf:Person. t1 
:person1 foaf:based_near "Paris". t2 
:person2 rdf:type foaf:Person. t3 
:person2 foaf:based_near "Paris". t4 
:person3 rdf:type foaf:Person. t5 
:person3 foaf:based_near "Paris". t6 
:person4 rdf:type foaf:Person. t7 
:person4 foaf:based_near "London". t8 
19 
select ?location 
where 
{ 
?person rdf:type foaf:Person. 
?person foaf:based_near ?location. 
} 
location 
London 
Paris How? Why? 
Provenance for the result tuple (location=“Paris”): 
How-provenance: (t1 t2) (t3 t4) (t5 t6) 
Why-provenance: {{t1, t2}, {t3, t4}, {t5, t6}} 
Lineage: {t1, t2, t3, t4, t5, t6} 
Geerts et al. Algebraic structures for capturing the provenance of SPARQL queries. ICDT '13 
Green et al. Tannen. Provenance semirings. SIGMOD/PODS, 2007. 
Buneman, Khanna, Tan, “Why and Where: A Characterization of Data Provenance”, ICDT’01
Non-Annotation-based Algorithm 
to Compute Why-Provenance 
Triple 
:person1 rdf:type foaf:Person. t1 
:person1 foaf:based_near "Paris". t2 
:person2 rdf:type foaf:Person. t3 
:person2 foaf:based_near "Paris". t4 
:person3 rdf:type foaf:Person. t5 
:person3 foaf:based_near "Paris". t6 
:person4 rdf:type foaf:Person. t7 
:person4 foaf:based_near "London". t8 
20 
select ?location 
where 
{ 
?person rdf:type foaf:Person. 
?person foaf:based_near ?location. 
} 
location 
London 
Paris Why? 
Bind the values from the result tuple to 
the original query and project all 
SELECT * 
{ 
?person rdf:type foaf:Person. 
?person foaf:based_near ?location. 
VALUES ?location {"Paris"} 
} 
variables 
person location 
:person1 Paris 
:person2 Paris 
:person3 Paris 
results 
:person1 rdf:type foaf:Person. 
:person1 foaf:based_near “Paris”. 
Provenance for the result tuple (location=“Paris”): 
Why-provenance: {{t1, t2}, {t3, t4}, {t5, t6}}
21 
Query execution time 
Number of result tuples 
Provenance generation time for all result tuples 
Provenance generation overhead for all result tuples 
Provenance generation time 
per result tuple
Source selection: SPARQL ASK [Schwarte et al. 2011] 
Nested loop join: evaluate iteratively 
Exclusive grouping and bound join [Schwarte et al. 2011] 
Virtual integration model (graph) [Gaignard 2013] 
22 
FedQ-LD: Federated Query 
Processor 
Data 
Source 
Data 
Source 
Data 
Source 
Sub-query 
Results of sub-query 
Sub-query 
Results of sub-query 
Sub-query 
Results of sub-query 
Query 
Results 
Explanation 
Facility 
Explain tuple 
Why- 
Provenance 
-based 
Explanation 
UI Basic query federation features 
Explanation-Aware Federated Query Processor Prototype
23
IMPACT OF QUERY RESULT 
EXPLANATIONS 
24
• Explanations for the Semantic Web 
– Assumptions: 
• improve users’ understanding 
• improved understanding leads to improved trust 
• Evaluating Explanations 
– Recommender systems [Tintarev et al. 2012] 
– Context-aware applications [Lim et al. 2009] 
25 
Not evaluated yet
• H1. Query result explanations improve user experience over having no 
explanations 
– User experience: understanding and trust 
• User study to test our hypothesis 
– Scenario: explanation-aware federated query processing 
– Participants: with explanation and without explanation 
– Learning: how the system works, example query with or without 
explanation 
– Reasoning: solve a federated query 
– Survey: feeling about the system 
• Setup 
– Date sources/Data sets: DBpedia and LinkedMDB 
– Query: British movies with American actors 
– Participants: 11 total, 6 w and 5 w/o; 8 m and 3 f; age:22-66; RDF and 
SPARQL 
26
90.00% 
80.00% 
70.00% 
60.00% 
50.00% 
40.00% 
30.00% 
20.00% 
10.00% 
0.00% 
With Explanation Without Explanation 
Fully Correct Partially Correct Incorrect 
Data source selection Source triple selection 
27 
Response about data source selection and source triple selection 
Participants with explanation understood the system better 
90.00% 
80.00% 
70.00% 
60.00% 
50.00% 
40.00% 
30.00% 
20.00% 
10.00% 
0.00% 
With Explanation Without Explanation 
Fully Correct Partiallly Correct Incorrect
28 
Confidence level of the participants about their answers 
90.00% 
80.00% 
70.00% 
60.00% 
50.00% 
40.00% 
30.00% 
20.00% 
10.00% 
0.00% 
With Explanation Without Explanation 
Very High High Medium Low Very Low 
Participants with explanation were more confident on their answers
60.00% 
50.00% 
40.00% 
30.00% 
20.00% 
10.00% 
Understanding Making trust judgments 
29 
How users feel about the system 
helpful (“yes”) or unhelpful (“no”) 
Participants felt that explanations are helpful 
60.00% 
50.00% 
40.00% 
30.00% 
20.00% 
10.00% 
0.00% 
With Explanation Without Explanation 
Yes No 
0.00% 
With Explanation Without Explanation 
Yes No
Query Performance Prediction 
Explaining SPARQL Query Results 
Linked Explanations: Explanations for 
Linked Data Applications 
Summarizing Explanations 
Outline
31 
Application 
Produces Consumes 
Results 
Explaining distributed reasoning 
Existing approach: centralized metadata registry [McGuiness et al 2003] 
Our proposal: Linked Explanations 
- Decentralized 
- Proof tree-based explanations
Publish explanation metadata as Linked Data 
Named graphs for reification and bundling metadata 
Dereferenceable named graph URIs 
- Statements inside the named graph 
- Related statements 
32 
Linked Explanations
Previous approaches 
Linked Data incompatibility: blank nodes 
No support for data interchanging standards (W3C PROV) 
Ratio4TA ontology http://ns.inria.fr/ratio4ta 
An extension of W3C PROV-O 
33 
Representing Explanation Metadata
34
35 
# Derived data 
lodapp:data1 { 
dbpedia:Philadelphia gn:parentFeature geonames:5205788. 
} 
# Explanation metadata 
lodapp:explanation1 { 
lodapp:data1 r4ta:hasExplanation lodapp:explanation1. 
# Type declarations 
lodapp:explanation1 rdf:type r4ta:ExplanationBundle. 
lodapp:corese rdf:type r4ta:SoftwareApplication. 
.... 
.... 
# Reasoning process 
lodapp:reasoningProcess1 r4ta:performedBy lodapp:corese; 
r4ta:usedData lodapp:inputData1; 
r4ta:usedData lodapp:inputData2; 
r4ta:computed lodapp:result1; 
r4ta:produced lodapp:data1. 
# Computed result 
lodapp:result1 r4ta:resultReasoner lodapp:corese . 
# Output data 
lodapp:data1 r4ta:derivedFrom lodapp:inputData1; 
r4ta:derivedFrom lodapp:inputData2; 
r4ta:belongsTo lodapp:result1; 
r4ta:derivedBy lodapp:derivation1. 
# Data derivation 
lodapp:derivation1 r4ta:usedRule lodapp:geoFeatureRule; 
r4ta:wasInvolvedInComputing lodapp:result1; 
r4ta:derivationReasoner lodapp:corese; 
r4ta:performedAsPartOf lodapp:reasoningProcess1. 
} 
# Dbpedia data 
lodapp:inputData1 { 
dbpedia:Philadelphia owl:sameAs geonames:4560349 . 
} 
# GeoNames data 
lodapp:inputData2 { 
geonames:4560349 gn:parentFeature geonames:5205788. 
} 
Application 
Consumes 
Produces 
# Explanation metadata 
dbp-meta:explanation2 { 
lodapp:inputData1 r4ta:hasExplanation dbp-meta:explanation2. 
.... 
.... 
} 
# Explanation metadata 
geo-meta:explanation3 { 
lodapp:inputData2 r4ta:hasExplanation geo-meta:explanation3. 
.... 
.... 
}
36 
Explanation User Interfaces
Query Performance Prediction 
Explaining SPARQL Query Results 
Linked Explanations: Explanations for 
Linked Data Applications 
Summarizing Explanations 
Outline
38 
Overwhelming
39 
Entry point to the full explanation 
Salient, abstract, and coherent information 
Provide a means to filter information in large explanations 
Filtering: a set of classes used in the reasoning 
Inspired by text summarization [Eduard 2005] and 
ontology summarization [Zhang et al 2007]
40 
Ranking Measures 
Salient RDF Statements 
Degree centrality of subject and object 
Abstract Statements Abstract
41 
Ranking Measures 
Similarity of RDF Statements 
Similarity between the filtering classes and the statement 
subject, predicate, object 
[Corby et al 2006]
42 
Re-Ranking Measures 
Subtree Weight in Proof Tree 
Salience of a statement w.r.t. its position in the proof tree 
considering the weights of all the statements in the current branch 
1 
1 
1 
1 
1 
3 
3 
0.6 
0.6 
0.6 
0.5 
0.5 
0.5 
0.4 
0.57 
0.6 
0.6 
0.6 
0.4 
0.6 
0.47 
Coherence 
Coherent 
Iteratively selecting the statement with best potential 
contribution to the total coherence of the summary 
Not coherent
• A query 
– Scientists born in United Kingdom 
• Query result with explanation 
– Bob because 
– Bob is a Computer Scientist 
– Computer Scientists are Scientists 
– Bob was born in London 
– London is part of England 
– England is part of United Kingdom 
• Rating the necessity of each explanation statements from scale of 1 
to 5 
43 
Evaluation 
Inferences: RDFS type propagation, 
owl:sameAs, transitivity of 
gn:parentFeature 
For “with filtering”: query + filtering classes (e.g. Computer Scientist, 
Place) + result + explanation
44 
Gender 
Female 
Male 
Knowledge of RDF 
Yes 
No 
journalism 
psychology 
computer science 
business administration 
biology 
chemist 
mathematician 
social scientist 
10 
8 
6 
4 
2 
0 
20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 
People 
Age 
Analysis of ground truths (Cosine similarity) 
Avg. Std. dev. 
Without Filtering 0.836 0.048 
With Filtering 0.835 0.065
45 
Evaluating Rankings 
salience: SSL ; abstractness: SAB ; similarity: SSM 
subtree weight: SST ; coherence: SCO ; sentence graph: SSG
46 
Evaluating Summaries 
salience: SSL ; abstractness: SAB ; similarity: SSM 
subtree weight: SST ; coherence: SCO ; sentence graph: SSG
CONCLUSIONS 
47
Summary of Contributions 
Query Performance Prediction 
Explaining SPARQL Query Results 
Linked Explanations 
Summarizing Explanations 
Non-annotation approach 
to why-provenance 
Evaluating the impact of 
explanations 
48
Perspectives 
49 
Query Performance Prediction 
Query optimization 
Training query generation 
Explaining performance 
Explaining SPARQL Query Results 
How-provenance 
More participants 
Re-evaluating 
Linked Explanations 
Named graphs 
Large amount of 
metadata 
Summarizing Explanations 
Effectively using the 
rankings for presentation 
Personalized explanations 
classifying users 
based on their usage logs
• Rakebul Hasan and Fabien Gandon. A Machine Learning Approach to SPARQL 
Query Performance Prediction. WI 2014 
• Rakebul Hasan. Generating and Summarizing Explanations for Linked Data. ESWC 
2014 
• Rakebul Hasan. Predicting SPARQL Query Performance and Explaining Linked 
Data. PhD Symposium, ESWC 2014 
• Rakebul Hasan and Fabien Gandon. Predicting SPARQL Query Performance. 
Poster, ESWC 2014 
• Rakebul Hasan, Kemele M. Endris and Fabien Gandon. SPARQL Query Result 
Explanation for Linked Data. SWCS 2014, ISWC 2014 
• Rakebul Hasan and Fabien Gandon. A Brief Review of Explanation in the Semantic 
Web. ExaCt 2012, ECAI 2012 
• Rakebul Hasan and Fabien Gandon. Linking Justifications in the Collaborative 
Semantic Web Applications. SWCS 2012, WWW 2012 
50 
Thank You

More Related Content

What's hot

Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking  Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking
Mohamed BEN ELLEFI
 
Context-Based Diversification for Keyword Queries over XML Data
Context-Based Diversification for Keyword Queries over XML DataContext-Based Diversification for Keyword Queries over XML Data
Context-Based Diversification for Keyword Queries over XML Data
1crore projects
 
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Lucidworks
 
Query Dependent Pseudo-Relevance Feedback based on Wikipedia
Query Dependent Pseudo-Relevance Feedback based on WikipediaQuery Dependent Pseudo-Relevance Feedback based on Wikipedia
Query Dependent Pseudo-Relevance Feedback based on Wikipedia
YI-JHEN LIN
 

What's hot (19)

A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesA Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and Challenges
 
Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking  Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking
 
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATA
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATAEFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATA
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATA
 
Context-Based Diversification for Keyword Queries over XML Data
Context-Based Diversification for Keyword Queries over XML DataContext-Based Diversification for Keyword Queries over XML Data
Context-Based Diversification for Keyword Queries over XML Data
 
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough dataВладимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
 
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
 
SemFacet paper
SemFacet paperSemFacet paper
SemFacet paper
 
Trust Models for RDF Data: Semantics and Complexity - AAAI2015
Trust Models for RDF Data: Semantics and Complexity - AAAI2015Trust Models for RDF Data: Semantics and Complexity - AAAI2015
Trust Models for RDF Data: Semantics and Complexity - AAAI2015
 
Fast Feature Selection for Learning to Rank - ACM International Conference on...
Fast Feature Selection for Learning to Rank - ACM International Conference on...Fast Feature Selection for Learning to Rank - ACM International Conference on...
Fast Feature Selection for Learning to Rank - ACM International Conference on...
 
Sub1579
Sub1579Sub1579
Sub1579
 
Query Dependent Pseudo-Relevance Feedback based on Wikipedia
Query Dependent Pseudo-Relevance Feedback based on WikipediaQuery Dependent Pseudo-Relevance Feedback based on Wikipedia
Query Dependent Pseudo-Relevance Feedback based on Wikipedia
 
Cg4201552556
Cg4201552556Cg4201552556
Cg4201552556
 
Semantic Annotation of Documents
Semantic Annotation of DocumentsSemantic Annotation of Documents
Semantic Annotation of Documents
 
Recommender System with Distributed Representation
Recommender System with Distributed RepresentationRecommender System with Distributed Representation
Recommender System with Distributed Representation
 
Master in Big Data Analytics and Social Mining 20015
Master in Big Data Analytics and Social Mining 20015Master in Big Data Analytics and Social Mining 20015
Master in Big Data Analytics and Social Mining 20015
 
DL'12 mastro at work
DL'12 mastro at workDL'12 mastro at work
DL'12 mastro at work
 
Relationship-Based Top-K Concept Retrieval for Ontology Search
Relationship-Based Top-K Concept Retrieval for Ontology SearchRelationship-Based Top-K Concept Retrieval for Ontology Search
Relationship-Based Top-K Concept Retrieval for Ontology Search
 
An automated template selection framework for keyword query over linked data
An automated template selection framework for keyword query over linked dataAn automated template selection framework for keyword query over linked data
An automated template selection framework for keyword query over linked data
 

Similar to Predicting query performance and explaining results to assist Linked Data consumption.

Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014
Maria Eskevich
 
Managing textual data semantically in relational databases by wael yahfooz an...
Managing textual data semantically in relational databases by wael yahfooz an...Managing textual data semantically in relational databases by wael yahfooz an...
Managing textual data semantically in relational databases by wael yahfooz an...
SK Ahammad Fahad
 
print mod 2.pdf
print mod 2.pdfprint mod 2.pdf
print mod 2.pdf
lathass5
 

Similar to Predicting query performance and explaining results to assist Linked Data consumption. (20)

MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
 
Workflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to ReportingWorkflow Provenance: From Modelling to Reporting
Workflow Provenance: From Modelling to Reporting
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
 
Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Managing textual data semantically in relational databases by wael yahfooz an...
Managing textual data semantically in relational databases by wael yahfooz an...Managing textual data semantically in relational databases by wael yahfooz an...
Managing textual data semantically in relational databases by wael yahfooz an...
 
print mod 2.pdf
print mod 2.pdfprint mod 2.pdf
print mod 2.pdf
 
A Machine Learning Approach to SPARQL Query Performance Prediction
A Machine Learning Approach to SPARQL Query Performance PredictionA Machine Learning Approach to SPARQL Query Performance Prediction
A Machine Learning Approach to SPARQL Query Performance Prediction
 
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL EndpointsA BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
 
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchStructure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
 
Provinance in scientific workflows in e science
Provinance in scientific workflows in e scienceProvinance in scientific workflows in e science
Provinance in scientific workflows in e science
 
Query Recommendation - Barcelona 2017
Query Recommendation - Barcelona 2017Query Recommendation - Barcelona 2017
Query Recommendation - Barcelona 2017
 
Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...Natural Language to SQL Query conversion using Machine Learning Techniques on...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Database novelty detection
Database novelty detectionDatabase novelty detection
Database novelty detection
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
Extending facet search to the general web
Extending facet search to the general webExtending facet search to the general web
Extending facet search to the general web
 
Phd thesis final presentation
Phd thesis   final presentationPhd thesis   final presentation
Phd thesis final presentation
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
Neo4j workshop at GraphSummit London 14 Nov 2023.pdf
Neo4j workshop at GraphSummit London 14 Nov 2023.pdfNeo4j workshop at GraphSummit London 14 Nov 2023.pdf
Neo4j workshop at GraphSummit London 14 Nov 2023.pdf
 

Recently uploaded

The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
University of Hertfordshire
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
LeenakshiTyagi
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
anilsa9823
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 

Recently uploaded (20)

The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 

Predicting query performance and explaining results to assist Linked Data consumption.

  • 1. Predicting Query Performance and Explaining Results to Assist Linked Data Consumption Candidate: Rakebul Hasan Jury: President: Johan Montagnat, CNRS (I3S) Director: Fabien Gandon, INRIA Co-director: Pierre-Antoine Champin, LIRIS, UCBL, Lyon Reviewers: Pascal Molli, University of Nantes Philippe Cudré-Mauroux, University of Fribourg, Switzerland
  • 2. 2 Accessing Linked Data Dereferencing URIs: default SPARQL Endpoints 68% data sets, 2011 98% data (triples), 2014 Consuming Linked Data Query federation On-the-fly dereferencing Crawling Integrating disparate data to support intelligent applications Attribution: “Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
  • 3. 3 Results Workload management tasks: configuration, organization, inspection, and optimization “How long will the query take to execute?” History of data lifecycle: make trust judgments, validate or invalidate results “Why this result?”
  • 4. 4 Linked Data Access/HTTP “Why this result?” “Show me the flow of information in the result derivation” “Show me the summary of what happened in the result derivation process”
  • 5. 5 Linked Data Access Assistance in Querying Query Performance Assistance in Result Understanding Query Results Results Produced by Applications
  • 6. Query Performance Prediction Explaining SPARQL Query Results Linked Explanations: Explanations for Linked Data Applications Summarizing Explanations Outline
  • 7. 7 Predictions Statistics about data? -Published statistics -Few datasets How to predict query performance without using data statistics? -Not detailed
  • 8. 8 Previously executed queries Query Q1 took 100 ms Query Q2 took 120 ms Query Q3 took 200 ms Query Q4 took 190 ms ... Query Qm took 300 ms Predictions “How long will the query take to execute?”
  • 9. Unseen query q 9 y is the performance metric Query Q1 took 100 ms Query Q2 took 120 ms Query Q3 took 200 ms Query Q4 took 190 ms ... Query Qm took 300 ms Learn Regression f(q) = ? Predictions How to model SPARQL query characteristics for machine learning algorithms? Feature extraction
  • 10. 10 Algebra features: extracted from SPARQL algebraic expression
  • 11. 11 Graph pattern features: Landmarks in training queries Similarities between landmark queries and the query in examination Inverting approximated graph edit distance Clustering: k-medoids with Approximated graph edit distance [Riesen et al 2009]
  • 12. 12 Queries: 1260 training, 420 validation and 420 test queries generated from DBPSB benchmark [Morsey et al. 2011] query templates RDF dataset: DBpedia Learning models: k-NN regression with k-D tree SVM with nu-SVR for regression Triple store: Jena TDB 1.0.0 16 GB allocated memory Commodity hardware: Intel Xeon 2.53 GHz 48 GB RAM Linux 2.6.32 Experiments
  • 14. 14 Algebra and Graph Pattern Features
  • 15. Model Training Time Avg. Prediction Time per Query k-NN + algebra 7.14 sec 3.42 ms SVM + algebra 26.26 sec 3.53 ms k-NN + algebra + graph pattern 3300.33 sec (55.01 min) 47.25 ms SVM + algebra + graph pattern 3390.71 sec (56.5 min) 98.1 ms 15
  • 16. Query Performance Prediction Explaining SPARQL Query Results Linked Explanations: Explanations for Linked Data Applications Summarizing Explanations Outline
  • 17. 17 Results History of data lifecycle: Modifying make trust judgments, validate or invalidate results “Why this result?” Provenance-based query result explanation
  • 18. • Explanations – Provenance models ( e.g. PML, W3C PROV-O) – Presentation/UI – Justifications • Provenance for query results – Relational databases • Why, where, how provenance – Annotation approach – Non-annotation approach – RDF and SPARQL • Transform RDF and SPARQL to relational models [Theoharis et al. 2011, Damásio et al. 2012] • Annotation approaches: Corese/KGRAM [Corby 2012], TripleProv [Wylot 2014] 18 Previous Work
  • 19. Query Result Provenance Triple :person1 rdf:type foaf:Person. t1 :person1 foaf:based_near "Paris". t2 :person2 rdf:type foaf:Person. t3 :person2 foaf:based_near "Paris". t4 :person3 rdf:type foaf:Person. t5 :person3 foaf:based_near "Paris". t6 :person4 rdf:type foaf:Person. t7 :person4 foaf:based_near "London". t8 19 select ?location where { ?person rdf:type foaf:Person. ?person foaf:based_near ?location. } location London Paris How? Why? Provenance for the result tuple (location=“Paris”): How-provenance: (t1 t2) (t3 t4) (t5 t6) Why-provenance: {{t1, t2}, {t3, t4}, {t5, t6}} Lineage: {t1, t2, t3, t4, t5, t6} Geerts et al. Algebraic structures for capturing the provenance of SPARQL queries. ICDT '13 Green et al. Tannen. Provenance semirings. SIGMOD/PODS, 2007. Buneman, Khanna, Tan, “Why and Where: A Characterization of Data Provenance”, ICDT’01
  • 20. Non-Annotation-based Algorithm to Compute Why-Provenance Triple :person1 rdf:type foaf:Person. t1 :person1 foaf:based_near "Paris". t2 :person2 rdf:type foaf:Person. t3 :person2 foaf:based_near "Paris". t4 :person3 rdf:type foaf:Person. t5 :person3 foaf:based_near "Paris". t6 :person4 rdf:type foaf:Person. t7 :person4 foaf:based_near "London". t8 20 select ?location where { ?person rdf:type foaf:Person. ?person foaf:based_near ?location. } location London Paris Why? Bind the values from the result tuple to the original query and project all SELECT * { ?person rdf:type foaf:Person. ?person foaf:based_near ?location. VALUES ?location {"Paris"} } variables person location :person1 Paris :person2 Paris :person3 Paris results :person1 rdf:type foaf:Person. :person1 foaf:based_near “Paris”. Provenance for the result tuple (location=“Paris”): Why-provenance: {{t1, t2}, {t3, t4}, {t5, t6}}
  • 21. 21 Query execution time Number of result tuples Provenance generation time for all result tuples Provenance generation overhead for all result tuples Provenance generation time per result tuple
  • 22. Source selection: SPARQL ASK [Schwarte et al. 2011] Nested loop join: evaluate iteratively Exclusive grouping and bound join [Schwarte et al. 2011] Virtual integration model (graph) [Gaignard 2013] 22 FedQ-LD: Federated Query Processor Data Source Data Source Data Source Sub-query Results of sub-query Sub-query Results of sub-query Sub-query Results of sub-query Query Results Explanation Facility Explain tuple Why- Provenance -based Explanation UI Basic query federation features Explanation-Aware Federated Query Processor Prototype
  • 23. 23
  • 24. IMPACT OF QUERY RESULT EXPLANATIONS 24
  • 25. • Explanations for the Semantic Web – Assumptions: • improve users’ understanding • improved understanding leads to improved trust • Evaluating Explanations – Recommender systems [Tintarev et al. 2012] – Context-aware applications [Lim et al. 2009] 25 Not evaluated yet
  • 26. • H1. Query result explanations improve user experience over having no explanations – User experience: understanding and trust • User study to test our hypothesis – Scenario: explanation-aware federated query processing – Participants: with explanation and without explanation – Learning: how the system works, example query with or without explanation – Reasoning: solve a federated query – Survey: feeling about the system • Setup – Date sources/Data sets: DBpedia and LinkedMDB – Query: British movies with American actors – Participants: 11 total, 6 w and 5 w/o; 8 m and 3 f; age:22-66; RDF and SPARQL 26
  • 27. 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% With Explanation Without Explanation Fully Correct Partially Correct Incorrect Data source selection Source triple selection 27 Response about data source selection and source triple selection Participants with explanation understood the system better 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% With Explanation Without Explanation Fully Correct Partiallly Correct Incorrect
  • 28. 28 Confidence level of the participants about their answers 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% With Explanation Without Explanation Very High High Medium Low Very Low Participants with explanation were more confident on their answers
  • 29. 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% Understanding Making trust judgments 29 How users feel about the system helpful (“yes”) or unhelpful (“no”) Participants felt that explanations are helpful 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% With Explanation Without Explanation Yes No 0.00% With Explanation Without Explanation Yes No
  • 30. Query Performance Prediction Explaining SPARQL Query Results Linked Explanations: Explanations for Linked Data Applications Summarizing Explanations Outline
  • 31. 31 Application Produces Consumes Results Explaining distributed reasoning Existing approach: centralized metadata registry [McGuiness et al 2003] Our proposal: Linked Explanations - Decentralized - Proof tree-based explanations
  • 32. Publish explanation metadata as Linked Data Named graphs for reification and bundling metadata Dereferenceable named graph URIs - Statements inside the named graph - Related statements 32 Linked Explanations
  • 33. Previous approaches Linked Data incompatibility: blank nodes No support for data interchanging standards (W3C PROV) Ratio4TA ontology http://ns.inria.fr/ratio4ta An extension of W3C PROV-O 33 Representing Explanation Metadata
  • 34. 34
  • 35. 35 # Derived data lodapp:data1 { dbpedia:Philadelphia gn:parentFeature geonames:5205788. } # Explanation metadata lodapp:explanation1 { lodapp:data1 r4ta:hasExplanation lodapp:explanation1. # Type declarations lodapp:explanation1 rdf:type r4ta:ExplanationBundle. lodapp:corese rdf:type r4ta:SoftwareApplication. .... .... # Reasoning process lodapp:reasoningProcess1 r4ta:performedBy lodapp:corese; r4ta:usedData lodapp:inputData1; r4ta:usedData lodapp:inputData2; r4ta:computed lodapp:result1; r4ta:produced lodapp:data1. # Computed result lodapp:result1 r4ta:resultReasoner lodapp:corese . # Output data lodapp:data1 r4ta:derivedFrom lodapp:inputData1; r4ta:derivedFrom lodapp:inputData2; r4ta:belongsTo lodapp:result1; r4ta:derivedBy lodapp:derivation1. # Data derivation lodapp:derivation1 r4ta:usedRule lodapp:geoFeatureRule; r4ta:wasInvolvedInComputing lodapp:result1; r4ta:derivationReasoner lodapp:corese; r4ta:performedAsPartOf lodapp:reasoningProcess1. } # Dbpedia data lodapp:inputData1 { dbpedia:Philadelphia owl:sameAs geonames:4560349 . } # GeoNames data lodapp:inputData2 { geonames:4560349 gn:parentFeature geonames:5205788. } Application Consumes Produces # Explanation metadata dbp-meta:explanation2 { lodapp:inputData1 r4ta:hasExplanation dbp-meta:explanation2. .... .... } # Explanation metadata geo-meta:explanation3 { lodapp:inputData2 r4ta:hasExplanation geo-meta:explanation3. .... .... }
  • 36. 36 Explanation User Interfaces
  • 37. Query Performance Prediction Explaining SPARQL Query Results Linked Explanations: Explanations for Linked Data Applications Summarizing Explanations Outline
  • 39. 39 Entry point to the full explanation Salient, abstract, and coherent information Provide a means to filter information in large explanations Filtering: a set of classes used in the reasoning Inspired by text summarization [Eduard 2005] and ontology summarization [Zhang et al 2007]
  • 40. 40 Ranking Measures Salient RDF Statements Degree centrality of subject and object Abstract Statements Abstract
  • 41. 41 Ranking Measures Similarity of RDF Statements Similarity between the filtering classes and the statement subject, predicate, object [Corby et al 2006]
  • 42. 42 Re-Ranking Measures Subtree Weight in Proof Tree Salience of a statement w.r.t. its position in the proof tree considering the weights of all the statements in the current branch 1 1 1 1 1 3 3 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.57 0.6 0.6 0.6 0.4 0.6 0.47 Coherence Coherent Iteratively selecting the statement with best potential contribution to the total coherence of the summary Not coherent
  • 43. • A query – Scientists born in United Kingdom • Query result with explanation – Bob because – Bob is a Computer Scientist – Computer Scientists are Scientists – Bob was born in London – London is part of England – England is part of United Kingdom • Rating the necessity of each explanation statements from scale of 1 to 5 43 Evaluation Inferences: RDFS type propagation, owl:sameAs, transitivity of gn:parentFeature For “with filtering”: query + filtering classes (e.g. Computer Scientist, Place) + result + explanation
  • 44. 44 Gender Female Male Knowledge of RDF Yes No journalism psychology computer science business administration biology chemist mathematician social scientist 10 8 6 4 2 0 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 People Age Analysis of ground truths (Cosine similarity) Avg. Std. dev. Without Filtering 0.836 0.048 With Filtering 0.835 0.065
  • 45. 45 Evaluating Rankings salience: SSL ; abstractness: SAB ; similarity: SSM subtree weight: SST ; coherence: SCO ; sentence graph: SSG
  • 46. 46 Evaluating Summaries salience: SSL ; abstractness: SAB ; similarity: SSM subtree weight: SST ; coherence: SCO ; sentence graph: SSG
  • 48. Summary of Contributions Query Performance Prediction Explaining SPARQL Query Results Linked Explanations Summarizing Explanations Non-annotation approach to why-provenance Evaluating the impact of explanations 48
  • 49. Perspectives 49 Query Performance Prediction Query optimization Training query generation Explaining performance Explaining SPARQL Query Results How-provenance More participants Re-evaluating Linked Explanations Named graphs Large amount of metadata Summarizing Explanations Effectively using the rankings for presentation Personalized explanations classifying users based on their usage logs
  • 50. • Rakebul Hasan and Fabien Gandon. A Machine Learning Approach to SPARQL Query Performance Prediction. WI 2014 • Rakebul Hasan. Generating and Summarizing Explanations for Linked Data. ESWC 2014 • Rakebul Hasan. Predicting SPARQL Query Performance and Explaining Linked Data. PhD Symposium, ESWC 2014 • Rakebul Hasan and Fabien Gandon. Predicting SPARQL Query Performance. Poster, ESWC 2014 • Rakebul Hasan, Kemele M. Endris and Fabien Gandon. SPARQL Query Result Explanation for Linked Data. SWCS 2014, ISWC 2014 • Rakebul Hasan and Fabien Gandon. A Brief Review of Explanation in the Semantic Web. ExaCt 2012, ECAI 2012 • Rakebul Hasan and Fabien Gandon. Linking Justifications in the Collaborative Semantic Web Applications. SWCS 2012, WWW 2012 50 Thank You

Editor's Notes

  1. Good afternoon everyone, Welcome to my phd thesis defense. I will talk about predicting query performance and explaining results to assist linked data consumption.
  2. In the recent years, we have seen a sharp growth of publishing linked data, thanks to the w3c linking open data initiative. DON’T SAY: Click There are two basic ways to access this data: first, dereferencing URIs, and then a large portion of this data is also available via SPARQL endpoints. DON’T SAY: Click There are various approaches for integrating this desperate data to support a new generation of intelligent applications.
  3. In this context of integrating linked data, users such as knowledge engineers can have resource intensive workloads. They may need support for workload management tasks, for example: inspecting, organizing, configuring. They may want to ask questions like “how long will the query take to execute?” DON’T SAY: Click When they get the results back, they may want to know about the history of data used in the result derivation, to make more informed trust judgments, to validate or invalidate results. They may want to ask questions like “why this result?”
  4. Additionally, in the context of applications which consume this linked data and produce their results, users can ask why an application produced a result. They may want to know the flow of the information in the result derivation. DON’T SAY: Click Furthermore, they may also want to have a summarized overview of what happened in the whole derivation process.
  5. In this thesis, our focus is twofold: first, assisting users in querying by providing predicted query performance to have an understanding of query behaviors prior to executing the queries second, assisting users in result understanding. Two types of results: query results, and results produced by applications that consume linked data.
  6. In the rest of the presentation, I am going to talk about the major contributions of my thesis. First, I’ll talk about query performance prediction. Then, I’ll talk about explaining SPARQL query results. Next, I’ll talk about linked explanations. And finally I’ll talk about summarizing explanations
  7. Traditional approaches for query cost estimation are based on data statistics. First some statistics about the underlying data is generated or extracted. Then, based on those statistics, prediction models predict the cost of the queries. DON’T SAY: Click In the context of linked data, statistics about the data is often missing. DON’T SAY: Click There are some approaches to publish statistics about the data. But there are very few datasets that follow these approaches. In addition, those statistics are very basic, not detailed enough for prediction models. DON’T SAY: Click So the challenge is how to predict query performance without using data statistics. ------------------------ Only 32.20 % (95 out of 295) data sources provide a voiD description.
  8. Again the scenario for us is that the users will ask how long will the query take to execute. DON’T SAY: Click Our approach is to learn query performance metrics from query logs of already executed queries. We apply machine learning on those query logs to predict query performance metrics. DON’T SAY: Click We do this on the querying side without using any data statistics, which makes our approach suitable for the linked data scenario.
  9. The first step in our approach is to represent queries as vectors so that they can be used by machine learning algorithms. Here x is a vector representing a query and x will have various features. y here is the performance metric. DON’T SAY: Click Then we learn a mapping function f(x)=y using regression. DON’T SAY: Click And finally when we have an unseen query, we compute the value of this function f to compute our prediction. DON’T SAY: Click So the challenge here for us is how to model SPARQL query characteristics for machine learning algorithms, namely feature extraction.
  10. Firs we extract SPARQL algebra features for a query. We transform a sparql query to sparql algebraic expression tree. Then from the operators of this tree, we construct a vector. The frequencies and cardinalities of these operators become values for different dimensions of the vector.
  11. Next we have the graph pattern features, which is a relative representation of the query pattern in a query, relative to the training query. First we find some landmark queries by clustering the training queries. Cluster centers are the landmark queries. Then we compute similarity values between the landmark queries and the query in examination. DON’T SAY: Click To cluster the training queries, we use k-medoids, which allows us to use an arbitrary distance function. The distance function for us is the approximated graph edit distance. A query pattern is represented by a graph. To compute the distance between two such graphs, we use approximated graph edit distance. We use an approximated solution because the exact solution for graph edit distance has exponential time complexity. The solution we use has a cubic time complexity. DON’T SAY: Click To compute the similarity between two graphs, we just invert their approximated graph edit distance. ------------------------------ Graph pattern features Landmarks in training queries: clustering using k-medoids with approximate edit distance. Similarity values between landmark queries and the query in examination form the graph pattern feature vector. Similarly is computed by inverting approximate edit distance.
  12. We did some experiments with this representation. We generated our training, validation, and test queries from the DBPSB benchmark query templates. DBPSB templates cover most commonly used SPARQL query features in the queries sent to DBPedia. As learning models, we used k-nn regression with kd-tree and SVM with nu-svr kernel. We used commodity hardware with jena triplestore. We use query execution time as the performance metric. ------------------------- DBPSB templates cover most commonly used SPARQL query features in the queries sent to DBPedia DBpedia as RDF data set Predicting query execution time k-NN regression with k-D tree SVM with nu-SVR for regression 4 core Intel Xeon 2.53 GHz CPU, 48 GB system RAM, and Linux 2.6.32 operating system.
  13. First the predictions using only the algebra features. The figure in the top left shows the comparison between predicted and actual values using k-nn. On the x axis we have the predicted execution times. On the y axis we have the actual query execution times. On the right side, we have DBPSB templates on the x axis, and root mean squared errors on the y axis. The R-squared value for k-nn is .96645. An R-squared value close to one means the model predicts well. As you can see, some queries have large errors in this model. Next, we used the algebra features with support vector machine. Our prediction accuracy improved a bit. Some errors went down. The R-squared value went up.
  14. When we use both the algebraic and graph pattern features, the accuracy of k-nn worsens a bit. But SVM gives us the most accurate predictions among all the experiments. R-squared value of 0.985 and most of the predictions are close to the perfect prediction line. The errors are low for most queries.
  15. Time required for prediction and training, when we use both types of features then the training time is very high because we need to compute the distance matrix for the training queries. However this is an offline process. So it does not effect the amount of time required to predict. For prediction time, again when we use both types of features, the avg. time required to predict is higher, but they are reasonable, less than 100 ms. The reason for higher time is that we have to compute the approximated graph edit distance for the query in examination.
  16. So this was everything about query performance prediction. Next, I am going to talk about explaining query results.
  17. When the users get the results back, , they may want to know about the history of data used in the result derivation. We provide this kind of information by means of explanations. We provide provenance based query result explanations. DON’T SAY: Click Challenge here is, how to generate provenance for SPARQL query results on SPARQL endpoints without modifying the query language, the underlying data model, or the query processing engine? Previous approaches for computing SPARQL query result provenance are annotation based approaches which require modifying the query language, the underlying data model, or the query processing engine, to keep tack of what happened during the query processing and using this meta information to generate provenance. DON’T SAY: Click But in the context of linked data, it’s not possible to do this modifications.
  18. Previous works on explanations in the semantic web literature focus on provenance models, presentation of explanations, and justification based explanations. For provenance related works, relational databases filed has a rich literature. Major types of provenance include why, where, how provenance. In addition, there are two approaches to compute provenance: annotation and non-annotation. Annotation approaches keep tack of the provenance related metadata during the query processing. Non-annotation approaches compute the provenance only when it’s needed, by means of querying the data again. In the RDF and SPARQL literature, there are few approaches for computing query result provenance. Some approaches are based on transforming RDF and SPARQL to relational models, and applying the provenance computation approaches of relational databases. There are few approaches for native SPARQL query processing, but these approaches are annotation based approaches. We propose a non-annotation based approach to compute query result provenance. --------------- Annotation/eager approach Extra annotations added during the query processing Keeping traces of the source data for results
  19. Little bit about query result provenance, what we mean by query result provenance. Imagine we have this data – these triples, this query, and these results. Let’s say we want to know the provenance of the result tuple Paris – why this result tuple was derived, or how this result tuple was derived. Why-provenance will give you the different derivations of the result tuple with explicit information for each derivation – each of the inner-set is a derivation path. How-provenance will give you the derivations along with the performed operations. The theoretical foundations of these notions are described in pervious works on provenance for SPARQL and relational database queries. -------------------------- Why-provenance: all different derivations of the result tuple with explicit information for each derivation. Intuitively, for a result tuple t for query Q on RDF graph G, lineage is the set of triples G’ which contribute to the result t
  20. We adopt the notion of why-provenance to provide query result explanations. We propose a non-annotation based algorithm, as it’s suitable for the linked data scenario. I will show you a simplified simulation of our proposed algorithm. Let’s say for the same example as before, we want to compute why-provenance for the result tuple Paris. DON’T SAY: Click In the first step, we bind the result tuple to the original query, and then we project all the variables in the new re-written query. DON’T SAY: Click The results of the re-written query intuitively give us all the relevant variable bindings for the result tuple in examination. DON’T SAY: Click Then we replace the variables in the triple patterns in the original query, by the corresponding values from the result tuples of the re-written query to generate provenance. Each tuple in the result tuples of the re-written query represents a derivation path. To get all the derivation paths, we iterate through all the result tuples of the re-written query. In this example, this two triples are t1 and t2, which is the first derivation in the why-provenance. So this is the general idea. But when there are complex operators like UNION, things get a bit more complex. At the moment, we support SELECT Queries without subqueries. We do not support: FILTER (NOT) EXISTS, MINUS, property paths, aggregates. ----------------------- A witness: replace the variables in the BGPs by the corresponding values in a result tuple of the rewritten query results. Check the existence of the resulted triples using SPARQL ASK. Why-Provenance: Do this for all the result tuples in the results of the rewritten query for why provenance and record them. The result for this example would be: {{t1, t2}, {t3, t4}, {t5, t6}}
  21. We did a performance evaluation of our approach with the DBPSB benchmark queries and the same setup as our query performance prediction experiments. The first column is the query template number, then #RES is the number of result tuples, then QET is the query execution time, PGT is the provenance generation time for all result tuples, then PGO is the provenance generation overhead for all result tuples, finally PGTPR is the provenance generation time per result tuple. So when we generate provenance for all the query result tuples, our algorithm is very costly in terms of time, the worst overhead in our case is 61,587%. But this is understandable because we have to solve a query for each of the result tuples to compute provenance. So non-annotation based algorithms are not good for generating provenance for all the query result tuples. For us the interesting measure here is the PGTPR, provenance generation time per result tuple. In our explanation scenario, we will only generate why-provenance for a result tuple when we need the provenance. As you can see here, all the PGTPR is really low. And that is why our approach is suitable for the explanation scenario.
  22. We implemented our approach in a federated query processor prototype. We built a basic federated query processor with common query federation features like source selection, nested loop join, exclusive grouping and bound join, and virtual integration model. DON’T SAY: Click Then on top of our federated query processor, we implemented our explanation facility as a plug-in. When you get the results, you can ask explanations for each of the tuples in the result. The system will respond with a why-provenance based explanation user interface. ---------------------------- Why-provenance-based explanations in the context of querying and data integration over Linked Data Query federation is a prominent approach to consume, process, and integrate Linked Data
  23. To give you an example, here we have the query user interface. The query here is solved from DBpedia and LinkedMDB. You have some results here, and when you click on the explain button for a result tuple, the explanation user interface will appear. DON’T SAY: Click The user interfaces shows a derivation from the why-provenance of the selected result tuple. The oval shapes here are representing data sets, for example: DBpedia, LinkedMDB. The first rectangle contains triple patterns that matched against it’s corresponding data set, and the second rectangle contains provenance triples in the corresponding data set.
  24. So that was everything about how we generate and present query result explanations. Now I am going to talk about the impact of query result explanations on users. It’s good that we have explanations, but we also need to understand whether the explanations are useful.
  25. In the previous works in the explanations for the semantic web literature, the assumptions were that explanations improve users’ understanding, and the improved understanding leads improved trust. DON’T SAY: Click But these assumptions where not evaluated, whether these assumptions are true or not. DON’T SAY: Click In the other other areas, for example recommender systems, and context-aware applications, researchers have proposed methodologies for evaluating explanations. Our work is based on the methodologies proposed by Lim and others for evaluating explanations. ------------ What are the impacts of query result explanations? These assumptions are however are not evaluated
  26. Based on the previous works on explanations for the semantic web, we hypothesize that “query result explanations improve user experience over having no explanations”. We define user experience as users understanding of the system and their perception of trust on the results. We develop a user study to test our hypothesis. The scenario is explanation-aware federated query processing. There were two groups of participants: with explanation and without explanation. There are three sections in the study. First, in the learning section, we gave a high level description of the system and how it works. We also gave them an example query and a result tuple for the query. The participants in the with explanation group additionally received an explanation for the result tuple they received. The next section is the reasoning section where we ask the participants to manually solve a federated federated query. The goal of the reasoning section is to examine whether the participants can apply their knowledge of how the system works learned in the learning section. So that we can examine whether there was an impact of providing explanations in the learning section. Finally, in the survey section, we ask the participants how they feel about the system. We used DBpedia and LinkedMDB as datasets. We used a very simple query “british movies with American actors”. 11 participants took part in the study. 6 were provided explanations and 5 were without explanations. There were 8 males and 3 females. All the participants had knowledge of RDF and SPARQL. The age of the participants ranges from 22-66. ------------ users’ understanding of the system and their perception of trust on the results Two groups of participants: with explanations and without explanations (randomly selected) Learning section Provided with a high level description of the system An example of federated query and a result with or without explanation Reasoning section Participant were given a federated query solving task Given a query and a result tuple Select data sources Select triples which contribute to the result Confidence level on their answer choices Survey section How they feel about the system Understanding Making trust judgments
  27. The response about data source/data set selection and source triple (i.e. provenance triple) selection for the task of solving the given federated query with a result tuple. In the data set selection part, we can not really come to a conclusion whether there is an impact of explanations. Because the answers for both with explanation and without explanation groups are very similar. For the source triple selection part, most participants with explanations answered correctly, but many participants without explanations answered incorrectly. So we can say that the participants with explanation understood the system better. Because most participant with explanation correctly selected the data sources and the source triples.
  28. The confidence level of participants about their answers, When the participants were given explanations, their confidence level about the answers where very high or high. When they were not given explanations, they were saying that their confidence level is only high. So here we see that the participants with explanations were more confident with their answers.
  29. Finally, how the users feel about the system, whether they feel explanations were helpful or unhelpful. Irrespective of whether we provided the participants explanations or not, majority of them said explanations were helpful to understand the system. Also, irrespective of whether we provided them explanations or not, majority of them said explanations were helpful for making trust judgments. The participants felt that the explanations helped them to better understand the system and to make better trust judgments on the results. Overall, these results validate our hypothesis
  30. So that was everything about explaining query results. Now I am going to talk about Linked Explanations: explanations for linked data applications.
  31. The scenario here is that applications consume and produce linked data. Applications also consume linked data which are these produced linked data. DON’T SAY: Click In this context, the scenario really is explaining distributed reasoning. The existing approach for explaining distributed reasoning is centralized. There is a centralized metadata registry. In oppose to that our approach is to decentralize explanations for distributed reasoning. We propose linked explanations for this. We provide proof-tree based explanations.
  32. So what do we mean by linked explanations? We publish the explanation metadata as linked data. The main caveat here is to using dereferenceable named graphs for reification and bundling metadata. When we dereference a named graph URI, we return the statements inside the named graph, and we also return the related statements that are the RDF statements where the named graph URI is a subject or an object. Linked explanations enable explanation for distributed data. To generate proof-tree base explanations, we can follow the links and retrieve explanation metadata for source data recursively ---------------------------- Enables explanation for distributed data Follow the links and retrieve explanation metadata for source data recursively
  33. To publish explanation metadata as linked data, first we need a vocabulary to describe them. Previous approaches for representing explanation metadata have some incompatibility with Linked data with regards to blank nodes . Blank nodes are usually avoided in Linked data because they are not globally referenceable. Also previous approaches do not use data interchanging standards such as w3c prov-o. We propose Ratio4TA which by extending W3C PROV-O. Extending PROV-O promotes interoperability by enabling data consumers to process explanation metadata according to the W3C PROV standards.
  34. Ratio4TA allows to describe data derivations, dependency between input data and output data, the reasoning process, results, rules, software applications. Then we can bundle these metadata using an explanationbundle named graph
  35. To give you an example, applications can produce and publish the data and metadata as Linked Data as shown here. DON’T SAY: Click, DON’T SAY: Click Then consumers of the data and metadata can follow the links and retrieve explanation metadata for source data recursively. In this way we can generate proof-tree-based explanations for data which was derived from distributed source data.
  36. You can have some nice user interfaces. Here we generate natural language based proof trees by using the rdf labels
  37. This was everything about linked explanations. Now I am going to talk about summarizing these explanations.
  38. This is an example of a proof tree based explanation, with minimal information, for a derivation using data from dbpedia and geonames. DON’T SAY: Click However, as you can see here, explanations can get large and can be overwhelming.
  39. How can we summarize this kind of explanations, which can provide an entry point to the full explanation. We also want to provide a feature of filtering information in explanations. Our approach is inspired by text summarization and ontology summarization. We define some measures for summarizing explanations.
  40. We rank the RDF statements in an explanations by three measures to provide summarized explanations. First, salience measure, the salience of an RDF statement indicates the importance of the RDF statement. We take the weighted average of the normalized degree centrality of the subject and the object of an RDF statement in the proof tree. DON’T SAY: Click The next measure is abstract statements. We consider a statement that is close to the root statement in the corresponding proof tree is more abstract than a statement that is far from the root. We compute the abstractness of a statement by inverting the level of the statement to which it belongs.
  41. Third measure is the similarity measure. The consumers of our explanations can specify a set of classes as their filtering criteria. We rank the more similar statements to the classes given in the filtering criteria higher. We compute the similarity between the set of filtering classes and a statement by taking the similarity scores between the classes of subject, predicate, object and the filtering criteria classes. We use the approximated query solving feature of Corese SPARQL engine to compute similarity between two classes. The approximated query solving feature is a semantic distance-based similarity feature to compute conceptual similarity between two classes in a schema. ------ extra We did not use the centrality of the predicate of statement while computing salience because the centrality values of predicates in an RDF graph often do not change as they are directly used from the schemata. In contrast, every new RDF statement changes the centrality values of its subject and object.
  42. We use two more measures to improve the rankings produced by the combinations of three measures we presented so far. First, Subtree Weight in Proof Tree. This measure helps us to measure the salience of a statement w.r.t. its position in the proof tree considering the weights of all the statements in the current branch. For example the picture here shows the subtree weight computation by considering only salience as a combination to compute ranking score. DONOT SAY: click animation subtree First, the number of statements in the subtrees DONOT SAY: click animation Ssl The weight of the satements computed using is the salience measure, but it can be computed by combinations of the measures I present before. DONOT SAY: click animation scorest the subtree weight in proof tree measure by by taking the average weights of all the statements in that subtree. Finally, the coherence measure for re-ranking. The idea is to provide more coherent information in the summary. We re-rank the explanation statements by iteratively selecting the statement with best potential contribution to the total coherence of the summary. We consider an RDF statement x to be coherent to an RDF statement y if x is directly derived from y. For example this two statements here are coherent and these two statements are not coherent --------------------- Previous works in text summarization [10] and ontology summarization [27] show that coherent information are desirable in summaries. For re-ranking a ranked list of statements, we repeatedly select the next RDF statement, n times where n is the number of statements. Here RL is the ranked list of RDF statements; S is the list of already selected RDF statements in the summary; i is the next RDF statement to be selected in S. We re-rank RL by repeatedly selecting next i As you can see, as the next statement, we select the best statement considering salience and potential contribution to the total coherence of the summary Again, the score (j ) for a statement j here can be computed by combinations of the measures i presented before. The reward score of a statement j is the amount of potential contribution value, ranging from 0.0 to 1.0, to the total coherence of the summary if j is added to S . The function coherent (S ) returns the number of coherent statements in the summary S .
  43. To evaluate our summarization approach, we again did a user study. We gave a participant a query with a result and an explanation for the result. In the reasoning for the query result, we have RDFS type propagation and owl:sameAs inferences, and also inferences with respect to the transitivity of gn:parentFeature. We used data from dbpedia and geonames. We asked the users to rank the statements in the explanation. To evaluate the filtering feature, we provide a participant a randomly selected class along with a query, a result, and the explanation for the result. And we ask them to rank the statements in the explanations.
  44. Out of the 24 survey participants with different backgrounds 18 participants in our survey had knowledge of RDF and 6 participants did not have any knowledge of RDF. The ages of the participants range from 22 to 59. 20 participants were male and 4 were female. The table in the bottom shows the total average agreement between rating vectors measured by cosine similarity and standard deviations for two scenarios – without filtering criteria and with filtering criteria. The average agreements for both the scenarios are more than 0.8 which is considerably high. However, the standard deviation is higher for the scenario with filtering criteria. The reason for this higher standard deviation is that the participants had to consider the highly subjective factor of similarity and therefore they had chances to disagree. -- their ratings had more variance for the scenario with FL.
  45. We use normalized discounted cumulative gain to evaluate ranking quality. An nDCGp value of 1.0 means that the ranking is perfect at position p with respect to the ideal ranking. In our study, the average of ratings by all the survey participants for a statement is the grade for that statement, which gives us the ideal ranking . The figures show the average nDCG values of the three test cases for different rankings by different measure combinations. The x-axis represents ranks and the y-axis represents nDCG . For the scenario without filtering criteria (the figure on the left), three of the measure combinations produce very similar rankings to the ground truth rankings. For the scenario with filtering criteria (the figure on the right), the same three measure combinations with added similarity measure have the best nDCG values. This means that the participants consider central abstract, and coherent information as necessary information in explanation summaries for the scenario without filtering criteria . This also holds for the scenario with filtering criteria with the added observation that the participants also consider similar information as necessary information. The nDCG values for these measure combinations are higher than 0.9 for all ranks. This means that the rankings by these measure combinations are highly similar to the ground truth rankings. In contrast, the sentence graph summarization ranking has low nDCG values compared to all the other rankings for the scenario without filtering criteria . This shows that our explanation summarization algorithms produce much higher quality rankings than sentence graph summarization algorithm. ----------- Drop if no time: Discounted Cumulative Gain measures the quality of results of an Information Retrieval (IR) system in a ranked list Drop if no time: It uses the assigned ratings/grades to measure the usefulness, or gain, of a ranked list of results. It penalizes high quality results appearing lower and Drop if no time: Normalized discounted cumulative Gain (nDCG ) allows to calculate and compare discounted cumulative gain across multiple lists of results where each of the lists might have different length. nDCG values are in the interval 0.0 to 1.0.
  46. We evaluate the summarized explanations produced by different measure combinations by comparing them to human generated summarized explanations (i.e. ground truth summarized explanations) using F-score . To generate the ground truth summarized explanation for an explanation, we include a statement in the ground truth summarized explanation if its rating is greater than or equal to the average rating of all the statements in the original explanation. The Figures show the average F-scores for different measure combinations for summaries with different sizes for the three test cases. The x-axis represents compression ratio CR . The y-axis represents F-scores . For the scenario without filtering criteria (the figure on the left), the best F-score is 0.72 when CR value is 0.33 by the measure combinations salience + abstract+ subtree, and salience + abstract + subtree + coherent. This is a desirable situation with a high F-score and low CR . The sentence graph summarization performs poorly with a best F-score value of 0.34 in the CR interval 0.05 to 0.3. For the scenario with FL (the figure on the right), the best F-score is 0.66 at CR values 0.53 and. However, the F-score 0.6 at CR value 0.3 by the measure combination salience +abstractness +similarity +coherence is more desirable because the size of the summary is smaller. As expected, our summarization approach performs worse in the scenario with filtering criteria where we use the similarity measure . This is due to the fact that the survey participants had to consider the highly subjective factor of similarity. Recall reflects how many good statements the algorithm have not missed Precision reflects how many of the algorithm’s selected statements are good F-score is a composite measure of Recall and Precision Gold standard summary: if a statements rating is greater than or equal to the average rating of all the statements in the original explanation
  47. So that was everything about summarizing explanations. Now the conclusions.
  48. Summary of the contributions. I first spoke about query performance prediction, the goal was to provide predicted query performance related information prior to executing the queries to assist users in workload management related tasks. Then I spoke about explaining query results. The goal was to assist users in understanding sparql query results. I presented a non-annotation approach to generate why-provenance for sparql quey results. I also presented a user study to evaluate the impact of query result explanations. Next I spoke about linked explanations, which basically allow you to explain distributed reasoning in the context of linked data. Finally I spoke about summarizing those explanations.
  49. Perspectives, For query performance prediction, we would like to see how we can use our approach in query optimization, specially in scenarios where we query the linked data, for example federated query processing over linked data. There is also how we can generate training data. The idea is to mine query logs to extract to find some dominant features of the queries, and then using those features to synthetically generate training query, which have a good coverage of the possible queries. Finally explaining performance, at the moment we can say that a query may take this amount of time or that amount of time, but we can’t say why take this amount of time or that amount of time. So it will be interesting to explore how we can explain that aspect. This will include explaining machine learning algorithms. The next is explaining SPARQL query results, we would like to extend our algorithm for how provenance. We would like to do our study with more participants. It was hard to find people who are motivated to solve the tasks. Maybe it would be interesting to identify people in croudsourcing platforms and reward them for solving the tasks. We would also like to re-design the study to go back to participants and ask them why they have answered something, and why not something else. Next linked explanations, at the moment it’s not clear what is the community consensus with respect to dereferencing named graph URIs. It would be interesting to see how this develops, specially now that RDF 1.1 adopted the notion of named graph. Then, we have a large amount of metadata when we describe explanations using our vocabulary. So we would need scalable RDF data storage and processing techniques. Finally, summarizing explanations, we would to use our rankings for effective presentation of explanations. For example expanding or not expanding a branch of a proof-tree while presenting it. We would also like to provide personalized explanations. For example, we can classify users based on their usage logs and provide personalized explanations to target different types of users.
  50. With that I will finish my presentation. Thank you for your attention.