%in Harare+277-882-255-28 abortion pills for sale in Harare
A Machine Learning Approach to SPARQL Query Performance Prediction
1. A Machine Learning Approach to
SPARQL Query Performance
Prediction
Rakebul Hasan
Wimmics Research Team
INRIA Sophia Antipolis
France
2. Context
2
Slide derived from Andreas Blumauer’s Linked Data slides
• Linked Data Principles
1. Use URIs as names for things.
2. Use HTTP URIs, so that people can
look up those names.
3. When someone looks up a URI,
provide useful information, using the
standards (RDF, SPARQL).
4. Include links to other URIs, so that
they can discover more things.
3. Context
• W3C Linking Open Data (LOD) Initiative
• An initiative to publish open data as Linked Data
• From 2 billion triples in 2007 to 30 billion triples in 2011
• Accessing Linked Data
– Dereferencing URIs
– SPARQL Endpoints
3
4. Context
• Querying Linked Data
– SPARQL Endpoints: SPARQL query service via HTTP
implementing SPARQL Protocol
– 68% of the data sets provide SPARQL Endpoints as
of September 2011
– As of Today, 98% of the triples in LOD cloud are
accessible via SPARQL
• 57,856,463,005 out of 58,882,358,557 triples
http://stats.lod2.eu/
4
5. Context
• Understanding Query Behavior in the context
of Linked Data
– Workload allocation to ensure specific QoS
requirements are met
– Predicting query performance metrics
5
6. Query Performance Prediction
• Traditional approaches use underlying data statistics-based
cost models to predict query performance
• Data statistics are often missing in the Linked Data scenario
– Only 32.20 % (95 out of 295) data sources provide a voiD
description.
• Basic statistics such as number of triples, often not detailed enough
for statistics based models
– In fact, what makes effective statistics for query cost estimation on RDF is
unclear.
• Challenge
– How to predict query performance without using data statistics?
6
7. Understanding performance of
database queries
• Ganapathi et al. predicting performance
metrics of database queries prior to query
execution using machine learning.
• Akdere et al. use machine learning for
predicting query execution time.
Ganapathi et al.: Predicting multiple metrics for queries: Better decisions enabled by machine learning, ICDE’09
Akdere et al, Learning-based query performance modeling and prediction, ICDE’12, 7
8. Predicting Query Performance
• Learn query performance from already
executed queries
• Challenge: how to model SPARQL query
characteristics for machine learning
algorithms - feature extraction?
8
9. Modeling SPARQL Query Execution
• Two types of features
– Algebra features: extracted from SPARQL algebraic
expression of a query
– Graph pattern features: a vector representation of
the query pattern of a query relative to the
training queries
9
10. Modeling SPARQL Query Execution
• Algebra features
– Jena API to extract
SPARQL algebra
expressions
10
11. • Graph pattern features
– Find landmarks in training
queries by clustering
• K-medoids with
approximate graph edit
distance
– Compute distance
between landmark queries
and the query in
examination to construct a
graph pattern feature
vector
• Approximate graph edit
distance for distance
computation
11
12. Graph Edit Distance
• Minimum amount of distortion needed to
transform one graph to another
– Bipartite matching based approximated graph edit
distance with
• Previous research shows accurate results with
classification problems
Riesen et al. “A Novel Software Toolkit for Graph Edit Distance Computation”, 9th IAPR-TC-15, GbRPR 2013
12
13. Experiments
• 1260 training, 420 validation, and 420 test
queries generated from DBPSB benchmark query
templates
– DBPSB templates cover most commonly used SPARQL
query features in the queries sent to DBPedia
• DBpedia as RDF data set
• Predicting query execution time
– k-NN regression with k-D tree
– SVM with nu-SVR for regression
13
DBpedia: http://dbpedia.org/
DBPSB: http://aksw.org/Projects/DBPSB.html
17. Summary
• Understanding SPARQL query behavior in the
Linked Data scenario
– Predicting query performance metrics
• learn query execution times from already
executed queries
– without using statistics about the underlying RDF
data.
– Modeling (vector representation) SPARQL queries for
machine learning algorithms
• Feature extraction
– Highly accurate predictions for common Linked Data
queries
17
18. Future Work on QPP
• Incorporating bandwidth related features.
• Query optimization for Linked Data applications:
– in place of selectivity estimation for alternative
queries?
• How to accurately predict performance for single
triple patterns
– Alternative query construction for Linked Data
applications – join order optimization. E.g. Federated
Query Processing over Linked Data
• How to generate training queries?
– Next slide
18
19. Statistical Analysis of Query Logs
• Approach to systematically generating training queries
Mario Arias, Javier D. Fernández, Miguel A. Martínez-Prieto, Pablo de la Fuente: An Empirical Study of Real-World SPARQL Queries,
1st International Workshop on Usage Analysis and the Web of Data,
co-located with the 20th International World Wide Web Conference (WWW2011)
19
Web of Documents ->
documents were described using HTML and globally identified using URLs
Retrieval mechanism: HTTP protocol
all these ensured creating a single global data space
Data on the Web
Many formats and APIs
Proprietary interfaces
No single global data space – no hyperlinks between data items within different data sources
Web of Data -> a single global data space
using RDF to publish data on the Web
links between data items within different data sources
Histograms -> on which to create histograms for effective estimation
The graph edit distance between two graphs is the minimum amount of distortion needed to transform one graph to another.
The minimum amount of distortion is the sequence of edit operations with minimum cost. The edit operations are deletions, insertions, and substitutions of nodes and edges.
Refining training queries from query logs by considering the statistically significant characteristics
Bootstrapping: Starting with a initial set of properties, resources and literals and than generate training queries by permutations and combinations of the statistically significant features
Simplifying the pattern features: join features, triple pattern features, pattern graph features to represent query patterns