A Machine Learning Approach to
SPARQL Query Performance
Prediction
Rakebul Hasan
Wimmics Research Team
INRIA Sophia Antipo...
Context
2
Slide derived from Andreas Blumauer’s Linked Data slides
• Linked Data Principles
1. Use URIs as names for thing...
Context
• W3C Linking Open Data (LOD) Initiative
• An initiative to publish open data as Linked Data
• From 2 billion trip...
Context
• Querying Linked Data
– SPARQL Endpoints: SPARQL query service via HTTP
implementing SPARQL Protocol
– 68% of the...
Context
• Understanding Query Behavior in the context
of Linked Data
– Workload allocation to ensure specific QoS
requirem...
Query Performance Prediction
• Traditional approaches use underlying data statistics-based
cost models to predict query pe...
Understanding performance of
database queries
• Ganapathi et al. predicting performance
metrics of database queries prior ...
Predicting Query Performance
• Learn query performance from already
executed queries
• Challenge: how to model SPARQL quer...
Modeling SPARQL Query Execution
• Two types of features
– Algebra features: extracted from SPARQL algebraic
expression of ...
Modeling SPARQL Query Execution
• Algebra features
– Jena API to extract
SPARQL algebra
expressions
10
• Graph pattern features
– Find landmarks in training
queries by clustering
• K-medoids with
approximate graph edit
distan...
Graph Edit Distance
• Minimum amount of distortion needed to
transform one graph to another
– Bipartite matching based app...
Experiments
• 1260 training, 420 validation, and 420 test
queries generated from DBPSB benchmark query
templates
– DBPSB t...
Algebra Features
14
Algebra and graph pattern features
15
Time Required for Training and
Predictions
16
Summary
• Understanding SPARQL query behavior in the
Linked Data scenario
– Predicting query performance metrics
• learn q...
Future Work on QPP
• Incorporating bandwidth related features.
• Query optimization for Linked Data applications:
– in pla...
Statistical Analysis of Query Logs
• Approach to systematically generating training queries
Mario Arias, Javier D. Fernánd...
• Thank you
20
Upcoming SlideShare
Loading in …5
×

A Machine Learning Approach to SPARQL Query Performance Prediction

427 views
332 views

Published on

A Machine Learning Approach to SPARQL Query Performance Prediction

Published in: Software
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
427
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Web of Documents ->
    documents were described using HTML and globally identified using URLs
    Retrieval mechanism: HTTP protocol
    all these ensured creating a single global data space

    Data on the Web
    Many formats and APIs
    Proprietary interfaces
    No single global data space – no hyperlinks between data items within different data sources
    Web of Data -> a single global data space
    using RDF to publish data on the Web
    links between data items within different data sources
  • Histograms -> on which to create histograms for effective estimation
  • The graph edit distance between two graphs is the minimum amount of distortion needed to transform one graph to another.
    The minimum amount of distortion is the sequence of edit operations with minimum cost. The edit operations are deletions, insertions, and substitutions of nodes and edges.
  • Refining training queries from query logs by considering the statistically significant characteristics

    Bootstrapping: Starting with a initial set of properties, resources and literals and than generate training queries by permutations and combinations of the statistically significant features

    Simplifying the pattern features: join features, triple pattern features, pattern graph features to represent query patterns
  • A Machine Learning Approach to SPARQL Query Performance Prediction

    1. 1. A Machine Learning Approach to SPARQL Query Performance Prediction Rakebul Hasan Wimmics Research Team INRIA Sophia Antipolis France
    2. 2. Context 2 Slide derived from Andreas Blumauer’s Linked Data slides • Linked Data Principles 1. Use URIs as names for things. 2. Use HTTP URIs, so that people can look up those names. 3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL). 4. Include links to other URIs, so that they can discover more things.
    3. 3. Context • W3C Linking Open Data (LOD) Initiative • An initiative to publish open data as Linked Data • From 2 billion triples in 2007 to 30 billion triples in 2011 • Accessing Linked Data – Dereferencing URIs – SPARQL Endpoints 3
    4. 4. Context • Querying Linked Data – SPARQL Endpoints: SPARQL query service via HTTP implementing SPARQL Protocol – 68% of the data sets provide SPARQL Endpoints as of September 2011 – As of Today, 98% of the triples in LOD cloud are accessible via SPARQL • 57,856,463,005 out of 58,882,358,557 triples http://stats.lod2.eu/ 4
    5. 5. Context • Understanding Query Behavior in the context of Linked Data – Workload allocation to ensure specific QoS requirements are met – Predicting query performance metrics 5
    6. 6. Query Performance Prediction • Traditional approaches use underlying data statistics-based cost models to predict query performance • Data statistics are often missing in the Linked Data scenario – Only 32.20 % (95 out of 295) data sources provide a voiD description. • Basic statistics such as number of triples, often not detailed enough for statistics based models – In fact, what makes effective statistics for query cost estimation on RDF is unclear. • Challenge – How to predict query performance without using data statistics? 6
    7. 7. Understanding performance of database queries • Ganapathi et al. predicting performance metrics of database queries prior to query execution using machine learning. • Akdere et al. use machine learning for predicting query execution time. Ganapathi et al.: Predicting multiple metrics for queries: Better decisions enabled by machine learning, ICDE’09 Akdere et al, Learning-based query performance modeling and prediction, ICDE’12, 7
    8. 8. Predicting Query Performance • Learn query performance from already executed queries • Challenge: how to model SPARQL query characteristics for machine learning algorithms - feature extraction? 8
    9. 9. Modeling SPARQL Query Execution • Two types of features – Algebra features: extracted from SPARQL algebraic expression of a query – Graph pattern features: a vector representation of the query pattern of a query relative to the training queries 9
    10. 10. Modeling SPARQL Query Execution • Algebra features – Jena API to extract SPARQL algebra expressions 10
    11. 11. • Graph pattern features – Find landmarks in training queries by clustering • K-medoids with approximate graph edit distance – Compute distance between landmark queries and the query in examination to construct a graph pattern feature vector • Approximate graph edit distance for distance computation 11
    12. 12. Graph Edit Distance • Minimum amount of distortion needed to transform one graph to another – Bipartite matching based approximated graph edit distance with • Previous research shows accurate results with classification problems Riesen et al. “A Novel Software Toolkit for Graph Edit Distance Computation”, 9th IAPR-TC-15, GbRPR 2013 12
    13. 13. Experiments • 1260 training, 420 validation, and 420 test queries generated from DBPSB benchmark query templates – DBPSB templates cover most commonly used SPARQL query features in the queries sent to DBPedia • DBpedia as RDF data set • Predicting query execution time – k-NN regression with k-D tree – SVM with nu-SVR for regression 13 DBpedia: http://dbpedia.org/ DBPSB: http://aksw.org/Projects/DBPSB.html
    14. 14. Algebra Features 14
    15. 15. Algebra and graph pattern features 15
    16. 16. Time Required for Training and Predictions 16
    17. 17. Summary • Understanding SPARQL query behavior in the Linked Data scenario – Predicting query performance metrics • learn query execution times from already executed queries – without using statistics about the underlying RDF data. – Modeling (vector representation) SPARQL queries for machine learning algorithms • Feature extraction – Highly accurate predictions for common Linked Data queries 17
    18. 18. Future Work on QPP • Incorporating bandwidth related features. • Query optimization for Linked Data applications: – in place of selectivity estimation for alternative queries? • How to accurately predict performance for single triple patterns – Alternative query construction for Linked Data applications – join order optimization. E.g. Federated Query Processing over Linked Data • How to generate training queries? – Next slide 18
    19. 19. Statistical Analysis of Query Logs • Approach to systematically generating training queries Mario Arias, Javier D. Fernández, Miguel A. Martínez-Prieto, Pablo de la Fuente: An Empirical Study of Real-World SPARQL Queries, 1st International Workshop on Usage Analysis and the Web of Data, co-located with the 20th International World Wide Web Conference (WWW2011) 19
    20. 20. • Thank you 20

    ×