Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Machine Learning Approach to SPARQL Query Performance Prediction

825 views

Published on

A Machine Learning Approach to SPARQL Query Performance Prediction

Published in: Software
  • Be the first to comment

A Machine Learning Approach to SPARQL Query Performance Prediction

  1. 1. A Machine Learning Approach to SPARQL Query Performance Prediction Rakebul Hasan Wimmics Research Team INRIA Sophia Antipolis France
  2. 2. Context 2 Slide derived from Andreas Blumauer’s Linked Data slides • Linked Data Principles 1. Use URIs as names for things. 2. Use HTTP URIs, so that people can look up those names. 3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL). 4. Include links to other URIs, so that they can discover more things.
  3. 3. Context • W3C Linking Open Data (LOD) Initiative • An initiative to publish open data as Linked Data • From 2 billion triples in 2007 to 30 billion triples in 2011 • Accessing Linked Data – Dereferencing URIs – SPARQL Endpoints 3
  4. 4. Context • Querying Linked Data – SPARQL Endpoints: SPARQL query service via HTTP implementing SPARQL Protocol – 68% of the data sets provide SPARQL Endpoints as of September 2011 – As of Today, 98% of the triples in LOD cloud are accessible via SPARQL • 57,856,463,005 out of 58,882,358,557 triples http://stats.lod2.eu/ 4
  5. 5. Context • Understanding Query Behavior in the context of Linked Data – Workload allocation to ensure specific QoS requirements are met – Predicting query performance metrics 5
  6. 6. Query Performance Prediction • Traditional approaches use underlying data statistics-based cost models to predict query performance • Data statistics are often missing in the Linked Data scenario – Only 32.20 % (95 out of 295) data sources provide a voiD description. • Basic statistics such as number of triples, often not detailed enough for statistics based models – In fact, what makes effective statistics for query cost estimation on RDF is unclear. • Challenge – How to predict query performance without using data statistics? 6
  7. 7. Understanding performance of database queries • Ganapathi et al. predicting performance metrics of database queries prior to query execution using machine learning. • Akdere et al. use machine learning for predicting query execution time. Ganapathi et al.: Predicting multiple metrics for queries: Better decisions enabled by machine learning, ICDE’09 Akdere et al, Learning-based query performance modeling and prediction, ICDE’12, 7
  8. 8. Predicting Query Performance • Learn query performance from already executed queries • Challenge: how to model SPARQL query characteristics for machine learning algorithms - feature extraction? 8
  9. 9. Modeling SPARQL Query Execution • Two types of features – Algebra features: extracted from SPARQL algebraic expression of a query – Graph pattern features: a vector representation of the query pattern of a query relative to the training queries 9
  10. 10. Modeling SPARQL Query Execution • Algebra features – Jena API to extract SPARQL algebra expressions 10
  11. 11. • Graph pattern features – Find landmarks in training queries by clustering • K-medoids with approximate graph edit distance – Compute distance between landmark queries and the query in examination to construct a graph pattern feature vector • Approximate graph edit distance for distance computation 11
  12. 12. Graph Edit Distance • Minimum amount of distortion needed to transform one graph to another – Bipartite matching based approximated graph edit distance with • Previous research shows accurate results with classification problems Riesen et al. “A Novel Software Toolkit for Graph Edit Distance Computation”, 9th IAPR-TC-15, GbRPR 2013 12
  13. 13. Experiments • 1260 training, 420 validation, and 420 test queries generated from DBPSB benchmark query templates – DBPSB templates cover most commonly used SPARQL query features in the queries sent to DBPedia • DBpedia as RDF data set • Predicting query execution time – k-NN regression with k-D tree – SVM with nu-SVR for regression 13 DBpedia: http://dbpedia.org/ DBPSB: http://aksw.org/Projects/DBPSB.html
  14. 14. Algebra Features 14
  15. 15. Algebra and graph pattern features 15
  16. 16. Time Required for Training and Predictions 16
  17. 17. Summary • Understanding SPARQL query behavior in the Linked Data scenario – Predicting query performance metrics • learn query execution times from already executed queries – without using statistics about the underlying RDF data. – Modeling (vector representation) SPARQL queries for machine learning algorithms • Feature extraction – Highly accurate predictions for common Linked Data queries 17
  18. 18. Future Work on QPP • Incorporating bandwidth related features. • Query optimization for Linked Data applications: – in place of selectivity estimation for alternative queries? • How to accurately predict performance for single triple patterns – Alternative query construction for Linked Data applications – join order optimization. E.g. Federated Query Processing over Linked Data • How to generate training queries? – Next slide 18
  19. 19. Statistical Analysis of Query Logs • Approach to systematically generating training queries Mario Arias, Javier D. Fernández, Miguel A. Martínez-Prieto, Pablo de la Fuente: An Empirical Study of Real-World SPARQL Queries, 1st International Workshop on Usage Analysis and the Web of Data, co-located with the 20th International World Wide Web Conference (WWW2011) 19
  20. 20. • Thank you 20

×