Top-k Linked Data Query Processing   Andreas Wagner, Duc Thanh Tran, Günter Ladwig,   Andreas Harth, and Rudi StuderInstit...
Introduction and Motivation    Top-k Linked Data Query Processing                           Evaluation Results2   Andreas ...
INTRODUCTION & MOTIVATION3                      Institute of Applied Informatics and Formal                               ...
Linked Data Query Processing                                Linked Data Query                                Processing En...
Top-K Query Processing      Users are usually interested in only a few results      Top-K query processing addresses the e...
Contributions      Transfer top-k query processing to the Linked Data setting      Linked Data specific improvements of th...
TOP-K LINKED DATA QUERY    PROCESSING7                      Institute of Applied Informatics and Formal                   ...
Top-K Query Processing in a Linked Data    Setting (1) – Requirements (1)      Source index mapping triple patterns to sou...
Top-K Query Processing in a Linked Data    Setting (2) – Requirements (2)      Sorted access on each join input           ...
Top-K Query Processing in a Scheduling Strategy:                                 Linked Data     Setting (3) – Push Bound ...
Top-K Query Processing in a Linked Data     Setting (4) – Push Bound Rank Join (2)                                Score   ...
Improving the Threshold Estimation (1)       Threshold estimation:                              Threshold: max { max_1 + m...
Improving the Threshold Estimation (2)     Star-shaped Entity Query Bounds       Observation: Results for entity queries c...
Improving the Threshold Estimation (3)     Look-ahead Bounds       Idea: Provide a more accurate upper bound for the unsee...
EVALUATION15                Institute of Applied Informatics and Formal                                  Description Metho...
Evaluation – Setting       We implemented three systems           Push-based symmetric hash join operator [2,5]           ...
Evaluation – Results (1)       Overall Results               Overview of processing times for all queries (k = 1, d = n)  ...
Evaluation – Results (2)       Effect of K and Score Distributions18             Andreas Wagner, Duc Thanh Tran, Günter La...
CONCLUSION19                Institute of Applied Informatics and Formal                                  Description Metho...
Conclusion      We showed that top-k processing techniques are applicable      to the Linked Data setting.      Top-k stra...
QUESTIONS21               Institute of Applied Informatics and Formal                                 Description Methods ...
REFERENCES22                Institute of Applied Informatics and Formal                                  Description Metho...
References     [1]   A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. Sattler, and J. Umbrich. Data           summaries fo...
BACKUP SLIDES24                   Institute of Applied Informatics and Formal                                     Descript...
Early Pruning of Partial Results       Motivation: Top-k join processing can be quite costly in terms of       memory cons...
Upcoming SlideShare
Loading in …5
×

Linked Data Top-K Query Processing

816 views

Published on

"Linked Data Top-K Query Processing" paper at ESWC'12.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
816
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Introduction:* Challenges in Current Linked Data Query Processing*Processing of Ranked Linked Data* Our ContributionsTop-k* Top-K Query Processing in a Linked Data Setting* Improving the Threshold Estimation* Eager Pruning of Partial Results
  • * Special case of federated query processing* Only http-lookups are availablefor data access* Entire sources have to be retrieved
  • * Provides strategies for computing only the k top-ranked results*Other (less relevant) results are not materialized* For computing the top-1 result, no data from src. 2 is needed.
  • *Tighter threshold estimation and early partial result pruning
  • * For instance, scores for triples can be obtained through PageRank inspired ranking [4]* However, no triples are indexed (i.e., each source must be scanned)
  • * Join inputs must be accessible in a descending score order* We store min/max triple score per source, and allow sources to be accessed in descending score order (via a scheduling strategy)
  • * Given our ranking function, sorted access and source index we can employ a push-based rank join
  • * The threshold allows us estimate scores of the unseen query result bindings and terminate early
  • Push-based symmetric hash join operator (shj) Rank-join operator with corner-bound (rj-cc) [6] Rank-join operator with tigther corner-bound and early pruning (rj-tc)* (all push-based join processing and left-deep join trees): * (due to network latency issues, sources were downloaded and Linked Data access was simulated on one single machine)
  • * Differences due to less input data retrieved* Some queries (e.g., q10 or q20) equal as result set too small (i.e., all (!) data had to retrieved)* Differences between rj-cc and rj-tc not showing properly in (a) as evaluation was on local machineOutlier q19 due to implementation issueQ9: early pruning: 8% of buffered data safed. However, no „real“ impact on efficiency -> main aspect here is number of source to be retrieved
  • (b) Average number of sources (different k, d = n). (c) Average evaluation time (different k, d = n). (d) Average evaluation time (different n, k = 10). (e) Average evaluation time with varying number of triple patterns (k = 1, d = n).
  • Q9: early pruning: 8% of buffered data safed. However, no „real“ impact on efficiency -> main aspect here is number of source to be retrieved
  • * ( „seen“ and „output“ buffers)* That is, any partial result having a (partial) score that together the maximal possible score for the unevaluated query part is ≤ than the currently smallest top-k score
  • Linked Data Top-K Query Processing

    1. 1. Top-k Linked Data Query Processing Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi StuderInstitute of Applied Informatics and Formal Description Methods (AIFB)KIT – University of the State of Baden-Wuerttemberg andNational Research Center of the Helmholtz Association www.kit.edu
    2. 2. Introduction and Motivation Top-k Linked Data Query Processing Evaluation Results2 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal Andreas Harth, and Rudi Studer Description Methods (AIFB)
    3. 3. INTRODUCTION & MOTIVATION3 Institute of Applied Informatics and Formal Description Methods (AIFB)
    4. 4. Linked Data Query Processing Linked Data Query Processing Engine HTTP lookup data URI Src. data sources Problems: Efficiency and Scalability4 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal Andreas Harth, and Rudi Studer Description Methods (AIFB)
    5. 5. Top-K Query Processing Users are usually interested in only a few results Top-K query processing addresses the efficiency and scalability issues ex:sgt_pepper foaf:name "Sgt. Pepper"; ex:song "Lucy". ex:beatles foaf:name Src. 1 "The Beatles"; Src. 2 ex:album ex:sgt_pepper; ex:album ex:help. SELECT * WHERE Src. 3 { ex:beatles ex:album ?album . ex:help foaf:name ?album ex:song ?song . "Help!"; } ex:song "Help!".5 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal Andreas Harth, and Rudi Studer Description Methods (AIFB)
    6. 6. Contributions Transfer top-k query processing to the Linked Data setting Linked Data specific improvements of the top-k approach Evaluation using real-world data6 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal Andreas Harth, and Rudi Studer Description Methods (AIFB)
    7. 7. TOP-K LINKED DATA QUERY PROCESSING7 Institute of Applied Informatics and Formal Description Methods (AIFB)
    8. 8. Top-K Query Processing in a Linked Data Setting (1) – Requirements (1) Source index mapping triple patterns to sources containing bindings (e.g., [1,2]) Ranking function determining the relevance of triple pattern bindings TP1: ex:beatles ex:album ?album . Linked Data TP2: ?album ex:song ?song . Query Processing source Engine index TP2 TP1 TP2 ex:sgt_pepper foaf:name score∈ [0,1] "Sgt. Pepper"; score ∈ [2,3] Src. 3 ex:song "Lucy".ex:beatles foaf:name Src. 1 "The Beatles"; ex:help foaf:nameex:album ex:sgt_pepper; "Help!";ex:album ex:help. Src. 2 score∈ [1,2] ex:song "Help!".8 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal Andreas Harth, and Rudi Studer Description Methods (AIFB)
    9. 9. Top-K Query Processing in a Linked Data Setting (2) – Requirements (2) Sorted access on each join input 2 Src. 3 score ∈ [2,3] Scheduling 1 Strategy Src. 1 3 Src. 2 score ∈ [0,1] Bindings with TP1: score ∈ [1,2] descending ex:beatles ex:album ?album TP2: ?album ex:song ?song scores9 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal Andreas Harth, and Rudi Studer Description Methods (AIFB)
    10. 10. Top-K Query Processing in a Scheduling Strategy: Linked Data Setting (3) – Push Bound Rank Joinsource 1 Load (1) 3 Score Query Bindings – Output Queue Score Seen Triples (TP1) 1 ex:beatles ex:album ex:sgt_pepper Score Seen Triples (TP2) Score Seen Triples (TP1) 1 ex:beatles ex:album 3 ex:help ex:song "Help!" ex:help Sorted Access for Sorted Access for ex:beatles foaf:name Src. ex:beatles ex:album ?album1. "The Beatles"; ?album foaf:name ?song 3 ex:help ex:song Src. ex:album ex:sgt_pepper; "Help!"; ex:album ex:help. ex:song "Help!".10 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal Andreas Harth, and Rudi Studer Description Methods (AIFB)
    11. 11. Top-K Query Processing in a Linked Data Setting (4) – Push Bound Rank Join (2) Score Query Bindings – Output Queue Threshold: 4 4 ex:beatles ex:album ex:help . ex:help ex:song "Help!" . Score Seen Triples (TP1) 1 ex:beatles ex:album Found query binding with ex:sgt_pepper score ≥ threshold Seen Triples (TP2) Score 1 ex:beatles ex:album STOP 3 ex:help ex:song "Help!" ex:help Sorted Access for Sorted Access for ex:beatles ex:album ?album . ?album ex:song ?song Src. 211 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal Andreas Harth, and Rudi Studer Description Methods (AIFB)
    12. 12. Improving the Threshold Estimation (1) Threshold estimation: Threshold: max { max_1 + min_2 , max_2 + min_1 } upper bound seenmax_1 max_2 Score Seen Triples (TP1) Score Seen Triples (TP2) +min_1 min_2 upper bound unseen We improve the threshold estimation: Star-shaped entity query bounds Look-ahead bounds12 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal Andreas Harth, and Rudi Studer Description Methods (AIFB)
    13. 13. Improving the Threshold Estimation (2) Star-shaped Entity Query Bounds Observation: Results for entity queries come from one single source Idea: Upper bound scores for triple pattern bindings via the maximal possible triple score score ∈ [1,2]upper-bound ex:sgt_pepper foaf:namefor triple "Sgt. Pepper"; Src. 3 ex:song "Lucy".bindings: 3 ex:song ?y ex:help foaf:name ?x "Help!"; ex:song "Help!". Src. 2 foaf:name ?z score ∈ [2,3]upper-boundfor triple bindings: 3 upper bound for entity query bindings: 3 + 313 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal Andreas Harth, and Rudi Studer Description Methods (AIFB)
    14. 14. Improving the Threshold Estimation (3) Look-ahead Bounds Idea: Provide a more accurate upper bound for the unseen bindings scores via the „next possible“ score Threshold: max { 1 + 3 , 1 + 3 } = 4 2 Score Query Bindings – Output Queue 4 ex:beatles ex:album ex:help . ex:help ex:song "Help!" .max_1 = 1 max_2 = 3 Score Seen Triples (TP1) Score Seen Triples (TP2) 1 ex:beatles ex:album 3 ex:help ex:song "Help!" Src. 3 ex:sgt_pepper min_2 = 3 1 ex:beatles ex:album ex:help min_2 = 2min_1 = 1 Sorted Access for ?album ex:song ?song Src. 2 Sorted Access for score ∈ [1,2] ex:beatles ex:album ?album .14 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal Andreas Harth, and Rudi Studer Description Methods (AIFB)
    15. 15. EVALUATION15 Institute of Applied Informatics and Formal Description Methods (AIFB)
    16. 16. Evaluation – Setting We implemented three systems Push-based symmetric hash join operator [2,5] Standard top-k operator [6] Improved top-k operator Query set: 20 queries (8 FedBench and 12 own queries), having varying result size (1 to ~10.000) and complexity (2 to 5 triple patterns) Data set: ~ 2.000.000 triples, distributed over ~700.000 sources Parameters: k ∈ {1,5,10,20} and score distributions ∈ {uniform, normal, exponential}16 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal Andreas Harth, and Rudi Studer Description Methods (AIFB)
    17. 17. Evaluation – Results (1) Overall Results Overview of processing times for all queries (k = 1, d = n) Top-k strategies lead to runtime improvement of 35% on average (compared to standard Linked Data processing) Tighter bounding lead to further improvements of 12% on average (compared to standard top-k processing)17 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal Andreas Harth, and Rudi Studer Description Methods (AIFB)
    18. 18. Evaluation – Results (2) Effect of K and Score Distributions18 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal Andreas Harth, and Rudi Studer Description Methods (AIFB)
    19. 19. CONCLUSION19 Institute of Applied Informatics and Formal Description Methods (AIFB)
    20. 20. Conclusion We showed that top-k processing techniques are applicable to the Linked Data setting. Top-k strategies lead to significant time savings w.r.t. small values of k (in our experiments 35% on average) We showed that our improved top-k strategy lead to further runtime advantages (in our experiments 12% on average)20 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal Andreas Harth, and Rudi Studer Description Methods (AIFB)
    21. 21. QUESTIONS21 Institute of Applied Informatics and Formal Description Methods (AIFB)
    22. 22. REFERENCES22 Institute of Applied Informatics and Formal Description Methods (AIFB)
    23. 23. References [1] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. Sattler, and J. Umbrich. Data summaries for on-demand queries over linked data. In World Wide Web, 2010. [2] G. Ladwig and T. Tran. Linked Data Query Processing Strategies. In ISWC, 2010. [3] M. Wu, L. Berti-Equille, A. Marian, C. M. Procopiuc, and D. Srivastava. Processing top-k join queries. Proc. VLDB Endow., pages 860–870, 2010. [4] A. Harth, S. Kinsella, and S. Decker. Using naming authority to rank data and ontologies for web search. In ISWC, pages 277–292, 2009. [5] G. Ladwig and T. Tran. SIHJoin: Querying Remote and Local Linked Data. In ESWC, 2011. [6] K. Schnaitter and N. Polyzotis. Optimal algorithms for evaluating rank joins in database systems. ACM Trans. Database Syst., 35:6:1–6:47, 2010.23 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal Andreas Harth, and Rudi Studer Description Methods (AIFB)
    24. 24. BACKUP SLIDES24 Institute of Applied Informatics and Formal Description Methods (AIFB)
    25. 25. Early Pruning of Partial Results Motivation: Top-k join processing can be quite costly in terms of memory consumption Idea: Prune such partial query results that cannot contribute to a final top-k result Currently known top-2 results: Rank Query Bindings – Output Queue 6 ex:help foaf:name "Help!". ex:song ?y ex:help ex:song "Help!" . 4 ex:sgt_pepper foaf:name "Sgt. Pepper". ?x ex:sgt_pepper ex:song "Lucy". foaf:name ?z Currently known partial results:upper-bound Rank Triple Pattern Binding ≤for triple bindings: 3 1 ex:sgt_pepper ex:song "Getting Better". +25 maximal score: 3 + 1 = 4 Institute of Applied Informatics and Formal Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Andreas Harth, and Rudi Studer Description Methods (AIFB)

    ×