Linked Data Query Processing Strategies
Upcoming SlideShare
Loading in...5
×
 

Linked Data Query Processing Strategies

on

  • 739 views

 

Statistics

Views

Total Views
739
Views on SlideShare
739
Embed Views
0

Actions

Likes
0
Downloads
8
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Linked Data Query Processing Strategies Linked Data Query Processing Strategies Presentation Transcript

  • Linked Data Query Processing StrategiesGünter Ladwig, Thanh TranInternational Semantic Web Conference 2010, ShanghaiInstitute of AppliedInformatics and Formal DescriptionMethods (AIFB)KIT – University of the State of Baden-Württemberg andNational Large-scale Research Center of the Helmholtz Association www.kit.edu
  • Contents Introduction Challenges Contributions Linked Data Query Processing Strategies Stream-based Query Processing Corrective Source Ranking Evaluation Conclusion2 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • What is Linked Data? Linked Data Principles Use URIs to identify things Use HTTP URIs that allow dereferencing Dereferencing a URI provides information about the thing in a standard format (RDF) Include links to other, related URIs Linked Data Query Processing Evaluate queries directly over Linked Data Dereference Linked Data URIs during query processing3 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • Challenges Volume of Source Collection Each URI is a potential data source Dynamic of Source Collection Sources may change rapidly over time Sources might only be discovered at run-time Heterogeneity of Sources, Source Descriptions and Access Methods Sources vary in size Description of sources vary in completeness Access methods: URI lookup, SPARQL endpoints, local cache, ...4 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • Contributions Discussion of Linked Data Query Processing strategies Mixed strategy, combining local indexes and run-time discovery Stream-based Query Processing Data can arrive at any time and in any order Suited to deal with network latency Corrective Source Ranking Deals with different types of source descriptions Ranking is refined at run-time5 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • LINKED DATA QUERY PROCESSING STRATEGIES6 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • Top-down Query Evaluation SELECT ?paper ?author WHERE { ?paperswrc:author ?author . ?paperswc:isPartOf ?proc . ?proc swc:relatedToEvent<http://sw.org/eswc/2010>. Probe } Source URI Score Local Select and Retrieve sources http://sw.org/person/AB 0.87 source rank sources Join data index ... ... Local index, assumed to be complete Selection and ranking of sources No run-time discovery Fast, only relevant sources are retrieved Not up-to-date, index size may become very large7 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • Bottom-up Query EvaluationSELECT ?paper ?author WHERE { ?paperswrc:author ?author . ?paperswc:isPartOf ?proc . ?proc swc:relatedToEvent<http://semweb.org/eswc/2010> . } Retrieve source <http://sw.org/proc/eswc/2010>swc:relatedToEvent <http://sw.org/eswc/2010> . ... Sources are discovered at Discover new sources run-time through links swc:paper1 swc:isPartOf <http://sw.org/proc/eswc/2010>. ... Answers can be incomplete as links might not be discoverable Slower, as unnecessary sources are retrieved Always up-to-date8 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • Mixed Strategy Combination of top-down and bottom-up strategies Partial local index of sources, not assumed to be complete New sources are discovered at run-time Addresses volume and dynamic of Linked Data Corrective Source Ranking Deal with heterogeneous source descriptions Stream-based Query Processing Deal with unpredictable nature of Linked Data access9 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • STREAM-BASED QUERY PROCESSING10 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • Stream-based Query Processing Results Network latency Query Plan Join Do not block! Evaluation driven by Join name(?y, ?n) incoming data Compile-time worksAt(?x, dbpedia:KIT) knows(?x, ?y) Construct query plan Samples Probe local index for sources Push Run-time Source Retrieval Retrieve Source Ranker Rank sources source Source Retriever 1 Source 1 (score: 1.0) Retrieve sources Source Retriever 2 Source Source 2 (score: 0.7) discovered ... Push data into query plan ... Discover new sources Local source index11 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • Push-based Symmetric Hash Join Operation t7 t4 Maintains a hash table for each t7 t5 input Arriving tuples are inserted into one hash table and then the other is probed for join combinations Push output Push-based Tuples are pushed into operators Left input Right input from the leaves to the root of the Key T Key T query plan a t1 , t3 b t4 , t5 Execution driven by incoming tuples instead of results b t 2 t7 2, c t6 Results reported as soon as input tuples arrive Insert Probe Tuples can arrive on all inputs in Pushed on left: t7(b) any order12 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • CORRECTIVE SOURCE RANKING13 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • Corrective Source Ranking Prefer more relevant sources Relevancy of a source is based on Current query Any available intermediate results Overall optimization goal Define a set of source features and derive concrete source metrics Not all metrics are available for all sources (heterogeneity) Refine previously computed metrics using newly discovered information14 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • Source Features and Metrics Source is more relevant if it contains data that contributes to answers of the query Triple Pattern Cardinality Join Pattern Cardinality Cardinalities stored in local index Some patterns have high cardinality for all or many sources (e.g. ) These patterns do not discriminate sources15 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • Source Features and Metrics Adopt TF-IDF concept to obtain weights for triple patterns Importance positively correlates with how often bindings to a pattern occur in a source (i.e. cardinality) Importance negatively correlates with how often its bindings occur in all sources of the source collection S Triple Frequency – Inverse Source Frequency (TF-ISF)16 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • Source Features and Metrics - Links Source linked from many other sources is more relevant Relevance is higher when these links match query predicates Links are only discovered at run-time17 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • Metric Correction and Refinement During query processing new information becomes available: intermediate join results, links Refine and correctpreviously computed metrics Important in the case of non-discriminative patterns Instantiate triple pattern of a join with samples of intermediate results to obtain better join size estimates Example Perform triple pattern Intermediate results in SHJ operator cardinality lookups18 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • Ranking at Run-time Optimization goal: early result reporting Indexed sources: triple and join pattern cardinality, TF-ISF, weighted links, sampled join size estimates Discovered sources: weighted links Ranking has to be refined at run-time Parameters influencing behavior and cost of ranking process Invalid Score Threshold: ranking is performed when the number of sources with invalid scores passes a threshold Sample Size: larger samples for join size estimation will give better estimates, are also more costly Resampling Threshold: cache join size estimates and perform sampling only when the hash table of join operator grows past a given threshold19 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • EVALUATION20 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • Evaluation Systems: top-down (TD), bottom-up (BU), mixed (MI) 8 queries over various datasets (DBpedia, Geonames, NYT, Freebase, ...) To make the approaches comparable, sources were restricted to those discoverable by the BU approach ~6200 sources, containing ~500k triples Sources hosted on local proxy server with artificial delay of 2 seconds 25% of sources were randomly chosen to construct index for MI21 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • Results Overall early result reporting 25% results: MI 8.7s, BU 15.1s 50% results: MI 12.8s, BU 22.0s Improvement of ~42% Detailed results for two queries: Query 1 Query 6 BU MI TD BU MI TD 25% Results 24810.5 10300.0 11038.0 8222.5 4743.5 5545.0 50% Results 43464.5 40782.0 15787.0 10961.5 7650.5 5634.0 Total 84066.5 86895.5 44323.5 24086.0 20711.0 16469.0 Src. 0.0 853.0 1444.5 0.0 1331.0 1863.5 Selection Ranking 25.5 2404.0 411.5 23.5 292.5 335.0 #Sources 622 612 154 236 92 4922 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • Result Arrival Times23 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • Ranking Heuristics24 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)
  • Conclusion Mixed strategy for Linked Data Query Processing Partial knowledge available beforehand, incorporated with source discovery at run-time Corrective Source Ranking Metrics for source relevancy Refinement of ranking at run-time Stream-based Query Processing Early results reported on average 42% faster Future work Adapt query plan to changing properties of incoming data Query local and remote data25 November 11th, 2010 ISWC 2010, Shanghai, China Institute of AppliedInformatics and Formal DescriptionMethods (AIFB)