Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Opportunistic Linked Data Querying through Approximate Membership Metadata

701 views

Published on

Between uri dereferencing and the sparql protocol lies a largely
unexplored axis of possible interfaces to Linked Data, eachwith its own combination of trade-offs. One of these interfaces is Triple Pattern Fragments, which allows clients to execute sparql queries against low-cost servers, at the cost of higher bandwidth. Increasing a client’s efficiency means lowering the number of requests, which can among others be achieved through additional metadata in responses.
We noted that typical sparql query evaluations against Triple Pattern Fragments
require a significant portion of membership subqueries, which check the presence
of a specific triple, rather than a variable pattern. This paper studies the impact
of providing approximate membership functions, i.e., Bloom filters and Golombcoded
sets, as extra metadata. In addition to reducing http requests, such functions
allow to achieve full result recall earlier when temporarily allowing lower precision.
Half of the tested queries from aWatDiv benchmark test set could be executed with
up to a third fewer http requests with only marginally higher server cost. Query
times, however, did not improve, likely due to slower metadata generation and
transfer. This indicates that approximate membership functions can partly improve
the client-side query process with minimal impact on the server and its interface.

Published in: Engineering
  • Be the first to comment

Opportunistic Linked Data Querying through Approximate Membership Metadata

  1. 1. Opportunistic Linked Data Querying through Approximate Membership Metadata Miel Vander Sande
  2. 2. “Solve a query for a client, 
 and it will be happy for a day.
 Teach a client to SPARQL, 
 and it’ll query happily ever after.” ! — Confucius, 431 BC
  3. 3. Linked Data Fragments: a uniform view
 on publishing Linked Data Exploring the axis: selector and metadata Approximate Membership Metadata Querying through Approximate Membership Metadata Opportunistic Querying
  4. 4. Linked Data Fragments: a uniform view
 on publishing Linked Data Exploring the axis: selector and metadata Approximate Membership Metadata Querying through Approximate Membership Metadata Opportunistic Querying
  5. 5. Interaction between client & server.
 The hunt for trade-offs: What can we learn? high server costlow server cost data
 dump SPARQL
 endpoint interface offered by the server high availability low availability high bandwidth low bandwidth out-of-date data live data low client costhigh client cost
  6. 6. Linked Data Fragments are
 a uniform view on Linked Data interfaces. data
 dump SPARQL
 endpoint interface offered by the server Every Linked Data interface
 offers specific fragments
 of a Linked Data set.
  7. 7. data metadata controls What triples does it contain? What do we know about it? How to access more data? Each type of Linked Data Fragment
 is defined by three characteristics.
  8. 8. all dataset triples (none) data dump number of triples, file size data metadata controls Each type of Linked Data Fragment
 is defined by three characteristics.
  9. 9. triples matching the query (none) (none) SPARQL query result data metadata controls Each type of Linked Data Fragment
 is defined by three characteristics.
  10. 10. Linked Data Fragments: a uniform view
 on publishing Linked Data Exploring the axis: selector and metadata Approximate Membership Metadata Querying through Approximate Membership Metadata Opportunistic Querying
  11. 11. low server cost data
 dump SPARQL
 query results high availability live data Linked Data
 documents triple pattern
 fragments You have to start somewhere: 
 Triple Pattern Fragments. Verborgh, R., Hartig, O.,…: Querying datasets on the Web with high availability. ISWC2014 high bandwidth
  12. 12. data (first 100) controls (other fragments) metadata (total count)
  13. 13. controls Triple pattern fragment servers
 enable clients to be intelligent. <http://fragments.dbpedia.org/2014/en#dataset> hydra:search [ hydra:template "http://fragments.dbpedia.org/2014/en {?subject,predicate,object}"; hydra:mapping [ hydra:variable "subject"; hydra:property rdf:subject ], [ hydra:variable "predicate"; hydra:property rdf:predicate ], [ hydra:variable "object"; hydra:property rdf:object ] ]. The RDF representation explains:
 “you can query by triple pattern”.
  14. 14. The RDF representation explains:
 “this is the number of matches”. metadata Triple pattern fragment servers
 enable clients to be intelligent. <#fragment> void:triples 8141.
  15. 15. Give them a SPARQL query.
 Give them a URL of any dataset fragment. How can intelligent clients
 solve SPARQL queries over fragments? They look inside the fragment
 to see how to access the dataset and use the metadata
 to decide how to plan the query.
  16. 16. The client splits the query
 into the available fragments. SELECT ?artist ?name WHERE { ?artist a dbpedia-owl:Artist; rdfs:label ?name; dbpedia-owl:birthPlace dbpedia:Padua. FILTER LANGMATCHES(LANG(?name), "EN") }
  17. 17. The client gets the fragments
 and inspects their metadata. ?artist a dbpedia-owl:Artist. first 100 triples 96,000 ?artist rdfs:label ?name. first 100 triples 12,000,000 ?artist dbont:birthPlace dbpedia:Padua. first 100 triples 135
  18. 18. ?artist a dbpedia-owl:Artist. 96.000 ?artist rdfs:label ?name. 12.000.000 ?artist dbont:birthPlace dbpedia:Padua. dbpedia:Alberto_Benettin dbont:birthPlace dbpedia:Padua. 135 dbpedia:Alberto_Bigon dbont:birthPlace dbpedia:Padua. The metadata enables the client
 to choose the right starting point. dbp:Alberto_Benettin a dbont:Artist. dbp:Alberto_Benettin rdfs:label ?name.
  19. 19. For some patterns, many requests are of type “is this triple in the dataset?” Fractionofmembershipqueries 0% 25% 50% 75% 100% L1 L2 L3 L4 L5 S1 S2 S3 S4 S5 S6 S7 F1 F2 F3 F4 F5 C1 C2 C3 20 WatDiv queries
 linear (L), star (S), snowflake-shaped (F) and complex (C)
  20. 20. Advancing in selector and/or metadata dimensions. metadata selector Triple Pattern Fragments low server cost high availability live data high bandwidth Simple
 Questions Complex 
 Questions No information 
 for the client Extensive useful
 information for the client
  21. 21. Advancing in selector and/or metadata dimensions. metadata selector Triple Pattern Fragments Substring search J Van Herwegen et. al.: Substring Filtering for Low-Cost Linked Data Interfaces
 Last talk of this session!
  22. 22. Advancing in selector and/or metadata dimensions. metadata selector Triple Pattern Fragments Substring search Approximate Membership
 Function (AMF)
  23. 23. Linked Data Fragments: a uniform view
 on publishing Linked Data Exploring the axis: selector and metadata Approximate Membership Metadata Querying through Approximate Membership Metadata Opportunistic Querying
  24. 24. Append TPF response with a compact representation of all possible mappings. metadata Triple Pattern Fragments Approximate Membership Function (AMF) Approximate set membership assessment with a predefined false positive probability. Bloom filter / Golomb-coded set +
  25. 25. “Can we reduce the number of HTTP requests?” “Can we reduce the total execution time?” “What is the overhead on server CPU load?”
  26. 26. Bloom Filter Golomb-coded set (GCS) 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 … 0 1 0 ! ! n0 dbpedia:Alberto_Benettin n1 dbpedia:Alberto_Bigon nx … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 m 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 … 0 1 0 k0 k1 kx k0 k1 kx 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 ! n0 dbpedia:Alberto_Benettin n1 dbpedia:Alberto_Bigon k 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 k 0 1 0 1 1 0 1Golomb coded Geometric distribution
  27. 27. “this BloomFilter with false positive probability X and hash function Y represents the presence of all bindings for ?s”. metadata Server enables clients to avoid 
 membership requests. <#fragment> void:triples 96300. # existing count metadata _:membershipFunction a ms:BloomFilter; # AMF metadata ms:hashSize 524288; ms:hashFunction <MyMurmur1>, <MyMurmur2>; ms:memberCollection [ ms:sourceCollection <#fragment>; ms:projectedProperty rdf:subject ]; ms:falsePositiveRate 0.05; ms:falseNegativeRate 0.0; ms:binaryRepresentation "QmF...ZTY"^^xsd:base64Binary.
  28. 28. GET ?artist dbont:birthPlace dbpedia:Padua. dbpedia:Alberto_Benettin dbont:birthPlace dbpedia:Padua. 135 … Client filters non-members locally 
 with one extra (cached) request GET dbpedia:Alberto_Benettin a dbont:Artist. 0 GET dbpedia:Alberto_Bigon a dbont:Artist. 1 GET dbpedia:Alberto_Da_Zara a dbont:Artist. 1 GET dbpedia:Alberto_Gallo a dbont:Artist. 0 GET dbpedia:Alberto_Bigon a dbont:Artist. 1 GET ?artist a dbont:Artist. Approx.MembershipFilt. GET …
  29. 29. We evaluated for request count, server cost and speedup in a Web setting. BloomFilter: MurMurHash3, GCS: FNV-1 1 HTTP Cache with 1 Mbps p = 1/1024 (0.1%) , 1/128 (1%), 1/64 (1.6%) 250 queries from 125 diverse WatDiv templates on Amazon EC2 machine WatDiv 100M triples dataset Timeout: 3min
  30. 30. We evaluated for request count, server cost and speedup in a Web setting. vs. vanilla TPF server & client Original “greedy” algorithm
 Optimized join-tree algorithm* 250 queries from 125 diverse WatDiv templates on Amazon EC2 machine * Van Herwegen, et. al.: Query Execution Optimization for Clients of Triple Patterns Fragments. ESWC2015 2 client algorithms:
  31. 31. > 50% of the queries has fewer requests,
 < 20% has more requests. Greedy Bloom Greedy GCS Optimized Bloom Optimized GCS Percentage of queries (p = 1/1024) 0% 25% 50% 75% 100% 6% 5% 18% 17% 59% 62% 49% 50% 35% 33% 33% 32% Equal Fewer Requests More Requests
  32. 32. Queries with relatively many HTTP req. (45,000+ / query) benefit greatly Differencein#Requests 0 4,000 8,000 12,000 16,000 Fewer Requests More Requests Greedy Bloom Greedy GCS Optimized Bloom Optimized GCS < 35
  33. 33. No queries have reduction in execution time, a third even has increase. Greedy Bloom Greedy GCS Optimized Bloom Optimized GCS Percentage of queries (p = 1/1024) 0% 25% 50% 75% 100% 16% 31% 33% 38%0% 84% 69% 67% 62% Equal Lower Execution time Higher Execution time
  34. 34. Server remains low-cost, as impact is 
 very acceptable (< 6%). CPU(%) 0 7.5 15 22.5 30 O riginal Bloom (1/1024) Bloom (1/128) Bloom (1/64) G CS (1/1024) G CS (1/128) G CS (1/64) 11.110.810.2 14.9 11.210.8 9.2
  35. 35. Linked Data Fragments: a uniform view
 on publishing Linked Data Exploring the axis: selector and metadata Approximate Membership Metadata Querying through Approximate Membership Metadata Opportunistic Querying
  36. 36. During execution, a result candidate could already be correct (1 - p). Can we be opportunistic here, and temporarily allow imprecise results?
  37. 37. “Can we reduce the time to 100% recall?” Opportunistic Linked Data Querying 13 only allow certain results temporarily allow uncertain results start execution start execution 1st result computed 1st result computed n < r results computed n < r results computed r results computed r results computed r + f results computed 0% recall 100% recall 100% recall 100% precision Fig. 2. This SPARQL query execution timeline compares regular and opportunistic query execution, assuming r total query results and f false positives. Note how both approaches achieve 100% recall and precision at a shared point in the end, but there exists a period during which only opportunistic execution reaches 100% recall (shaded). need to be discarded. The user thus sees the photos faster than if they had only been retrieved after full precision was achieved. This example
  38. 38. Temporarily allowing <100% precision 
 can reduce 100% recall time with 1/3. Executiontime(s) 0 35 70 105 140 Greedy + Bloom (p = 1/1024) 100% Recall 100% Precision Number of revoked results was 0 or 1.
  39. 39. Linked Data Fragments: a uniform view
 on publishing Linked Data Exploring the axis: selector and metadata Approximate Membership Metadata Querying through Approximate Membership Metadata Opportunistic Querying
  40. 40. For some queries types, bandwidth highly decreases for TPF query execution. Approximate Membership Metadata 
 is a nuanced debate For larger fragments, realtime computation hurts execution time. We expect gain with 
 pre-caching and out-of-band delivery. Opportunistic querying is a promising direction for further exploration.
  41. 41. TRIPLE PATTERN fragments data APPR. MEM. FILT. No one size fits all, explore the axis.
 Find metrics that fit your use-case. Client & Server load
 Request & Response size
 Protocol (HTTP) impact
 … Try you own trade-off server at our demo (and get a nice cup of coffee). Start serving Linked Data like a barista
  42. 42. Opportunistic Linked Data Querying through Approximate Membership Metadata Miel Vander Sande

×