TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

  • 1,350 views
Uploaded on

Given the heterogeneity of the data one can find on the Linked Data cloud, being able to trace back the provenance of query results is rapidly becoming a must-have feature of RDF systems. While......

Given the heterogeneity of the data one can find on the Linked Data cloud, being able to trace back the provenance of query results is rapidly becoming a must-have feature of RDF systems. While provenance models have been extensively discussed in recent years, little attention has been given to the efficient implementation of provenance-enabled queries inside data stores. This paper introduces TripleProv: a new system extending a native RDF store to efficiently handle such queries. TripleProv implements two different storage models to physically co-locate lineage and instance data, and for each of them implements algorithms for tracing provenance at two granularity levels. In the following, we present the overall architecture of our system, its different lineage storage models, and the various query execution strategies we have implemented to efficiently answer provenance-enabled queries. In addition, we present the results of a comprehensive empirical evaluation of our system over two different datasets and workloads.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,350
On Slideshare
699
From Embeds
651
Number of Embeds
6

Actions

Shares
Downloads
4
Comments
0
Likes
1

Embeds 651

http://thinklinks.wordpress.com 560
https://thinklinks.wordpress.com 50
https://twitter.com 31
http://feedly.com 8
http://digg.com 1
https://www.google.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 23rd International World Wide Web Conference, 10th April 2014, Seoul, Korea TripleProv Efficient Processing of Lineage Queries over a Native RDF Store Marcin Wylot1 , Philippe Cudré-Mauroux1 , and Paul Groth2 1) eXascale Infolab, University of Fribourg, Switzerland 2) Web & Madia Group, VU University Amsterdam, Netherlands
  • 2. Outline ➢ Motivation ➢ Provenance Polynomials ➢ System ➢ Results
  • 3. Data Provenance “Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness.” How a query answer was derived: what data was combined to produce the result.
  • 4. Data Integration ➢ Integrated and summarized data ➢ Trust, transparency, and cost ➢ Capability to pinpoint the exact source from which the result was selected ➢ Capability to trace back the complete list of sources and how they were combined to deliver a result
  • 5. Querying Distributed Data Sources How exactly was the answer derived?
  • 6. Application: Post-query Calculations ➢ Scores or probabilities for query result ➢ Result ranking ➢ Compute trust ➢ Information quality based on used sources
  • 7. Application: Query Execution ➢ Modify query strategies on the fly ➢ Restrict results to certain subset of sources ➢ Restrict results w.r.t. queries over provenance ➢ Access control, only certain sources will appear ➢ Detect if result would be valid when removing certain source
  • 8. Provenance Polynomials ➢ Ability to characterize ways each source contributed ➢ Pinpoint the exact source to each result ➢ Trace back the list of sources the way they were combined to deliver a result
  • 9. Graph-based Query select ?lat ?long ?g1 ?g2 ?g3 ?g4 where { graph ?g1 {?a [] "Eiffel Tower" . } graph ?g2 {?a inCountry FR . } graph ?g3 {?a lat ?lat . } graph ?g4 {?a long ?long . } } lat long l1 l2 l4 l4, lat long l1 l2 l4 l5, lat long l1 l2 l5 l4, lat long l1 l2 l5 l5, lat long l1 l3 l4 l4, lat long l1 l3 l4 l5, lat long l1 l3 l5 l4, lat long l1 l3 l5 l5, lat long l2 l2 l4 l4, lat long l2 l2 l4 l5, lat long l2 l2 l5 l4, lat long l2 l2 l5 l5, lat long l2 l3 l4 l4, lat long l2 l3 l4 l5, lat long l2 l3 l5 l4, lat long l2 l3 l5 l5, lat long l3 l2 l4 l4, lat long l3 l2 l4 l5, lat long l3 l2 l5 l4, lat long l3 l2 l5 l5, lat long l3 l3 l4 l4, lat long l3 l3 l4 l5, lat long l3 l3 l5 l4, lat long l3 l3 l5 l5,
  • 10. TripleProv Resuls result: lat, long provenance polynomial: (l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕ l9)
  • 11. Polynomials Operators ➢ Union (⊕) ○ constraint or projection satisfied with multiple sources l1 ⊕ l2 ⊕ l3 ○ multiple entities satisfy a set of constraints or projections ➢ Join (⊗) ○ sources joined to handle a constraint or a projection ○ OS and OO joins between few sets of constraints (l1 ⊕ l2) ⊗ (l3 ⊕ l4)
  • 12. Example Polynomial select ?lat ?long where { ?a [] ``Eiffel Tower''. ?a inCountry FR . ?a lat ?lat . ?a long ?long . } (l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕ l9)
  • 13. Example Polynomial select ?l ?long ?lat where { ?p name ``Krebs, Emil'' . ?p deathPlace ?l . ?c [] ?l . ?c featureClass P . ?c inCountry DE . ?c long ?long . ?c lat ?lat . } [(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5)] ⊗ [( l6 ⊕ l7) ⊗ (l8) ⊗ (l9 ⊕ l10) ⊗ (l11 ⊕ l12) ⊗ (l13)]
  • 14. Granularity Levels ➢ source-level: sources of a triples ➢ triple-level: all pieces of data used to answer the query (l1 ⊕ l2) ⊗ (l3 ⊕ l4)
  • 15. System Architecture
  • 16. Native Data Model ➢ Semantically co-located data ➢ Template based molecules
  • 17. Various Physical Storage Models Differences: ➢ ease of implementation ➢ memory consumption ➢ query execution ➢ interference with the original concept of molecule 1) SPOL 2) LSPO 3) SLPO 4) SPLO
  • 18. Annotated Triples ➢ Annotated provenance ➢ Quadruples ➢ Easy to implement ➢ Source data repeated for each triple
  • 19. Co-located Elements ➢ Data grouped by source ➢ Physically co-located ➢ Avoids duplication of the same source inside a molecule ➢ Data about a given subject co-located in one molecule ➢ More difficult to implement
  • 20. Experiments How expensive it is to trace provenance? What is the overhead on query execution time?
  • 21. Datasets ➢ Two collections of RDF data gathered from the Web ○ Billion Triple Challenge (BTC): Crawled from the linked open data cloud ○ Web Data Commons (WDC): RDFa, Microdata extracted from common crawl ➢ Typical collections gathered from multiple sources ➢ sampled subsets of ~110 million triples each; ~25GB each
  • 22. Workloads ➢ 8 Queries defined for BTC ○ T. Neumann and G. Weikum. Scalable join processing on very large rdf graphs. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 627–640. ACM, 2009. ➢ Two additional queries with UNION and OPTIONAL clauses ➢ 7 various new queries for WDC http://exascale.info/tripleprov
  • 23. Results Overhead of tracking provenance compared to vanilla version of the system for BTC dataset source-level co-located source-level annotated triple-level co-located triple-level annotated
  • 24. Conclusions ➢ provenance overhead is considerable but acceptable, on average about 60-70% ➢ most suitable storage model depends upon data and workloads characteristics ➢ annotated: more appropriate for heterogenous datasets and workloads retrieving provenance ➢ co-located: more appropriate for homogenous datasets and workload filtering by source
  • 25. Future Work ➢ Distributed version ➢ Dynamic storage model ➢ Adaptive query execution strategies ➢ PROV output ➢ Over provenance queries
  • 26. Summary ➢ TripleProv: an efficient triplestore tracking provenance ➢ Two storage models ➢ Fine-grained multilevel provenance tracing ➢ Formal provenance polynomials ➢ Experimental evaluation http://exascale.info/tripleprov
  • 27. Loading & Memory Billion Triple Challenge Web Data Commons
  • 28. Results Overhead of tracking provenance compared to vanilla version of the system for WDC dataset source-level SLPO source-level SPOL triple-level SLPO triple-level SPOL
  • 29. Polynomials: multiple records [(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕ l9)] ⊕ [(l5 ⊕ l7) ⊗ (l4) ⊗ ( l13 ⊕ l17) ⊗ (l28)] ⊕ [(l4) ⊗ (l1 ⊕ l2) ⊗ ( l3 ⊕ l7) ⊗ (l8 ⊕ l9⊕ l4)]