TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

23rd International World Wide Web Conference, 10th April 2014, Seoul, Korea
TripleProv
Efficient Processing of Lineage
Queries over a Native RDF Store
Marcin Wylot1
,
Philippe Cudré-Mauroux1
, and Paul Groth2
1)
eXascale Infolab, University of Fribourg, Switzerland
2)
Web & Madia Group, VU University Amsterdam, Netherlands

Outline
➢ Motivation
➢ Provenance Polynomials
➢ System
➢ Results

Data Provenance
“Provenance is information about
entities, activities, and people involved
in producing a piece of data or thing, which can be used to form
assessments about its quality, reliability or trustworthiness.”
How a query answer was derived: what data was
combined to produce the result.

Data Integration
➢ Integrated and summarized data
➢ Trust, transparency, and cost
➢ Capability to pinpoint the exact
source from which the result was
selected
➢ Capability to trace back the
complete list of sources and how
they were combined to deliver a
result

Querying Distributed Data Sources
How exactly was the answer derived?

Application: Post-query Calculations
➢ Scores or probabilities for query result
➢ Result ranking
➢ Compute trust
➢ Information quality based on used sources

Application: Query Execution
➢ Modify query strategies on the fly
➢ Restrict results to certain subset of sources
➢ Restrict results w.r.t. queries over provenance
➢ Access control, only certain sources will appear
➢ Detect if result would be valid when removing certain
source

Provenance Polynomials
➢ Ability to characterize ways each source contributed
➢ Pinpoint the exact source to each result
➢ Trace back the list of sources the way they were combined
to deliver a result

Graph-based Query
select ?lat ?long ?g1 ?g2 ?g3 ?g4
where {
graph ?g1 {?a [] "Eiffel Tower" . }
graph ?g2 {?a inCountry FR . }
graph ?g3 {?a lat ?lat . }
graph ?g4 {?a long ?long . }
}
lat long l1 l2 l4 l4,

TripleProv Resuls
result:
lat, long
provenance polynomial:
(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕ l9)

Polynomials Operators
➢ Union (⊕)
○ constraint or projection satisfied with multiple sources
l1 ⊕ l2 ⊕ l3
○ multiple entities satisfy a set of constraints or projections
➢ Join (⊗)
○ sources joined to handle a constraint or a projection
○ OS and OO joins between few sets of constraints
(l1 ⊕ l2) ⊗ (l3 ⊕ l4)

Example Polynomial
select ?lat ?long where {
?a [] ``Eiffel Tower''.
?a inCountry FR .
?a lat ?lat .
?a long ?long .
}
(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕ l9)

Example Polynomial
select ?l ?long ?lat where
{
?p name ``Krebs, Emil'' .
?p deathPlace ?l .
?c [] ?l .
?c featureClass P .
?c inCountry DE .
?c long ?long .
?c lat ?lat .
}
[(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5)]
⊗
[( l6 ⊕ l7) ⊗ (l8) ⊗ (l9 ⊕ l10) ⊗ (l11 ⊕ l12) ⊗ (l13)]

Granularity Levels
➢ source-level: sources of a triples
➢ triple-level: all pieces of data used to answer the query
(l1 ⊕ l2) ⊗ (l3 ⊕ l4)

Native Data Model
➢ Semantically co-located data
➢ Template based molecules

Various Physical Storage Models
Differences:
➢ ease of implementation
➢ memory consumption
➢ query execution
➢ interference with the original concept of molecule
1) SPOL 2) LSPO 3) SLPO 4) SPLO

Annotated Triples
➢ Annotated provenance
➢ Quadruples
➢ Easy to implement
➢ Source data repeated
for each triple

Co-located Elements
➢ Data grouped by source
➢ Physically co-located
➢ Avoids duplication of the
same source inside a
molecule
➢ Data about a given subject
co-located in one molecule
➢ More difficult to implement

Experiments
How expensive it is to trace
provenance?
What is the overhead on query
execution time?

Datasets
➢ Two collections of RDF data gathered from the Web
○ Billion Triple Challenge (BTC): Crawled from the linked
open data cloud
○ Web Data Commons (WDC): RDFa, Microdata
extracted from common crawl
➢ Typical collections gathered from multiple sources
➢ sampled subsets of ~110 million triples each; ~25GB each

Workloads
➢ 8 Queries defined for BTC
○ T. Neumann and G. Weikum. Scalable join processing on very large rdf
graphs. In Proceedings of the 2009 ACM SIGMOD International
Conference on Management of data, pages 627–640. ACM, 2009.
➢ Two additional queries with UNION and OPTIONAL
clauses
➢ 7 various new queries for WDC
http://exascale.info/tripleprov

Results
Overhead of tracking provenance compared to
vanilla version of the system for BTC dataset
source-level co-located
source-level annotated
triple-level co-located
triple-level annotated

Conclusions
➢ provenance overhead is considerable but acceptable,
on average about 60-70%
➢ most suitable storage model depends upon data and
workloads characteristics
➢ annotated: more appropriate for heterogenous datasets
and workloads retrieving provenance
➢ co-located: more appropriate for homogenous datasets
and workload filtering by source

Future Work
➢ Distributed version
➢ Dynamic storage model
➢ Adaptive query execution strategies
➢ PROV output
➢ Over provenance queries

Summary
➢ TripleProv: an efficient triplestore tracking provenance
➢ Two storage models
➢ Fine-grained multilevel provenance tracing
➢ Formal provenance polynomials
➢ Experimental evaluation
http://exascale.info/tripleprov

Loading & Memory
Billion Triple Challenge
Web Data Commons

Results
Overhead of tracking provenance compared to
vanilla version of the system for WDC dataset
source-level SLPO
source-level SPOL
triple-level SLPO
triple-level SPOL

Polynomials: multiple records
[(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕ l9)]
⊕
[(l5 ⊕ l7) ⊗ (l4) ⊗ ( l13 ⊕ l17) ⊗ (l28)]
⊕
[(l4) ⊗ (l1 ⊕ l2) ⊗ ( l3 ⊕ l7) ⊗ (l8 ⊕ l9⊕ l4)]

TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

Recommended

Recommended

More Related Content

More from eXascale Infolab

More from eXascale Infolab (20)

Recently uploaded

Recently uploaded (20)

TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store