TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

eXascale Infolab
eXascale InfolabeXascale Infolab
23rd International World Wide Web Conference, 10th April 2014, Seoul, Korea
TripleProv
Efficient Processing of Lineage
Queries over a Native RDF Store
Marcin Wylot1
,
Philippe Cudré-Mauroux1
, and Paul Groth2
1)
eXascale Infolab, University of Fribourg, Switzerland
2)
Web & Madia Group, VU University Amsterdam, Netherlands
Outline
➢ Motivation
➢ Provenance Polynomials
➢ System
➢ Results
Data Provenance
“Provenance is information about
entities, activities, and people involved
in producing a piece of data or thing, which can be used to form
assessments about its quality, reliability or trustworthiness.”
How a query answer was derived: what data was
combined to produce the result.
Data Integration
➢ Integrated and summarized data
➢ Trust, transparency, and cost
➢ Capability to pinpoint the exact
source from which the result was
selected
➢ Capability to trace back the
complete list of sources and how
they were combined to deliver a
result
Querying Distributed Data Sources
How exactly was the answer derived?
Application: Post-query Calculations
➢ Scores or probabilities for query result
➢ Result ranking
➢ Compute trust
➢ Information quality based on used sources
Application: Query Execution
➢ Modify query strategies on the fly
➢ Restrict results to certain subset of sources
➢ Restrict results w.r.t. queries over provenance
➢ Access control, only certain sources will appear
➢ Detect if result would be valid when removing certain
source
Provenance Polynomials
➢ Ability to characterize ways each source contributed
➢ Pinpoint the exact source to each result
➢ Trace back the list of sources the way they were combined
to deliver a result
Graph-based Query
select ?lat ?long ?g1 ?g2 ?g3 ?g4
where {
graph ?g1 {?a [] "Eiffel Tower" . }
graph ?g2 {?a inCountry FR . }
graph ?g3 {?a lat ?lat . }
graph ?g4 {?a long ?long . }
}
lat long l1 l2 l4 l4,
lat long l1 l2 l4 l5,
lat long l1 l2 l5 l4,
lat long l1 l2 l5 l5,
lat long l1 l3 l4 l4,
lat long l1 l3 l4 l5,
lat long l1 l3 l5 l4,
lat long l1 l3 l5 l5,
lat long l2 l2 l4 l4,
lat long l2 l2 l4 l5,
lat long l2 l2 l5 l4,
lat long l2 l2 l5 l5,
lat long l2 l3 l4 l4,
lat long l2 l3 l4 l5,
lat long l2 l3 l5 l4,
lat long l2 l3 l5 l5,
lat long l3 l2 l4 l4,
lat long l3 l2 l4 l5,
lat long l3 l2 l5 l4,
lat long l3 l2 l5 l5,
lat long l3 l3 l4 l4,
lat long l3 l3 l4 l5,
lat long l3 l3 l5 l4,
lat long l3 l3 l5 l5,
TripleProv Resuls
result:
lat, long
provenance polynomial:
(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕ l9)
Polynomials Operators
➢ Union (⊕)
○ constraint or projection satisfied with multiple sources
l1 ⊕ l2 ⊕ l3
○ multiple entities satisfy a set of constraints or projections
➢ Join (⊗)
○ sources joined to handle a constraint or a projection
○ OS and OO joins between few sets of constraints
(l1 ⊕ l2) ⊗ (l3 ⊕ l4)
Example Polynomial
select ?lat ?long where {
?a [] ``Eiffel Tower''.
?a inCountry FR .
?a lat ?lat .
?a long ?long .
}
(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕ l9)
Example Polynomial
select ?l ?long ?lat where
{
?p name ``Krebs, Emil'' .
?p deathPlace ?l .
?c [] ?l .
?c featureClass P .
?c inCountry DE .
?c long ?long .
?c lat ?lat .
}
[(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5)]
⊗
[( l6 ⊕ l7) ⊗ (l8) ⊗ (l9 ⊕ l10) ⊗ (l11 ⊕ l12) ⊗ (l13)]
Granularity Levels
➢ source-level: sources of a triples
➢ triple-level: all pieces of data used to answer the query
(l1 ⊕ l2) ⊗ (l3 ⊕ l4)
System Architecture
Native Data Model
➢ Semantically co-located data
➢ Template based molecules
Various Physical Storage Models
Differences:
➢ ease of implementation
➢ memory consumption
➢ query execution
➢ interference with the original concept of molecule
1) SPOL 2) LSPO 3) SLPO 4) SPLO
Annotated Triples
➢ Annotated provenance
➢ Quadruples
➢ Easy to implement
➢ Source data repeated
for each triple
Co-located Elements
➢ Data grouped by source
➢ Physically co-located
➢ Avoids duplication of the
same source inside a
molecule
➢ Data about a given subject
co-located in one molecule
➢ More difficult to implement
Experiments
How expensive it is to trace
provenance?
What is the overhead on query
execution time?
Datasets
➢ Two collections of RDF data gathered from the Web
○ Billion Triple Challenge (BTC): Crawled from the linked
open data cloud
○ Web Data Commons (WDC): RDFa, Microdata
extracted from common crawl
➢ Typical collections gathered from multiple sources
➢ sampled subsets of ~110 million triples each; ~25GB each
Workloads
➢ 8 Queries defined for BTC
○ T. Neumann and G. Weikum. Scalable join processing on very large rdf
graphs. In Proceedings of the 2009 ACM SIGMOD International
Conference on Management of data, pages 627–640. ACM, 2009.
➢ Two additional queries with UNION and OPTIONAL
clauses
➢ 7 various new queries for WDC
http://exascale.info/tripleprov
Results
Overhead of tracking provenance compared to
vanilla version of the system for BTC dataset
source-level co-located
source-level annotated
triple-level co-located
triple-level annotated
Conclusions
➢ provenance overhead is considerable but acceptable,
on average about 60-70%
➢ most suitable storage model depends upon data and
workloads characteristics
➢ annotated: more appropriate for heterogenous datasets
and workloads retrieving provenance
➢ co-located: more appropriate for homogenous datasets
and workload filtering by source
Future Work
➢ Distributed version
➢ Dynamic storage model
➢ Adaptive query execution strategies
➢ PROV output
➢ Over provenance queries
Summary
➢ TripleProv: an efficient triplestore tracking provenance
➢ Two storage models
➢ Fine-grained multilevel provenance tracing
➢ Formal provenance polynomials
➢ Experimental evaluation
http://exascale.info/tripleprov
Loading & Memory
Billion Triple Challenge
Web Data Commons
Results
Overhead of tracking provenance compared to
vanilla version of the system for WDC dataset
source-level SLPO
source-level SPOL
triple-level SLPO
triple-level SPOL
Polynomials: multiple records
[(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕ l9)]
⊕
[(l5 ⊕ l7) ⊗ (l4) ⊗ ( l13 ⊕ l17) ⊗ (l28)]
⊕
[(l4) ⊗ (l1 ⊕ l2) ⊗ ( l3 ⊕ l7) ⊗ (l8 ⊕ l9⊕ l4)]
1 of 29

Recommended

RDAP 15 Local ICPSR Data Curation Workshop Pilot Project by
RDAP 15 Local ICPSR Data Curation Workshop Pilot ProjectRDAP 15 Local ICPSR Data Curation Workshop Pilot Project
RDAP 15 Local ICPSR Data Curation Workshop Pilot ProjectASIS&T
1.2K views1 slide
2014-03-20 Open PHACTS - A Data Platform for Drug Discovery by
2014-03-20 Open PHACTS - A Data Platform for Drug Discovery2014-03-20 Open PHACTS - A Data Platform for Drug Discovery
2014-03-20 Open PHACTS - A Data Platform for Drug Discoveryopen_phacts
2.5K views33 slides
Open PHACTS API Walkthrough by
Open PHACTS API WalkthroughOpen PHACTS API Walkthrough
Open PHACTS API WalkthroughPaul Groth
2.9K views47 slides
Research Data Management for Librarians at Oxford Brookes by
Research Data Management for Librarians at Oxford BrookesResearch Data Management for Librarians at Oxford Brookes
Research Data Management for Librarians at Oxford BrookesMarieke Guy
1.8K views54 slides
From Open Data to Open Science, by Geoffrey Boulton by
 From Open Data to Open Science, by Geoffrey Boulton From Open Data to Open Science, by Geoffrey Boulton
From Open Data to Open Science, by Geoffrey BoultonLEARN Project
1.2K views31 slides
NSW Open Data Challenge: Data Request Service by
NSW Open Data Challenge: Data Request ServiceNSW Open Data Challenge: Data Request Service
NSW Open Data Challenge: Data Request ServiceCofluence
556 views6 slides

More Related Content

More from eXascale Infolab

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction by
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictioneXascale Infolab
287 views30 slides
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S... by
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...eXascale Infolab
167 views16 slides
Representation Learning on Complex Graphs by
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex GraphseXascale Infolab
539 views33 slides
A force directed approach for offline gps trajectory map by
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapeXascale Infolab
459 views12 slides
Cikm 2018 by
Cikm 2018Cikm 2018
Cikm 2018eXascale Infolab
872 views18 slides
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit... by
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...eXascale Infolab
787 views20 slides

More from eXascale Infolab(20)

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction by eXascale Infolab
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
eXascale Infolab287 views
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S... by eXascale Infolab
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
eXascale Infolab167 views
Representation Learning on Complex Graphs by eXascale Infolab
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex Graphs
eXascale Infolab539 views
A force directed approach for offline gps trajectory map by eXascale Infolab
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory map
eXascale Infolab459 views
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit... by eXascale Infolab
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
eXascale Infolab787 views
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous... by eXascale Infolab
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
eXascale Infolab1.2K views
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans by eXascale Infolab
Dependency-Driven Analytics: A Compass for Uncharted Data OceansDependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
eXascale Infolab687 views
SANAPHOR: Ontology-based Coreference Resolution by eXascale Infolab
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference Resolution
eXascale Infolab1.1K views
Efficient, Scalable, and Provenance-Aware Management of Linked Data by eXascale Infolab
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data
eXascale Infolab713 views
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data by eXascale Infolab
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
eXascale Infolab4K views
Executing Provenance-Enabled Queries over Web Data by eXascale Infolab
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web Data
eXascale Infolab1.5K views
The Dynamics of Micro-Task Crowdsourcing by eXascale Infolab
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task Crowdsourcing
eXascale Infolab1.6K views
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu... by eXascale Infolab
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
eXascale Infolab3.1K views
CIKM14: Fixing grammatical errors by preposition ranking by eXascale Infolab
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
eXascale Infolab1.7K views

Recently uploaded

Krishna VSC 692 Credit Seminar.pptx by
Krishna VSC 692 Credit Seminar.pptxKrishna VSC 692 Credit Seminar.pptx
Krishna VSC 692 Credit Seminar.pptxKrishnaSharma682993
11 views54 slides
Determination of color fastness to rubbing(wet and dry condition) by crockmeter. by
Determination of color fastness to rubbing(wet and dry condition) by crockmeter.Determination of color fastness to rubbing(wet and dry condition) by crockmeter.
Determination of color fastness to rubbing(wet and dry condition) by crockmeter.ShadmanSakib63
6 views6 slides
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio... by
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...Trustlife
142 views17 slides
Exploring the nature and synchronicity of early cluster formation in the Larg... by
Exploring the nature and synchronicity of early cluster formation in the Larg...Exploring the nature and synchronicity of early cluster formation in the Larg...
Exploring the nature and synchronicity of early cluster formation in the Larg...Sérgio Sacani
1.2K views12 slides
application of genetic engineering 2.pptx by
application of genetic engineering 2.pptxapplication of genetic engineering 2.pptx
application of genetic engineering 2.pptxSankSurezz
14 views12 slides
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...ILRI
8 views1 slide

Recently uploaded(20)

Determination of color fastness to rubbing(wet and dry condition) by crockmeter. by ShadmanSakib63
Determination of color fastness to rubbing(wet and dry condition) by crockmeter.Determination of color fastness to rubbing(wet and dry condition) by crockmeter.
Determination of color fastness to rubbing(wet and dry condition) by crockmeter.
ShadmanSakib636 views
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio... by Trustlife
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...
Discovery of therapeutic agents targeting PKLR for NAFLD using drug repositio...
Trustlife142 views
Exploring the nature and synchronicity of early cluster formation in the Larg... by Sérgio Sacani
Exploring the nature and synchronicity of early cluster formation in the Larg...Exploring the nature and synchronicity of early cluster formation in the Larg...
Exploring the nature and synchronicity of early cluster formation in the Larg...
Sérgio Sacani1.2K views
application of genetic engineering 2.pptx by SankSurezz
application of genetic engineering 2.pptxapplication of genetic engineering 2.pptx
application of genetic engineering 2.pptx
SankSurezz14 views
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI8 views
ELECTRON TRANSPORT CHAIN by DEEKSHA RANI
ELECTRON TRANSPORT CHAINELECTRON TRANSPORT CHAIN
ELECTRON TRANSPORT CHAIN
DEEKSHA RANI10 views
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI5 views
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe... by Anmol Vishnu Gupta
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
2. Natural Sciences and Technology Author Siyavula.pdf by ssuser821efa
2. Natural Sciences and Technology Author Siyavula.pdf2. Natural Sciences and Technology Author Siyavula.pdf
2. Natural Sciences and Technology Author Siyavula.pdf
ssuser821efa10 views
Note on the Riemann Hypothesis by vegafrank2
Note on the Riemann HypothesisNote on the Riemann Hypothesis
Note on the Riemann Hypothesis
vegafrank27 views
Applications of Large Language Models in Materials Discovery and Design by Anubhav Jain
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
Anubhav Jain13 views
A giant thin stellar stream in the Coma Galaxy Cluster by Sérgio Sacani
A giant thin stellar stream in the Coma Galaxy ClusterA giant thin stellar stream in the Coma Galaxy Cluster
A giant thin stellar stream in the Coma Galaxy Cluster
Sérgio Sacani18 views
Evaluation and Standardization of the Marketed Polyherbal drug Patanjali Divy... by Anmol Vishnu Gupta
Evaluation and Standardization of the Marketed Polyherbal drug Patanjali Divy...Evaluation and Standardization of the Marketed Polyherbal drug Patanjali Divy...
Evaluation and Standardization of the Marketed Polyherbal drug Patanjali Divy...
Factors affecting fluorescence and phosphorescence.pptx by SamarthGiri1
Factors affecting fluorescence and phosphorescence.pptxFactors affecting fluorescence and phosphorescence.pptx
Factors affecting fluorescence and phosphorescence.pptx
SamarthGiri17 views

TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store

  • 1. 23rd International World Wide Web Conference, 10th April 2014, Seoul, Korea TripleProv Efficient Processing of Lineage Queries over a Native RDF Store Marcin Wylot1 , Philippe Cudré-Mauroux1 , and Paul Groth2 1) eXascale Infolab, University of Fribourg, Switzerland 2) Web & Madia Group, VU University Amsterdam, Netherlands
  • 2. Outline ➢ Motivation ➢ Provenance Polynomials ➢ System ➢ Results
  • 3. Data Provenance “Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness.” How a query answer was derived: what data was combined to produce the result.
  • 4. Data Integration ➢ Integrated and summarized data ➢ Trust, transparency, and cost ➢ Capability to pinpoint the exact source from which the result was selected ➢ Capability to trace back the complete list of sources and how they were combined to deliver a result
  • 5. Querying Distributed Data Sources How exactly was the answer derived?
  • 6. Application: Post-query Calculations ➢ Scores or probabilities for query result ➢ Result ranking ➢ Compute trust ➢ Information quality based on used sources
  • 7. Application: Query Execution ➢ Modify query strategies on the fly ➢ Restrict results to certain subset of sources ➢ Restrict results w.r.t. queries over provenance ➢ Access control, only certain sources will appear ➢ Detect if result would be valid when removing certain source
  • 8. Provenance Polynomials ➢ Ability to characterize ways each source contributed ➢ Pinpoint the exact source to each result ➢ Trace back the list of sources the way they were combined to deliver a result
  • 9. Graph-based Query select ?lat ?long ?g1 ?g2 ?g3 ?g4 where { graph ?g1 {?a [] "Eiffel Tower" . } graph ?g2 {?a inCountry FR . } graph ?g3 {?a lat ?lat . } graph ?g4 {?a long ?long . } } lat long l1 l2 l4 l4, lat long l1 l2 l4 l5, lat long l1 l2 l5 l4, lat long l1 l2 l5 l5, lat long l1 l3 l4 l4, lat long l1 l3 l4 l5, lat long l1 l3 l5 l4, lat long l1 l3 l5 l5, lat long l2 l2 l4 l4, lat long l2 l2 l4 l5, lat long l2 l2 l5 l4, lat long l2 l2 l5 l5, lat long l2 l3 l4 l4, lat long l2 l3 l4 l5, lat long l2 l3 l5 l4, lat long l2 l3 l5 l5, lat long l3 l2 l4 l4, lat long l3 l2 l4 l5, lat long l3 l2 l5 l4, lat long l3 l2 l5 l5, lat long l3 l3 l4 l4, lat long l3 l3 l4 l5, lat long l3 l3 l5 l4, lat long l3 l3 l5 l5,
  • 10. TripleProv Resuls result: lat, long provenance polynomial: (l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕ l9)
  • 11. Polynomials Operators ➢ Union (⊕) ○ constraint or projection satisfied with multiple sources l1 ⊕ l2 ⊕ l3 ○ multiple entities satisfy a set of constraints or projections ➢ Join (⊗) ○ sources joined to handle a constraint or a projection ○ OS and OO joins between few sets of constraints (l1 ⊕ l2) ⊗ (l3 ⊕ l4)
  • 12. Example Polynomial select ?lat ?long where { ?a [] ``Eiffel Tower''. ?a inCountry FR . ?a lat ?lat . ?a long ?long . } (l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕ l9)
  • 13. Example Polynomial select ?l ?long ?lat where { ?p name ``Krebs, Emil'' . ?p deathPlace ?l . ?c [] ?l . ?c featureClass P . ?c inCountry DE . ?c long ?long . ?c lat ?lat . } [(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5)] ⊗ [( l6 ⊕ l7) ⊗ (l8) ⊗ (l9 ⊕ l10) ⊗ (l11 ⊕ l12) ⊗ (l13)]
  • 14. Granularity Levels ➢ source-level: sources of a triples ➢ triple-level: all pieces of data used to answer the query (l1 ⊕ l2) ⊗ (l3 ⊕ l4)
  • 16. Native Data Model ➢ Semantically co-located data ➢ Template based molecules
  • 17. Various Physical Storage Models Differences: ➢ ease of implementation ➢ memory consumption ➢ query execution ➢ interference with the original concept of molecule 1) SPOL 2) LSPO 3) SLPO 4) SPLO
  • 18. Annotated Triples ➢ Annotated provenance ➢ Quadruples ➢ Easy to implement ➢ Source data repeated for each triple
  • 19. Co-located Elements ➢ Data grouped by source ➢ Physically co-located ➢ Avoids duplication of the same source inside a molecule ➢ Data about a given subject co-located in one molecule ➢ More difficult to implement
  • 20. Experiments How expensive it is to trace provenance? What is the overhead on query execution time?
  • 21. Datasets ➢ Two collections of RDF data gathered from the Web ○ Billion Triple Challenge (BTC): Crawled from the linked open data cloud ○ Web Data Commons (WDC): RDFa, Microdata extracted from common crawl ➢ Typical collections gathered from multiple sources ➢ sampled subsets of ~110 million triples each; ~25GB each
  • 22. Workloads ➢ 8 Queries defined for BTC ○ T. Neumann and G. Weikum. Scalable join processing on very large rdf graphs. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 627–640. ACM, 2009. ➢ Two additional queries with UNION and OPTIONAL clauses ➢ 7 various new queries for WDC http://exascale.info/tripleprov
  • 23. Results Overhead of tracking provenance compared to vanilla version of the system for BTC dataset source-level co-located source-level annotated triple-level co-located triple-level annotated
  • 24. Conclusions ➢ provenance overhead is considerable but acceptable, on average about 60-70% ➢ most suitable storage model depends upon data and workloads characteristics ➢ annotated: more appropriate for heterogenous datasets and workloads retrieving provenance ➢ co-located: more appropriate for homogenous datasets and workload filtering by source
  • 25. Future Work ➢ Distributed version ➢ Dynamic storage model ➢ Adaptive query execution strategies ➢ PROV output ➢ Over provenance queries
  • 26. Summary ➢ TripleProv: an efficient triplestore tracking provenance ➢ Two storage models ➢ Fine-grained multilevel provenance tracing ➢ Formal provenance polynomials ➢ Experimental evaluation http://exascale.info/tripleprov
  • 27. Loading & Memory Billion Triple Challenge Web Data Commons
  • 28. Results Overhead of tracking provenance compared to vanilla version of the system for WDC dataset source-level SLPO source-level SPOL triple-level SLPO triple-level SPOL
  • 29. Polynomials: multiple records [(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕ l9)] ⊕ [(l5 ⊕ l7) ⊗ (l4) ⊗ ( l13 ⊕ l17) ⊗ (l28)] ⊕ [(l4) ⊗ (l1 ⊕ l2) ⊗ ( l3 ⊕ l7) ⊗ (l8 ⊕ l9⊕ l4)]