Executing Provenance-Enabled Queries over Web Data

eXascale Infolab
eXascale InfolabeXascale Infolab
24th International World Wide Web Conference, 21st May 2015, Florence, Italy
Executing Provenance-Enabled
Queries over Web Data
Marcin Wylot1
,
Philippe Cudré-Mauroux1
, and Paul Groth2
1)
eXascale Infolab, University of Fribourg, Switzerland
2)
Elsevier Labs, Amsterdam, Netherlands
Research Question
How RDF databases can
efficiently support
provenance-enabled
queries?
2
Outline
➢ Motivation
➢ Provenance-Enabled Queries
➢ Query Execution Strategies
➢ Results
3
Data Provenance
“Provenance is information about
entities, activities, and people involved
in producing a piece of data or thing, which can be used to form
assessments about its quality, reliability or trustworthiness.”
Which pieces of data were combined to produce
the result?
4
Data Integration
➢ Integrated and summarized data
➢ Trust, transparency, and cost
➢ Capability to store and track
provenance data (WWW 2014)
➢ Capability to tailor queries with
provenance information (WWW
2015)
5
Querying Linked Data
use data following a provenance specification
6
Provenance-Enabled Query
A Workload Query is a query producing results a user is
interested in. These results are referred to as workload
query results.
A Provenance Query is a query that selects a set of data
from which the workload query results should originate.
A Provenance-Enabled Query is a pair consisting of a
Workload Query and a Provenance Query, producing results
a user is interested in (as specified by the Workload Query)
and originating only from data pre-selected by the
Provenance Query.
7
Provenance-Enabled Query: Example
SELECT ?t WHERE {
?a <type> <article> .
?a <tag> <Obama> .
?a <title> ?t . }
➢ ensure that the articles come from sources attributed to the government
SELECT ?ctx WHERE {
?ctx prov:wasAttributedTo <government> . }
➢ ensure that the data used to produce the answer was associated a
“SeniorEditor” and a “Manager”
SELECT ?ctx WHERE {
?ctx prov:wasGeneratedBy <articleProd>.
<articleProd> prov:wasAssociatedWith ?ed .
?ed rdf:type <SeniorEdior> .
<articleProd> prov:wasAssociatedWith ?m .
?m rdf:type <Manager> . }
8
Executing Provenance-Enabled Queries
A workload and a provenance query are given as input to
a triplestore, which produces results for both queries and
then combine them to obtain the final results.
9
TripleProv: Query Execution Pipeline
input: provenance-enable query
➢ execute the provenance query
➢ optionally pre-materializing or co-locating data
➢ optionally rewrite the workload queries
➢ execute the workload queries
➢
output: the workload query results, restricted to those which were derived
from data specified by the provenance query 10
Physical Storage Models
A molecule collocates objects related
to a given subject; it is composed of a
subject, and a series of predicate and
object related to that subject.
Extended for provenance data a
molecule collocates the context values
with the predicate-object pairs.
This avoids the duplication of the
same context value, while at the same
time collocating all data about a given
subject in one structure.
11
RDF molecule
basic data unit
Query Execution Strategies
1. Post-Filtering
2. Query Rewriting
3. Full Materialization
4. Pre-Filtering
5. Partial Materialization
12
Post-Filtering
➢ the baseline strategy
➢ executes both the workload and the provenance query independently.
➢ the provenance and workload queries can be executed in any order
➢ the results from the provenance query are used to filter a posteriori
the results of the workload query based on their provenance
13
Query Rewriting
14
execute the provenance
query
rewrite the query plan;
add provenance
constraints
return restricted results
➢ efficient from the provenance query
execution side
➢ can be suboptimal from the
workload query execution side
It can be implemented in two ways by
the triplestores, either by modifying
the query execution process, or by
rewriting the workload queries in
order to include constraints on the
named graphs.
Full Materialization
15
We implemented a basic view mechanisms in TripleProv. These
mechanisms allow us to project, materialize and utilize as a
secondary structure the portions of the molecules that are following
the provenance specification.
Full Materialization
16
execute the
provenance query
materialize data for
the provenance query
execute workload
queries on the
materialized view
➢ This strategy will outperform all other
strategies when executing the workload
queries, since they are executed as is on
the relevant subset of the data.
➢ Materializing all potential tuples based on
the provenance query can be expensive,
both in terms of storage space and latency.
➢ Implementing this strategy requires either to
manually materialize the relevant tuples and
modify the workload queries accordingly, or
to use a triplestore supporting materialized
views.
Pre-Filtering
➢ Dedicated provenance index
collocates, for each context
values, the ids (or hashes) of all
tuples belonging to this context.
➢ The index is created upfront
when the data is loaded.
17
Pre-Filtering
18
execute the
provenance query
execute workload
queries, including
early filtering with the
provenance index
➢ The provenance index is
looked up during the query
execution to filter molecules
that are compatible with the
provenance specification.
➢ This strategy requires to create
a new index structure in the
system, and to modify both the
loading and the query execution
processes.
Partial Materialization
➢ This strategy introduces a trade-off between the performance of
the provenance query and that of the workload queries.
➢ While executing the provenance query, the system builds a
temporary structure maintaining the ids of all molecules
belonging to the context values returned by the provenance
query.
19
Partial Materialization
20
execute the
provenance query and
partially materialize
molecules
execute workload
queries, including
early filtering based
on pre-materialized
set of molecules
➢ The system dynamically (and efficiently)
looks-up all molecules and can filter
them out early in case they do not
appear in the temporary structure.
➢ Query processing operations can be
executed faster on a reduced number of
elements.
➢ The implementation of this strategy
requires the introduction of an additional
data structure at the core of the system,
and the adjustment of the query
execution process in order to use it.
Experiments
What is the most efficient query
execution strategy for provenance-
enabled queries?
21
Datasets
➢ Two collections of RDF data gathered from the Web
○ Billion Triple Challenge (BTC): Crawled from the linked open data
cloud
○ Web Data Commons (WDC): RDFa, Microdata extracted from
common crawl
➢ Typical collections gathered from multiple sources
➢ Sampled subsets of ~40 million triples each; ~10GB each
➢ Added provenance specific triples (184 for WDC and 360 for BTC); that
the provenance queries do not modify the result sets of the workload
queries
22
Workloads
➢ Queries defined for BTC
○ T. Neumann and G. Weikum. Scalable join processing on very large rdf
graphs. In Proceedings of the 2009 ACM SIGMOD International
Conference on Management of data, pages 627–640. ACM, 2009.
➢ Two additional queries with UNION and OPTIONAL
clauses
➢ 7 various new queries for WDC
http://exascale.info/provqueries
23
Results for BTC
➢ Full Materialization: 44x faster
than the vanilla version of the
system
➢ Partial Materialization: 35x faster
➢ Pre-Filtering: 23x faster
➢ Adaptive Partial Materialization
executes a provenance query and
materialize data 475 times faster
than Full Materialization
➢ Query Rewriting and Post-
Filtering strategies perform
significantly slower
24
Results for Representative Scenario
➢ original BTC dataset
➢ no added triples
➢ output changes due to
provenance specification
➢ higher performance gains
for all provenance aware
strategies are in the more
realistic scenario
25
smaller number of context values from the provenance query
smaller number of relevant molecules to inspect
Data Analysis
➢ How many context values refer
to how many triples? How
selective it is?
➢ 6'819'826 unique context values
in the BTC dataset.
➢ The majority of the context
values are highly selective.
26
➢ average selectivity
○ 5.8 triples per context value
○ 2.3 molecules per context value
Conclusions
➢ Querying provenance data does not necessarily
introduce a performance overhead.
➢ Queries tailored with provenance data can be executed
faster.
➢ Provenance information is highly selective.
➢ Partial Materialization represents the best trade-off for
provenance-enabled queries, but it introduces a
materialization cost and is not trivial to implement.
27
Summary
➢ provenance-enabled queries: to tailor queries with
provenance information
➢ five provenance aware query execution strategies
➢ TripleProv: an efficient triplestore allowing to store, track,
and query provenance
➢ experimental evaluation and data analysis
★ http://exascale.info/provqueries
★ http://exascale.info/tripleprov
28
❖ email: marcin@exascale.info
❖ twitter: @mwylot
1 of 28

Recommended

Analysis of the “KDD Cup-1999” Datasets by
Analysis of the  “KDD Cup-1999”  DatasetsAnalysis of the  “KDD Cup-1999”  Datasets
Analysis of the “KDD Cup-1999” DatasetsRafsanjani, Muhammod
1.1K views13 slides
Poster (1) by
Poster (1)Poster (1)
Poster (1)Daniel Osei
65 views1 slide
chemengine karthi acs sandiego rev1.0 by
chemengine karthi acs sandiego rev1.0chemengine karthi acs sandiego rev1.0
chemengine karthi acs sandiego rev1.0Muthukumarasamy Karthikeyan
145 views1 slide
Bio Linux by
Bio LinuxBio Linux
Bio Linuxguest294984
1.1K views24 slides
ChemEngine ACS by
ChemEngine ACSChemEngine ACS
ChemEngine ACSMuthukumarasamy Karthikeyan
236 views17 slides
Ijricit 01-002 enhanced replica detection in short time for large data sets by
Ijricit 01-002 enhanced replica detection in  short time for large data setsIjricit 01-002 enhanced replica detection in  short time for large data sets
Ijricit 01-002 enhanced replica detection in short time for large data setsIjripublishers Ijri
113 views3 slides

More Related Content

What's hot

Journals analysis ppt by
Journals analysis pptJournals analysis ppt
Journals analysis pptMuhammad Heikal
1.5K views28 slides
A cyber physical stream algorithm for intelligent software defined storage by
A cyber physical stream algorithm for intelligent software defined storageA cyber physical stream algorithm for intelligent software defined storage
A cyber physical stream algorithm for intelligent software defined storageMade Artha
228 views5 slides
Transparency in the Data Supply Chain by
Transparency in the Data Supply ChainTransparency in the Data Supply Chain
Transparency in the Data Supply ChainPaul Groth
1.2K views43 slides
Integrative information management for systems biology by
Integrative information management for systems biologyIntegrative information management for systems biology
Integrative information management for systems biologyNeil Swainston
759 views45 slides
Slice for Distributed Persistence (JavaOne 2010) by
Slice for Distributed Persistence (JavaOne 2010)Slice for Distributed Persistence (JavaOne 2010)
Slice for Distributed Persistence (JavaOne 2010)Pinaki Poddar
980 views62 slides
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES by
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESkevig
15 views7 slides

What's hot(16)

A cyber physical stream algorithm for intelligent software defined storage by Made Artha
A cyber physical stream algorithm for intelligent software defined storageA cyber physical stream algorithm for intelligent software defined storage
A cyber physical stream algorithm for intelligent software defined storage
Made Artha228 views
Transparency in the Data Supply Chain by Paul Groth
Transparency in the Data Supply ChainTransparency in the Data Supply Chain
Transparency in the Data Supply Chain
Paul Groth1.2K views
Integrative information management for systems biology by Neil Swainston
Integrative information management for systems biologyIntegrative information management for systems biology
Integrative information management for systems biology
Neil Swainston759 views
Slice for Distributed Persistence (JavaOne 2010) by Pinaki Poddar
Slice for Distributed Persistence (JavaOne 2010)Slice for Distributed Persistence (JavaOne 2010)
Slice for Distributed Persistence (JavaOne 2010)
Pinaki Poddar980 views
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES by kevig
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
kevig15 views
Drug Repurposing using Deep Learning on Knowledge Graphs by Databricks
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge Graphs
Databricks563 views
Ontology based clustering algorithms by Ikutwa
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithms
Ikutwa1.2K views
Integrating vague association mining with markov model by ijsc
Integrating vague association mining with markov modelIntegrating vague association mining with markov model
Integrating vague association mining with markov model
ijsc368 views
Applying soft computing techniques to corporate mobile security systems by Paloma De Las Cuevas
Applying soft computing techniques to corporate mobile security systemsApplying soft computing techniques to corporate mobile security systems
Applying soft computing techniques to corporate mobile security systems
Iaetsd a survey on one class clustering by Iaetsd Iaetsd
Iaetsd a survey on one class clusteringIaetsd a survey on one class clustering
Iaetsd a survey on one class clustering
Iaetsd Iaetsd192 views
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS by Editor IJCATR
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
Editor IJCATR473 views
Similarity distance measures by thilagasna
Similarity  distance measuresSimilarity  distance measures
Similarity distance measures
thilagasna2.2K views
“Open Data Web” – A Linked Open Data Repository Built with CKAN by Chengjen Lee
“Open Data Web” – A Linked Open Data Repository Built with CKAN“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN
Chengjen Lee2.2K views
International Journal of Engineering Research and Development (IJERD) by IJERD Editor
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor362 views
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ... by IRJET Journal
IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET Journal12 views

Similar to Executing Provenance-Enabled Queries over Web Data

Efficient, Scalable, and Provenance-Aware Management of Linked Data by
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataeXascale Infolab
713 views48 slides
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f... by
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...The Statistical and Applied Mathematical Sciences Institute
425 views45 slides
Query aware determinization of uncertain objects by
Query aware determinization of uncertain objectsQuery aware determinization of uncertain objects
Query aware determinization of uncertain objectsSoftroniics india
121 views6 slides
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ... by
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuantUniversity
793 views44 slides
Analytics of analytics pipelines: from optimising re-execution to general Dat... by
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
61 views38 slides
C4Bio paper talk by
C4Bio paper talkC4Bio paper talk
C4Bio paper talkPaolo Missier
782 views22 slides

Similar to Executing Provenance-Enabled Queries over Web Data(20)

Efficient, Scalable, and Provenance-Aware Management of Linked Data by eXascale Infolab
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data
eXascale Infolab713 views
Query aware determinization of uncertain objects by Softroniics india
Query aware determinization of uncertain objectsQuery aware determinization of uncertain objects
Query aware determinization of uncertain objects
Softroniics india121 views
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ... by QuantUniversity
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuantUniversity793 views
Analytics of analytics pipelines: from optimising re-execution to general Dat... by Paolo Missier
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Paolo Missier61 views
Recording and Reasoning Over Data Provenance in Web and Grid Services by Martin Szomszor
Recording and Reasoning Over Data Provenance in Web and Grid ServicesRecording and Reasoning Over Data Provenance in Web and Grid Services
Recording and Reasoning Over Data Provenance in Web and Grid Services
Martin Szomszor1.3K views
IEEE 2015 Java Projects by Vijay Karan
IEEE 2015 Java ProjectsIEEE 2015 Java Projects
IEEE 2015 Java Projects
Vijay Karan90 views
A Web Extraction Using Soft Algorithm for Trinity Structure by iosrjce
A Web Extraction Using Soft Algorithm for Trinity StructureA Web Extraction Using Soft Algorithm for Trinity Structure
A Web Extraction Using Soft Algorithm for Trinity Structure
iosrjce121 views
Final report group2 by George Sam
Final report group2Final report group2
Final report group2
George Sam245 views
Offsite presentation original by sally.de
Offsite presentation originalOffsite presentation original
Offsite presentation original
sally.de79 views
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central by Paolo Missier
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Paolo Missier1.3K views
Svm Classifier Algorithm for Data Stream Mining Using Hive and R by IRJET Journal
Svm Classifier Algorithm for Data Stream Mining Using Hive and RSvm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
IRJET Journal45 views

More from eXascale Infolab

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction by
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictioneXascale Infolab
287 views30 slides
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S... by
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...eXascale Infolab
167 views16 slides
Representation Learning on Complex Graphs by
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex GraphseXascale Infolab
539 views33 slides
A force directed approach for offline gps trajectory map by
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapeXascale Infolab
459 views12 slides
Cikm 2018 by
Cikm 2018Cikm 2018
Cikm 2018eXascale Infolab
871 views18 slides
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit... by
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...eXascale Infolab
787 views20 slides

More from eXascale Infolab(20)

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction by eXascale Infolab
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
eXascale Infolab287 views
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S... by eXascale Infolab
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
eXascale Infolab167 views
Representation Learning on Complex Graphs by eXascale Infolab
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex Graphs
eXascale Infolab539 views
A force directed approach for offline gps trajectory map by eXascale Infolab
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory map
eXascale Infolab459 views
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit... by eXascale Infolab
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
eXascale Infolab787 views
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous... by eXascale Infolab
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
eXascale Infolab1.2K views
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans by eXascale Infolab
Dependency-Driven Analytics: A Compass for Uncharted Data OceansDependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
eXascale Infolab687 views
SANAPHOR: Ontology-based Coreference Resolution by eXascale Infolab
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference Resolution
eXascale Infolab1.1K views
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data by eXascale Infolab
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
eXascale Infolab4K views
The Dynamics of Micro-Task Crowdsourcing by eXascale Infolab
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task Crowdsourcing
eXascale Infolab1.6K views
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu... by eXascale Infolab
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
eXascale Infolab3.1K views
CIKM14: Fixing grammatical errors by preposition ranking by eXascale Infolab
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
eXascale Infolab1.7K views
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series) by eXascale Infolab
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
eXascale Infolab663 views

Recently uploaded

MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf by
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdfMODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdfKerryNuez1
24 views5 slides
"How can I develop my learning path in bioinformatics? by
"How can I develop my learning path in bioinformatics?"How can I develop my learning path in bioinformatics?
"How can I develop my learning path in bioinformatics?Bioinformy
23 views13 slides
Pollination By Nagapradheesh.M.pptx by
Pollination By Nagapradheesh.M.pptxPollination By Nagapradheesh.M.pptx
Pollination By Nagapradheesh.M.pptxMNAGAPRADHEESH
16 views9 slides
Disinfectants & Antiseptic by
Disinfectants & AntisepticDisinfectants & Antiseptic
Disinfectants & AntisepticSanket P Shinde
10 views36 slides
별헤는 사람들 2023년 12월호 전명원 교수 자료 by
별헤는 사람들 2023년 12월호 전명원 교수 자료별헤는 사람들 2023년 12월호 전명원 교수 자료
별헤는 사람들 2023년 12월호 전명원 교수 자료sciencepeople
37 views30 slides
MILK LIPIDS 2.pptx by
MILK LIPIDS 2.pptxMILK LIPIDS 2.pptx
MILK LIPIDS 2.pptxabhinambroze18
7 views15 slides

Recently uploaded(20)

MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf by KerryNuez1
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdfMODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf
KerryNuez124 views
"How can I develop my learning path in bioinformatics? by Bioinformy
"How can I develop my learning path in bioinformatics?"How can I develop my learning path in bioinformatics?
"How can I develop my learning path in bioinformatics?
Bioinformy23 views
Pollination By Nagapradheesh.M.pptx by MNAGAPRADHEESH
Pollination By Nagapradheesh.M.pptxPollination By Nagapradheesh.M.pptx
Pollination By Nagapradheesh.M.pptx
MNAGAPRADHEESH16 views
별헤는 사람들 2023년 12월호 전명원 교수 자료 by sciencepeople
별헤는 사람들 2023년 12월호 전명원 교수 자료별헤는 사람들 2023년 12월호 전명원 교수 자료
별헤는 사람들 2023년 12월호 전명원 교수 자료
sciencepeople37 views
Nitrosamine & NDSRI.pptx by NileshBonde4
Nitrosamine & NDSRI.pptxNitrosamine & NDSRI.pptx
Nitrosamine & NDSRI.pptx
NileshBonde413 views
Conventional and non-conventional methods for improvement of cucurbits.pptx by gandhi976
Conventional and non-conventional methods for improvement of cucurbits.pptxConventional and non-conventional methods for improvement of cucurbits.pptx
Conventional and non-conventional methods for improvement of cucurbits.pptx
gandhi97618 views
RemeOs science and clinical evidence by PetrusViitanen1
RemeOs science and clinical evidenceRemeOs science and clinical evidence
RemeOs science and clinical evidence
PetrusViitanen136 views
Experimental animal Guinea pigs.pptx by Mansee Arya
Experimental animal Guinea pigs.pptxExperimental animal Guinea pigs.pptx
Experimental animal Guinea pigs.pptx
Mansee Arya15 views
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl... by GIFT KIISI NKIN
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...
GIFT KIISI NKIN22 views
How to be(come) a successful PhD student by Tom Mens
How to be(come) a successful PhD studentHow to be(come) a successful PhD student
How to be(come) a successful PhD student
Tom Mens473 views
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance... by InsideScientific
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
InsideScientific49 views

Executing Provenance-Enabled Queries over Web Data

  • 1. 24th International World Wide Web Conference, 21st May 2015, Florence, Italy Executing Provenance-Enabled Queries over Web Data Marcin Wylot1 , Philippe Cudré-Mauroux1 , and Paul Groth2 1) eXascale Infolab, University of Fribourg, Switzerland 2) Elsevier Labs, Amsterdam, Netherlands
  • 2. Research Question How RDF databases can efficiently support provenance-enabled queries? 2
  • 3. Outline ➢ Motivation ➢ Provenance-Enabled Queries ➢ Query Execution Strategies ➢ Results 3
  • 4. Data Provenance “Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness.” Which pieces of data were combined to produce the result? 4
  • 5. Data Integration ➢ Integrated and summarized data ➢ Trust, transparency, and cost ➢ Capability to store and track provenance data (WWW 2014) ➢ Capability to tailor queries with provenance information (WWW 2015) 5
  • 6. Querying Linked Data use data following a provenance specification 6
  • 7. Provenance-Enabled Query A Workload Query is a query producing results a user is interested in. These results are referred to as workload query results. A Provenance Query is a query that selects a set of data from which the workload query results should originate. A Provenance-Enabled Query is a pair consisting of a Workload Query and a Provenance Query, producing results a user is interested in (as specified by the Workload Query) and originating only from data pre-selected by the Provenance Query. 7
  • 8. Provenance-Enabled Query: Example SELECT ?t WHERE { ?a <type> <article> . ?a <tag> <Obama> . ?a <title> ?t . } ➢ ensure that the articles come from sources attributed to the government SELECT ?ctx WHERE { ?ctx prov:wasAttributedTo <government> . } ➢ ensure that the data used to produce the answer was associated a “SeniorEditor” and a “Manager” SELECT ?ctx WHERE { ?ctx prov:wasGeneratedBy <articleProd>. <articleProd> prov:wasAssociatedWith ?ed . ?ed rdf:type <SeniorEdior> . <articleProd> prov:wasAssociatedWith ?m . ?m rdf:type <Manager> . } 8
  • 9. Executing Provenance-Enabled Queries A workload and a provenance query are given as input to a triplestore, which produces results for both queries and then combine them to obtain the final results. 9
  • 10. TripleProv: Query Execution Pipeline input: provenance-enable query ➢ execute the provenance query ➢ optionally pre-materializing or co-locating data ➢ optionally rewrite the workload queries ➢ execute the workload queries ➢ output: the workload query results, restricted to those which were derived from data specified by the provenance query 10
  • 11. Physical Storage Models A molecule collocates objects related to a given subject; it is composed of a subject, and a series of predicate and object related to that subject. Extended for provenance data a molecule collocates the context values with the predicate-object pairs. This avoids the duplication of the same context value, while at the same time collocating all data about a given subject in one structure. 11 RDF molecule basic data unit
  • 12. Query Execution Strategies 1. Post-Filtering 2. Query Rewriting 3. Full Materialization 4. Pre-Filtering 5. Partial Materialization 12
  • 13. Post-Filtering ➢ the baseline strategy ➢ executes both the workload and the provenance query independently. ➢ the provenance and workload queries can be executed in any order ➢ the results from the provenance query are used to filter a posteriori the results of the workload query based on their provenance 13
  • 14. Query Rewriting 14 execute the provenance query rewrite the query plan; add provenance constraints return restricted results ➢ efficient from the provenance query execution side ➢ can be suboptimal from the workload query execution side It can be implemented in two ways by the triplestores, either by modifying the query execution process, or by rewriting the workload queries in order to include constraints on the named graphs.
  • 15. Full Materialization 15 We implemented a basic view mechanisms in TripleProv. These mechanisms allow us to project, materialize and utilize as a secondary structure the portions of the molecules that are following the provenance specification.
  • 16. Full Materialization 16 execute the provenance query materialize data for the provenance query execute workload queries on the materialized view ➢ This strategy will outperform all other strategies when executing the workload queries, since they are executed as is on the relevant subset of the data. ➢ Materializing all potential tuples based on the provenance query can be expensive, both in terms of storage space and latency. ➢ Implementing this strategy requires either to manually materialize the relevant tuples and modify the workload queries accordingly, or to use a triplestore supporting materialized views.
  • 17. Pre-Filtering ➢ Dedicated provenance index collocates, for each context values, the ids (or hashes) of all tuples belonging to this context. ➢ The index is created upfront when the data is loaded. 17
  • 18. Pre-Filtering 18 execute the provenance query execute workload queries, including early filtering with the provenance index ➢ The provenance index is looked up during the query execution to filter molecules that are compatible with the provenance specification. ➢ This strategy requires to create a new index structure in the system, and to modify both the loading and the query execution processes.
  • 19. Partial Materialization ➢ This strategy introduces a trade-off between the performance of the provenance query and that of the workload queries. ➢ While executing the provenance query, the system builds a temporary structure maintaining the ids of all molecules belonging to the context values returned by the provenance query. 19
  • 20. Partial Materialization 20 execute the provenance query and partially materialize molecules execute workload queries, including early filtering based on pre-materialized set of molecules ➢ The system dynamically (and efficiently) looks-up all molecules and can filter them out early in case they do not appear in the temporary structure. ➢ Query processing operations can be executed faster on a reduced number of elements. ➢ The implementation of this strategy requires the introduction of an additional data structure at the core of the system, and the adjustment of the query execution process in order to use it.
  • 21. Experiments What is the most efficient query execution strategy for provenance- enabled queries? 21
  • 22. Datasets ➢ Two collections of RDF data gathered from the Web ○ Billion Triple Challenge (BTC): Crawled from the linked open data cloud ○ Web Data Commons (WDC): RDFa, Microdata extracted from common crawl ➢ Typical collections gathered from multiple sources ➢ Sampled subsets of ~40 million triples each; ~10GB each ➢ Added provenance specific triples (184 for WDC and 360 for BTC); that the provenance queries do not modify the result sets of the workload queries 22
  • 23. Workloads ➢ Queries defined for BTC ○ T. Neumann and G. Weikum. Scalable join processing on very large rdf graphs. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 627–640. ACM, 2009. ➢ Two additional queries with UNION and OPTIONAL clauses ➢ 7 various new queries for WDC http://exascale.info/provqueries 23
  • 24. Results for BTC ➢ Full Materialization: 44x faster than the vanilla version of the system ➢ Partial Materialization: 35x faster ➢ Pre-Filtering: 23x faster ➢ Adaptive Partial Materialization executes a provenance query and materialize data 475 times faster than Full Materialization ➢ Query Rewriting and Post- Filtering strategies perform significantly slower 24
  • 25. Results for Representative Scenario ➢ original BTC dataset ➢ no added triples ➢ output changes due to provenance specification ➢ higher performance gains for all provenance aware strategies are in the more realistic scenario 25 smaller number of context values from the provenance query smaller number of relevant molecules to inspect
  • 26. Data Analysis ➢ How many context values refer to how many triples? How selective it is? ➢ 6'819'826 unique context values in the BTC dataset. ➢ The majority of the context values are highly selective. 26 ➢ average selectivity ○ 5.8 triples per context value ○ 2.3 molecules per context value
  • 27. Conclusions ➢ Querying provenance data does not necessarily introduce a performance overhead. ➢ Queries tailored with provenance data can be executed faster. ➢ Provenance information is highly selective. ➢ Partial Materialization represents the best trade-off for provenance-enabled queries, but it introduces a materialization cost and is not trivial to implement. 27
  • 28. Summary ➢ provenance-enabled queries: to tailor queries with provenance information ➢ five provenance aware query execution strategies ➢ TripleProv: an efficient triplestore allowing to store, track, and query provenance ➢ experimental evaluation and data analysis ★ http://exascale.info/provqueries ★ http://exascale.info/tripleprov 28 ❖ email: marcin@exascale.info ❖ twitter: @mwylot