Provenance for Data Munging Environments

Paul Groth
Elsevier Labs
@pgroth | pgroth.com
Provenance for
Data Munging Environments
Information Sciences Institute – August 13, 2015

Outline
• What’s data munging and why it’s
important?
• The role of provenance
• The reality….
• Desktop data munging & provenance
• Database data munging & provenance
• Declarative data munging (?)

60 % of time is spent on data
preparation

Data Sources
Compound
Disease
PathwayTarget ✔
✔
✔
✔Tissue ✔

Tension:
Integrated &
Summarized
Data
Transparenc
y& Trust

Solution:
Tracking and exposing
provenance*
* a record that describes the people, institutions,
entities, and activities involved in producing,
influencing, or delivering a piece of data”
The PROV Data Model
(W3C Recommendation)

What if you’re not a large
organization?

The model:
Adventures in word2vec (3)

The model:
Adventures in word2vec (3)
Look provenance informatio

http://ivory.idyll.org/blog/replication-i.html

DESKTOP DATA MUNGING &
PROVENANCE

References
Manolis Stamatogiannakis, Paul Groth, Herbert Bos.
Looking Inside the Black-Box: Capturing Data
Provenance Using Dynamic Instrumentation.
5th International Provenance and Annotation Workshop
(IPAW'14)
Manolis Stamatogiannakis, Paul Groth, Herbert Bos.
Decoupling Provenance Capture and Analysis from
Execution.
7th USENIX Workshop on the Theory and Practice of
Provenance (TaPP'15)
23

Capturing Provenance
Disclosed Provenance
+ Accuracy
+ High-level semantics
– Intrusive
– Manual Effort
Observed Provenance
– False positives
– Semantic Gap
+ Non-intrusive
+ Minimal manual effort
CPL (Macko ‘12)
Trio (Widom ‘09)
Wings (Gil ‘11)
Taverna (Oinn ‘06)
VisTrails (Fraire ‘06)
ES3 (Frew ‘08)
Trec (Vahdat ‘98)
PASSv2 (Holland ‘08)
DTrace Tool (Gessiou ‘12)
24

Challenge
• Can we capture provenance
– with low false positive ratio?
– without manual/obtrusive integration effort?
• We have to rely on observed provenance.
25

State of the art
Application
• Observed provenance systems treat programs as black-
boxes.
• Can’t tell if an input file was actually used.
• Can’t quantify the influence of input to output.
26

Taint Tracking
Geology Computer Science
27

Our solution: DataTracker
• Captures high-fidelity provenance using Taint
Tracking.
• Key building blocks:
– libdft (Kemerlis ‘12) ➞ Reusable taint-tracking
framework.
– Intel Pin (Luk ‘05) ➞ Dynamic instrumentation
framework.
• Does not require modification of applications.
• Does not require knowledge of application
semantics.
28

Evaluation: tackling the n×m
problem
30
• DataTracker is able
to track the actual
use of the input data.
• Read data ≠ Use
data.
• Eliminates false
positives (---->)
present in other
observed
provenance capture
methods.

Evaluation: vim
31
DataTracker attributes individual bytes of the output to the input.
Demo video: http://bit.ly/dtracker-demo

Can we do good enough?
• Can taint tracking
a. become an “always-on” feature?
b. be turned on for all running processes?
• What if we want to also run other analysis
code?
• Can we pre-determine the right analysis
code?
32

Re-execution
Common tactic in provenance:
• DB: Reenactment queries (Glavic ‘14)
• DistSys: Chimera (Foster ‘02), Hadoop (Logothetis ‘13),
DistTape (Zhao ‘12)
• Workflows: Pegasus (Groth ‘09)
• PL: Slicing (Perera ‘12)
• OS: pTrace (Guo ‘11)
• Desktop: Excel (Asuncion ‘11)
33

Methodology
Selection
Provenance analysis
Instrumentation
Execution Capture
35

Prototype Implementation
• PANDA: an open-
source Platform for
Architecture-Neutral
Dynamic Analysis.
(Dolan-Gavitt ‘14)
• Based on the QEMU
virtualization platform.
36

• PANDA logs self-contained execution traces.
– An initial RAM snapshot.
– Non-deterministic inputs.
• Logging happens at virtual CPU I/O ports.
– Virtual device state is not logged  can’t “go-live”.
Prototype Implementation (2/3)
PANDA
CPU RAM
Input
Interrupt
DMA Initial RAM
Snapshot
Non-
determinism
log
RAM
PANDA Execution Trace
37

Prototype Implementation (3/3)
• Analysis plugins
– Read-only access to the VM state.
– Invoked per instr., memory access, context switch, etc.
– Can be combined to implement complex functionality.
– OSI Linux, PROV-Tracer, ProcStrMatch.
• Debian Linux guest.
• Provenance stored PROV/RDF triples, queried with
SPARQL.
PANDA
Executio
n Trace
PANDA
Triple
Store
Plugin APlugin C
Plugin B
CPU
RAM
38
used
endedAtTime
wasAssociatedWith
actedOnBehalfOf
wasGeneratedBy
wasAttributedTo
wasDerivedFrom
wasInformedBy
Activity
Entity
Agent
xsd:dateTime
startedAtTime
xsd:dateTime

OS Introspection
• What processes are currently executing?
• Which libraries are used?
• What files are used?
• Possible approaches:
– Execute code inside the guest-OS.
– Reproduce guest-OS semantics purely from
the hardware state (RAM/registers).
39

The PROV-Tracer Plugin
• Registers for process creation/destruction
events.
• Decodes executed system calls.
• Keeps track of what files are used as
input/output by each process.
• Emits provenance in an intermediate
format when a process terminates.
40

More Analysis Plugins
• ProcStrMatch plugin.
– Which processes contained string S in their
memory?
• Other possible types of analysis:
– Taint tracking
– Dynamic slicing
41

Overhead (again) (1/2)
• QEMU incurs a 5x slowdown.
• PANDA recording imposes an additional
1.1x – 1.2x slowdown.
Virtualization is the dominant overhead
factor.
42

Overhead (again) (2/2)
• QEMU is a suboptimal virtualization
option.
• ReVirt – User Mode Linux (Dunlap ‘02)
– Slowdown: 1.08x rec. + 1.58x virt.
• ReTrace – VMWare (Xu ‘07)
– Slowdown: 1.05x-2.6x rec. + ??? virt.
Virtualization slowdown is considered
acceptable.
Recording overhead is fairly low. 43

Storage Requirements
• Storage requirements vary with the
workload.
• For PANDA (Dolan-Gavitt ‘14):
– 17-915 instructions per byte.
• In practice: O(10MB/min) uncompressed.
• Different approaches to reduce/manage
storage requirements.
– Compression, HD rotation, VM snapshots.
• 24/7 recording seems within limits of
todays’ technology. 44

Highlights
• Taint tracking analysis is a powerful method
for capturing provenance.
– Eliminates many false positives.
– Tackles the “n×m problem”.
• Decoupling provenance analysis from
execution is possible by the use of VM record
& replay.
• Execution traces can be used for post-hoc
provenance analysis.
45

References
Marcin Wylot, Philip Cudré-Mauroux, Paul Groth
TripleProv: Efficient Processing of Lineage Queries
over a Native RDF Store
World Wide Web Conference 2014
Marcin Wylot, Philip Cudré-Mauroux, Paul Groth
Executing Provenance-Enabled Queries over Web
Data
World Wide Web Conference 2015
47

RDF is great for munging data
➢ Ability to arbitrarily add new
information (schemaless)
➢ Syntaxes are easy to concatenate
new data
➢ Information has a well defined
structure
➢ Identifiers are distributed but
controlled
48

What’s the provenance of my query
result?
Qr

Graph-based Query
select ?lat ?long ?g1 ?g2 ?g3 ?g4
where {
graph ?g1 {?a [] "Eiffel Tower" . }
graph ?g2 {?a inCountry FR . }
graph ?g3 {?a lat ?lat . }
graph ?g4 {?a long ?long . }
}
lat long l1 l2 l4 l4,

Provenance Polynomials
➢ Ability to characterize ways each source contributed
➢ Pinpoint the exact source to each result
➢ Trace back the list of sources the way they were combined
to deliver a result

Polynomials Operators
➢ Union (⊕)
○ constraint or projection satisfied with multiple sources
l1 ⊕ l2 ⊕ l3
○ multiple entities satisfy a set of constraints or
projections
➢ Join (⊗)
○ sources joined to handle a constraint or a projection
○ OS and OO joins between few sets of constraints
(l1 ⊕ l2) ⊗ (l3 ⊕ l4)

Example Polynomial
select ?lat ?long where {
?a [] ``Eiffel Tower''.
?a inCountry FR .
?a lat ?lat .
?a long ?long .
}
(l1 ⊕ l2 ⊕ l3) ⊗ (l4 ⊕ l5) ⊗ ( l6 ⊕ l7) ⊗ (l8 ⊕

Experiments
How expensive it is to trace
provenance?
What is the overhead on query
execution time?

Datasets
➢ Two collections of RDF data gathered from the Web
○ Billion Triple Challenge (BTC): Crawled from the linked
open data cloud
○ Web Data Commons (WDC): RDFa, Microdata
extracted from common crawl
➢ Typical collections gathered from multiple sources
➢ sampled subsets of ~110 million triples each; ~25GB each

Workloads
➢ 8 Queries defined for BTC
○ T. Neumann and G. Weikum. Scalable join processing on very large rdf
graphs. In Proceedings of the 2009 ACM SIGMOD International
Conference on Management of data, pages 627–640. ACM, 2009.
➢ Two additional queries with UNION and OPTIONAL
clauses
➢ 7 various new queries for WDC
http://exascale.info/tripleprov

Results
Overhead of tracking provenance compared to
vanilla version of the system for BTC dataset
source-level co-
located
source-level
annotated
triple-level co-
located
triple-level
annotated

TripleProv: Query Execution
Pipeline
input: provenance-enable query
➢ execute the provenance query
➢ optionally pre-materializing or co-locating data
➢ optionally rewrite the workload queries
➢ execute the workload queries
output: the workload query results, restricted to those which were derived
from data specified by the provenance query 59

Experiments
What is the most efficient query
execution strategy for provenance-
enabled queries?
60

Datasets
➢ Two collections of RDF data gathered from the Web
○ Billion Triple Challenge (BTC): Crawled from the linked open data
cloud
○ Web Data Commons (WDC): RDFa, Microdata extracted from
common crawl
➢ Typical collections gathered from multiple sources
➢ Sampled subsets of ~40 million triples each; ~10GB each
➢ Added provenance specific triples (184 for WDC and 360 for BTC); that
the provenance queries do not modify the result sets of the workload
queries
61

Results for BTC
➢ Full Materialization: 44x faster
than the vanilla version of the
system
➢ Partial Materialization: 35x faster
➢ Pre-Filtering: 23x faster
➢ Adaptive Partial Materialization
executes a provenance query and
materialize data 475 times faster
than Full Materialization
➢ Query Rewriting and Post-
Filtering strategies perform
significantly slower
62

Data Analysis
➢ How many context values refer
to how many triples? How
selective it is?
➢ 6'819'826 unique context values
in the BTC dataset.
➢ The majority of the context
values are highly selective.
63
➢ average selectivity
○ 5.8 triples per context value
○ 2.3 molecules per context value

DECLARATIVE DATA
MUNGING (?)
64

References
Sara Magliacane, Philip Stutz, Paul Groth, Abraham
Bernstein
foxPSL: A Fast, Optimized and eXtended PSL
implementation
International Journal of Approximate Reasoning (2015)
65

Why logic?
- Concise & natural way to represent relations
- Declarative representation:
- Can reuse, extend, combine rules
- Experts can write rules
- First order logic:
- Can exploit symmetries to avoid duplicated
computation (e.g. lifted inference)

Let the reasoner munge the
data.
See Sebastien Riedel’s etc. work towards
pushing more NLP problems in to the
reasoner.
http://cl.naist.jp/~kevinduh/z/acltutorialslide
s/matrix_acl2015tutorial.pdf

Statistical Relational Learning
● Several flavors:
o Markov Logic Networks,
o Bayesian Logic Programs
o Probabilistic Soft Logic (PSL) [Broecheler, Getoor,
UAI 2010]
● PSL was successfully applied:
o Entity resolution, Link prediction
o Ontology alignment, Knowledge graph
identification
o Computer vision, trust propagation, …

Probabilistic Soft Logic (PSL)
● Probabilistic logic with soft truth values ∈ [0,1]
friends(anna, bob)= 0.8
votes(anna, demo) = 0.99
● Weighted rules:
[weight = 0.7] friends(A,B) && votes(A,P) => votes(B,P)
● Inference as constrained convex minimization:
votes(bob, demo) = 0.8

FoxPSL: Fast Optimized eXtended PSL
classes ∃partially
grounded rules
optimizations
DSL:
FoxPSL
lang

Experiments: comparison with ACO
SLURM cluster: 4 nodes, each with 2x10 cores and 128GB RAM
ACO = implementation of consensus optimization on
GraphLab used for grounded PSL

Conclusions
• Data munging is a central task
• Provenance is a requirement
• Now:
• Provenance by stealth (ack Carole Goble)
• Separate provenance analysis from
instrumentation.
• Future:
• The computer should do the work

Future Research
• Explore optimizations of taint tracking for
capturing provenance.
• Provenance analysis of real-world traces
(e.g. from rrshare.org).
• Tracking provenance across environments
• Traces/logs as central provenance
primitive
• Declarative data munging
73

Provenance for Data Munging Environments

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Viewers also liked

Viewers also liked (20)

Similar to Provenance for Data Munging Environments

Similar to Provenance for Data Munging Environments (20)

More from Paul Groth

More from Paul Groth (20)

Recently uploaded

Recently uploaded (20)

Provenance for Data Munging Environments

Editor's Notes