Provenance in Databases and Scientific Workflows: Part I

Provenance in Databases
and Scientific Workflows
Bertram Ludäscher
ludaesch@illinois.edu
31st Brazilian Symposium on Databases
October 4-7, 2016, Salvador Bahia
Director, Center for Informatics Research in Science & Scholarship (CIRSS)
School of Information Sciences (iSchool@Illinois)
& National Center for Supercomputing Applications (NCSA)
& Department of Computer Science (CS@Illinois)

Welcome & Bem-Vindo!
• Boa tarde e bem-vindo ao tutorial sobre
proveniência!
• Welcome to the Tutorial on: Provenance in
Databases and Scientific Workflows
– Proveniência em bancos (bases) de dados e fluxos de
trabalho (workflows) científicos
• Desculpas ...
– Back to English (my 2nd language)
• Feel free to interrupt and ask questions!
• (You can also ask questions in German or Spanish ...)
Provenance @ SBBD'16

Introductions should come first …
• MS Computer Science, U Karlsruhe (K.I.T.)
• PhD Computer Science, U Freiburg, Germany
• Research Scientist, UC San Diego, SDSC
• Dept. of Computer Science, UC Davis
• University of Illinois at Urbana-Champaign
– School of Information Sciences
– Natl. Center for Supercomputing Applications
– Dept. of Computer Science

• Part I: Provenance in Scientific Workflows
– Alta Vista: Provenance everywhere!
– Provenance & Scientific Workflows
– Provenance Models and Standards (not so much)
– Provenance Tools
• Example & Demo: YesWorkflow
• Part II: Provenance in Databases
– Foundations of provenance in databases
– Why-, How-, and Why-Not provenance
Outline of the Tutorial:
A “Tour de Provenance”

1st Tour Stop: The Fine Arts
• One of these is has been sold for nearly $180 million.
• The other could be worth as much or more.
• Which is which?
• What is the difference?

Provenance - Proveniência
• Oxford English Dictionary
– The place of origin or earliest known history of something:
• an orange rug of Iranian provenance
– The beginning of something’s existence; its origin:
• they try to understand the whole universe, its provenance and fate
– A record of ownership of a work of art or an antique, used as a
guide to authenticity or quality:
• the manuscript has a distinguished provenance
• What is the origin (provenance!) of “provenance” ?

The Many Faces of Provenance
• What are those?
• Cosmology
• Geology, Stratigraphy
• Phylogeny
– the Tree of Life
• Genealogy
– your family: literally
• Academic Pedigree
– “Doktorvater” (Doktor-Mutter?)
• Etymology
• Chain of custody
– of art(ifacts)
• Yes: all about origins and history …

Provenance or Provenience?

Provenience vs Provenance
• In archaeology: special sense of location (3D), layer ..

2nd Stop: Liberal Arts & Sciences
• Can you “see provenance” in this image?
• Grand Canyon’s rock layers are a record of the early geologic history of North America.
The ancestral puebloan granaries at Nankoweap Creek tell archaeologists about more
recent human history. (By Drenaline, licensed under CC BY-SA 3.0)

Science & Natural History:
Understanding what happened…
Zrzavý, Jan, David Storch, and Stanislav Mihulka.
Evolution: Ein Lese-Lehrbuch. Springer-Verlag, 2009.
Author: Jkwchui (Based on
drawing by Truth-seeker2004)

Computational Provenance
• Origin and processing history of an artifact
– usually: data (products), figures, ...
– sometimes: workflow (and script) evolution …
• Different sub-communities:
– Provenance in (scientific) workflows (Tutorial Part I)
– Provenance in databases (Tutorial Part II)
– Wait, there is more:
• ... programming languages, systems/security, …

Why should you care about provenance?
• It’s an important problem:
– reproducibility crisis, transparency, data sharing, …
• There are (still) many deeply technical and practical challenges:
– Efficient capture, management, use of provenance
– Models, semantics, query languages
– Provenance .. for others? Or provenance for self!
– Interdisciplinary work; cross-fertilization: databases, workflows,
programming languages, security, …, various scientific communities
(bioinformatics, ...)
• You have a head start here!
– Marta Mattoso, Daniel de Oliveira, Vanessa Braganholo, Juliana Freire, ...
(e.g. SBBD proceedings ..)
• … oh, and it’s also a fun topic ...

Provenance Research everywhere …

OneProblem: Reproducibility Crisis
• Different sciences have
different reproducibility
crises …
• Focus here:
Computational
Reproducibility
– R, Matlab, Python, .. scripts
– Scientific workflows, ...
• How to facilitate reproducibility
for computational and data scientists?

Use Provenance for
Transparency, Reproducibility
• What input data went into
this study?
• What methods were used?
• … with what parameter
settings, calibrations, …?
• Can we trust the data and
methods?
§ Provenance (lineage): track origin and processing history of data è
trust, data quality ~ audit trail for attribution, credit
§ Discovery of data, methodologies, experiments

Climate Change: Whodunnit?

Tracing the sources (data, code)

Provenance today: Important but hard
èmany research projects,
groups conduct R&D on
provenance methods,
tools, …
Example:
Climate Change Impacts
in the United States
U.S. National Climate Assessment
U.S. Global Change Research Program
“This report is the result of a three-year
analytical effort by a team of over 300
experts, overseen by a broadly
constituted Federal Advisory Committee
of 60 members. It was developed from
information and analyses gathered in
over 70 workshops and listening sessions
held across the country.”

A scientific data federation: DataONE
Data Observation Network for Earth
search.dataone.orgProvenance @ SBBD'16

DataONE Cyberinfrastucture:
Coordinating Nodes
www.dataone.org/coordinating-nodes
Coordinating Nodes
• retain complete metadata
catalog
• indexing for search
• network-wide services
• ensure content availability
(preservation)
• replication services
Components for a flexible, scalable, sustainable
network

network
DataONE Cyberinfrastructure:
Member Nodes
www.dataone.org/member-nodes
Coordinating Nodes
catalog
(preservation)
Member Nodes
• diverse institutions
• serve local community
• provide resources for
managing their data
• retain copies of data

DataONE Cyberinfrastructure:
Investigator Toolkit
www.dataone.org/investigator-toolkit
Coordinating Nodes
catalog
(preservation)
Member Nodes
• diverse institutions
• serve local community
• provide resources for
managing their data
• retain copies of data
Investigator Toolkit
network

Provenance in Action: Benefits & Impact
A DataONE search (here: “grass”) yields different packages with provenance

DataONE: Support for Provenance
Yaxing’s script with
inputs & output
products
Christopher’s
YesWorkflow
model
Christopher using
Yaxing’s outputs as
inputs for his script
Christopher’s results
can be traced back all
the way to Yaxing’s
input

REWIND: From Provenance to Reproducible Science …
Capturing provenance is crucial for
transparency, interpretation, debugging, …
=> repeatable experiments,
=> reproducible science
=> need workflow-system agnostic model

... via scientific workflows (… and scripts)

… 3rd stop: Scientific Workflows: ASAP
• Automation
– wfs to automate computational aspects of science
• Scaling (exploit and optimize machine cycles)
– wfs should make use of parallel compute resources
– wfs should be able handle large data
• Abstraction, Evolution, Reuse (human cycles)
– wfs should be easy to (re-)use, evolve, share
• Provenance
– wfs should capture processing history, data lineage
è traceable data- and wf-evolution
è Reproducible Science
Trident
Workbench
VisTrails
Es war einmal …

10 Essential Functions
of a Scientific Workflow System1. Automate programs and services scientists already use.
2. Schedule invocations of programs and services correctly and efficiently – in
parallel where possible.
3. Manage dataflow to, from, and between programs and services.
4. Enable scientists (not just developers) to author or modify workflows easily.
5. Predict what a workflow will do when executed: prospective provenance.
6. Record what happened during workflow execution: retrospective provenance.
7. Reveal and query provenance – how workflow products were derived from inputs
via programs and services.
8. Organize intermediate and final data products as desired by users.
9. Enable scientists to version, share and publish their workflows.
10. Empower scientists who wish to automate additional programs and services
themselves.
These functions (not just dataflow & actors) distinguish scientific workflow automation
from general (scientific) software development.
Src: Tim McPhillipsProvenance @ SBBD'16

Find OTUs
(OTUHunter)
Assign Taxonomy
(STAP)
Profile alignment
(STAP or Infernal)
Build phylogenetic
tree (RaxML or
Quicktree)
View tree:
Dendroscope
UniFrac: tree &
environment file
Assembled
contigs
Chimera check
(Mallard)
Diversity statistics:
Text: OUT list, Chao1, Shannon
Graphs: rarefaction curves, rank-
abundance curves
Visualization tools:
Cytoscape networks &
Heat map
WATERS:
Workflow for Alignment, Taxonomy,
Ecology of Ribosomal Sequences
(Amber Hartman; Eisen Lab; UC Davis)
+/- cipres
+/- cluster
+/- cluster
+/- cluster

Executable WATERS Workflow in Kepler

Example
Bioinformatics
Workflow:
Motif-Catcher
Marc Facciotti et al.
UC Davis Genome Center

Motif-Catcher workflow, implemented in Kepler
S Köhler et al. Improved Motif Detection in Large Sequence Sets with
Random Sampling in a Kepler workflow, ICCS-WS, 2012

Kepler Workflows & Decision Making
(Kruger Natl. Park, South Africa)
SANParks Matt Jones, NCEAS @ UC Santa Barbara

A Data-Streaming Workflow over Sensor Data

• Monitor and control supercomputer
simulations
– 50+ composite actors (subworkflows)
– 4 levels of hierarchy
– 1000+ atomic (Java) actors
43 actors, 3 levels
196 actors, 4 levels
30 actors
137 actors
33 actors
150
123 actors
66 actors
12 actors
Norbert Podhorszki
ORNL (then: UC Davis)
“Plumbing” workflow

Data Curation Workflows
(Filtered-Push … Kepler … Kurator projects)

So what is “provenance” (sensu W3C) ?
• Provenance refers to the sources of information, including entities
and processes, involved in producing or delivering an artifact (*)
• Provenance is a description of how things came to be, and how
they came to be in the state they are in today (*)
• Provenance is a record that describes the people, institutions, entities,
and activities, involved in producing, influencing, or delivering a piece
of data or a thing in the world

W3C PROV Family of Specifications: Modeling
Provenance Analysis and RDF Query Processing,
Satya S. Sahoo, Praveen Rao, ISWC, October, 2015.
W3C PROV Family of Specifications: Provenance Modeling
•  W3C Recommendations
o  PROV Data Model (PROV-
DM)
o  PROV Ontology (PROV-O)
o  PROV-Constraints
o  PROV Notations (PROV-
N)
•  PROV Working Group Notes
(selected)
o  PROV-Access and Querying (AQ)
o  PROV Dictionary
o  PROV XML
o  PROV and Dublin Core Mappings
(PROV-DC)
o  PROV Semantics (using first-order logic)
(PROV-SEM)

Provenance & Semantic Web Layer Cake
Provenance Analysis and RDF Query Processing,
Satya S. Sahoo, Praveen Rao, ISWC, October, 2015.
Provenance and Semantic Web Layer Cake
•  Proof layer aka
Provenance
•  Trust is derived from
provenance
information

Back to the basics: Open Provenance
Model (OPM) => W3C Prov

W3C Prov: some finer points

Runtime Provenance
(a.k.a. traces, logs,
retrospective
provenance,
“Trace-land”)
4th Stop: Different Kinds of Data Provenance in
Workflows Workflow Modeling & Design
(a.k.a. prospective provenance
“Workflow-land”)

Another PROV Extension: OPMW

Another PROV Extension: OPMW
• Approximately:
– Workflow-Land vs Trace-Land

Provenance Standards vs Tools
• Do we need more standards to sort this out?
– … or do we already have too many “standards”?
• How should we think about provenance?
– ... in workflows and databases?
• What can we do with provenance?
– ... in workflows and databases?
• Tools to create, share, use provenance
– … not just for “provenance for others”
– ... need more “provenance for self”

A simple (simplistic?) World View:
Workflow-Land ó Trace-Land
• Not a “standard” – but helps (me!) think about
workflows (prospective provenance) and traces
(retrospective provenance). Provenance @ SBBD'16

ProvONE: PROV for scientific workflows
(Transfer station to any of several other “standard extensions”)
“Trace-Land” (retrospective provenance)
“Data-Land”
Yang Cao1
, Christopher Jones2
, Víctor Cuevas-
Vicenttín3
, Matthew B. Jones2
, Bertram
Ludäscher1
, Timothy McPhillips1
, Paolo
Missier4
, Christopher Schwalm5
, Peter
Slaughter2
, Dave Vieglais6
, Lauren Walker2
,
Yaxing Wei7
1
University of Illinois, Urbana-Champaign, 2
National Center for
Ecological Analysis and Synthesis, UCSB, 3
Universidad Popular
Autónoma del Estado de Puebla, Mexico, 4
School of Computing
Science,
Newcastle University, UK, 5
Woods Hole Research Center, Falmouth,
MA, 6
University of Kansas, Lawrence, 7
Environmental Sciences
Division, Oak Ridge National Lab, TN
Also: A. Marinho, L. Murta, C.Werner, V.Braganholo, S. Serra da Cruz,
E.Ogasawara, M. Mattoso. “ProvManager: A Provenance Management
System for Scientific Workflows.” Concurrency and Computation:
Practice and Experience 24, no. 13 (2012): 1513–1530
…
“Workflow-Land” (prospective prov.)

Provenance Sleuth or Engineer?
• Scientists are Provenance (i.e., Natural History) Sleuths
• {Computational, Computer, Information}-Scientists should
(also) be Provenance Engineers
– Ensure your “Data Tree of Life” (data provenance) is correct!
– What is the origin and processing history of your data?
• With great provenance come great questions!
– “We store everything!”
– Huh? Yes, provenance is the answer… (yawn..)
– But what is the question??
• Engineer’s Stance:
– What questions do you want to answer?
– Let’s find out what observables we need to capture, what query
language we should use, how we do that efficiently (later), …

Drilling down into “Trace-Land”:
From MoC to MoP via Observables
• Model of Computation MoC
– specification/algorithm to compute Outputs = MoC(Wf,Params,Inputs)
– a director or scheduler implements MoC
– gives rise to formal notions of
• computation (aka run) R
– Formalisms to define M?
• Model of Provenance MoP
– associate with a MoC a “default” MoP (= MoC ± Δ)
– the MoP is a “trimmed” MoC
• T = R – I + M
– Trace = Run – Ignored-observables + Modeled-observables
• Observables (of a MoC / MoP)
– functional observables (may influence output o)
• token rate, notions of firing, …
– non-functional observables (not part of M, do not influence o)
• token timestamp, size, … (unless the MoC cares about those)

M. Anand, S. Bowers,
et al., SSDBM’09
From MoCs to Models of Provenance (MoPs)

Fine-grained, Data & MoC-aware MoP
M. Anand, S. Bowers,
et al., SSDBM’09

Types of Data Provenance
• Black-box
– know (next to) nothing at compile-time
– at runtime: keep some data lineage
– most prov sensu WF work use this
• White-box
– statically (compile-time) analyzable
– q(Y1,Y2) :- p(X1,X2), r(X1,Y1), s(X2,Y2)
– Most prov sensu DB work use this
• Grey-box
– can “look inside” (some black boxes)
– … e.g. b/c they have subworkflows
– … or FP signatures: A :: t1, t2à t3,t4
– … or semantic annotations (sem.types)
f
A
q
t1
t2
t3
t4
X1
X2
Y1
Y2

✔ Provenance capture (Matlab, R, Python, … scientific workflow systems)
✔ Uploading, sharing, linking provenance through various provenance tools
✗ Tools for scientists to exploit (≠ capture, share, link) provenance for their own
day-to-day work.
è Prime the provenance pump and increase provenance generation
è Scientists accelerate their work via new, active uses of provenance.
But … how to prime the provenance pump??
Must support “Provenance for Self” !
Provenance
for Self?!
Provenance
for Others

From Workflows & Provenance to
Provenance for Script-based Workflows …
• What workflow tools are (most) scientists using?
– Workflow systems
– … vs scripts (Python, R, MATLAB, ...)
• What provenance tools are their?
– Workflow system support
– Tools for “workflow” scripts!?

Yes, scripts are (can be) workflows too!
Interactive Visualization

SKOPE: Synthesized Knowledge Of Past Environments
Bocinsky, Kohler et al. study rain-fed maize of Anasazi
– Four Corners; AD 600–1500. Climate change influenced Mesa Verde Migrations; late
13th century AD. Uses network of tree-ring chronologies to reconstruct a spatio-
temporal climate field at a fairly high resolution (~800 m) from AD 1–2000. Algorithm
estimates joint information in tree-rings and a climate signal to identify “best” tree-ring
chronologies for climate reconstructing.
K. Bocinsky, T. Kohler, A 2000-year reconstruction of the rain-fed
maize agricultural niche in the US Southwest. Nature
Communications. doi:10.1038/ncomms6618
… implemented as an R Script …

Provenance Support for Reproducible Science
Example: Paleoclimate Reconstruction
Science paper (OA) uses:
• open source code:
– R, PaleoCAR, …
• Is that all we need?
• What was the
“workflow”?
• Is there prospective
and/or retrospective
provenance?

GetModernClimate
PRISM_annual_growing_season_precipitation
SubsetAllData
dendro_series_for_calibration
dendro_series_for_reconstruction CAR_Analysis_unique
cellwise_unique_selected_linear_models
CAR_Analysis_union
cellwise_union_selected_linear_models
CAR_Reconstruction_union
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
CAR_Reconstruction_union_output
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years
?
5th Stop: YesWorkflow:
Yes, scripts are workflows, too!
• Script vs Workflows/ASAP:
– Automation: *****
– Scaling: **
– Abstraction: *
– Provenance: **

YesWorkflow.org
• YesWorkflow (YW)
– Started as a grass-roots effort (Kurator, SKOPE, ..)
– … meeting the scientists/users where they R!
• R, Matlab, (i)Python, Jupyter, …
– Scripts + simple user annotations
• => Reveal the workflow model/abstraction
… that underlies the (script) implementation
• => YW can give us more of ASAP!
– First YW: ASAP (Abstraction)...
– Then YW-recon: ASAP (reconstructing runtime Provenance)

YW (prospective) and
YW-Recon (retrospective) Provenance
• 1. YW: Annotate Script => YW Model
– Annotate @BEGIN..@END, @IN, @OUT
– Visualize, share, be happy J
• 2. Run script
– Files are read and written
– Folder- & Filenames have metadata
• 3. YW-Recon
– Use @URI tags that link YW Model ó Persisted Data
– Run URI-template queries
• cf. “ls -R” & RegEx matching
• 4. YW-Query
– Answer the user’s provenance queries

YW annotations: Model your Workflow!

YesWorkflow: Prospective & Retrospective
Provenance … (almost) for free!
• YW annotations in
the script (R,
Python, Matlab)
are used to
recreate the
workflow view
from the script …
cassette_id
sample_score_cutoff
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
YW!

GetModernClimate
PRISM_annual_growing_season_precipitation
SubsetAllData
dendro_series_for_calibration
dendro_series_for_reconstruction CAR_Analysis_unique
cellwise_unique_selected_linear_models
CAR_Analysis_union
cellwise_union_selected_linear_models
CAR_Reconstruction_union
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
CAR_Reconstruction_union_output
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years
Paleoclimate Reconstruction (EnviRecon.org)
• … explained using YesWorkflow!
Kyle B., (computational) archaeologist:
"It took me about 20 minutes to comment. Less
than an hour to learn and YW-annotate, all-told."

main
fetch_mask
input_mask_file
load_data
input_data_file standardize_with_mask
land_water_mask
NEE_data simple_diagnose
standardized_NEE_data result_NEE_pdf
Get 3 views for the price of 1!
result_NEE_pdf
input_mask_file land_water_mask
fetch_mask
input_data_file NEE_data
load_data
standardized_NEE_data
standardize_with_mask
standardize_with_mask
simple_diagnose
fetch_mask land_water_mask
load_data NEE_data
standardize_with_mask standardized_NEE_data simple_diagnose result_NEE_pdf
input_mask_file
input_data_file
Process view
Data view
Combined view

Multi-Scale Synthesis and Terrestrial Model Intercomparison
Project (MsTMIP)
fetch_drought_variable
drought_variable_1
fetch_effect_variable
effect_variable_1
convert_effect_variable_units
effect_variable_2
create_land_water_mask
land_water_mask
init_data_variables
predrought_effect_variable_1 drought_value_variable_1 recovery_time_variable_1 drought_number_variable_1
define_droughts
sigma_dv_event month_dv_length
detrend_deseasonalize_effect_variable
effect_variable_3
calculate_data_variables
recovery_time_variable_2 drought_value_variable_2 predrought_effect_variable_2 drought_number_variable_2
export_recovery_time_figure
output_recovery_time_figure
export_drought_value_variable_figure
output_drought_value_variable_figure
export_predrought_effect_variable_figure
output_predrought_effect_variable_figure
export_drought_number_variable_figure
output_drought_number_figure
input_drough_variable
input_effect_variable
Christopher Schwalm,
Yaxing Wei

Figure 4: Process workflow view of an A↵ymetrix analysis script (in R).
4 YesWorkflow Examples
In the following we show YesWorkflow views extracted from real-world scientific use cases.
The scripts were annoted with YW tags by scientists and script authors, using a very
modest training and mark-up e↵ort.1
Due to lack of space, the actual MATLAB and R
scripts with their YW markup are not included here. However, they are all available
from the yw-idcc-15 repository on the YW GitHub site [Yes15].
Gene Expression Microarray Data Analysis
• [Normalize]
– Normalization of data across microarray datasets
• [SelectDEGs]
– Selection of differentially expressed genes between conditions
• [GO Analysis]
– determination of gene ontology statistics for the resulting datasets
• [MakeHeatmap]
– creation of a heatmap of the differentially expressed genes.
Tyler Kolisnik, Mark Bieda

initialize_run
run_log
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_idenergyframe_number
raw_image
transform_images
corrected_image
collection_log
sample_spreadsheet
calibration_image
cassette_id
sample_score_cutoff
Data collection workflow (X-ray diffraction)

run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt

YW-RECON: Prospective & Retrospective
Provenance … (almost) for free!
cassette_id
sample_score_cutoff
sample_spreadsheet
calibration_image
initialize_run
run_log
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_log
collect_data_set
sample_id energy frame_number
raw_image
transform_images
corrected_image
collection_log
• URI-templates link conceptual entities
to runtime provenance “left behind” by
the script author …
• … facilitating provenance reconstructionProvenance @ SBBD'16

initialize_run
run_log
calculate_strategy
log_rejected_sample
rejection_log
collect_data_set
raw_image
transform_images
corrected_image
collection_log
sample_spreadsheet
calibration_image
cassette_id
sample_score_cutoff
Data collection workflow: runtime data
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
... ...
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│

1. YW annotations => YW model
2. Files & Folders left by a run => runtime (meta-)data

initialize_run
run_log
calculate_strategy
log_rejected_sample
rejection_log
collect_data_set
raw_image
transform_images
corrected_image
collection_log
sample_spreadsheet
calibration_image
cassette_id
sample_score_cutoff
Q1: What samples did the script run collect images
from?
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
... ...
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│


initialize_run
run_log
calculate_strategy
log_rejected_sample
rejection_log
collect_data_set
raw_image
transform_images
corrected_image
collection_log
sample_spreadsheet
calibration_image
cassette_id
sample_score_cutoff
Q2: What energies were used for image collection from
sample DRT322?
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
... ...
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│


initialize_run
run_log
calculate_strategy
log_rejected_sample
rejection_log
collect_data_set
raw_image
transform_images
corrected_image
collection_log
sample_spreadsheet
calibration_image
cassette_id
sample_score_cutoff
Q3: Where is the raw image of the corrected image
DRT322_11000ev_030.img? run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
... ...
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│


initialize_run
run_log
calculate_strategy
log_rejected_sample
rejection_log
collect_data_set
raw_image
transform_images
corrected_image
collection_log
sample_spreadsheet
calibration_image
cassette_id
sample_score_cutoff
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
... ...
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│

Q5: What cassette-id had the sample leading to
DRT240_10000ev_001.img?

João F. Pimentel, Saumen Dey, Timothy McPhillips,
Khalid Belhajjame, David Koop, Leonardo Murta,
Vanessa Braganholo, Bertram Ludäscher
Yin & Yang: Demonstrating complementary
provenance from noWorkflow &
YesWorkflow

Provenance of the Yin & Yang Demo
Dagstuhl’16
Reproducibility Seminar
Provenance-Week’16 Demo!
was_derived_from__via
TaPP’15, Edinburgh

Using Provenance from Script Runs
Example from the log-file:
2016-06-07 20:32:36 Wrote run/data/DRT240/DRT240_11000eV_002.img
But how was that image derived?? (“Provenance for Self!”)Provenance @ SBBD'16

module.__build_class__
simulate_data_collection
180 return
180 run_logger
201 return
201 new_image_file
230 parser
231 cassette_id
236 add_option
241 add_option
246 add_option
248 set_usage
251 parse_args
251 args
251 options
254 module.len
24 cassette_id
24 sample_score_cutoff
24 data_redundancy
24 calibration_image_file
30 exists
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
36 run_log
37 write
38 str(sample_score_cutoff)
38 write
49 str.format
49 sample_spreadsheet_file
50 spreadsheet_rows
cassette_q55_spreadsheet.csv
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format 51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
72 str.format
72 write
73 open
73 rejection_log
74 str.format
74 TextIOWrapper.write
50 spreadsheet_rows
51 str.format
51 write
50 sample_name
50 sample_quality
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
calibration.img
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer 120 collection_log
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
50 spreadsheet_rows
51 str.format
51 write
50 sample_name
50 sample_quality
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open 119 collection_log_file 120 module.writer 120 collection_log
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 collection_log
121 writer.writerow
50 spreadsheet_rows
128 return
run/run_log.txt
run/rejected_samples.txt
run/raw/q55/DRT240/e10000/image_001.raw
run/data/DRT240/DRT240_10000eV_001.img
run/collected_images.csv
noWorkflow:
not only
Workflow!
• Scripts have provenance, too!
• Transparently capture some/all
provenance from Python script
runs.
• Use filter queries to “zoom” into
relevant parts ..

230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>
251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])
251 args = ['q55']
251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>
24 cassette_id = 'q55'
24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0
24 calibration_image_file = 'calibration.img'
49 str.format
49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'
50 sample_name = 'DRT240'50 sample_quality = 45
61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])
61 accepted_sample = 'DRT240'61 num_images = 2
61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'
92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'
106 str.format
106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')
calibration.img
$ now dataflow -f "run/data/DRT240/DRT240_11000eV_002.img"
$(NW_FILTERED_LINEAGE_GRAPH).gv: $(NW_FACTS)
now helper df_style.py
now dataflow -v 55 -f
$(RETROSPECTIVE_LINEAGE_VALUE) -m simulation
| python df_style.py -d BT -e >
$(NW_FILTERED_LINEAGE_GRAPH).gv
.. auto-“make” this!
noWorkflow lineage
of an image file
Provenance information
about Python function calls,
variable assignments, etc.

initialize_run
run_log load_screening_results
calculate_strategy
accepted_samplerejected_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_id energyframe_number raw_image
transform_images
corrected_imagetotal_intensitypixel_count
collection_log
sample_spreadsheet
calibration_image
sample_score_cutoffdata_redundancy
cassette_id
YesWorkflow: Yes, scripts are Workflows, too!
• Use YW annotations
@begin...@end, @in,
@out to reveal hidden
conceptual workflow
(prospective provenance)
• Script isn't changed:
– annotations via comments
(=> language independent)
• For understanding and
sharing the “big picture”
• Query and visualize!

Alternate YW Views
initialize_run
load_screening_results calculate_strategy
log_rejected_sample
collect_data_set transform_images log_average_image_intensity
initialize_run
calculate_strategy
log_rejected_sample
rejection_log
collect_data_set
transform_images
collection_log
sample_spreadsheet
calibration_image
cassette_id
Process view
Data view
Workflow view

initialize_run
calculate_strategy
log_rejected_sample
rejection_log
collect_data_set
transform_images
collection_log
sample_spreadsheet
calibration_image
cassette_id
What is the lineage of “corrected_image”?
From here on “upwards”:
What led (leads) to this?
.. and what is irrelevant
and should be pruned??

collect_data_set
sample_id energy frame_number raw_image
calculate_strategy
accepted_sample num_imagesenergies
transform_images
corrected_image
sample_spreadsheet
calibration_image
sample_score_cutoff data_redundancy
cassette_id
Subgraph
resulting from
lineage query
on YW workflow
model
What is the lineage of
corrected_image?

initialize_run
calculate_strategy
log_rejected_sample
rejection_log
collect_data_set
transform_images
collection_log
sample_spreadsheet
calibration_image
cassette_id
collect_data_set
sample_id energy frame_number raw_image
calculate_strategy
accepted_sample num_imagesenergies
transform_images
corrected_image
sample_spreadsheet
calibration_image
sample_score_cutoff data_redundancy
cassette_id
180 return
180 run_logger
201 return
201 new_image_file
230 parser
231 cassette_id
236 add_option
241 add_option
246 add_option
248 set_usage
251 parse_args
251 args
251 options
254 module.len
24 cassette_id
24 sample_score_cutoff
24 data_redundancy
24 calibration_image_file
30 exists
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
36 run_log
37 write
38 write
49 str.format
49 sample_spreadsheet_file
50 spreadsheet_rows
cassette_q55_spreadsheet.csv
50 sample_name
50 sample_quality
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
72 str.format
72 write
73 open
73 rejection_log
74 str.format
74 TextIOWrapper.write
50 spreadsheet_rows
51 str.format
51 write
50 sample_name
50 sample_quality
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
calibration.img
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 module.writer
120 collection_log
121 writer.writerow
50 spreadsheet_rows
51 str.format
51 write
50 sample_name
50 sample_quality
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 collection_log
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open 119 collection_log_file 120 module.writer 120 collection_log
121 writer.writerow
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 total_intensity
106 pixel_count
107 str.format
107 write
119 open
120 collection_log
121 writer.writerow
50 spreadsheet_rows
128 return
run/run_log.txt
run/rejected_samples.txt
run/collected_images.csv
230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>
251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])
251 args = ['q55']
251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>
24 cassette_id = 'q55'
24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0
24 calibration_image_file = 'calibration.img'
49 str.format
49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'
50 sample_name = 'DRT240'50 sample_quality = 45
61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])
61 accepted_sample = 'DRT240'61 num_images = 2
61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'
92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'
106 str.format
106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')
calibration.img
lineage query
lineage query
YesWorkflow:
Conceptual workflow model
noWorkflow:
Python trace model
But how do we
bridge this gap???
Would like to use YW
model to query NW
data!

We’re off to see the Wizard of Prov ...
We're off to see the Wizard,
The wonderful Wizard of Prov!
--
We hear he is a wiz of a wiz
If ever a wiz there was.
--
If ever, oh ever, a wiz there was,
The Wizard of Prov is one because,
Because, because, because, because, because,
Because of the wonderful things he does.
• Enrich YW conceptual view
with NW Python provenance!
• Get the best of both worlds!
• How hard can it be to bridge
YW and NW …
(cf. TaPP’15 prototype)

Diamonds are forever
Bridges aren’t …

… new bridge-building can be stressful
… even if just painting over.

Habemus Pons!
We’ve got the Bridge!
The bridge is the journey..
(The journey is the destination)
Lineage of image file
in terms of YW
model, with details
from NW provenance

Secret Reproducible Sauce
• Combining provenance information from
noWorkflow and YesWorkflow
• Using all the good stuff:
– make, docker, Prolog, SQL, Graphviz
• Open source
– github.com/yesworkflow-org/yw-noworkflow
– github.com/gems-uff/yin-yang-demo
• Have a closer look at the demo!

Demo Time …

Demo Time

Provenance in Databases and Scientific Workflows: Part I

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to Provenance in Databases and Scientific Workflows: Part I

Similar to Provenance in Databases and Scientific Workflows: Part I (20)

More from Bertram Ludäscher

More from Bertram Ludäscher (20)

Recently uploaded

Recently uploaded (20)

Provenance in Databases and Scientific Workflows: Part I