From Provenance Standards and Tools to Queries and Actionable Provenance

From Provenance Standards and Tools to
Queries and Actionable Provenance
Bertram Ludäscher et al.
DataONE (Jones, Budden, Vieglais, … )
SKOPE (Bocinsky, Kintigh, Kohler, McPhillips, ...)
KURATOR (Morris, McPhillips, Zhang, ...)
WHOLE-TALE (Turk, Stodden, ...)
YesWorkflow (McPhillips, ...)

Provenance (Lineage) matters …
• One of these sold for $180M, the other one for
$22K (but could be worth more ... definitely maybe ...)
• Which one would you like to own?
Ludäscher: Queries & Actionable Provenance 2

Provenance (Lineage) matters …
• One of these sold for $180M, the other one for …
• … $450M !!!

Provenance is: keeping records …
• Grand Canyon’s rock layers are a record of the early geologic history of North America.
The ancestral puebloan granaries at Nankoweap Creek tell archaeologists about more
recent human history. (By Drenaline, licensed under CC BY-SA 3.0)
• Not shown: computational archaeologists reconstructing past climate from multiple tree-
ring databases è computational provenance is key for transparency & reproducibility

... and provenance is:
Understanding what happened!
Zrzavý, Jan, David Storch, and Stanislav
Mihulka. Evolution: Ein Lese-Lehrbuch.
Springer-Verlag, 2009.
Author: Jkwchui (Based on
drawing by Truth-seeker2004)

"The government are very keen on
amassing statistics. They collect them,
add them, raise them to the nth power,
take the cube root and prepare
wonderful diagrams.
But you must never forget that every one
of these figures comes in the first
instance from the village watchman,
who just puts down what he damn
pleases.”
Why we need data lineage and
computational provenance

Computational Provenance …
• Origin, processing history of artifacts
– data products, figures, ...
– also: underlying workflow
è understand methods, dataflow, and dependencies
Climate Change Impacts
in the United States
U.S. National Climate Assessment
U.S. Global Change Research Program

Evolution towards the Living Paper
• 1st Generation:
– narrative (prose)
• 2nd Generation: plus …
– name .. identify .. include (access to) data
• 3rd Generation: plus …
– name .. reference .. include code (software) ..
– and provenance … and exec environment (containers)
Whole Tale
Whole Tale Dashboard

9
DataONE: Search and Provenance Display
Ludäscher: Queries & Actionable Provenance

10Ludäscher: Queries & Actionable Provenance

11Ludäscher: Queries & Actionable Provenance

Adding YesWorkflow to DataONE
Yaxing’s script with
inputs & output
products
Christopher’s
YesWorkflow
model
Christopher using
Yaxing’s outputs as
inputs for his script
Christopher’s results
can be traced back all
the way to Yaxing’s
input

Runtime Provenance
(a.k.a. traces, logs,
retrospective
provenance,
“Trace-land”)
Workflow Modeling & Design
(a.k.a. prospective provenance
“Workflow-land”)
Workflows ó Provenance an important link!

14
Trace
Workflow
Data (extensible)
See purl.dataone.org/provone-v1-dev

Provenance Support for Reproducible Science
Example: Paleoclimate Reconstruction
Science paper (OA) uses:
• open source code:
– R, PaleoCAR, …
• Is that all we need?
• What was the
“workflow”?
• Is there prospective
and/or retrospective
provenance?

SKOPE: Synthesized Knowledge Of Past Environments
Bocinsky, Kohler et al. study rain-fed maize of Anasazi
– Four Corners; AD 600–1500. Climate change influenced Mesa Verde Migrations; late
13th century AD. Uses network of tree-ring chronologies to reconstruct a spatio-
temporal climate field at a fairly high resolution (~800 m) from AD 1–2000. Algorithm
estimates joint information in tree-rings and a climate signal to identify “best” tree-ring
chronologies for climate reconstructing.
K. Bocinsky, T. Kohler, A 2000-year reconstruction of the rain-fed
maize agricultural niche in the US Southwest. Nature
Communications. doi:10.1038/ncomms6618
… implemented as an R Script …

YesWorkflow: Prospective & Retrospective
Provenance … (almost) for free!
• YW annotations in
a (Python, R, …)
script recreate a
workflow view
from the script …
cassette_id
sample_score_cutoff
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
YW!
@BEGIN .. @END ..
@IN .. @OUT ..
@URI .. @LOG ..

GetModernClimate
PRISM_annual_growing_season_precipitation
SubsetAllData
dendro_series_for_calibration
dendro_series_for_reconstruction CAR_Analysis_unique
cellwise_unique_selected_linear_models
CAR_Analysis_union
cellwise_union_selected_linear_models
CAR_Reconstruction_union
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
CAR_Reconstruction_union_output
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years
Paleoclimate Reconstruction (openSKOPE.org)
• … explained using YesWorkflow!
Kyle B., (computational) archaeologist:
"It took me about 20 minutes to comment. Less
than an hour to learn and YW-annotate, all-told."

YW Demo Use Cases (IDCC’17)
Domain Use case Programming language Provenance methods
Climate science C3C4 MATLAB YW + MATLAB
RunManager
Astrophysics LIGO Python YW + NW (code-level)
Protein crystal samples Simulate data
collection
Python YW + NW (code-level)
Biodiversity data
curation
kurator-SPNHC Python YW-recon + YW-logging
Social network analysis Twitter Python YW + NW (file-level)
Oceanography OHIBC Howe Sound
(multi-run multi-script)
R YW + R RunManager

run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt

YW-RECON: Prospective & Retrospective
Provenance … (almost) for free!
cassette_id
sample_score_cutoff
sample_spreadsheet
calibration_image
initialize_run
run_log
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_log
collect_data_set
sample_id energy frame_number
raw_image
transform_images
corrected_image
collection_log
• URI-templates link conceptual entities to
runtime provenance “left behind” by the
script author …
• … facilitating provenance reconstructionProvenance @DUG-2017 20

initialize_run
run_log
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_idenergyframe_number
raw_image
transform_images
corrected_image
collection_log
sample_spreadsheet
calibration_image
cassette_id
sample_score_cutoff
Q1: What samples did the script run collect images
from?
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
... ...
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│

Provenance @DUG-2017 21

initialize_run
run_log
calculate_strategy
log_rejected_sample
rejection_log
collect_data_set
raw_image
transform_images
corrected_image
collection_log
sample_spreadsheet
calibration_image
cassette_id
sample_score_cutoff
Q2: What energies were used for image collection from
sample DRT322?
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
... ...
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│


initialize_run
run_log
calculate_strategy
log_rejected_sample
rejection_log
collect_data_set
raw_image
transform_images
corrected_image
collection_log
sample_spreadsheet
calibration_image
cassette_id
sample_score_cutoff
Q3: Where is the raw image of the corrected image
DRT322_11000ev_030.img? run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
... ...
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│


initialize_run
run_log
calculate_strategy
log_rejected_sample
rejection_log
collect_data_set
raw_image
transform_images
corrected_image
collection_log
sample_spreadsheet
calibration_image
cassette_id
sample_score_cutoff
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
... ...
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│

Q5: What cassette-id had the sample leading to
DRT240_10000ev_001.img?

Hybrid Provenance:
YW Model + Runtime
Observables (file level)
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
• The YW model can be connected
with runtime observables
• è YW recon (prov reconstruction)
• Here:
• What specific files were read,
written and where do they occur
in the workflow?

C3-C4 Prospective Provenance
Ludäscher: Queries & Actionable Provenance
C3_C4_map_present_NA
fetch_SYNMAP_land_cover_map_variable
lon_variable lat_variable lon_bnds_variable lat_bnds_variable
fetch_monthly_mean_air_temperature_data
Tair_Matrix
fetch_monthly_mean_precipitation_data
Rain_Matrix
initialize_Grass_Matrix
Grass_variable
examine_pixels_for_grass
C3_Data C4_Data
generate_netcdf_file_for_C3_fraction
C3_fraction_data
file:outputs/SYNMAP_PRESENTVEG_C3Grass_RelaFrac_NA_v2.0.nc
C4_fraction_data
file:outputs/SYNMAP_PRESENTVEG_C4Grass_RelaFrac_NA_v2.0.nc
generate_netcdf_file_for_Grass_fraction
Grass_fraction_data
file:outputs/SYNMAP_PRESENTVEG_Grass_Fraction_NA_v2.0.nc
SYNMAP_land_cover_map_data
inputs/land_cover/SYNMAP_NA_QD.nc
mean_airtemp
file:inputs/narr_air.2m_monthly/air.2m_monthly_{start_year}_{end_year}_mean.{month}.nc
mean_precip
file:inputs/narr_apcp_rescaled_monthly/apcp_monthly_{start_year}_{end_year}_mean.{month}.nc
26

What does C4_fraction_data depend on ?
C4_Data
Rain_Matrix
Tair_Matrix
C4_fraction_data
mean_airtempmean_precip
Tair_Matrix
Rain_Matrix
Grass_variable
C3_Data C4_Data
C3_fraction_data
C4_fraction_data
Grass_fraction_data
C4_fraction_data
lineage very similar to
overall workflow graph!

What does Grass_fraction_data depend on?
Tair_Matrix
Rain_Matrix
Grass_variable
C3_Data C4_Data
C3_fraction_data
C4_fraction_data
Grass_fraction_data
C4_fraction_data lineage different from overall workflow graph!
- Smaller subgraph
- Depends on only 1 of 3 inputs!
Grass_variable
Grass_fraction_data

What happens after running the script?
Hybrid provenance graph!
• 3 inputs spread across
25 (=2x24 + 1) files
• Do all 3 output files
depend on all 25
inputs?
Tair_Matrix
Rain_Matrix
Grass_variable
C3_Data C4_Data
C3_fraction_data
outputs/SYNMAP_PRESENTVEG_C3Grass_RelaFrac_NA_v2.0.nc
C4_fraction_data
Grass_fraction_data
outputs/SYNMAP_PRESENTVEG_Grass_Fraction_NA_v2.0.nc
mean_airtemp
inputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.9.nc
mean_precip
inputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.4.nc

What C4_fraction_data depends on (hybrid) …
C4_Data
Rain_Matrix
Tair_Matrix
C4_fraction_data
Earlier prospective
query result
C4_Data
Rain_Matrix
Tair_Matrix
C4_fraction_data
mean_airtemp
mean_precip

What Grass_fraction_data depends on (hybrid)…
Grass_variable
Grass_fraction_data
Grass_variable
Grass_fraction_data
outputs/SYNMAP_PRESENTVEG_Grass_Fraction_NA_v2.0.nc
Tair_Matrix
Rain_Matrix
Grass_variable
C3_Data C4_Data
C3_fraction_data
C4_fraction_data
Grass_fraction_data
Overall workflow
Upstream of
Grass_fraction_data
(prospective)
Upstream of Grass_fraction_data
(hybrid)
# @BEGIN
Gravitational_Wave_Detection
# @IN fn_d @as FN_Detector
# @IN fn_sr @as FN_Sampling_Rate
# @OUT shifted.wav @as
shifted_wave
# @OUT whitenbp.wav @as
whitened_bandpass
import numpy as np
from scipy import signal
…
# @BEGIN
Amplitude_Spectral_Density
# @IN strain_H1
# @IN strain_L1
# @PARAM fs
# @OUT psd_H1
# @OUT psd_L1
# @OUT GW150914_ASDs.png @URI …
…
NFFT = 1*fs
fmin, fmax = 10, 2000
…
YesWorkflow-annotated
scripts
Logic rules for recon
querying, and vis
prospective and ret
provenance tog
st
upstream(strain_LI_whitenbp) [prospective]
WHITENING
strain_H1_whiten strain_L1_whiten
AMPLITUDE_SPECTRAL_DENSITY
PSD_H1 PSD_L1
LOAD_DATA
strain_H1 strain_L1
BANDPASSING
strain_L1_whitenbp
FN_Detector
ﬁle:{Detector}_LOSC_4_V1-...
FN_Sampling_rate
ﬁle:H-H1_LOSC_{Rate}_V1-...
fs
upstream(strain_L1_whitenbp) [URI-recon]
WHITENING
PSD_H1 PSD_L1
LOAD_DATA
strain_H1 strain_L1
BANDPASSING
strain_L1_whitenbp
FN_Detector
L-L1_LOSC_4_V1-1126259446-32.hdf5
H-H1_LOSC_4_V1-1126259446-32.hdf5
FN_Sampling_rate
H-H1_LOSC_4_V1-1126259446-32.hdf5
H-H1_LOSC_16_V1-1126259446-32.hdf5
fs
YesWorkflow toolkit
Extract annotations and
model script as a workflow
YesWorkflow toolk
Reconstruct script run
retrospective provena
YesWorkflow toolkit
Render workflow
model graphically
Prospective Provenance
user-defined
workflow models
Hybrid Proven
General purpose provena
Provenance querie
Query provenance
(esp. graphs) and
visualize results
pr
Workflow model (graph)
Facts (Prolog)
Reconstructed proven
Facts (Prolog)
prospective + file-
level runtime
observables

LIGO example: What strain_L1_whitenbp depends on …
Overall workflow
Upstream of
strain_L1_whitenbp
(prospective)
GRAVITATIONAL_WAVE_DETECTION
LOAD_DATA
Load hdf5 data.
strain_H1strain_L1 strain_16 strain_4
Amplitude spectral density.
ASDs
file:GW150914_ASDs.png
PSD_H1PSD_L1
WHITENING
suppress low frequencies noise.
BANDPASSING
remove high frequency noise.
strain_H1_whitenbp strain_L1_whitenbp
STRAIN_WAVEFORM_FOR_WHITENED_DATA
plot whitened data.
WHITENED_strain_data
file:GW150914_strain_whitened.png
SPECTROGRAMS_FOR_STRAIN_DATA
plot spectrogram for strain data.
spectrogram
file:GW150914_{detector}_spectrogram.png
SPECTROGRAMS_FOR_WHITEND_DATA
plot spectrogram for whitened data.
spectrogram_whitened
file:GW150914_{detector}_spectrogram_whitened.png
FILTER_COEFS
Filter signal in time domain (bandpassing).
COEFFICIENTS
FILTER_DATA
filter data.
filtered_white_noise_data
file:GW150914_filter.png
strain_H1_filtstrain_L1_filt
STRAIN_WAVEFORM_FOR_FILTERED_DATA
plot the filtered data.
H1_strain_filtered
file:GW150914_H1_strain_filtered.png
H1_strain_unfiltered
file:GW150914_H1_strain_unfiltered.png
WAVE_FILE_GENERATOR_FOR_WHITENED_DATA
Make sound files for whitened data.
whitened_bandpass_wavefile
file:GW150914_{detector}_whitenbp.wav
SHIFT_FREQUENCY_BANDPASSED
shift frequency of bandpassed signal.
strain_H1_shifted strain_L1_shifted
WAVE_FILE_GENERATOR_FOR_SHIFTED_DATA
Make sound files for shifted data.
shifted_wavefile
file:GW150914_{detector}_shifted.wav
DOWNSAMPLING
Downsampling from 16384 Hz to 4096 Hz.
H1_ASD_SamplingRate
file:GW150914_H1_ASD_{SamplingRate}.png
FN_Detector
file:{Detector}_LOSC_4_V1-1126259446-32.hdf5
FN_Sampling_rate
file:H-H1_LOSC_{DownSampling}_V1-1126259446-32.hdf5
fs
upstream(strain_LI_whitenbp) [prospective]
WHITENING
PSD_H1 PSD_L1
LOAD_DATA
strain_H1 strain_L1
BANDPASSING
strain_L1_whitenbp
FN_Detector
file:{Detector}_LOSC_4_V1-...
FN_Sampling_rate
file:H-H1_LOSC_{Rate}_V1-...
fs
upstream(strain_L1_whitenbp) [URI-recon]
WHITENING
PSD_H1 PSD_L1
LOAD_DATA
strain_H1 strain_L1
BANDPASSING
strain_L1_whitenbp
FN_Detector
L-L1_LOSC_4_V1-1126259446-32.hdf5
H-H1_LOSC_4_V1-1126259446-32.hdf5
FN_Sampling_rate
H-H1_LOSC_4_V1-1126259446-32.hdf5
H-H1_LOSC_16_V1-1126259446-32.hdf5
fs
upstream(strain_LI_whitenbp) [NW-recon]
WHITENING
strain_L1_whiten
strain_L1_whiten = array([8.494, -1.672, ..., 72.156])
PSD_L1
psd_L1 = scipy.interpolate.interpolate.interp1d
object at 0x113969418
LOAD_DATA
strain_L1
strain_L1 = array([-1.779e-18, -1.765e-18, ..., -1.719e-18])
BANDPASSING
strain_L1_whitenbp
strain_L1_whitenbp = array([8.184, 19.935,..., -0.684])
FN_Detector
fn_d = L-L1_LOSC_4_V1-1126259446-32.hdf5
fs
fs = 4096
Upstream of strain_L1_whitenbp
(hybrid YW-NW at the code-
level)
Upstream of strain_L1_whitenbp
(hybrid YW-NW at the file-level)
3 inputs spread across
5 (=2x2 + 1) files
Does intermediate data
strain_L1_whitenbp
depend on all 5 inputs?
• Intermediate data
strain_L1_whiten
bp depend only
on 2 out of 5
inputs!

DwCA Taxon Lookup
Workflow
• Declare inputs, outputs, and
steps of a script (or wf) with
YW annotations to ...
– communicate provenance
graphically (via graphviz)
– combine different forms of
provenance
– query provenance
• Simple YW annotations in
comments:
– @BEGIN Step, @END Step
– @IN Data, @OUT Data
– @URI Template, @LOG Pattern
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��

Taxon Lookup Workflow:
Data View and Process View

The story of
two individual
records
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
• One took the GBIF
route, while …
• … the other went
all WORMS!

The aggregate story ..
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
• How many records were
observed as inputs or outputs
of workflow steps?
• Were there any NULL values?
How many?

Summary
• YW annotations can be added
easily to your scripts to reap
workflow benefits
– Documentation of what’s
important
– Visualization of dependencies
– Querying provenance
(prospective, retrospective,
and hybrid)
è make provenance actionable
è provenance for self!
=> github.com/yesworkflow-org/yw
=> try.yesworkflow.org
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��
��

Demo Time
(Disclaimer) https://github.com/idaks/dataone-ahm-2016-poster
https://github.com/idaks/wt-prov-summer-2017
https://github.com/yesworkflow-org/yw-idcc-17

From Provenance Standards and Tools to Queries and Actionable Provenance

Recommended

Recommended

More Related Content

Similar to From Provenance Standards and Tools to Queries and Actionable Provenance

Similar to From Provenance Standards and Tools to Queries and Actionable Provenance (20)

More from Bertram Ludäscher

More from Bertram Ludäscher (20)

Recently uploaded

Recently uploaded (20)

From Provenance Standards and Tools to Queries and Actionable Provenance