SlideShare a Scribd company logo
1 of 70
Download to read offline
From Workflows to Transparent
Research Objects and Reproducible
Science Tales
Bertram Ludäscher
ludaesch@illinois.edu
Director, Center for Informatics Research in Science & Scholarship (CIRSS)
School of Information Sciences (iSchool@Illinois)
& National Center for Supercomputing Applications (NCSA)
& Department of Computer Science (CS@Illinois)
PARSEC Synthesis Workshop
2020-07-011B. Ludäscher: Workflows & Provenance
Overview
• Scientific Workflows: What are we doing?
– What are they and why should you care?
• Provenance: What have we done?
– Prospective and retrospective provenance
– Better together! (e.g. YesWorkflow & noWorkflow)
• Transparent, Reproducible Research Objects:
– The Whole Tale project
• Misc (or next time .. )
– Agreeing to disagree: taxonomy alignment with Euler/X
B. Ludäscher: Workflows & Provenance 2
Scientific Workflows: ASAP
• Automation
– wfs to automate computational aspects of science
• Scaling (exploit and optimize machine cycles)
– wfs should make use of parallel compute resources
– wfs should be able handle large data
• Abstraction, Evolution, Reuse (human cycles)
– wfs should be easy to (re-)use, evolve, share
• Provenance
– wfs should capture processing history, data lineage
è traceable data- and wf-evolution
è Reproducible Science
Trident
Workbench
VisTrails
Es war einmal …
B. Ludäscher: Workflows & Provenance 3
10 Essential functions of a scientific workflow system
1. Automate programs and services scientists already use.
2. Schedule invocations of programs and services correctly and efficiently – in
parallel where possible.
3. Manage dataflow to, from, and between programs and services.
4. Enable scientists (not just developers) to author or modify workflows easily.
5. Predict what a workflow will do when executed: prospective provenance.
6. Record what happened during workflow execution: retrospective provenance.
7. Reveal retrospective provenance – how workflow products were derived from
inputs via programs and services.
8. Organize intermediate and final data products as desired by users.
9. Enable scientists to version, share and publish their workflows.
10. Empower scientists who wish to automate additional programs and services
themselves.
These functions (not just dataflow & actors) distinguish scientific workflow
automation from general scientific software development.
B. Ludäscher: Workflows & Provenance 4
Src: Timothy McPhillips
Find OTUs
(OTUHunter)
Assign Taxonomy
(STAP)
Profile alignment
(STAP or Infernal)
Build phylogenetic
tree (RaxML or
Quicktree)
View tree:
Dendroscope
UniFrac: tree &
environment file
Assembled
conMgs
Chimera check
(Mallard)
Diversity statistics:
Text: OUT list, Chao1, Shannon
Graphs: rarefaction curves, rank-
abundance curves
Visualization tools:
Cytoscape networks &
Heat map
WATERS:
Workflow for Alignment, Taxonomy,
Ecology of Ribosomal Sequences
(Amber Hartman; Eisen Lab; UC Davis)
+/- cipres
+/- cluster
+/- cluster
+/- cluster
B. Ludäscher: Workflows & Provenance 5
Executable WATERS Workflow in Kepler
B. Ludäscher: Workflows & Provenance 6
Example
Bioinformatics
Workflow:
Motif-Catcher
Marc Faccio) et al.
UC Davis Genome Center
B. Ludäscher: Workflows & Provenance 7
Motif-Catcher workflow, implemented in Kepler
S Köhler et al. Improved Motif Detection in Large Sequence Sets with
Random Sampling in a Kepler workflow, ICCS-WS, 2012
B. Ludäscher: Workflows & Provenance 8
A Data-Streaming Workflow over Sensor Data
B. Ludäscher: Workflows & Provenance 9
• Monitor and control supercomputer
simulations
– 50+ composite actors (subworkflows)
– 4 levels of hierarchy
– 1000+ atomic (Java) actors
43 actors, 3 levels
196 actors, 4 levels
30 actors
206 actors, 4 levels
137 actors
33 actors
150
123 actors
66 actors
12 actors
243 actors, 4 levels
Norbert Podhorszki
ORNL (then: UC Davis)
“Plumbing” workflow
B. Ludäscher: Workflows & Provenance 10
A Reproducibility (Transparency!) Crisis
• Does science have a
(different) reproducibility
crisis (crises)?
• Focus here:
Computational
Reproducibility
– R, Matlab, Python, .. scripts
– Scientific workflows, ...
• How to facilitate reproducibility
for computational and data scientists?
B. Ludäscher: Workflows & Provenance 11
Provenance defined …
• Oxford English Dictionary
– The place of origin or earliest known history of something:
• an orange rug of Iranian provenance
– The beginning of something’s existence; its origin:
• they try to understand the whole universe, its provenance and fate
– A record of ownership of a work of art or an antique, used as a
guide to authenticity or quality:
• the manuscript has a distinguished provenance
• What is the origin (provenance!) of “provenance” ?
B. Ludäscher: Workflows & Provenance 12
Provenance: keeping records …
• Grand Canyon’s rock layers are a record of the early geologic history of North America.
The ancestral puebloan granaries at Nankoweap Creek tell archaeologists about more
recent human history. (By Drenaline, licensed under CC BY-SA 3.0)
• Not shown: computational archaeologists reconstructing past climate from multiple tree-
ring databases è computational provenance is key for transparency & reproducibility
B. Ludäscher: Workflows & Provenance 13
… and Understanding what happened!
… frozen accidents
Zrzavý, Jan, David Storch, and Stanislav
Mihulka. Evolution: Ein Lese-Lehrbuch.
Springer-Verlag, 2009.
Author: Jkwchui (Based on
drawing by Truth-seeker2004)
B. Ludäscher: Workflows & Provenance 14
Computational Provenance …
• Origin, processing history of artifacts
– data products, figures, ...
– also: underlying workflow
è understand methods, dataflow, and dependencies
B. Ludäscher: Workflows & Provenance 15
Climate Change Impacts
in the United States
U.S. National Climate Assessment
U.S. Global Change Research Program
Kurator: Data Curation Workflows
(Filtered-Push … Kepler … Kurator projects)
B. Ludäscher: Workflows & Provenance 16
Runtime Provenance
(a.k.a. traces, logs,
retrospective
provenance,
“Trace-land”)
Workflow Modeling & Design
(a.k.a. prospective provenance
“Workflow-land”)
B. Ludäscher: Workflows & Provenance 17
Workflows ó Provenance a critical link!
Workflow Thinking: Die Grenzen meiner Sprache
bedeuten die Grenzen meiner Welt …
• Vanilla Process Network
• Func3onal Programming
Dataflow Network
• XML Transforma3on
Network
• Collec3on-oriented
Modeling & Design
framework (COMAD)
– Look Ma: No Shims!
B. Ludäscher: Workflows & Provenance 18
SKOPE: Synthesized Knowledge Of Past Environments
Bocinsky, Kohler … study rain-fed maize of Anasazi
– Four Corners; AD 600–1500. Climate change influenced Mesa Verde Migrations; late
13th century AD. Uses network of tree-ring chronologies to reconstruct a spatio-
temporal climate field at a fairly high resolution (~800 m) from AD 1–2000. Algorithm
estimates joint information in tree-rings and a climate signal to identify “best” tree-ring
chronologies for climate reconstructing.
K. Bocinsky, T. Kohler, A 2000-year reconstruction of the rain-fed
maize agricultural niche in the US Southwest. Nature
Communications. doi:10.1038/ncomms6618
… implemented as an R Script …
B. Ludäscher: Workflows & Provenance 19
Provenance Support for Reproducible Science
Example: Paleoclimate Reconstruction
Science paper (OA) uses:
• open source code:
– R, PaleoCAR, …
• Is that all we need?
• What was the
“workflow”?
• Is there prospective
and/or retrospective
provenance?
B. Ludäscher: Workflows & Provenance 20
How come? What’s the data provenance?
• What input data
was used? At
what spatio-
temporal
resolution?
• How does the
model work? (ML
method)
• What code was
run (and how
many times), with
what parameter
settings to
produce which
products?
B. Ludäscher: Workflows & Provenance 21
How come? Read the paper(s)!
B. Ludäscher: Workflows & Provenance
• Papers are
(increasingly) open
access; data and
code is (increasingly)
available, e.g. on
github.
• Still: significant
hurdles to
(computationally)
build upon prior
work, data products,
etc.
22
YesWorkflow: Prospective & Retrospective
Provenance … (almost) for free!
• YW annotations in a
(Python, R, …) script
recreate a workflow
view from the script …
cassette_id
sample_score_cutoff
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
YW!
B. Ludäscher: Workflows & Provenance
@BEGIN .. @END ..
@IN .. @OUT ..
@URI .. @LOG ..
23
Adding YesWorkflow to DataONE
Yaxing’s script with
inputs & output
products
Christopher’s
YesWorkflow
model
Christopher using
Yaxing’s outputs as
inputs for his script
Christopher’s results
can be traced back all
the way to Yaxing’s
input
B. Ludäscher: Workflows & Provenance 24
• Data Observation Network for Earth (DataONE)
– Network of earth science data repositories (member nodes)
– Large NSF DataNet project to Discover, Share, Use …
– … earth science data: ecology, biodiversity, …
• My R&D focus: provenance tools & technologies, ProvONE:
– W3C PROV model extended to combine retrospective & prospective provenance
B. Ludäscher: Workflows & Provenance 25
: Provenance in DataONE
A DataONE search (here: “grass”) yields different packages with Data Provenance
(not covered: Seman.c Search)
B. Ludäscher: Workflows & Provenance 26
Exploring Provenance in DataONE
• Let’s go there è Mark Carls. 2017. Analysis of hydrocarbons following
the Exxon Valdez oil spill, Gulf of Alaska, 1989 - 2014. Gulf of Alaska
Data Portal. urn:uuid:3249ada0-afe3-4dd6-875e-0f7928a4c171.
27B. Ludäscher: Workflows & Provenance
DataONE: Search and Provenance Display
28B. Ludäscher: Workflows & Provenance
DataONE: Search and Provenance Display
29B. Ludäscher: Workflows & Provenance
João F. Pimentel, Saumen Dey, Timothy McPhillips,
Khalid Belhajjame, David Koop, Leonardo Murta,
Vanessa Braganholo, Bertram Ludäscher
Yin & Yang: Demonstrating complementary
provenance from noWorkflow &
YesWorkflow
module.__build_class__
module.__build_class__
simulate_data_collection
180 return
180 run_logger
201 return
201 new_image_file
230 parser
231 cassette_id
236 add_option
241 add_option
246 add_option
248 set_usage
251 parse_args
251 args
251 options
254 module.len
24 cassette_id
24 sample_score_cutoff
24 data_redundancy
24 calibration_image_file
30 exists
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
36 run_log
37 write
38 str(sample_score_cutoff)
38 write
38 str(sample_score_cutoff)
49 str.format
49 sample_spreadsheet_file
50 spreadsheet_rows
cassette_q55_spreadsheet.csv
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format 51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
72 str.format
72 write
73 open
73 rejection_log
74 str.format
74 TextIOWrapper.write
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
calibration.img
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open 119 collection_log_file 120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
128 return
run/run_log.txt
run/rejected_samples.txt
run/raw/q55/DRT240/e10000/image_001.raw
run/data/DRT240/DRT240_10000eV_001.img
run/collected_images.csv
run/raw/q55/DRT240/e10000/image_002.raw
run/data/DRT240/DRT240_10000eV_002.img
run/raw/q55/DRT240/e11000/image_001.raw
run/data/DRT240/DRT240_11000eV_001.img
run/raw/q55/DRT240/e11000/image_002.raw
run/data/DRT240/DRT240_11000eV_002.img
run/raw/q55/DRT240/e12000/image_001.raw
run/data/DRT240/DRT240_12000eV_001.img
run/raw/q55/DRT240/e12000/image_002.raw
run/data/DRT240/DRT240_12000eV_002.img
run/raw/q55/DRT322/e10000/image_001.raw
run/data/DRT322/DRT322_10000eV_001.img
run/raw/q55/DRT322/e10000/image_002.raw
run/data/DRT322/DRT322_10000eV_002.img
run/raw/q55/DRT322/e11000/image_001.raw
run/data/DRT322/DRT322_11000eV_001.img
run/raw/q55/DRT322/e11000/image_002.raw
run/data/DRT322/DRT322_11000eV_002.img
noWorkflow:
not only
Workflow!
• Scripts have provenance, too!
• Transparently capture some/all
provenance from Python script
runs.
• Use filter queries to “zoom” into
relevant parts ..
B. Ludäscher: Workflows & Provenance 31
simulate_data_collection
230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>
251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])
251 args = ['q55']
251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>
24 cassette_id = 'q55'
24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0
24 calibration_image_file = 'calibration.img'
49 str.format
49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'
50 spreadsheet_rows(sample_spreadsheet_file)
50 sample_name = 'DRT240'50 sample_quality = 45
61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])
61 accepted_sample = 'DRT240'61 num_images = 2
61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'
92 collect_next_image(casset ... _{frame_number:03d}.raw')
92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'
106 str.format
106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')
calibration.img
run/data/DRT240/DRT240_11000eV_002.img
$ now dataflow -f "run/data/DRT240/DRT240_11000eV_002.img"
$(NW_FILTERED_LINEAGE_GRAPH).gv: $(NW_FACTS)
now helper df_style.py
now dataflow -v 55 -f
$(RETROSPECTIVE_LINEAGE_VALUE) -m simulation
| python df_style.py -d BT -e >
$(NW_FILTERED_LINEAGE_GRAPH).gv
.. auto-“make” this!
noWorkflow lineage
of an image file
Provenance information
about Python function calls,
variable assignments, etc.
B. Ludäscher: Workflows & Provenance 32
simulate_data_collection
initialize_run
run_log load_screening_results
sample_namesample_quality
calculate_strategy
accepted_samplerejected_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_id energyframe_number raw_image
transform_images
corrected_imagetotal_intensitypixel_count
log_average_image_intensity
collection_log
sample_spreadsheet
calibration_image
sample_score_cutoffdata_redundancy
cassette_id
YesWorkflow: Yes, scripts are Workflows, too!
• Use YW annotations
@begin...@end, @in,
@out to reveal hidden
conceptual workflow
(prospective provenance)
• Script isn't changed:
– annotations via comments
(=> language independent)
• For understanding and
sharing the “big picture”
• Query and visualize!
B. Ludäscher: Workflows & Provenance 33
simulate_data_collection
initialize_run
run_log load_screening_results
sample_namesample_quality
calculate_strategy
accepted_samplerejected_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_id energyframe_number raw_image
transform_images
corrected_imagetotal_intensitypixel_count
log_average_image_intensity
collection_log
sample_spreadsheet
calibration_image
sample_score_cutoffdata_redundancy
cassette_id
simulate_data_collection
collect_data_set
sample_id energy frame_number raw_image
calculate_strategy
accepted_sample num_imagesenergies
load_screening_results
sample_namesample_quality
transform_images
corrected_image
sample_spreadsheet
calibration_image
sample_score_cutoff data_redundancy
cassette_id
module.__build_class__
module.__build_class__
simulate_data_collection
180 return
180 run_logger
201 return
201 new_image_file
230 parser
231 cassette_id
236 add_option
241 add_option
246 add_option
248 set_usage
251 parse_args
251 args
251 options
254 module.len
24 cassette_id
24 sample_score_cutoff
24 data_redundancy
24 calibration_image_file
30 exists
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
36 run_log
37 write
38 str(sample_score_cutoff)
38 write
38 str(sample_score_cutoff)
49 str.format
49 sample_spreadsheet_file
50 spreadsheet_rows
cassette_q55_spreadsheet.csv
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format 51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
72 str.format
72 write
73 open
73 rejection_log
74 str.format
74 TextIOWrapper.write
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
calibration.img
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open 119 collection_log_file 120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
128 return
run/run_log.txt
run/rejected_samples.txt
run/raw/q55/DRT240/e10000/image_001.raw
run/data/DRT240/DRT240_10000eV_001.img
run/collected_images.csv
run/raw/q55/DRT240/e10000/image_002.raw
run/data/DRT240/DRT240_10000eV_002.img
run/raw/q55/DRT240/e11000/image_001.raw
run/data/DRT240/DRT240_11000eV_001.img
run/raw/q55/DRT240/e11000/image_002.raw
run/data/DRT240/DRT240_11000eV_002.img
run/raw/q55/DRT240/e12000/image_001.raw
run/data/DRT240/DRT240_12000eV_001.img
run/raw/q55/DRT240/e12000/image_002.raw
run/data/DRT240/DRT240_12000eV_002.img
run/raw/q55/DRT322/e10000/image_001.raw
run/data/DRT322/DRT322_10000eV_001.img
run/raw/q55/DRT322/e10000/image_002.raw
run/data/DRT322/DRT322_10000eV_002.img
run/raw/q55/DRT322/e11000/image_001.raw
run/data/DRT322/DRT322_11000eV_001.img
run/raw/q55/DRT322/e11000/image_002.raw
run/data/DRT322/DRT322_11000eV_002.img
simulate_data_collection
230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>
251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])
251 args = ['q55']
251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>
24 cassette_id = 'q55'
24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0
24 calibration_image_file = 'calibration.img'
49 str.format
49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'
50 spreadsheet_rows(sample_spreadsheet_file)
50 sample_name = 'DRT240'50 sample_quality = 45
61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])
61 accepted_sample = 'DRT240'61 num_images = 2
61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'
92 collect_next_image(casset ... _{frame_number:03d}.raw')
92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'
106 str.format
106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')
calibration.img
run/data/DRT240/DRT240_11000eV_002.img
lineage query
lineage query
YesWorkflow:
Conceptual workflow model
noWorkflow:
Python trace model
But how do we
bridge this gap???
Would like to use YW
model to query NW
data!
B. Ludäscher: Workflows & Provenance 34
Habemus Pons!
We’ve got the Bridge!
The bridge is the journey..
(The journey is the destination)
Lineage of image file
in terms of YW
model, with details
from NW provenance
B. Ludäscher: Workflows & Provenance 35
B. Ludäscher: Workflows & Provenance 36
�����������������
�����
��������������������������������������������������������������
��������������������������������������������������������������
��������������
����������������������������������
���������
����������������
�������������������������������������������������������������
����������
�����������������
��������������������������������������������������������������������������������������
����������������
�������
��������������
������������������
�������������������������������������
����������������
�����������������
��������������������������������������
�������������������
�����������
�������������������������������
������������������
����������
������������������������������
�����������������
�����������
����������������������������
������������
�������������
������������������������������������������������������
���������������������
�����������������������������������
�����������������
�����������������
�����
���������
��������������
����������������
����������
����������
�����������������
����������������
����������
�������
����������
������������������
����������������
���������
�����������������
�������������������
���������
�����������
������������������
�������������
���������
����������
�����������������
�������������
��������
�����������
������������
�������������
���������������������
�������������������������������������������������������������������
�����������������
�������������������������������������������������������������������������
�����������������
������������������
����������������
�������
����������
�����������
������������������
�����
���������
��������������
����������������
����������
���������������
�����������������
����������������
���������
�����������������
�������������������
���������������������������������
����������
�����������������
��������������������������������������
�����������
������������
�������������
���������������������
�������������������������������������������������������������������
�����������������
������������������������������������������������������������������
DwCA Taxon Lookup
Workflow
• Declare inputs, outputs, and
steps of a script (or wf) with
YW annota-ons to ...
– communicate provenance
graphically (via graphviz)
– combine different forms of
provenance
– query provenance
• Simple YW annota?ons in
comments:
– @BEGIN Step, @END Step
– @IN Data, @OUT Data
– @URI Template, @LOG Pa+ern
B. Ludäscher: Workflows & Provenance 37
�����������������
�����
��������������������������������������������������������������
��������������������������������������������������������������
��������������
����������������������������������
���������
����������������
�������������������������������������������������������������
����������
�����������������
��������������������������������������������������������������������������������������
����������������
�������
��������������
������������������
�������������������������������������
����������������
�����������������
��������������������������������������
�������������������
�����������
�������������������������������
������������������
����������
������������������������������
�����������������
�����������
����������������������������
������������
�������������
������������������������������������������������������
���������������������
�����������������������������������
�����������������
Taxon Lookup Workflow:
Data View and Process View
B. Ludäscher: Workflows & Provenance 38
The story of
two individual
records
B. Ludäscher: Workflows & Provenance 39
�����������������
�����������������
�������������������
�������
����������
����������
�����������������
�����
���������
��������������
����������������
����������
���������������
�����������������
����������������
������
������������������
����������������
�������������������������������
�����������
������������������
����
�����������
������������
�������������
���������������������
�������������������������������������������������������������������
�����������������
�������������������������������������������������������������������������
�����������������
������������������
����������������
�������
����������
�����������
������������������
�����
���������
��������������
����������������
����������
���������������
�����������������
����������������
���������
�����������������
�������������������
���������������������������������
����������
�����������������
��������������������������������������
�����������
������������
�������������
���������������������
�������������������������������������������������������������������
�����������������
������������������������������������������������������������������
• One took the GBIF
route, while …
• … the other went
all WORMS!
Non-
Marine?
è GBIF
Marine?
è
WORMS
The aggregate story ..
B. Ludäscher: Workflows & Provenance 40
�����������������
�����
���������
��������������
����������������
����������
����������
�����������������
����������������
����������
�������
����������
������������������
����������������
���������
�����������������
�������������������
���������
�����������
������������������
�������������
���������
����������
�����������������
�������������
��������
�����������
������������
�������������
���������������������
�������������������������������������������������������������������
�����������������
�������������������������������������������������������������������������
• How many records were
observed as inputs or outputs
of workflow steps?
• Were there any NULL values?
How many?
Hybrid Provenance:
YW Model + Run6me
Observables (file level)
B. Ludäscher: Workflows & Provenance
�����������������
�����
���������
��������������
����������������
����������
�����������������
����������������
�������
����������
������������������
����������������
�����������������
�������������������
�����������
������������������
����������
�����������������
�����������
������������
�������������
���������������������
�������������������������������������������������������������������
�����������������
�������������������������������������������������������������������������
• The YW model can be connected
with runtime observables
• è YW recon (prov reconstruction)
• Here:
• What specific files were read,
written and where do they occur
in the workflow?
41
YesWorkflow Summary
• Lightweight YW annotations can
be added easily to your scripts to
reap workflow benefits
– Documentation of what’s
important
– Visualization of dependencies
– Querying provenance (prospective,
retrospective, and hybrid)
– Independent of system or language
used (R, Python, MATLAB, workflow
tools, …)
è make provenance actionable
è provenance for self!
=> github.com/yesworkflow-org/yw
=> try.yesworkflow.org
B. Ludäscher: Workflows & Provenance 42
�����������������
�����
��������������������������������������������������������������
��������������������������������������������������������������
��������������
����������������������������������
���������
����������������
�������������������������������������������������������������
����������
�����������������
��������������������������������������������������������������������������������������
����������������
�������
��������������
������������������
�������������������������������������
����������������
�����������������
��������������������������������������
�������������������
�����������
�������������������������������
������������������
����������
������������������������������
�����������������
�����������
����������������������������
������������
�������������
������������������������������������������������������
���������������������
�����������������������������������
�����������������
�����������������
�����
���������
��������������
����������������
����������
����������
�����������������
����������������
����������
�������
����������
������������������
����������������
���������
�����������������
�������������������
���������
�����������
������������������
�������������
���������
����������
�����������������
�������������
��������
�����������
������������
�������������
���������������������
�������������������������������������������������������������������
�����������������
�������������������������������������������������������������������������
YW Demo Use Cases (IDCC’17)
Domain Use case Programming language Provenance methods
Climate science C3C4 MATLAB YW + MATLAB
RunManager
Astrophysics LIGO Python YW + NW (code-level)
Protein crystal samples Simulate data
collection
Python YW + NW (code-level)
Biodiversity data
curation
kurator-SPNHC Python YW-recon + YW-logging
Social network analysis Twitter Python YW + NW (file-level)
Oceanography OHIBC Howe Sound
(multi-run multi-script)
R YW + R RunManager
B. Ludäscher: Workflows & Provenance 43
• SKOPE: system and tools to discover, access,
analyze, visualize paleoenvironmental data
– unprecedented ability to explore provenance
(detailed, comprehensible record of computational
derivation of results)
– for researchers, tinkerers, and modelers
• Whole Tale:
– leverage & contribute to existing CI to support the
whole tale (“living paper”), from workflow run to
scholarly publication
– integrate tools & CI (DataONE, Globus, iRODS,
NDS, ...) to simplify use and promote best
practices.
– driven by science WGs (Archaeology/SKOPE,
materials science, astro, bio ..)
Project Vignettes
B. Ludäscher: Workflows & Provenance 44
Whole Tale: The next step in the evolution of the
scholarly article: The “Living [Frozen?] Paper”
• 1st Generation:
– narrative (prose)
• 2nd Generation: plus …
– name .. identify .. include (access to) data
• 3rd Generation: plus …
– name .. reference .. include code (software) ..
– and provenance … and exec environment (containers)
B. Ludäscher: Workflows & Provenance 45
Whole Tale
Whole Tale Dashboard
Whole Tale Vision
Tale
Data
{ Code
D1PROV
46
WT Architecture
47B. Ludäscher: Workflows & Provenance
https://dashboard.
wholetale.org
Example Tale:
LIGO gravitational wave detection
(tutorial Jupyter notebook)
B. Ludäscher: Workflows & Provenance 49
https://dashboard.wholetale.org
What is Whole Tale?
● NSF-funded Data Infrastructure Building Blocks (DIBBs)
project
● Platform to create, publish, and execute tales
● Simplify process of creating & verifying reproducible
computational artifacts
● https://dashboard.wholetale.org
50
B. Ludäscher: Workflows & Provenance
Why Whole Tale?
● Increased reliance on computation across domains
○ new skill requirements for researchers
● Open Science changing norms and expectations
○ increased emphasis on sharing data & code
○ … with transparency and reproducibility in
mind!
○ => from sharing data to sharing research objects
○ FAIR principles 51
B. Ludäscher: Workflows & Provenance
Whole Tale:
Enables Computational Science
52
B. Ludäscher: Workflows & Provenance
Whole Tale & the Elements of a …
Reproducible Computational Research Platform
53
Easy-to-access
cloud-based
computational
environments
Transparent
access to
research data
Collaborate
and share with
others
Export or publish
executable
research
objects
Re-execute
Review
Verify
Re-use
Develop Analyze Share ReproducePackage
Coming soon
B. Ludäscher: Workflows & Provenance
Whole Tale Roles and Stakeholders
54
Researchers,
Grad Students
Editors,
Publishers
Analysis
Publish &
Re-use
Verify
Badging,
Verification
Scientific
Software
+ Data
Repositories
Reviewers, Curators
B. Ludäscher: Workflows & Provenance
Develop & Analyze with Whole Tale
● Easy to access cloud-based environments
○ Your laptop in the cloud
● Popular tools
○ + … extensible!
● Work with data & code in transparent
(provenance-enabled) ways
○ Automatic data citation
○ Automatic computational provenance capture
(coming soon) 55
B. Ludäscher: Workflows & Provenance
Package & Reproduce with Whole Tale
● Executable Research Objects
● Publish or export to research archives
● Compatible with new norms for
reproducibility and transparency
● For verification and re-use
56
B. Ludäscher: Workflows & Provenance
Whole Tale and
57
●Discover & access data from any DataONE
repository
●Analyze data in Whole Tale
●Package & publish tales to Metacat-based
repositories
●Provenance support
B. Ludäscher: Workflows & Provenance
What exactly is (in) a Tale?
58
● Verifiable
● Remixable
● Standards-based
✓Tale: Research object
○ data, code, narrative,
compute environment
✓Executable
✓Transparent
✓Publishable
B. Ludäscher: Workflows & Provenance
59
Whole Tale Platform Overview
Research & Quantitative
Computational Environments
External Data Sources
Code + Narrative
●Authenticate using your institutional
identity
●Access commonly-used computational
environments
●Easily customize your environment (via
repo2docker)
●Reference and access externally registered
data
●Create or upload your data and code
●Add metadata (including provenance
information)
●Submit code, data, and environment to
archival repository
●Get a persistent identifier
●Share for verification and re-use
Publish
Tale
Create
tale
Analyze
data
Coming Soon:
B. Ludäscher: Workflows & Provenance
Tale Creation Workflow
"Analyze in WT" or
register data by URL or
digital object identifier:
Create a Tale, entering a
name and selecting
interactive environment
A container is launched based
on selected environment with
an empty workspace and
external data mounted read-
only
Create/upload code and
scripts
Execute code/scripts to
generate results/
outputs
Export the Tale in
compressed BagIt-RO
format to run locally for
verification.
Publish the tale to a
supported repository,
generating a persistent
identifier.
Customize environment
adding special
packages/software
dependencies
Re-execute in Whole
Tale
Enter descriptive metadata
including authors, title,
description, and illustration
image
schema:author
schema:name
schema:category
pav:createdBy
schema:license
B. Ludäscher: Workflows & Provenance 60
Demo: Analyzing Seal Migration Patterns
A research team is preparing to publish a
manuscript describing a computational model
for estimating animal movement paths from
telemetry data:
● Telemetry data published in Research
Workspace
● Analysis and visualization in RStudio
● Existing routines stored in local R files
● Analysis requires specialized R packages
● Publish results for the community in
DataONE
61
Based on: J.M. London and D.S.Johnson. Alaska bearded and spotted seal example dataset and
analysis. https://github.com/jmlondon/crwexampleakbs, 2019
Live Demo or Demo Video
Key features
Supported environments
●Extension to Binder's repo2docker
○Jupyter, JupyterLab
○RStudio (based on Rocker Project)
○OpenRefine
●Coming soon:
○Matlab, Stata
62
Key features
Supported data repositories
●Register data from supported research data
repositories
●Referenced data is cited
○ Ideally eventually contributing to citation counts
● Publish tales back to research repositories
63
Key features
Export to BagIt-RO
●BagIt: archival format
●Re-runnable in WT
●BagIt-RO
○Open archival format
○Research Object support
○Extended for Big Data
64
tale/
bagit.txt
bag-info.txt
data/
workspace/
run.py
LICENSE
requirements.txt
output.csv
LICENSE
metadata/
manifest.json
manifest-sha1.txt
start-here/
README.md
tagmanifest-sha1.txt
Key features
Export and Run Locally
●Natural outcome of Tale export and repo2docker
●Download a zip file (BagIt-RO)
●run-local.sh
○ Build image (repo2docker)
○ Fetch external data (bdbag)
○ Execute (Docker)
65
Coming soon
● Tapis/Agave data sources
● Sharing/collaboration
● Create tale from Git repository
● Image preservation
● System provenance capture
● Better user experience
66
Some References
(Kepler, Kurator, YesWorkflow, Whole-Tale, Reproducibility, Euler/X)
1. Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J. and Zhao, Y., 2006. Scientific
workflow management and the Kepler system. Concurrency and computation: Practice and experience, 18(10), pp.1039-1065.
2. McPhillips, T., Bowers, S., Zinn, D. and Ludäscher, B., 2009. Scientific workflow design for mere mortals. Future Generation
Computer Systems, 25(5), pp.541-551.
3. Morris, P.J., Hanken, J., Lowery, D., Ludäscher, B., Macklin, J., McPhillips, T., Wieczorek, J. and Zhang, Q., 2018. Kurator: Tools
for Improving Fitness for Use of Biodiversity Data. Biodiversity Information Science and Standards, 2, p.e26539
4. T. McPhillips, T. Song, T. Kolisnik, S. Aulenbach, K. Belhajjame, R.K. Bocinsky, Y. Cao, J. Cheney, F. Chirigati, S. Dey, J. Freire, C.
Jones, J. Hanken, K.W. Kintigh, T.A. Kohler, D. Koop, J.A. Macklin, P. Missier, M. Schildhauer, C. Schwalm, Y. Wei, M. Bieda, B.
Ludäscher (2015). YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from
Scripts. International Journal of Digital Curation 10, 298-313.
5. T. McPhillips, S. Bowers, K. Belhajjame, B. Ludäscher (2015). Retrospective Provenance Without a Runtime Provenance
Recorder. 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP'15).
6. Brinckman, A., Chard, K., Gaffney, N., Hategan, M., Jones, M.B., Kowalik, K., Kulasekaran, S., Ludäscher, B., Mecum, B.D.,
Nabrzyski, J. and Stodden, V., 2019. Computing environments for reproducibility: Capturing the “Whole Tale”. Future
Generation Computer Systems, 94, pp.854-867.
7. Chard, K., Gaffney, N., Jones, M.B., Kowalik, K., Ludäscher, B., McPhillips, T., Nabrzyski, J., Stodden, V., Taylor, I., Thelen, T.,
Turk, M.J. and Willis, C., 2019. Application of BagIt-Serialized Research Object Bundles for Packaging and Re-execution of
Computational Analyses. In 2019 IEEE 15th International Conference on e-Science (e-Science). IEEE.
8. Chard, K., Gaffney, N., Jones, M.B., Kowalik, K., Ludäscher, B., Nabrzyski, J., Stodden, V., Taylor, I., Turk, M.J. and Willis, C.,
2019, June. Implementing Computational Reproducibility in the Whole Tale Environment. In Proceedings of the 2nd
International Workshop on Practical Reproducible Evaluation of Computer Systems (pp. 17-22). ACM.
9. McPhillips, T., Willis, C., Gryk, M., Nunez-Corrales, S., Ludäscher, B. 2019. Reproducibility by Other Means: Transparent
Research Objects. In 2019 IEEE 15th International Conference on e-Science (e-Science). IEEE.
10. Franz, N.M., Chen, M., Kianmajd, P., Yu, S., Bowers, S., Weakley, A.S. and Ludäscher, B., 2016. Names are not good enough:
reasoning over taxonomic change in the andropogon complex. Semantic Web, 7(6), pp.645-667.
Whole Tale Collaboration (PI Team)
● U Illinois (NCSA) Bertram Ludäscher, Victoria Stodden, Matt Turk
○ overall lead (co-operative agreement)
○ reproducibility; provenance; open source software
development; outreach
● U Chicago (Globus) Kyle Chard
○ data transfer & storage; compute; infrastructure
● UC Santa Barbara (NCEAS) Matt Jones
○ (meta-)data publishing; provenance; repositories
● U Texas, Austin (TACC) Niall Gaffney
○ compute; HTC; “big tale”; Science Gateways
● U Notre Dame (CRC) Jarek Nabrzyski
○ UX design; UI design
68
• Given two taxonomies and expert
articulations, find the merged
(=aligned) taxonomy that logically
follows.
• Problems:
– underconstrained alignment:
ambiguity; many possible worlds
(PWs) …
– overconstrained: inconsistency;
no PW
• Euler uses ASP, RCC reasoning to
infer merged taxonomies;
diagnose inconsistencies; reduce
ambiguity
github.com/
EulerProject/EulerX
Data Cleaning: Theory & Practice 69
Other Research Bits:
Logic-based Taxonomy Alignment in EulerX
with Prof. Nico Franz, Curator of Insects @ ASU
Is reproducibility really so complicated?
§ Reproducibility crisis?
§ Terminology crisis?
§ Or gullibility crisis?
§ What is reproducibility
anyway?
§ And who is responsible
for it?
Towards Reproducible Science Tales 70

More Related Content

What's hot

The role of annotation in reproducibility (Empirical 2014)
The role of annotation in reproducibility (Empirical 2014)The role of annotation in reproducibility (Empirical 2014)
The role of annotation in reproducibility (Empirical 2014)Oscar Corcho
 
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksResults Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksCarole Goble
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the partsCarole Goble
 
RARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsRARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
 
Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Carole Goble
 
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...Carole Goble
 
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Sciencedgarijo
 
Machines are people too
Machines are people tooMachines are people too
Machines are people tooPaul Groth
 
SEEK for Science: A Data and Model Management Platform to support Open and Re...
SEEK for Science: A Data and Model Management Platform to support Open and Re...SEEK for Science: A Data and Model Management Platform to support Open and Re...
SEEK for Science: A Data and Model Management Platform to support Open and Re...Carole Goble
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research ObjectsDavid De Roure
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...dgarijo
 
Being FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceBeing FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceCarole Goble
 
Reproducibility challenges in computational settings: what are they, why shou...
Reproducibility challenges in computational settings: what are they, why shou...Reproducibility challenges in computational settings: what are they, why shou...
Reproducibility challenges in computational settings: what are they, why shou...Research Data Alliance
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsPaul Groth
 
Reproducibility, Research Objects and Reality, Leiden 2016
Reproducibility, Research Objects and Reality, Leiden 2016Reproducibility, Research Objects and Reality, Leiden 2016
Reproducibility, Research Objects and Reality, Leiden 2016Carole Goble
 
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...Carole Goble
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
 

What's hot (20)

FAIRy Stories
FAIRy StoriesFAIRy Stories
FAIRy Stories
 
The role of annotation in reproducibility (Empirical 2014)
The role of annotation in reproducibility (Empirical 2014)The role of annotation in reproducibility (Empirical 2014)
The role of annotation in reproducibility (Empirical 2014)
 
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksResults Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the parts
 
RARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsRARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research Objects
 
Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017
 
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
 
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Machines are people too
Machines are people tooMachines are people too
Machines are people too
 
SEEK for Science: A Data and Model Management Platform to support Open and Re...
SEEK for Science: A Data and Model Management Platform to support Open and Re...SEEK for Science: A Data and Model Management Platform to support Open and Re...
SEEK for Science: A Data and Model Management Platform to support Open and Re...
 
ROHub
ROHubROHub
ROHub
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research Objects
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...
 
Being FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceBeing FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data Science
 
Reproducibility challenges in computational settings: what are they, why shou...
Reproducibility challenges in computational settings: what are they, why shou...Reproducibility challenges in computational settings: what are they, why shou...
Reproducibility challenges in computational settings: what are they, why shou...
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge Graphs
 
Reproducibility, Research Objects and Reality, Leiden 2016
Reproducibility, Research Objects and Reality, Leiden 2016Reproducibility, Research Objects and Reality, Leiden 2016
Reproducibility, Research Objects and Reality, Leiden 2016
 
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 

Similar to From Workflows to Transparent Research Objects

YesWorkflow: Yes, Scripts can be Workflows, Too!
YesWorkflow: Yes, Scripts can be Workflows, Too!YesWorkflow: Yes, Scripts can be Workflows, Too!
YesWorkflow: Yes, Scripts can be Workflows, Too!Bertram Ludäscher
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsAnubhav Jain
 
Kurator: Towards Data Curation for Mere Mortals
Kurator: Towards Data Curation for Mere MortalsKurator: Towards Data Curation for Mere Mortals
Kurator: Towards Data Curation for Mere MortalsBertram Ludäscher
 
DAIS Seminar: The Many Faces of Provenance in Databases and Workflows
DAIS Seminar: The Many Faces of Provenance in Databases and WorkflowsDAIS Seminar: The Many Faces of Provenance in Databases and Workflows
DAIS Seminar: The Many Faces of Provenance in Databases and WorkflowsBertram Ludäscher
 
From Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceFrom Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceBertram Ludäscher
 
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)EarthCube
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"Pinar Alper
 
Provenance and DataONE: Facilitating Reproducible Science
Provenance and DataONE: Facilitating Reproducible ScienceProvenance and DataONE: Facilitating Reproducible Science
Provenance and DataONE: Facilitating Reproducible ScienceBertram Ludäscher
 
A Sightseeing Tour of Provenance in Databases & Workflows
A Sightseeing Tour of Provenance in Databases & WorkflowsA Sightseeing Tour of Provenance in Databases & Workflows
A Sightseeing Tour of Provenance in Databases & WorkflowsBertram Ludäscher
 
GSLIS Research Showcase Presentation (Expanded)
GSLIS Research Showcase Presentation (Expanded)GSLIS Research Showcase Presentation (Expanded)
GSLIS Research Showcase Presentation (Expanded)Bertram Ludäscher
 
The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...Anubhav Jain
 
YesWorkflow: Retrospective Provenance Without a Runtime Provenance Recorder
YesWorkflow: Retrospective Provenance Without a Runtime Provenance RecorderYesWorkflow: Retrospective Provenance Without a Runtime Provenance Recorder
YesWorkflow: Retrospective Provenance Without a Runtime Provenance RecorderBertram Ludäscher
 
ICSSP-Panel Austin, May 15, 2016.
ICSSP-Panel Austin, May 15, 2016.ICSSP-Panel Austin, May 15, 2016.
ICSSP-Panel Austin, May 15, 2016.Bertram Ludäscher
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science researchAnubhav Jain
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesBertram Ludäscher
 
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseDeduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseBertram Ludäscher
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Bertram Ludäscher
 

Similar to From Workflows to Transparent Research Objects (20)

YesWorkflow: Yes, Scripts can be Workflows, Too!
YesWorkflow: Yes, Scripts can be Workflows, Too!YesWorkflow: Yes, Scripts can be Workflows, Too!
YesWorkflow: Yes, Scripts can be Workflows, Too!
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
Kurator: Towards Data Curation for Mere Mortals
Kurator: Towards Data Curation for Mere MortalsKurator: Towards Data Curation for Mere Mortals
Kurator: Towards Data Curation for Mere Mortals
 
DAIS Seminar: The Many Faces of Provenance in Databases and Workflows
DAIS Seminar: The Many Faces of Provenance in Databases and WorkflowsDAIS Seminar: The Many Faces of Provenance in Databases and Workflows
DAIS Seminar: The Many Faces of Provenance in Databases and Workflows
 
From Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceFrom Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & Provenance
 
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)
EarthCube's OceanLink - Project Overview and Presentation Updates (March 2014)
 
Tdwg14 fp-kurator-ludaescher
Tdwg14 fp-kurator-ludaescherTdwg14 fp-kurator-ludaescher
Tdwg14 fp-kurator-ludaescher
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"
 
Provenance and DataONE: Facilitating Reproducible Science
Provenance and DataONE: Facilitating Reproducible ScienceProvenance and DataONE: Facilitating Reproducible Science
Provenance and DataONE: Facilitating Reproducible Science
 
A Sightseeing Tour of Provenance in Databases & Workflows
A Sightseeing Tour of Provenance in Databases & WorkflowsA Sightseeing Tour of Provenance in Databases & Workflows
A Sightseeing Tour of Provenance in Databases & Workflows
 
GSLIS Research Showcase Presentation (Expanded)
GSLIS Research Showcase Presentation (Expanded)GSLIS Research Showcase Presentation (Expanded)
GSLIS Research Showcase Presentation (Expanded)
 
The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...
 
Works 2015-provenance-mileage
Works 2015-provenance-mileageWorks 2015-provenance-mileage
Works 2015-provenance-mileage
 
A Clean Slate?
A Clean Slate?A Clean Slate?
A Clean Slate?
 
YesWorkflow: Retrospective Provenance Without a Runtime Provenance Recorder
YesWorkflow: Retrospective Provenance Without a Runtime Provenance RecorderYesWorkflow: Retrospective Provenance Without a Runtime Provenance Recorder
YesWorkflow: Retrospective Provenance Without a Runtime Provenance Recorder
 
ICSSP-Panel Austin, May 15, 2016.
ICSSP-Panel Austin, May 15, 2016.ICSSP-Panel Austin, May 15, 2016.
ICSSP-Panel Austin, May 15, 2016.
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science research
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science Tales
 
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseDeduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
 

More from Bertram Ludäscher

Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionBertram Ludäscher
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Bertram Ludäscher
 
[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database RulesBertram Ludäscher
 
[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database RulesBertram Ludäscher
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsBertram Ludäscher
 
Which Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueWhich Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueBertram Ludäscher
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsBertram Ludäscher
 
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...Bertram Ludäscher
 
Dissecting Reproducibility: A case study with ecological niche models in th...
Dissecting Reproducibility:  A case study with ecological niche models  in th...Dissecting Reproducibility:  A case study with ecological niche models  in th...
Dissecting Reproducibility: A case study with ecological niche models in th...Bertram Ludäscher
 
Incremental Recomputation: Those who cannot remember the past are condemned ...
Incremental Recomputation:  Those who cannot remember the past are condemned ...Incremental Recomputation:  Those who cannot remember the past are condemned ...
Incremental Recomputation: Those who cannot remember the past are condemned ...Bertram Ludäscher
 
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsValidation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsBertram Ludäscher
 
An ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsAn ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsBertram Ludäscher
 
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachKnowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachBertram Ludäscher
 
Whole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchWhole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchBertram Ludäscher
 
ETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatBertram Ludäscher
 
From Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceFrom Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceBertram Ludäscher
 
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligionWild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligionBertram Ludäscher
 
Using YesWorkflow hybrid queries to reveal data lineage from data curation ac...
Using YesWorkflow hybrid queries to reveal data lineage from data curation ac...Using YesWorkflow hybrid queries to reveal data lineage from data curation ac...
Using YesWorkflow hybrid queries to reveal data lineage from data curation ac...Bertram Ludäscher
 
A Brief Provenance Tour … via DataONE
A Brief Provenance Tour  … via DataONEA Brief Provenance Tour  … via DataONE
A Brief Provenance Tour … via DataONEBertram Ludäscher
 
Declarative Datalog Debugging for Mere Mortals
Declarative Datalog Debugging for Mere MortalsDeclarative Datalog Debugging for Mere Mortals
Declarative Datalog Debugging for Mere MortalsBertram Ludäscher
 

More from Bertram Ludäscher (20)

Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
 
[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules
 
[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query Patterns
 
Which Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueWhich Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A Dialogue
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
 
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
 
Dissecting Reproducibility: A case study with ecological niche models in th...
Dissecting Reproducibility:  A case study with ecological niche models  in th...Dissecting Reproducibility:  A case study with ecological niche models  in th...
Dissecting Reproducibility: A case study with ecological niche models in th...
 
Incremental Recomputation: Those who cannot remember the past are condemned ...
Incremental Recomputation:  Those who cannot remember the past are condemned ...Incremental Recomputation:  Those who cannot remember the past are condemned ...
Incremental Recomputation: Those who cannot remember the past are condemned ...
 
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsValidation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
 
An ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsAn ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflows
 
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachKnowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
 
Whole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchWhole-Tale: The Experience of Research
Whole-Tale: The Experience of Research
 
ETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatETC & Authors in the Driver's Seat
ETC & Authors in the Driver's Seat
 
From Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceFrom Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable Provenance
 
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligionWild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
 
Using YesWorkflow hybrid queries to reveal data lineage from data curation ac...
Using YesWorkflow hybrid queries to reveal data lineage from data curation ac...Using YesWorkflow hybrid queries to reveal data lineage from data curation ac...
Using YesWorkflow hybrid queries to reveal data lineage from data curation ac...
 
A Brief Provenance Tour … via DataONE
A Brief Provenance Tour  … via DataONEA Brief Provenance Tour  … via DataONE
A Brief Provenance Tour … via DataONE
 
Declarative Datalog Debugging for Mere Mortals
Declarative Datalog Debugging for Mere MortalsDeclarative Datalog Debugging for Mere Mortals
Declarative Datalog Debugging for Mere Mortals
 

Recently uploaded

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 

Recently uploaded (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 

From Workflows to Transparent Research Objects

  • 1. From Workflows to Transparent Research Objects and Reproducible Science Tales Bertram Ludäscher ludaesch@illinois.edu Director, Center for Informatics Research in Science & Scholarship (CIRSS) School of Information Sciences (iSchool@Illinois) & National Center for Supercomputing Applications (NCSA) & Department of Computer Science (CS@Illinois) PARSEC Synthesis Workshop 2020-07-011B. Ludäscher: Workflows & Provenance
  • 2. Overview • Scientific Workflows: What are we doing? – What are they and why should you care? • Provenance: What have we done? – Prospective and retrospective provenance – Better together! (e.g. YesWorkflow & noWorkflow) • Transparent, Reproducible Research Objects: – The Whole Tale project • Misc (or next time .. ) – Agreeing to disagree: taxonomy alignment with Euler/X B. Ludäscher: Workflows & Provenance 2
  • 3. Scientific Workflows: ASAP • Automation – wfs to automate computational aspects of science • Scaling (exploit and optimize machine cycles) – wfs should make use of parallel compute resources – wfs should be able handle large data • Abstraction, Evolution, Reuse (human cycles) – wfs should be easy to (re-)use, evolve, share • Provenance – wfs should capture processing history, data lineage è traceable data- and wf-evolution è Reproducible Science Trident Workbench VisTrails Es war einmal … B. Ludäscher: Workflows & Provenance 3
  • 4. 10 Essential functions of a scientific workflow system 1. Automate programs and services scientists already use. 2. Schedule invocations of programs and services correctly and efficiently – in parallel where possible. 3. Manage dataflow to, from, and between programs and services. 4. Enable scientists (not just developers) to author or modify workflows easily. 5. Predict what a workflow will do when executed: prospective provenance. 6. Record what happened during workflow execution: retrospective provenance. 7. Reveal retrospective provenance – how workflow products were derived from inputs via programs and services. 8. Organize intermediate and final data products as desired by users. 9. Enable scientists to version, share and publish their workflows. 10. Empower scientists who wish to automate additional programs and services themselves. These functions (not just dataflow & actors) distinguish scientific workflow automation from general scientific software development. B. Ludäscher: Workflows & Provenance 4 Src: Timothy McPhillips
  • 5. Find OTUs (OTUHunter) Assign Taxonomy (STAP) Profile alignment (STAP or Infernal) Build phylogenetic tree (RaxML or Quicktree) View tree: Dendroscope UniFrac: tree & environment file Assembled conMgs Chimera check (Mallard) Diversity statistics: Text: OUT list, Chao1, Shannon Graphs: rarefaction curves, rank- abundance curves Visualization tools: Cytoscape networks & Heat map WATERS: Workflow for Alignment, Taxonomy, Ecology of Ribosomal Sequences (Amber Hartman; Eisen Lab; UC Davis) +/- cipres +/- cluster +/- cluster +/- cluster B. Ludäscher: Workflows & Provenance 5
  • 6. Executable WATERS Workflow in Kepler B. Ludäscher: Workflows & Provenance 6
  • 7. Example Bioinformatics Workflow: Motif-Catcher Marc Faccio) et al. UC Davis Genome Center B. Ludäscher: Workflows & Provenance 7
  • 8. Motif-Catcher workflow, implemented in Kepler S Köhler et al. Improved Motif Detection in Large Sequence Sets with Random Sampling in a Kepler workflow, ICCS-WS, 2012 B. Ludäscher: Workflows & Provenance 8
  • 9. A Data-Streaming Workflow over Sensor Data B. Ludäscher: Workflows & Provenance 9
  • 10. • Monitor and control supercomputer simulations – 50+ composite actors (subworkflows) – 4 levels of hierarchy – 1000+ atomic (Java) actors 43 actors, 3 levels 196 actors, 4 levels 30 actors 206 actors, 4 levels 137 actors 33 actors 150 123 actors 66 actors 12 actors 243 actors, 4 levels Norbert Podhorszki ORNL (then: UC Davis) “Plumbing” workflow B. Ludäscher: Workflows & Provenance 10
  • 11. A Reproducibility (Transparency!) Crisis • Does science have a (different) reproducibility crisis (crises)? • Focus here: Computational Reproducibility – R, Matlab, Python, .. scripts – Scientific workflows, ... • How to facilitate reproducibility for computational and data scientists? B. Ludäscher: Workflows & Provenance 11
  • 12. Provenance defined … • Oxford English Dictionary – The place of origin or earliest known history of something: • an orange rug of Iranian provenance – The beginning of something’s existence; its origin: • they try to understand the whole universe, its provenance and fate – A record of ownership of a work of art or an antique, used as a guide to authenticity or quality: • the manuscript has a distinguished provenance • What is the origin (provenance!) of “provenance” ? B. Ludäscher: Workflows & Provenance 12
  • 13. Provenance: keeping records … • Grand Canyon’s rock layers are a record of the early geologic history of North America. The ancestral puebloan granaries at Nankoweap Creek tell archaeologists about more recent human history. (By Drenaline, licensed under CC BY-SA 3.0) • Not shown: computational archaeologists reconstructing past climate from multiple tree- ring databases è computational provenance is key for transparency & reproducibility B. Ludäscher: Workflows & Provenance 13
  • 14. … and Understanding what happened! … frozen accidents Zrzavý, Jan, David Storch, and Stanislav Mihulka. Evolution: Ein Lese-Lehrbuch. Springer-Verlag, 2009. Author: Jkwchui (Based on drawing by Truth-seeker2004) B. Ludäscher: Workflows & Provenance 14
  • 15. Computational Provenance … • Origin, processing history of artifacts – data products, figures, ... – also: underlying workflow è understand methods, dataflow, and dependencies B. Ludäscher: Workflows & Provenance 15 Climate Change Impacts in the United States U.S. National Climate Assessment U.S. Global Change Research Program
  • 16. Kurator: Data Curation Workflows (Filtered-Push … Kepler … Kurator projects) B. Ludäscher: Workflows & Provenance 16
  • 17. Runtime Provenance (a.k.a. traces, logs, retrospective provenance, “Trace-land”) Workflow Modeling & Design (a.k.a. prospective provenance “Workflow-land”) B. Ludäscher: Workflows & Provenance 17 Workflows ó Provenance a critical link!
  • 18. Workflow Thinking: Die Grenzen meiner Sprache bedeuten die Grenzen meiner Welt … • Vanilla Process Network • Func3onal Programming Dataflow Network • XML Transforma3on Network • Collec3on-oriented Modeling & Design framework (COMAD) – Look Ma: No Shims! B. Ludäscher: Workflows & Provenance 18
  • 19. SKOPE: Synthesized Knowledge Of Past Environments Bocinsky, Kohler … study rain-fed maize of Anasazi – Four Corners; AD 600–1500. Climate change influenced Mesa Verde Migrations; late 13th century AD. Uses network of tree-ring chronologies to reconstruct a spatio- temporal climate field at a fairly high resolution (~800 m) from AD 1–2000. Algorithm estimates joint information in tree-rings and a climate signal to identify “best” tree-ring chronologies for climate reconstructing. K. Bocinsky, T. Kohler, A 2000-year reconstruction of the rain-fed maize agricultural niche in the US Southwest. Nature Communications. doi:10.1038/ncomms6618 … implemented as an R Script … B. Ludäscher: Workflows & Provenance 19
  • 20. Provenance Support for Reproducible Science Example: Paleoclimate Reconstruction Science paper (OA) uses: • open source code: – R, PaleoCAR, … • Is that all we need? • What was the “workflow”? • Is there prospective and/or retrospective provenance? B. Ludäscher: Workflows & Provenance 20
  • 21. How come? What’s the data provenance? • What input data was used? At what spatio- temporal resolution? • How does the model work? (ML method) • What code was run (and how many times), with what parameter settings to produce which products? B. Ludäscher: Workflows & Provenance 21
  • 22. How come? Read the paper(s)! B. Ludäscher: Workflows & Provenance • Papers are (increasingly) open access; data and code is (increasingly) available, e.g. on github. • Still: significant hurdles to (computationally) build upon prior work, data products, etc. 22
  • 23. YesWorkflow: Prospective & Retrospective Provenance … (almost) for free! • YW annotations in a (Python, R, …) script recreate a workflow view from the script … cassette_id sample_score_cutoff sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img initialize_run run_log file:run/run_log.txt load_screening_results sample_namesample_quality calculate_strategy rejected_sample accepted_sample num_images energies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_id energy frame_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv YW! B. Ludäscher: Workflows & Provenance @BEGIN .. @END .. @IN .. @OUT .. @URI .. @LOG .. 23
  • 24. Adding YesWorkflow to DataONE Yaxing’s script with inputs & output products Christopher’s YesWorkflow model Christopher using Yaxing’s outputs as inputs for his script Christopher’s results can be traced back all the way to Yaxing’s input B. Ludäscher: Workflows & Provenance 24
  • 25. • Data Observation Network for Earth (DataONE) – Network of earth science data repositories (member nodes) – Large NSF DataNet project to Discover, Share, Use … – … earth science data: ecology, biodiversity, … • My R&D focus: provenance tools & technologies, ProvONE: – W3C PROV model extended to combine retrospective & prospective provenance B. Ludäscher: Workflows & Provenance 25
  • 26. : Provenance in DataONE A DataONE search (here: “grass”) yields different packages with Data Provenance (not covered: Seman.c Search) B. Ludäscher: Workflows & Provenance 26
  • 27. Exploring Provenance in DataONE • Let’s go there è Mark Carls. 2017. Analysis of hydrocarbons following the Exxon Valdez oil spill, Gulf of Alaska, 1989 - 2014. Gulf of Alaska Data Portal. urn:uuid:3249ada0-afe3-4dd6-875e-0f7928a4c171. 27B. Ludäscher: Workflows & Provenance
  • 28. DataONE: Search and Provenance Display 28B. Ludäscher: Workflows & Provenance
  • 29. DataONE: Search and Provenance Display 29B. Ludäscher: Workflows & Provenance
  • 30. João F. Pimentel, Saumen Dey, Timothy McPhillips, Khalid Belhajjame, David Koop, Leonardo Murta, Vanessa Braganholo, Bertram Ludäscher Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow
  • 31. module.__build_class__ module.__build_class__ simulate_data_collection 180 return 180 run_logger 201 return 201 new_image_file 230 parser 231 cassette_id 236 add_option 241 add_option 246 add_option 248 set_usage 251 parse_args 251 args 251 options 254 module.len 24 cassette_id 24 sample_score_cutoff 24 data_redundancy 24 calibration_image_file 30 exists 33 exists 32 filepath 34 module.remove 33 exists 32 filepath 34 module.remove 33 exists 32 filepath 34 module.remove 36 run_log 37 write 38 str(sample_score_cutoff) 38 write 38 str(sample_score_cutoff) 49 str.format 49 sample_spreadsheet_file 50 spreadsheet_rows cassette_q55_spreadsheet.csv 50 spreadsheet_rows(sample_spreadsheet_file) 51 str.format 51 write 50 sample_name 50 sample_quality 61 calculate_strategy 61 rejected_sample 61 energies 61 accepted_sample 61 num_images 72 str.format 72 write 73 open 73 rejection_log 74 str.format 74 TextIOWrapper.write 50 spreadsheet_rows 50 spreadsheet_rows(sample_spreadsheet_file) 51 str.format 51 write 50 sample_name 50 sample_quality 61 calculate_strategy 61 rejected_sample 61 energies 61 accepted_sample 61 num_images 90 str.format 90 write 91 sample_id 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image calibration.img 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 50 spreadsheet_rows 50 spreadsheet_rows(sample_spreadsheet_file) 51 str.format 51 write 50 sample_name 50 sample_quality 61 calculate_strategy 61 rejected_sample 61 energies 61 accepted_sample 61 num_images 90 str.format 90 write 91 sample_id 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 50 spreadsheet_rows 128 return run/run_log.txt run/rejected_samples.txt run/raw/q55/DRT240/e10000/image_001.raw run/data/DRT240/DRT240_10000eV_001.img run/collected_images.csv run/raw/q55/DRT240/e10000/image_002.raw run/data/DRT240/DRT240_10000eV_002.img run/raw/q55/DRT240/e11000/image_001.raw run/data/DRT240/DRT240_11000eV_001.img run/raw/q55/DRT240/e11000/image_002.raw run/data/DRT240/DRT240_11000eV_002.img run/raw/q55/DRT240/e12000/image_001.raw run/data/DRT240/DRT240_12000eV_001.img run/raw/q55/DRT240/e12000/image_002.raw run/data/DRT240/DRT240_12000eV_002.img run/raw/q55/DRT322/e10000/image_001.raw run/data/DRT322/DRT322_10000eV_001.img run/raw/q55/DRT322/e10000/image_002.raw run/data/DRT322/DRT322_10000eV_002.img run/raw/q55/DRT322/e11000/image_001.raw run/data/DRT322/DRT322_11000eV_001.img run/raw/q55/DRT322/e11000/image_002.raw run/data/DRT322/DRT322_11000eV_002.img noWorkflow: not only Workflow! • Scripts have provenance, too! • Transparently capture some/all provenance from Python script runs. • Use filter queries to “zoom” into relevant parts .. B. Ludäscher: Workflows & Provenance 31
  • 32. simulate_data_collection 230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8> 251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55']) 251 args = ['q55'] 251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}> 24 cassette_id = 'q55' 24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0 24 calibration_image_file = 'calibration.img' 49 str.format 49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv' 50 spreadsheet_rows(sample_spreadsheet_file) 50 sample_name = 'DRT240'50 sample_quality = 45 61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000]) 61 accepted_sample = 'DRT240'61 num_images = 2 61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240' 92 collect_next_image(casset ... _{frame_number:03d}.raw') 92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw' 106 str.format 106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img') calibration.img run/data/DRT240/DRT240_11000eV_002.img $ now dataflow -f "run/data/DRT240/DRT240_11000eV_002.img" $(NW_FILTERED_LINEAGE_GRAPH).gv: $(NW_FACTS) now helper df_style.py now dataflow -v 55 -f $(RETROSPECTIVE_LINEAGE_VALUE) -m simulation | python df_style.py -d BT -e > $(NW_FILTERED_LINEAGE_GRAPH).gv .. auto-“make” this! noWorkflow lineage of an image file Provenance information about Python function calls, variable assignments, etc. B. Ludäscher: Workflows & Provenance 32
  • 33. simulate_data_collection initialize_run run_log load_screening_results sample_namesample_quality calculate_strategy accepted_samplerejected_sample num_imagesenergies log_rejected_sample rejection_log collect_data_set sample_id energyframe_number raw_image transform_images corrected_imagetotal_intensitypixel_count log_average_image_intensity collection_log sample_spreadsheet calibration_image sample_score_cutoffdata_redundancy cassette_id YesWorkflow: Yes, scripts are Workflows, too! • Use YW annotations @begin...@end, @in, @out to reveal hidden conceptual workflow (prospective provenance) • Script isn't changed: – annotations via comments (=> language independent) • For understanding and sharing the “big picture” • Query and visualize! B. Ludäscher: Workflows & Provenance 33
  • 34. simulate_data_collection initialize_run run_log load_screening_results sample_namesample_quality calculate_strategy accepted_samplerejected_sample num_imagesenergies log_rejected_sample rejection_log collect_data_set sample_id energyframe_number raw_image transform_images corrected_imagetotal_intensitypixel_count log_average_image_intensity collection_log sample_spreadsheet calibration_image sample_score_cutoffdata_redundancy cassette_id simulate_data_collection collect_data_set sample_id energy frame_number raw_image calculate_strategy accepted_sample num_imagesenergies load_screening_results sample_namesample_quality transform_images corrected_image sample_spreadsheet calibration_image sample_score_cutoff data_redundancy cassette_id module.__build_class__ module.__build_class__ simulate_data_collection 180 return 180 run_logger 201 return 201 new_image_file 230 parser 231 cassette_id 236 add_option 241 add_option 246 add_option 248 set_usage 251 parse_args 251 args 251 options 254 module.len 24 cassette_id 24 sample_score_cutoff 24 data_redundancy 24 calibration_image_file 30 exists 33 exists 32 filepath 34 module.remove 33 exists 32 filepath 34 module.remove 33 exists 32 filepath 34 module.remove 36 run_log 37 write 38 str(sample_score_cutoff) 38 write 38 str(sample_score_cutoff) 49 str.format 49 sample_spreadsheet_file 50 spreadsheet_rows cassette_q55_spreadsheet.csv 50 spreadsheet_rows(sample_spreadsheet_file) 51 str.format 51 write 50 sample_name 50 sample_quality 61 calculate_strategy 61 rejected_sample 61 energies 61 accepted_sample 61 num_images 72 str.format 72 write 73 open 73 rejection_log 74 str.format 74 TextIOWrapper.write 50 spreadsheet_rows 50 spreadsheet_rows(sample_spreadsheet_file) 51 str.format 51 write 50 sample_name 50 sample_quality 61 calculate_strategy 61 rejected_sample 61 energies 61 accepted_sample 61 num_images 90 str.format 90 write 91 sample_id 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image calibration.img 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 50 spreadsheet_rows 50 spreadsheet_rows(sample_spreadsheet_file) 51 str.format 51 write 50 sample_name 50 sample_quality 61 calculate_strategy 61 rejected_sample 61 energies 61 accepted_sample 61 num_images 90 str.format 90 write 91 sample_id 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format 93 write 92 energy 92 frame_number 92 intensity 92 raw_image_file 106 str.format 106 transform_image 106 corrected_image_file 106 total_intensity 106 pixel_count 107 str.format 107 write 118 average_intensity 119 open 119 collection_log_file 120 module.writer 120 collection_log 121 writer.writerow 92 collect_next_image 50 spreadsheet_rows 128 return run/run_log.txt run/rejected_samples.txt run/raw/q55/DRT240/e10000/image_001.raw run/data/DRT240/DRT240_10000eV_001.img run/collected_images.csv run/raw/q55/DRT240/e10000/image_002.raw run/data/DRT240/DRT240_10000eV_002.img run/raw/q55/DRT240/e11000/image_001.raw run/data/DRT240/DRT240_11000eV_001.img run/raw/q55/DRT240/e11000/image_002.raw run/data/DRT240/DRT240_11000eV_002.img run/raw/q55/DRT240/e12000/image_001.raw run/data/DRT240/DRT240_12000eV_001.img run/raw/q55/DRT240/e12000/image_002.raw run/data/DRT240/DRT240_12000eV_002.img run/raw/q55/DRT322/e10000/image_001.raw run/data/DRT322/DRT322_10000eV_001.img run/raw/q55/DRT322/e10000/image_002.raw run/data/DRT322/DRT322_10000eV_002.img run/raw/q55/DRT322/e11000/image_001.raw run/data/DRT322/DRT322_11000eV_001.img run/raw/q55/DRT322/e11000/image_002.raw run/data/DRT322/DRT322_11000eV_002.img simulate_data_collection 230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8> 251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55']) 251 args = ['q55'] 251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}> 24 cassette_id = 'q55' 24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0 24 calibration_image_file = 'calibration.img' 49 str.format 49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv' 50 spreadsheet_rows(sample_spreadsheet_file) 50 sample_name = 'DRT240'50 sample_quality = 45 61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000]) 61 accepted_sample = 'DRT240'61 num_images = 2 61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240' 92 collect_next_image(casset ... _{frame_number:03d}.raw') 92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw' 106 str.format 106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img') calibration.img run/data/DRT240/DRT240_11000eV_002.img lineage query lineage query YesWorkflow: Conceptual workflow model noWorkflow: Python trace model But how do we bridge this gap??? Would like to use YW model to query NW data! B. Ludäscher: Workflows & Provenance 34
  • 35. Habemus Pons! We’ve got the Bridge! The bridge is the journey.. (The journey is the destination) Lineage of image file in terms of YW model, with details from NW provenance B. Ludäscher: Workflows & Provenance 35
  • 36. B. Ludäscher: Workflows & Provenance 36 ����������������� ����� �������������������������������������������������������������� �������������������������������������������������������������� �������������� ���������������������������������� ��������� ���������������� ������������������������������������������������������������� ���������� ����������������� �������������������������������������������������������������������������������������� ���������������� ������� �������������� ������������������ ������������������������������������� ���������������� ����������������� �������������������������������������� ������������������� ����������� ������������������������������� ������������������ ���������� ������������������������������ ����������������� ����������� ���������������������������� ������������ ������������� ������������������������������������������������������ ��������������������� ����������������������������������� ����������������� ����������������� ����� ��������� �������������� ���������������� ���������� ���������� ����������������� ���������������� ���������� ������� ���������� ������������������ ���������������� ��������� ����������������� ������������������� ��������� ����������� ������������������ ������������� ��������� ���������� ����������������� ������������� �������� ����������� ������������ ������������� ��������������������� ������������������������������������������������������������������� ����������������� ������������������������������������������������������������������������� ����������������� ������������������ ���������������� ������� ���������� ����������� ������������������ ����� ��������� �������������� ���������������� ���������� ��������������� ����������������� ���������������� ��������� ����������������� ������������������� ��������������������������������� ���������� ����������������� �������������������������������������� ����������� ������������ ������������� ��������������������� ������������������������������������������������������������������� ����������������� ������������������������������������������������������������������
  • 37. DwCA Taxon Lookup Workflow • Declare inputs, outputs, and steps of a script (or wf) with YW annota-ons to ... – communicate provenance graphically (via graphviz) – combine different forms of provenance – query provenance • Simple YW annota?ons in comments: – @BEGIN Step, @END Step – @IN Data, @OUT Data – @URI Template, @LOG Pa+ern B. Ludäscher: Workflows & Provenance 37 ����������������� ����� �������������������������������������������������������������� �������������������������������������������������������������� �������������� ���������������������������������� ��������� ���������������� ������������������������������������������������������������� ���������� ����������������� �������������������������������������������������������������������������������������� ���������������� ������� �������������� ������������������ ������������������������������������� ���������������� ����������������� �������������������������������������� ������������������� ����������� ������������������������������� ������������������ ���������� ������������������������������ ����������������� ����������� ���������������������������� ������������ ������������� ������������������������������������������������������ ��������������������� ����������������������������������� �����������������
  • 38. Taxon Lookup Workflow: Data View and Process View B. Ludäscher: Workflows & Provenance 38
  • 39. The story of two individual records B. Ludäscher: Workflows & Provenance 39 ����������������� ����������������� ������������������� ������� ���������� ���������� ����������������� ����� ��������� �������������� ���������������� ���������� ��������������� ����������������� ���������������� ������ ������������������ ���������������� ������������������������������� ����������� ������������������ ���� ����������� ������������ ������������� ��������������������� ������������������������������������������������������������������� ����������������� ������������������������������������������������������������������������� ����������������� ������������������ ���������������� ������� ���������� ����������� ������������������ ����� ��������� �������������� ���������������� ���������� ��������������� ����������������� ���������������� ��������� ����������������� ������������������� ��������������������������������� ���������� ����������������� �������������������������������������� ����������� ������������ ������������� ��������������������� ������������������������������������������������������������������� ����������������� ������������������������������������������������������������������ • One took the GBIF route, while … • … the other went all WORMS! Non- Marine? è GBIF Marine? è WORMS
  • 40. The aggregate story .. B. Ludäscher: Workflows & Provenance 40 ����������������� ����� ��������� �������������� ���������������� ���������� ���������� ����������������� ���������������� ���������� ������� ���������� ������������������ ���������������� ��������� ����������������� ������������������� ��������� ����������� ������������������ ������������� ��������� ���������� ����������������� ������������� �������� ����������� ������������ ������������� ��������������������� ������������������������������������������������������������������� ����������������� ������������������������������������������������������������������������� • How many records were observed as inputs or outputs of workflow steps? • Were there any NULL values? How many?
  • 41. Hybrid Provenance: YW Model + Run6me Observables (file level) B. Ludäscher: Workflows & Provenance ����������������� ����� ��������� �������������� ���������������� ���������� ����������������� ���������������� ������� ���������� ������������������ ���������������� ����������������� ������������������� ����������� ������������������ ���������� ����������������� ����������� ������������ ������������� ��������������������� ������������������������������������������������������������������� ����������������� ������������������������������������������������������������������������� • The YW model can be connected with runtime observables • è YW recon (prov reconstruction) • Here: • What specific files were read, written and where do they occur in the workflow? 41
  • 42. YesWorkflow Summary • Lightweight YW annotations can be added easily to your scripts to reap workflow benefits – Documentation of what’s important – Visualization of dependencies – Querying provenance (prospective, retrospective, and hybrid) – Independent of system or language used (R, Python, MATLAB, workflow tools, …) è make provenance actionable è provenance for self! => github.com/yesworkflow-org/yw => try.yesworkflow.org B. Ludäscher: Workflows & Provenance 42 ����������������� ����� �������������������������������������������������������������� �������������������������������������������������������������� �������������� ���������������������������������� ��������� ���������������� ������������������������������������������������������������� ���������� ����������������� �������������������������������������������������������������������������������������� ���������������� ������� �������������� ������������������ ������������������������������������� ���������������� ����������������� �������������������������������������� ������������������� ����������� ������������������������������� ������������������ ���������� ������������������������������ ����������������� ����������� ���������������������������� ������������ ������������� ������������������������������������������������������ ��������������������� ����������������������������������� ����������������� ����������������� ����� ��������� �������������� ���������������� ���������� ���������� ����������������� ���������������� ���������� ������� ���������� ������������������ ���������������� ��������� ����������������� ������������������� ��������� ����������� ������������������ ������������� ��������� ���������� ����������������� ������������� �������� ����������� ������������ ������������� ��������������������� ������������������������������������������������������������������� ����������������� �������������������������������������������������������������������������
  • 43. YW Demo Use Cases (IDCC’17) Domain Use case Programming language Provenance methods Climate science C3C4 MATLAB YW + MATLAB RunManager Astrophysics LIGO Python YW + NW (code-level) Protein crystal samples Simulate data collection Python YW + NW (code-level) Biodiversity data curation kurator-SPNHC Python YW-recon + YW-logging Social network analysis Twitter Python YW + NW (file-level) Oceanography OHIBC Howe Sound (multi-run multi-script) R YW + R RunManager B. Ludäscher: Workflows & Provenance 43
  • 44. • SKOPE: system and tools to discover, access, analyze, visualize paleoenvironmental data – unprecedented ability to explore provenance (detailed, comprehensible record of computational derivation of results) – for researchers, tinkerers, and modelers • Whole Tale: – leverage & contribute to existing CI to support the whole tale (“living paper”), from workflow run to scholarly publication – integrate tools & CI (DataONE, Globus, iRODS, NDS, ...) to simplify use and promote best practices. – driven by science WGs (Archaeology/SKOPE, materials science, astro, bio ..) Project Vignettes B. Ludäscher: Workflows & Provenance 44
  • 45. Whole Tale: The next step in the evolution of the scholarly article: The “Living [Frozen?] Paper” • 1st Generation: – narrative (prose) • 2nd Generation: plus … – name .. identify .. include (access to) data • 3rd Generation: plus … – name .. reference .. include code (software) .. – and provenance … and exec environment (containers) B. Ludäscher: Workflows & Provenance 45 Whole Tale Whole Tale Dashboard
  • 47. WT Architecture 47B. Ludäscher: Workflows & Provenance https://dashboard. wholetale.org
  • 48. Example Tale: LIGO gravitational wave detection (tutorial Jupyter notebook)
  • 49. B. Ludäscher: Workflows & Provenance 49 https://dashboard.wholetale.org
  • 50. What is Whole Tale? ● NSF-funded Data Infrastructure Building Blocks (DIBBs) project ● Platform to create, publish, and execute tales ● Simplify process of creating & verifying reproducible computational artifacts ● https://dashboard.wholetale.org 50 B. Ludäscher: Workflows & Provenance
  • 51. Why Whole Tale? ● Increased reliance on computation across domains ○ new skill requirements for researchers ● Open Science changing norms and expectations ○ increased emphasis on sharing data & code ○ … with transparency and reproducibility in mind! ○ => from sharing data to sharing research objects ○ FAIR principles 51 B. Ludäscher: Workflows & Provenance
  • 52. Whole Tale: Enables Computational Science 52 B. Ludäscher: Workflows & Provenance
  • 53. Whole Tale & the Elements of a … Reproducible Computational Research Platform 53 Easy-to-access cloud-based computational environments Transparent access to research data Collaborate and share with others Export or publish executable research objects Re-execute Review Verify Re-use Develop Analyze Share ReproducePackage Coming soon B. Ludäscher: Workflows & Provenance
  • 54. Whole Tale Roles and Stakeholders 54 Researchers, Grad Students Editors, Publishers Analysis Publish & Re-use Verify Badging, Verification Scientific Software + Data Repositories Reviewers, Curators B. Ludäscher: Workflows & Provenance
  • 55. Develop & Analyze with Whole Tale ● Easy to access cloud-based environments ○ Your laptop in the cloud ● Popular tools ○ + … extensible! ● Work with data & code in transparent (provenance-enabled) ways ○ Automatic data citation ○ Automatic computational provenance capture (coming soon) 55 B. Ludäscher: Workflows & Provenance
  • 56. Package & Reproduce with Whole Tale ● Executable Research Objects ● Publish or export to research archives ● Compatible with new norms for reproducibility and transparency ● For verification and re-use 56 B. Ludäscher: Workflows & Provenance
  • 57. Whole Tale and 57 ●Discover & access data from any DataONE repository ●Analyze data in Whole Tale ●Package & publish tales to Metacat-based repositories ●Provenance support B. Ludäscher: Workflows & Provenance
  • 58. What exactly is (in) a Tale? 58 ● Verifiable ● Remixable ● Standards-based ✓Tale: Research object ○ data, code, narrative, compute environment ✓Executable ✓Transparent ✓Publishable B. Ludäscher: Workflows & Provenance
  • 59. 59 Whole Tale Platform Overview Research & Quantitative Computational Environments External Data Sources Code + Narrative ●Authenticate using your institutional identity ●Access commonly-used computational environments ●Easily customize your environment (via repo2docker) ●Reference and access externally registered data ●Create or upload your data and code ●Add metadata (including provenance information) ●Submit code, data, and environment to archival repository ●Get a persistent identifier ●Share for verification and re-use Publish Tale Create tale Analyze data Coming Soon: B. Ludäscher: Workflows & Provenance
  • 60. Tale Creation Workflow "Analyze in WT" or register data by URL or digital object identifier: Create a Tale, entering a name and selecting interactive environment A container is launched based on selected environment with an empty workspace and external data mounted read- only Create/upload code and scripts Execute code/scripts to generate results/ outputs Export the Tale in compressed BagIt-RO format to run locally for verification. Publish the tale to a supported repository, generating a persistent identifier. Customize environment adding special packages/software dependencies Re-execute in Whole Tale Enter descriptive metadata including authors, title, description, and illustration image schema:author schema:name schema:category pav:createdBy schema:license B. Ludäscher: Workflows & Provenance 60
  • 61. Demo: Analyzing Seal Migration Patterns A research team is preparing to publish a manuscript describing a computational model for estimating animal movement paths from telemetry data: ● Telemetry data published in Research Workspace ● Analysis and visualization in RStudio ● Existing routines stored in local R files ● Analysis requires specialized R packages ● Publish results for the community in DataONE 61 Based on: J.M. London and D.S.Johnson. Alaska bearded and spotted seal example dataset and analysis. https://github.com/jmlondon/crwexampleakbs, 2019 Live Demo or Demo Video
  • 62. Key features Supported environments ●Extension to Binder's repo2docker ○Jupyter, JupyterLab ○RStudio (based on Rocker Project) ○OpenRefine ●Coming soon: ○Matlab, Stata 62
  • 63. Key features Supported data repositories ●Register data from supported research data repositories ●Referenced data is cited ○ Ideally eventually contributing to citation counts ● Publish tales back to research repositories 63
  • 64. Key features Export to BagIt-RO ●BagIt: archival format ●Re-runnable in WT ●BagIt-RO ○Open archival format ○Research Object support ○Extended for Big Data 64 tale/ bagit.txt bag-info.txt data/ workspace/ run.py LICENSE requirements.txt output.csv LICENSE metadata/ manifest.json manifest-sha1.txt start-here/ README.md tagmanifest-sha1.txt
  • 65. Key features Export and Run Locally ●Natural outcome of Tale export and repo2docker ●Download a zip file (BagIt-RO) ●run-local.sh ○ Build image (repo2docker) ○ Fetch external data (bdbag) ○ Execute (Docker) 65
  • 66. Coming soon ● Tapis/Agave data sources ● Sharing/collaboration ● Create tale from Git repository ● Image preservation ● System provenance capture ● Better user experience 66
  • 67. Some References (Kepler, Kurator, YesWorkflow, Whole-Tale, Reproducibility, Euler/X) 1. Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J. and Zhao, Y., 2006. Scientific workflow management and the Kepler system. Concurrency and computation: Practice and experience, 18(10), pp.1039-1065. 2. McPhillips, T., Bowers, S., Zinn, D. and Ludäscher, B., 2009. Scientific workflow design for mere mortals. Future Generation Computer Systems, 25(5), pp.541-551. 3. Morris, P.J., Hanken, J., Lowery, D., Ludäscher, B., Macklin, J., McPhillips, T., Wieczorek, J. and Zhang, Q., 2018. Kurator: Tools for Improving Fitness for Use of Biodiversity Data. Biodiversity Information Science and Standards, 2, p.e26539 4. T. McPhillips, T. Song, T. Kolisnik, S. Aulenbach, K. Belhajjame, R.K. Bocinsky, Y. Cao, J. Cheney, F. Chirigati, S. Dey, J. Freire, C. Jones, J. Hanken, K.W. Kintigh, T.A. Kohler, D. Koop, J.A. Macklin, P. Missier, M. Schildhauer, C. Schwalm, Y. Wei, M. Bieda, B. Ludäscher (2015). YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts. International Journal of Digital Curation 10, 298-313. 5. T. McPhillips, S. Bowers, K. Belhajjame, B. Ludäscher (2015). Retrospective Provenance Without a Runtime Provenance Recorder. 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP'15). 6. Brinckman, A., Chard, K., Gaffney, N., Hategan, M., Jones, M.B., Kowalik, K., Kulasekaran, S., Ludäscher, B., Mecum, B.D., Nabrzyski, J. and Stodden, V., 2019. Computing environments for reproducibility: Capturing the “Whole Tale”. Future Generation Computer Systems, 94, pp.854-867. 7. Chard, K., Gaffney, N., Jones, M.B., Kowalik, K., Ludäscher, B., McPhillips, T., Nabrzyski, J., Stodden, V., Taylor, I., Thelen, T., Turk, M.J. and Willis, C., 2019. Application of BagIt-Serialized Research Object Bundles for Packaging and Re-execution of Computational Analyses. In 2019 IEEE 15th International Conference on e-Science (e-Science). IEEE. 8. Chard, K., Gaffney, N., Jones, M.B., Kowalik, K., Ludäscher, B., Nabrzyski, J., Stodden, V., Taylor, I., Turk, M.J. and Willis, C., 2019, June. Implementing Computational Reproducibility in the Whole Tale Environment. In Proceedings of the 2nd International Workshop on Practical Reproducible Evaluation of Computer Systems (pp. 17-22). ACM. 9. McPhillips, T., Willis, C., Gryk, M., Nunez-Corrales, S., Ludäscher, B. 2019. Reproducibility by Other Means: Transparent Research Objects. In 2019 IEEE 15th International Conference on e-Science (e-Science). IEEE. 10. Franz, N.M., Chen, M., Kianmajd, P., Yu, S., Bowers, S., Weakley, A.S. and Ludäscher, B., 2016. Names are not good enough: reasoning over taxonomic change in the andropogon complex. Semantic Web, 7(6), pp.645-667.
  • 68. Whole Tale Collaboration (PI Team) ● U Illinois (NCSA) Bertram Ludäscher, Victoria Stodden, Matt Turk ○ overall lead (co-operative agreement) ○ reproducibility; provenance; open source software development; outreach ● U Chicago (Globus) Kyle Chard ○ data transfer & storage; compute; infrastructure ● UC Santa Barbara (NCEAS) Matt Jones ○ (meta-)data publishing; provenance; repositories ● U Texas, Austin (TACC) Niall Gaffney ○ compute; HTC; “big tale”; Science Gateways ● U Notre Dame (CRC) Jarek Nabrzyski ○ UX design; UI design 68
  • 69. • Given two taxonomies and expert articulations, find the merged (=aligned) taxonomy that logically follows. • Problems: – underconstrained alignment: ambiguity; many possible worlds (PWs) … – overconstrained: inconsistency; no PW • Euler uses ASP, RCC reasoning to infer merged taxonomies; diagnose inconsistencies; reduce ambiguity github.com/ EulerProject/EulerX Data Cleaning: Theory & Practice 69 Other Research Bits: Logic-based Taxonomy Alignment in EulerX with Prof. Nico Franz, Curator of Insects @ ASU
  • 70. Is reproducibility really so complicated? § Reproducibility crisis? § Terminology crisis? § Or gullibility crisis? § What is reproducibility anyway? § And who is responsible for it? Towards Reproducible Science Tales 70