National Center for Supercomputing Applications
University of Illinois at Urbana–Champaign
Expressing and sharing workflows
Daniel S. Katz
Assistant Director for Scientific Software & Applications, NCSA
Research Associate Professor, CS
Research Associate Professor, ECE
Research Associate Professor, iSchool
dskatz@illinois.edu, d.katz@ieee.org
@danielskatz
What’s a workflow?
• A set of tasks and dependencies between them
• Perhaps expressed as data structure, e.g. graph (DAG or cyclic)
• How is this different than a computer program?
• The tasks as more well-defined (inputs, outputs)
• The tasks are longer (running time O(sec) – O(hr))
• Why express it differently?
• Program (script) is a natural way of expressing a workflow
• Examples: shell scripts, programs in Swift/Parsl
• YesWorkflow annotations to help in understanding scripts
• Swift/Parsl: functions used to identify components
• Expressing it as data corresponds to the compiled (assembly)
version of the workflow
• Useful for a lot of things, but not understanding
http://swift-lang.org, http://parsl-project.org
YesWorkflow (YW)
• Name: “Yes, scripts are (can be) workflows, too!”
• But, workflow (dataflow) usually hidden in the script
• Idea: let the script author reveal the structure by
declaring tasks (steps) and dataflow between tasks.
• This is a modeling step
• very coarse (workflow: one big black box w/ inputs & outputs)
• or rather fine (workflow has many steps, linked by dataflow)
• => language to explain (graphically) what the concepts
(relevant steps, relevant data) you want to share
• => this conceptual YW model can itself be queried; linked
with runtime observables, provenance
Credit: Bertram Ludäscher
Parsl
• A python-based parallel scripting library (http://parsl-project.org),
based on ideas in Swift (http://swift-lang.org)
• Tasks exposed as functions (python or bash)
@App('bash', data_flow_kernel)
def echo(message, outputs=[]):
return 'echo {0} &> {outputs[0]}’
@App('python', data_flow_kernel)
def cat(inputs=[]):
with open(inputs[0]) as f:
return f.readlines()
• Return values are futures
• Other tasks can be called that depend on these futures
• Will not run until futures are satisfied/filled
• Main code used to glue functions together
hello = echo("Hello World!", outputs=['hello1.txt'])
message = cat(inputs=[hello.outputs[0]])
• Fairly easy to understand
How to promote/share workflows
• How do we share general software?
• Libraries (units of execution with well-defined APIs)
• Source code (fork model)
• Source code repositories (GitHub), packaging
systems/repositories (PyPI, CRAN)
• How do we share data?
• Repositories (Dryad)
• For workflows
• Libraries -> sub-workflows, defined to provide well-specified
functionality
• Source code -> source code (scripts), may still be hard to
understand
• Data -> data repository for workflows (MyExperiment)
www.myexperiment.org
De	Roure,	D.,	Goble,	C.	Stevens,	R.	(2009)	The	Design	
and	Realisation of	the	myExperiment Virtual	Research	
Environment	for	Social	Sharing	of	Workflows.	Future	
Generation	Computer	Systems	25,	pp.	561-7.
• A	workflow	commons	for	workflow	sharing,	
designed	using	Web	2.0	principles
• Launched	open	beta	in	November	2007,	still	
actively	used
• Largest	public	collection	of	workflows,	for	
multiple	workflow	systems
• 2400+	entries	in	Google	Scholar	refer	to	
myExperiment
• Open	source,	REST	API,	part	of	Open	Linked	
Data	cloud	(66k	triples)	- lod-cloud.net
• Introduced	“packs”	which	led	to	Research	
Objects	– www.researchobject.org
• Workflow	collection	studied	in	scientific	
workflow	and	e-Science	communities
• Service	maintained	by	Manchester	and	Oxford	
universities.	Informs	design	of	other	workflow	
sharing	systems.	
• Content	stats:	10591	members,	393	groups,	
3876	workflows,	1233	files,	477	packs
Credit: Carole Goble
GitHub
• Widely used for sharing software, and socially working
on/with software (and many other types of documents)
• GitHub is used for sharing workflows today
• Both scripts and data
• Borrowing from “Software vs. data in the context of
citation”
• A workflow as a program or a script is code, a creative work
• Appropriate license: OSI-approved open source (e.g., BSD)
• A workflow as a DAG is data?
• Appropriate license: Creative Commons (e.g., CC-BY)?
• So, let’s keep workflows as programs/scripts
• Use YesWorkflow with scripts
• Use GitHub to share
Katz DS, Niemeyer KE, et al. (2016) Software vs. data in the context of citation.
PeerJ Preprints 4:e2630v1 doi: 10.7287/peerj.preprints.2630v1

Expressing and sharing workflows

  • 1.
    National Center forSupercomputing Applications University of Illinois at Urbana–Champaign Expressing and sharing workflows Daniel S. Katz Assistant Director for Scientific Software & Applications, NCSA Research Associate Professor, CS Research Associate Professor, ECE Research Associate Professor, iSchool dskatz@illinois.edu, d.katz@ieee.org @danielskatz
  • 2.
    What’s a workflow? •A set of tasks and dependencies between them • Perhaps expressed as data structure, e.g. graph (DAG or cyclic) • How is this different than a computer program? • The tasks as more well-defined (inputs, outputs) • The tasks are longer (running time O(sec) – O(hr)) • Why express it differently? • Program (script) is a natural way of expressing a workflow • Examples: shell scripts, programs in Swift/Parsl • YesWorkflow annotations to help in understanding scripts • Swift/Parsl: functions used to identify components • Expressing it as data corresponds to the compiled (assembly) version of the workflow • Useful for a lot of things, but not understanding http://swift-lang.org, http://parsl-project.org
  • 3.
    YesWorkflow (YW) • Name:“Yes, scripts are (can be) workflows, too!” • But, workflow (dataflow) usually hidden in the script • Idea: let the script author reveal the structure by declaring tasks (steps) and dataflow between tasks. • This is a modeling step • very coarse (workflow: one big black box w/ inputs & outputs) • or rather fine (workflow has many steps, linked by dataflow) • => language to explain (graphically) what the concepts (relevant steps, relevant data) you want to share • => this conceptual YW model can itself be queried; linked with runtime observables, provenance Credit: Bertram Ludäscher
  • 4.
    Parsl • A python-basedparallel scripting library (http://parsl-project.org), based on ideas in Swift (http://swift-lang.org) • Tasks exposed as functions (python or bash) @App('bash', data_flow_kernel) def echo(message, outputs=[]): return 'echo {0} &> {outputs[0]}’ @App('python', data_flow_kernel) def cat(inputs=[]): with open(inputs[0]) as f: return f.readlines() • Return values are futures • Other tasks can be called that depend on these futures • Will not run until futures are satisfied/filled • Main code used to glue functions together hello = echo("Hello World!", outputs=['hello1.txt']) message = cat(inputs=[hello.outputs[0]]) • Fairly easy to understand
  • 5.
    How to promote/shareworkflows • How do we share general software? • Libraries (units of execution with well-defined APIs) • Source code (fork model) • Source code repositories (GitHub), packaging systems/repositories (PyPI, CRAN) • How do we share data? • Repositories (Dryad) • For workflows • Libraries -> sub-workflows, defined to provide well-specified functionality • Source code -> source code (scripts), may still be hard to understand • Data -> data repository for workflows (MyExperiment)
  • 6.
    www.myexperiment.org De Roure, D., Goble, C. Stevens, R. (2009) The Design and Realisation of the myExperiment Virtual Research Environment for Social Sharing of Workflows. Future Generation Computer Systems 25, pp. 561-7. •A workflow commons for workflow sharing, designed using Web 2.0 principles • Launched open beta in November 2007, still actively used • Largest public collection of workflows, for multiple workflow systems • 2400+ entries in Google Scholar refer to myExperiment • Open source, REST API, part of Open Linked Data cloud (66k triples) - lod-cloud.net • Introduced “packs” which led to Research Objects – www.researchobject.org • Workflow collection studied in scientific workflow and e-Science communities • Service maintained by Manchester and Oxford universities. Informs design of other workflow sharing systems. • Content stats: 10591 members, 393 groups, 3876 workflows, 1233 files, 477 packs Credit: Carole Goble
  • 7.
    GitHub • Widely usedfor sharing software, and socially working on/with software (and many other types of documents) • GitHub is used for sharing workflows today • Both scripts and data • Borrowing from “Software vs. data in the context of citation” • A workflow as a program or a script is code, a creative work • Appropriate license: OSI-approved open source (e.g., BSD) • A workflow as a DAG is data? • Appropriate license: Creative Commons (e.g., CC-BY)? • So, let’s keep workflows as programs/scripts • Use YesWorkflow with scripts • Use GitHub to share Katz DS, Niemeyer KE, et al. (2016) Software vs. data in the context of citation. PeerJ Preprints 4:e2630v1 doi: 10.7287/peerj.preprints.2630v1