Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Catherine Wise, Nicholas J Car, Ryan Fraser and Geoff Squire
Data61 and LAND & WATER
Standard Proveance Reporting and
Scie...
What are VLs?
What is VHIRL?
What is provenance?
How does VHIRL manage provenance (or not)?
How do we represent VHIRL’s ac...
What are VLs?
From https://nectar.org.au/virtual-laboratories-1, they are:
data repositories and computational tools and streamlining re...
What is VHIRL?
• Virtual Hazards Impact & Risk Laboratory (VHIRL) is a scientific
workflow portal
• Gives researchers access to a cloud c...
Components of the Virtual Lab: Virtual Hazard
Impact & Risk Laboratory (VHIRL)
Data Services Processing
Services
Compute
S...
What is provenance?
From http://en.wikipedia.org/wiki/Provenance#Computer_Science:
What is provenance?
“Computer science uses the term provena...
How do we represent VLs using
standardised provenance?
• Natively tracks ‘everything’ used for scenario (re)runs
• Is not a: Data store, Software repo, Records mgt system
• Exte...
• SSSC is a web-based system to
manage code & dependencies
• Contains Problems &
Solutions that define a
workflow
• Soluti...
Scientific Solutions Software Centre (SSSC)
• Beautiful, RESTful API
this example:
http://vhirl-dev.csiro.au/scm/toolbox/2...
Mapping VHIRL to PROV 1
Input Data Process
Output
Data
Mapping VHIRL to PROV 2
Code Process
Output
Data
Config
Input Data
“Ontology Design Pattern”
Mapping VHIRL to PROV 3
Code Process
Output
Data
Config
Input Data
Who/
which
system
Who
used
Entity Activity Agent
Mapping VHIRL to PROMS
Report N
Entity Activity Agent
Reporting
System X
R.S. Report
Mapping VHIRL to PROMS
VHIRL provenance into PROMS Server
Report N
Entity Activity Agent
Reporting
System X
R.S. Report
Report N
Report N
Report ...
Modelling VHIRL’s data types
VL Run
output
data
userThe VL
Report N
managed
data
web
service
data
user
supplied
data
manag...
PROMS Reporting Toolkits
VHIRL’s native PROV output
RDF file
What work other, than
representation, is needed for
provenance?
Provenance effort (step) pyramid
Data Management
Establishing Reporting
Continued
Reporting
managed
data
web
service
data
user
supplied
data
managed
code
user
supplied
code
Data Management
output
data
all Entities ...
managed
data
web
service
data
user
supplied
data
managed
code
user
supplied
code
Data Management
VL ID’d and persisted
out...
managed
data
web
service
data
user
supplied
data
managed
code
user
supplied
code
Data Management
VL ID’d and persisted
out...
Establishing Reporting
VL
Report
Organisational
Provenance
Store
querying & redelivery
ProvenanceReportingToolkit
C#
Java
...
Establishing Reporting - Reporting Toolkits
managed
data
web
service
data
VL Run
“Grid X”
“Service Y”
“Run 456”
e1 = Entit...
Establishing Reporting - Reporting Toolkits
managed
data
web
service
data
VL Run
“Grid X”
“Service Y”
“Run 456”
e1 = Entit...
Establishing Reporting - Reporting Toolkits
managed
data
web
service
data
VL Run
“Grid X”
“Service Y”
“Run 456”
Agent
N
a0...
Establishing Reporting - Reporting Toolkits
managed
data
web
service
data
VL Run
“Grid X”
“Service Y”
“Run 456”
Agent
N
Re...
What do we get from this work?
Graph power!
Report NReporting
System X
...
URI power!
Report NReporting
System X
corporate
staff DB
temp repo
public web
service
DAP-style
repo
PROMS
instance
Distributed graphs!
GA PROMS
instance
VL PROMS
instance
Uni Prov
Store
Distributed Querying via endpoint cache
Standard Provenance Reporting and Scientific Software Management in Virtual Laboratories
Standard Provenance Reporting and Scientific Software Management in Virtual Laboratories
Standard Provenance Reporting and Scientific Software Management in Virtual Laboratories
Upcoming SlideShare
Loading in …5
×

Standard Provenance Reporting and Scientific Software Management in Virtual Laboratories

154 views

Published on

The Virtual Hazards Impact & Risk Laboratory (VHIRL) is a scientific workflow portal that provides researchers with access to a cloud computing environment for natural hazards eResearch tools. It allows researchers to construct experiments with data from a variety of sources and execute cloud computing processes for rapid and remote simulation and analysis. The service currently includes tools for the simulation of three major hazards affecting the Asia-Pacific region: earthquakes, tsunamis and tropical cyclones.

For scientific results, the establishment of provenance is key to reproducibility and trust. Thus the need for any virtual laboratory to provide provenance information for the tasks it manages is obvious, but the appropriate way to report and manage provenance information is not always so straightforward. Many virtual laboratories and workflow systems provide bespoke provenance management with a focus on internal system use. This has clear benefits for reproducibility within the system, but it limits the interoperability of systems. For VHIRL, a provenance solution was required that was as
interoperable with other, external, provenance systems as possible.

A related common issue facing workflow tools and virtual laboratories is the need to manage software code. With this comes well-known issues associated with code sharing: licensing, source code management, version management and dependency resolution. There are a wide selection of commonly used tools to help solve these problems, for example Git and Subversion.

A key goal of VHIRL was to externalise as much information management as was reasonable. VHIRL is a virtual laboratory: it is not designed to be a data store, software repository, or records management system. A solution was required that could hand off the management of provenance records and code to external services, with links between them, other data services and VHIRL jobs where appropriate.

Scientific software can be quite complicated and systems for managing dependencies and source vary from system to system. In order to provide the least friction for authors of software, we designed a system called the Scientific Software Solution Centre (SSSC) to manage solutions to scientific problems and deliver the solution templates, code and dependencies that enable them for use in VHIRL and other Virtual Laboratories and applications.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Standard Provenance Reporting and Scientific Software Management in Virtual Laboratories

  1. 1. Catherine Wise, Nicholas J Car, Ryan Fraser and Geoff Squire Data61 and LAND & WATER Standard Proveance Reporting and Scientifc Software Management in Virtual Labs
  2. 2. What are VLs? What is VHIRL? What is provenance? How does VHIRL manage provenance (or not)? How do we represent VHIRL’s actions to standardised provenance? What work, other than representation, is needed for provenance? What benefits do we get from this work? Outline
  3. 3. What are VLs?
  4. 4. From https://nectar.org.au/virtual-laboratories-1, they are: data repositories and computational tools and streamlining research workflows What are VLs?
  5. 5. What is VHIRL?
  6. 6. • Virtual Hazards Impact & Risk Laboratory (VHIRL) is a scientific workflow portal • Gives researchers access to a cloud computing for natural hazards research • data from a variety of sources • uses cloud computing resources • currently has tools for the earthquakes, tsunamis & tropical cyclones in the Asia-Pacific region What is VHIRL?
  7. 7. Components of the Virtual Lab: Virtual Hazard Impact & Risk Laboratory (VHIRL) Data Services Processing Services Compute Services Enablers Virtual Laboratories /AppsData Analytics Magnetics Gravity DEM eScript ANUGA NCI Petascale NCI Cloud NeCTAR Cloud Amazon Cloud Desktop Service Orchestration Provenance Metadata Auth. Coastal Inundation Tsuanmi Inundation Scenario Cyclone Wind Path Calculation Landsat Bathymetry Cyclone Wind Model Surface Wave Propagation (earthquake) TCRM Connectivity via Provenance | Melanie Ayre | eResearch Australiasia 2015, Brisbane
  8. 8. What is provenance?
  9. 9. From http://en.wikipedia.org/wiki/Provenance#Computer_Science: What is provenance? “Computer science uses the term provenance to mean the lineage of data or processes, as per data provenance. However there is a field of informatics research within computer science called provenance that studies how provenance of data and processes should be characterised, stored and used. Semantic web standards bodies, such as the World Wide Web Consortium, ratified a standard for provenance representation in 2014, known as PROV.”
  10. 10. How do we represent VLs using standardised provenance?
  11. 11. • Natively tracks ‘everything’ used for scenario (re)runs • Is not a: Data store, Software repo, Records mgt system • Externalises as much information mgt as possible • Code managed by the SSSC VHIRL’s own data management
  12. 12. • SSSC is a web-based system to manage code & dependencies • Contains Problems & Solutions that define a workflow • Solutions consists of a Toolbox • Toolboxes are code wrapped in a Python script + description of the required inputs Scientific Solutions Software Centre (SSSC) Class diagram for the SSSC
  13. 13. Scientific Solutions Software Centre (SSSC) • Beautiful, RESTful API this example: http://vhirl-dev.csiro.au/scm/toolbox/2 • Solution  prov:Plan • No RDF metadata, yet!
  14. 14. Mapping VHIRL to PROV 1 Input Data Process Output Data
  15. 15. Mapping VHIRL to PROV 2 Code Process Output Data Config Input Data “Ontology Design Pattern”
  16. 16. Mapping VHIRL to PROV 3 Code Process Output Data Config Input Data Who/ which system Who used Entity Activity Agent
  17. 17. Mapping VHIRL to PROMS Report N Entity Activity Agent Reporting System X R.S. Report
  18. 18. Mapping VHIRL to PROMS
  19. 19. VHIRL provenance into PROMS Server Report N Entity Activity Agent Reporting System X R.S. Report Report N Report N Report M Report NReporting System Y Report N Report N Report N Organisational Provenance Store reported and stored
  20. 20. Modelling VHIRL’s data types VL Run output data userThe VL Report N managed data web service data user supplied data managed code user supplied code
  21. 21. PROMS Reporting Toolkits
  22. 22. VHIRL’s native PROV output RDF file
  23. 23. What work other, than representation, is needed for provenance?
  24. 24. Provenance effort (step) pyramid Data Management Establishing Reporting Continued Reporting
  25. 25. managed data web service data user supplied data managed code user supplied code Data Management output data all Entities need to be ID’d (via URI) and persisted VL Run each VL run is reported as an Activity within a Report each VL instance has/needs an ID and is modelled as a Reporting System user each VL user is known by their login (account) details. Modelled as a Reporter The VL Report N each VL Report is ID’d and persisted in the VL Provenance Store
  26. 26. managed data web service data user supplied data managed code user supplied code Data Management VL ID’d and persisted output data cited using PROMS-O format soon to be VL ID’d and persisted, with minimal metadata recorded too SSSC ID’s and persisted perhaps SSSC ID’s and persisted, perhaps VL managed soon to be VL ID’d and persisted, if required, perhaps with time limits
  27. 27. managed data web service data user supplied data managed code user supplied code Data Management VL ID’d and persisted output data cited using PROMS-O format soon to be VL ID’d and persisted, with minimal metadata recorded too SSSC ID’s and persisted perhaps SSSC ID’s and persisted, perhaps VL managed soon to be VL ID’d and persisted, if required, perhaps with time limits Virtual Labs Service Citation Example [{ref}] {service title} {service endpoint URI} {query} {time queried} {cached copy ID} [1] “Subset of elevation” http://pid.csiro.au/service/anuga-thredds “bussleton.nc?var=elevation&spatial=bb& north=-33.06495205829679&south=- 33.551573283840156&west=114.849678 74597227&east=115.70661233971667&t emporal=all&time_start=&time_end=&hor izStride” “2014-12-15T13:15:11” http://pid.csiro.au/dataset/abcd1234
  28. 28. Establishing Reporting VL Report Organisational Provenance Store querying & redelivery ProvenanceReportingToolkit C# Java Python
  29. 29. Establishing Reporting - Reporting Toolkits managed data web service data VL Run “Grid X” “Service Y” “Run 456” e1 = Entity(title='Grid X', description='netCDF grid of property X', uri='http://eg-vl.org.au/dataset/123', downloadURL='http://eg-vl.org.au/dataset/123?_view=dl', wasAttributedTo='http://data.ga.gov.au/id/person/john.doe') Agent N Report N Report for Run 456
  30. 30. Establishing Reporting - Reporting Toolkits managed data web service data VL Run “Grid X” “Service Y” “Run 456” e1 = Entity(title='Grid X', description='netCDF grid of property X', uri='http://eg-vl.org.au/dataset/123', downloadURL='http://eg-vl.org.au/dataset/123?_view=dl', wasAttributedTo='http://data.ga.gov.au/id/person/john.doe') Agent N e2 = ServiceEntity( title='Subset of elevation', description='5km solar radiation interpolated raster service', serviceBaseUri='http://siss2.anu.edu.au/anuga/busselton.nc', query='var=elevation&spatial=bb&north=-33.06495205&south=- 33.551573283&west=114.84967874&east=115.70661233&tempor al=all&time_start=&time_end=&horizStride', queriedAtTime='2014-12-15T13:15:11' chachedCopy='http://bom.gov.au/dataset/678') Report N Report for Run 456
  31. 31. Establishing Reporting - Reporting Toolkits managed data web service data VL Run “Grid X” “Service Y” “Run 456” Agent N a0 = Activity( title='Run 456', description='Upper bound run, full Grid X use', wasAssociatedWith={VL added automatically}, startedAtTime={VL added automatically}, endedAtTime={VL added automatically}, usedEntities= [e1, e2], generatedEntities={VL added automatically})Report N Report for Run 456
  32. 32. Establishing Reporting - Reporting Toolkits managed data web service data VL Run “Grid X” “Service Y” “Run 456” Agent N Report N Report for Run 456 r0 = Report( title='Report for Run 456', description='Upper bound run, full Grid X use', startingActivity={VL added automatically}, endingActivity={VL added automatically}) rs0 = ReportSender('http://provstore.vl.org.au/report/') rs.send(r0)
  33. 33. What do we get from this work?
  34. 34. Graph power! Report NReporting System X ...
  35. 35. URI power! Report NReporting System X corporate staff DB temp repo public web service DAP-style repo PROMS instance
  36. 36. Distributed graphs! GA PROMS instance VL PROMS instance Uni Prov Store Distributed Querying via endpoint cache

×