Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Linking the prospective and retrospective provenance of scripts

599 views

Published on

Scripting languages like Python, R, andMATLAB have seen significant use across a variety of scientific domains. To assist scientists in the analysis of script executions, a number of mechanisms, e.g., noWorkflow, have been recently proposed to capture the provenance of script executions. The provenance information recorded can be used, e.g., to trace the lineage of a particular result by identifying the data inputs and the processing steps that were used to produce it. By and large, the provenance information captured for scripts is fine-grained in the sense that it captures data dependencies at the level of script statement, and do so for every variable within the script. While useful, the amount of recorded provenance information can be overwhelming for users and cumbersome to use. This suggests the need for abstraction mechanisms that focus attention on specific parts of provenance relevant for analyses. Toward this goal, we advocate that fine-grained provenance information recorded as the result of script execution can be abstracted using user-specified, workflow-like views. Specifically, we show how the provenance traces recorded by noWorkflow can be mapped to the workflow specifications generated by YesWorkflow from scripts based on user annotations. We examine the issues in constructing a successful mapping, provide an initial implementation of our solution, and present competency queries illustrating how a workflow view generated from the script can be used to explore the provenance recorded during script execution.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Linking the prospective and retrospective provenance of scripts

  1. 1. Saumen Dey, Khalid Belhajjame, David Koop, Meghan Raul, Bertram Ludäscher
  2. 2.  Scientists process and analyze their datasets using scripts.  Retrospective provenance information can be useful for scientists to analyze of script execution  E.g., noWorkflow [Murta et al., IPAW’14]  While useful, the amount of information such fine- grained traces contain can be overwhelming for end users. There is a need for abstraction techniques that focus the attention of users on the provenance information relevant for their analyses. TaPP'15 2
  3. 3. Workflow-Oriented Proposals  Zoom*UserViews [Biton et al., ICDE’08]:Workflow views are used for abstracting the retrospective provenance of (potentially complex) workflows  LabelFlow [Alper et al., IPAW’14]: Both (prospective and retrospective) workflow provenance are summarized using user annotations and reduction rules. Script-Oriented Proposals  YesWorkflow [McPhillips et al.,TaPP’15]:We have seen in the presentation by B. Ludäscher, how a URI scheme can be used to reconstruct part of the retrospective provenance, without recording it at run time.  This scheme is useful when the data products used and generated, including intermediate ones, are stored within the file system. We adopt a similar approach as Zoom*UserViews and apply it to scripts: We use the workflow specifications (i.e., prospective provenance) extracted by YesWorkflow to abstract the retrospective provenance captured by noWorkflow TaPP'15 3
  4. 4. Script Annotate Extract Workflow Specification Workflow Description YesWorkflow TaPP'15 4
  5. 5. Script Annotate Extract Workflow Specification Workflow Description YesWorkflow read_temperatureDataFile pastTemperatureData read_precipitationDataFile pastPrecipitationData simulate simulatedWeatherData extract_temperature temperatureData extract_precipitation precipitationData create_plot plot temperatureDataFile precipitationDataFile TaPP'15 5
  6. 6. Script Run Retrospective Provenance noWorkflow Annotate Extract Workflow Specification Workflow Description YesWorkflow • variable(6, 93, row, 14, ”[0.0]” , 1430231173.397779). • dependency(6, 34, 94, 93). TaPP'15 6
  7. 7. Script Link* Query Provenance User Run Retrospective Provenance noWorkflow Annotate Extract Workflow Specification Workflow Description YesWorkflow Link*: Links retrospective and prospective provenancesTaPP'15 7
  8. 8.  Variable names are not reliable IDs.  Two different variables may have the same name within a script, e.g., within the scope of different functions  noWorkflow tracks line numbers in the script to identify the variable  YesWorkflow does not provide this information  However, it provides the line numbers of the start and end of a block, when requested.  Using the two pieces of information, we are able to connect data values in noWorkflows to their correspondingYesWorkflow variables. TaPP'15 8
  9. 9.  The variables inYesWorkflows are associated with user annotations  Such annotations can be used to provide more information about the data values within noWorkflow retrospective provenance  We note here that it is possible inYesWorkflow to specify variables that are not mapped to any variables within the script TaPP'15 9
  10. 10. - d d e d r y y e - 1 # @begi n model 2 # @i n a @as past Temper at ur eDat a 3 # @i n b @as past Pr eci pi t at i onDat a 4 # @out dat a @as si mul at edWeat her 5 i f ( sum( a) / l en( a) ) < 0: 6 dat a = model 1( a, b) 7 el se: 8 dat a = model 2( a, b) 9 # @end model Figure3. Annotating anif-then-elsecontrol block using YesWork- flow model pastTemperatureData pastPrecipitationData simulatedWeather e t d e ) n - - - - d s s n - a d - e t s s 24 ## @begi n mai n 25 # @i n dat a1. dat @as t emper at ur eDat aFi l e @URI f i l e: t emp. dat 26 # @i n dat a2. dat @as pr eci pi t at i onDat aFi l e @URI f i l e: pr eci p. dat 27 # @out out put . png @as pl ot 28 29 ## @begi n r ead_f i l e_1 30 # @i n dat a1. dat @as t emper at ur eDat aFi l e @URI f i l e: t emp. dat 31 # @out a @as past Temper at ur eDat a 32 a = r ead_f i l e( " dat a1. dat " ) 33 ## @end r ead_f i l e_1 34 35 ## @begi n r ead_f i l e_2 36 # @i n dat a2. dat @as pr eci pi t at i onDat aFi l e @URI f i l e: pr eci p. dat 37 # @out b @as past Pr eci pi t at i onDat a 38 b = r ead_f i l e( " dat a2. dat " ) 39 ## @end r ead_f i l e_2 40 41 i f sum( a) / l en( a) < 0: 42 ## @begi n model _1 43 # @i n a @as past Temper at ur eDat a 44 # @i n b @as past Pr eci pi t at i onDat a 45 # @out dat a @as si mul at edWeat her 46 dat a = model 1( a, b) 47 ## @end model _1 48 el se: 49 ## @begi n model _2 50 # @i n a @as past Temper at ur eDat a 51 # @i n b @as past Pr eci pi t at i onDat a 52 # @out dat a @as si mul at edWeat her 53 dat a = model 2( a, b) 54 ## @end model _2 55 56 ## @begi n ext r act _t emper at ur e model_1 Model_2 simulatedWeather TaPP'15 10
  11. 11.  We found gaps in the noWorkflow dependency graph (when exported to prolog)  Function returns do not always link back to correct variable.  An object that is modified (e.g via a list.append call) is not always captured in the dependency graph  We also observed that noWorkflow does not capture information about the name of the script in the retrospective provenance  This is required for uniquely identifying variables. TaPP'15 11
  12. 12.  An object may be nested, i.e., it may have other objects as is children.  Any change to the child object is attributed to the parent object.  noWorkflow attributes any changes to any child object to the ultimate parent. TaPP'15 12
  13. 13. TaPP'15 13
  14. 14. TaPP'15 14
  15. 15. iDepAbs iDepAbs TaPP'15 15
  16. 16. Q1: find the temperature file used. fileName(N) :- map(V,A), A=“temperatureDataFile”, iData(V,N). Q2: show how the temperature file was used by. fileUsedBy(X,Y) :- iDepAbs(X,Y), map(X,A), A=“temperatureDataFile”. fileUsedBy(X,Y) :- fileUsedBy(X,Z). iDepAbs(Z,Y). TaPP'15 16
  17. 17.  We have implemented an approach for linking the retrospective provenance of script to a more abstracted and user defined prospective provenance.  This solution is complementary to theYesWorkflow solution for capturing retrospective provenance of data files [1]  The issues that we came across usingYesWorkflow and noWorkflow were communicated to the development teams of these tools to address them  Linking prospective and retrospective provenance for multiple scripts as opposed to a single script  Designers may organize the implementation their data analyses into multiple scripts. [1]T. M. McPhillips et al. Retrospective provenance without a run-time provenance recorder. InTapp, 2015 TaPP'15 17
  18. 18. Saumen Dey, Khalid Belhajjame, David Koop, Meghan Raul, Bertram Ludäscher

×