Your SlideShare is downloading. ×
Repro pdiff-talk (invited, Humboldt University, Berlin)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Repro pdiff-talk (invited, Humboldt University, Berlin)

160
views

Published on

See paper: …

See paper:
Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience. In Press (2013).


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
160
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Provenance and data differencing for workflow reproducibility analysis Paolo MissierSchool of Computing Science Newcastle University, UK Humboldt University, Berlin March 4, 2013
  • 2. Provenance Metadata Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*) Provenance is a description of how things came to be, and how they came to be in the state they are in today (*) Why does it matter? • To establish quality, relevance, trust • To track information attribution through complex transformations • To describe one’s experiment to others, for understanding / reuse • To provide evidence in support of scientific claims • To enable post hoc process analysis for debugging, improvement, evolution2 (*) Definitions proposed by the W3C Incubator Group on provenance: http://www.w3.org/2005/Incubator/prov/wiki/What_Is_Provenance
  • 3. A colourful provenance graph Remote past Recent past Editing phase wasDerivedFrom used paper3 reading wasGeneratedBy specializationOf specializationOf Bob-1 Bob Bob-2 type=person type=person actedOnBehalfOf role=author role=main_editor role=author role=jr_editor Alice wasAssociatedWith wasAttributedTo role=editor wasAssociatedWith wasAssociatedWith wasGeneratedBy draft used wasGeneratedBy draft used wasGeneratedBy draft used drafting commenting editing v1 comments v2 distribution=internal wasDerivedFrom status=draft distribution=internal version=0.1 status=draft version=0.1 Publishing phase type=person role=headOfPublication wasDerivedFrom pub wasGeneratedBy pub guideline actedOnBehalfOf guidelines guidelines Alice Charlie update v1 v2 distribution=public role=issuer wasAssociatedWith status=draft version=1.0 wasAssociatedWith draft used wasGeneratedBy publication WD1 w3c: type=institution v2 consortium3
  • 4. Motivation: Reproducibility in e-science • Setting: Collaborative, Open Science – Increasing rate of data sharing in science • The stick: both journals and funders demand that data be uploaded – Multiple data journals, data repositories emerging • The carrot: data is given a DOI and is citable, scientists get credit • Thomson’s Data Citation Index • Dryad data repository for biosciences(*) • The DataBib repository of research data • NSF Data Preservation projects: DataONE • best practices document: notebooks.dataone.org/bestpractices/ •... and many others (*) As of Jan 27, 2013, Dryad contains 2585 data packages and 7097 data files, associated with articles in 187 journals.4
  • 5. General problems • Quality assurance – from non-malicious errors in method or data, all the way to fraud – ... leading to retractions in scientific publications • see eg http://retractionwatch.wordpress.com/ • Repeatability – If I replicate your experiment / repeat your process on the same data, will I get the same results? • Reproducibility -- a more general notion The ability for a third party who has access to the description of the original experiment and its results to reproduce those results, using a possibly different setting, with the goal to to confirm or dispute the original experimenter’s claims.5
  • 6. Specifically, in e-science... • Experimental method → scripts, programs, workflows • Publication = results + {program, workflow} + evidence of results • Repeatability, reproducibility – will I be able to run (my version of) your workflow on (my version of) your input and compare my results to yours? • Evidence of result: provenance of {program, workflow} execution • Side note: portability issues are out of scope – VMs often solve the problem, with some limitations • not when workflows depend on third party services • only for limited size data dependencies Main issue: Workflow evolution and decay6
  • 7. Mapping the reproducibility space Environmental variations ED ED ! ED wfs wfs ! wfs Experimental variations repeatability wf results d confirmation disfunctional wf ! wf method workflow non-functioning d variation - service updates workflow - state changes wf exception data divergence d ! d analysis, variation analysis debugging wf ! wf data and reproducibility d ! d method variation decay region Goal: to help scientists understand the effect of workflow / data / dependencies evolution on workflow execution results Approach: compare provenance traces generated during the runs: PDIFF P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow7 reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013. In press.
  • 8. Mapping the reproducibility space Environmental variations ED ED ! ED wfs wfs ! wfs Experimental variations repeatability wf results d confirmation disfunctional wf ! wf method workflow non-functioning d variation - service updates workflow - state changes wf exception data divergence d ! d analysis, variation analysis debugging wf ! wf data and reproducibility d ! d method variation decay region Goal: to help scientists understand the effect of workflow / data / dependencies evolution on workflow execution results Approach: compare provenance traces generated during the runs: PDIFF P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow7 reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013. In press.
  • 9. Decay • Workflows that have external dependencies are harder to maintain – they may become disfunctional or break altogether Zhao, Jun, Jose Gomez-Perez, Khalid Belhajjame, Graham Klyne, and Et Al. “Why Workflows Break - Understanding and Combating Decay in Taverna Workflows.” In Procs. E-science Conference.8 Chicago, 2012.
  • 10. Decay • Workflows that have external dependencies are harder to maintain – they may become disfunctional or break altogether Zhao, Jun, Jose Gomez-Perez, Khalid Belhajjame, Graham Klyne, and Et Al. “Why Workflows Break - Understanding and Combating Decay in Taverna Workflows.” In Procs. E-science Conference.8 Chicago, 2012.
  • 11. Workflows and provenance traces Workflow (structure): directed graph W=(T,E) T: set of tasks (computational units) P: set of (input, output) ports associated to each task t ∈ T E ⊂ T X T: graph edges representing data dependencies ⟨ti.pA,tj.pB⟩ ∈ E: data produced by ti on port pA ∈ P is routed to port pB ∈ P of tj Execution trace: tr = exec(W, d, ED, wfms) A: activities D: data items R = { used, genBy } relations: used ⊂ A × D × P, genBy ⊂ D × A × P Workflow inputs: tr.I = {d ∈ D|a ∈ A, p ∈ P (d,a,p) ∉ genBy}. Workflow outputs: tr.O = {d ∈ D|a ∈ A, p ∈ P (a,d,p) ∉ used}.9
  • 12. Workflow evolution tr = exec(W, d, ED, wfms) Each of the elements in an execution may evolve (semi) independently from the others: trt = exec t (Wi , ED j , dh , wfms k ), with i, j, h, k < t W ED d wfms t1 W1 ED1 d1 wfms1 tr1 = exec1(W1,ED1,d1,wfms1) t2 W2 tr2 = exec2(W2,ED1,d1,wfms1) t3 ED3 d3 tr3 = exec3(W2,ED3,d3,wfms1) t4 wfms4 tr4 = exec4(W2,ED3,d3,wfms4) t5 ED5 tr5 = exec5(W2,ED5,d3,wfms4) Repeatability: • Can trt be computed again at some time t’>t? • Requires saving EDt but may be impractical (eg large DB state)10
  • 13. Reproducibility Can a new version trt’ of trt be computed at some later time t’ > t, after one of more of the elements has changed? trt = exec t (Wi , ED j , dh , wfms k ), with i, j, h, k < t tr t = exec t (Wi , ED j , dh , wfms k ) W ED d wfms t1 W1 ED1 d1 wfms1 tr1 = exec1(W1,ED1,d1,wfms1) t2 W2 tr2 = exec2(W2,ED1,d1,wfms1) t3 ED3 d3 tr3 = exec3(W2,ED3,d3,wfms1) t4 wfms4 tr4 = exec4(W2,ED3,d3,wfms4) t5 ED5 tr5 = exec5(W2,ED5,d3,wfms4) Potential issues: • Wi may not run new EDj’ • Wi may not run with wfmsk’ • Wi’ may not run with dh’11 • ...
  • 14. Data divergence analysis using provenance • All work done with reference to the e-Science Central WFMS • Assumption: workflow WFj (new version) runs to completion – thus it produces a new provenance trace – however, it may be disfunctional relative to WFi (the original) • Example: only input data changes: d != d’, WFj == WFi tr t = exec t (W, ED, d, wfms), tr t = exec t (W, ED, d , wfms) S0 S1 S2 S3 S4 Note: results may diverge even when the input datasets are identical, for example when one or more of the services exhibits non-deterministic behaviour, or depends on external state12 that has changed between executions.
  • 15. Reproducibility requires comparing datasets • Experimenters may validate results by deliberately altering the experimental settings (Wi′, dj′) • The outcomes will not be identical, but – are they similar enough to the original to conclude that the experiment was successfully reproduced? ∆D(trt.O, trt′.O) • Data comparison is type- and format-dependent in general • Example: – workflow output: a classification model computed using model builders – two models may be different but statistically equivalent • e-Science Central accommodates user-defined data diff blocks – these are just Java-based workflow blocks if ∆D(trt.O, trt′.O) > threshold: why are results diverging?13
  • 16. Provenance traces for two runs d1 d2 d1 d2 S0 S1 S0 S1 S0 S1 P0 P0 d3 z w d3 z w P0 P1 P0 P1 P0 P1 S3 S2 S3 S2 S2 S3 P0 P0 x y x y P0 P1 S4 used P0 P1 P0 P1 S4 S4 genBy df df (i) Trace A (ii) Trace B14
  • 17. Delta graphs A graph obtained as a result of traces “diff” which can be used to explain observed differences in workflow outputs, in terms of differences throughout the two executions. This is the simplest d1 d2 d1 d2 possible delta “graph”! S0 S1 S0 S1 d3 d2 , d 2 z w d3 z w P0 P1 P0 P1 S3 S2 S3 S2 w, w x y x y y, y P0 P1 P0 P1 S4 S4 dF , d F df df (iii) Delta tree (i) Trace A (ii) Trace B15
  • 18. More involved workflow differences WA sv2 WB • S0 is followed by S0 in WA but not in WB ; • S3 is preceded by S3 in WB but not in WA ; • S2 in WA is replaced by a new version, S2v2 , in WB ; • S1 in WA is replaced by S5 in WB .16
  • 19. The corresponding traces tr t = exec t (W, ED, d, wfms), tr t = exec t (W , ED , d, wfms) d0 d0 S Sv2 d1 d1 S0 S0 d2 d2 S0 S1 S3 S5 w h k w h k P0 P1 P0 P1 S3 S2 S3 S2v2 y z z y P0 P1 P0 P1 S4 S4 x17 x (i) Trace A (ii) Trace B
  • 20. Delta graph computed by PDIFF S, Sv2 (version change) d1 , d 1 S0 S0 S3 S1 , S 5 S0 , S 0 (service repl.) S0 , S 3 h, h k, k w, w P0 branch of S2 P1 branch of S2 S2 , S2v2 (version change) y, y z, z18 P0 branch of S4 x, x P1 branch of S4
  • 21. Summary • Setting: – scientific results computed using workflows – openness / data sharing has potential to accelerate science – but requires results validation and reproducibility • Problem: reproducibility is hard to achieve – workflow decay – evolution of data, workflow spec, dependencies, wf engine • Goal: support divergence analysis • Approach: PDIFF -- comparing provenance traces generated during the runs19
  • 22. Selected references Zhao, Jun, Jose Gomez-Perez, Khalid Belhajjame, Graham Klyne, and Et Al. Why Workflows Break - Understanding and Combating Decay in Taverna Workflows. In Procs. E-science Conference. Chicago, 2012. Cohen-Boulakia S, Leser U. Search, adapt, and reuse: the future of scientific workflows. SIGMOD Rec. Sep 2011; 40(2):6–16, doi:http://doi.acm.org/10.1145/2034863.2034865 Peng RD,Dominici F., Zeger SL. ReproducibleEpidemiologicResearch. American Journal of Epidemiology 2006; 163(9):783–789, doi:10.1093/aje/kwj093. Drummond C. Science, Replicability is not Reproducibility: Nor is it Good Science. Procs. 4th workshop on Evaluation Methods for Machine Learning In conjunction with ICML 2009, Montreal, Canada, 2009. Peng R. Reproducible Research in Computational Science. Science Dec 2011; 334(6060): 1226–1127 Schwab M, Karrenbach M, Claerbout J. Making Scientific Computations Reproducible. Computing in Science Engineering 2000; 2(6):61–67 P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013. In press. Mesirov J. Accessible Reproducible Research. Science 2010; 32720