Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Works 2015-provenance-mileage

424 views

Published on

YesWorkflow: More Provenance Mileage from Scientific Workflows and Scripts. Keynote at WORKS 2015: Workshop
Workflows in Support of Large-Scale Science. Sunday Nov. 15, 2015, Austin, Texas.

Published in: Data & Analytics
  • Be the first to comment

Works 2015-provenance-mileage

  1. 1. YesWorkflow: More Provenance Mileage from Scientific Workflows and Scripts! Bertram  Ludäscher   Director, Center for Informatics Research in Science and Scholarship (CIRSS) Professor, Graduate School of Library and Information Science (GSLIS) Faculty affiliate, NCSA & Department of Computer Science
  2. 2. Outline   •  All  things  “Provenance”  …     •  Provenance:  Why  should  you  care?   •  Provenance  in  Databases   – Why-­‐,  How-­‐,  …,  Why-­‐Not  Provenance   •  …  vs  Provenance  in  ScienCfic  Workflows   •  YesWorkflow:  Doing  more  (someCmes  with  less)   More  Provenance  Mileage  from  Workflows  and  Scripts   2  
  3. 3. Provenance  Palooza     •  Provenance     –  …  or  provenience?   •  Chain  of  custody   •  Lineage   •  Pedigree   •  Genealogy     •  Phylogeny     •  History   •  Origin   More  Provenance  Mileage  from  Workflows  and  Scripts   3
  4. 4. Provenance  Research  everywhere  …     …  and  here:   More  Provenance  Mileage  from  Workflows  and  Scripts   4  
  5. 5. Provenance as we all know it •  Oxford English Dictionary: –  coming from some particular source or quarter; origin, derivation –  the history or pedigree of a work of art, manuscript, rare book, etc. –  concretely, a record of the passage of an item through its various owners (“chain of custody”) •  Merriam-Webster: –  prov·e·nance noun ˈpräv-nəәn(t)s, ˈprä-vəә-ˌnän(t)s –  the origin or source of something •  Origin: –  French, from provenir to come forth, originate, from Latin provenire, from pro- forth + venire to come More  Provenance  Mileage  from  Workflows  and  Scripts   5  
  6. 6. Provenance 6 More  Provenance  Mileage  from  Workflows  and  Scripts   Is  this  a  real  Leonardo?  Lack  of  reliable   Provenance  casts  a  doubt  on  this  …    
  7. 7. Pedigree 7 More  Provenance  Mileage  from  Workflows  and  Scripts  
  8. 8. More  Provenance  Mileage  from  Workflows  and  Scripts   8   Natural  History:     Understanding  what  happened…   Zrzavý,  Jan,  David  Storch,  and  Stanislav  Mihulka.   EvoluIon:  Ein  Lese-­‐Lehrbuch.  Springer-­‐Verlag,  2009.   Author:  Jkwchui  (Based  on   drawing  by  Truth-­‐seeker2004)  
  9. 9. Provenience  vs  Provenance   More  Provenance  Mileage  from  Workflows  and  Scripts   9
  10. 10. More  Provenance  Mileage  from  Workflows  and  Scripts   10 Society  of  American  Archivists    hVp://www2.archivists.org/glossary/ terms/p/provenance     •  Principle  of   provenance   (respect  des   fonds)   •  Keep   records  of   different   origins   separate  to   preserve   context     Archivists  
  11. 11. So  what  is  “provenance”  (sensu  W3C)  ?   •  Provenance  refers  to  the  sources  of  informaIon,  including  en11es   and  processes,  involving  in  producing  or  delivering  an  ar1fact  (*)   •  Provenance  is  a  descripIon  of  how  things  came  to  be,  and  how   they  came  to  be  in  the  state  they  are  in  today    (*)   •  Provenance  is  a  record  that  describes  the  people,  ins1tu1ons,  en11es,   and  ac1vi1es,  involved  in  producing,  influencing,  or  delivering  a  piece   of  data  or  a  thing  in  the  world   More  Provenance  Mileage  from  Workflows  and  Scripts   11  
  12. 12. Outline   •  All  things  “Provenance”  …     •  Provenance:  Why  should  you  care?   •  Provenance  in  Databases   – Why-­‐,  How-­‐,  …,  Why-­‐Not  Provenance   •  Provenance  in  ScienCfic  Workflows   •  YesWorkflow:  Doing  more  (someCmes  with  less)   More  Provenance  Mileage  from  Workflows  and  Scripts   12  
  13. 13. Provenance  =>  Transparency   •  =  “Externally-­‐facing”   provenance     – “Them-­‐Provenance”   •  Later:  “Internally-­‐ facing”  provenance   – “Me-­‐Provenance”   More  Provenance  Mileage  from  Workflows  and  Scripts   13  
  14. 14. Climate  Change:  Whodunnit?   More  Provenance  Mileage  from  Workflows  and  Scripts   14  
  15. 15. Tracing  the  sources  (data,  code)     More  Provenance  Mileage  from  Workflows  and  Scripts   15  
  16. 16. From “Climate Gate” to Reproducible Science 16 More  Provenance  Mileage  from  Workflows  and  Scripts  
  17. 17. Data & Provenance Management: Single Model 17 More  Provenance  Mileage  from  Workflows  and  Scripts  
  18. 18. Data & Provenance Management: Model Chains 18 More  Provenance  Mileage  from  Workflows  and  Scripts  
  19. 19. Some things people do with “provenance” •  Result  validaCon       •  Result  debugging  (science  vs  wf  logic)   •  Reproducibility  and  Repeatability       •  ExplanaCon  (derivaCons,  traces,  proof  trees)   •  RunCme  monitoring   –  Profiling,  benchmarking   •  Performance  OpCmizaCon  (“smart  re-­‐run”)   •  Fault-­‐tolerance,  crash-­‐recovery   •  Database  view  maintenance  (e.g.  data  warehousing)   •  …     19 More  Provenance  Mileage  from  Workflows  and  Scripts  
  20. 20. Provenance for Virtual Joint Experiments •  How do we ensure that Charlie gets a complete account of the history of Wc s outputs? •  How do we ensure that Alice gets her due (partial) credit when Charlie uses Bob s data v? è traces TA and TB will be critical è need to compose them to obtain TC We  can  view  the  composiCon  WC  as  a  new,  virtual  workflow   Charlie Alice (1) develop! WA (2) run! RA zx Bob (3) develop!WB (5) run!RB vuf v WC:= (6) inspect provenance! (7) understand, generate! WA WS WB uzx (4) data sharing! TA! TB!f -1 More  Provenance  Mileage  from  Workflows  and  Scripts   20  
  21. 21. Open  Provenance  Model  =>  W3C  Prov   More  Provenance  Mileage  from  Workflows  and  Scripts   21  
  22. 22. W3C  Prov:  One  size  fits  all?   More  Provenance  Mileage  from  Workflows  and  Scripts   22  
  23. 23. Outline   •  All  things  “Provenance”  …     •  Provenance:  Why  should  you  care?   •  Provenance  in  Databases   – Why-­‐,  How-­‐,  …,  Why-­‐Not  Provenance   •  Provenance  in  ScienCfic  Workflows   •  YesWorkflow:  Doing  more  (someCmes  with  less)   More  Provenance  Mileage  from  Workflows  and  Scripts   23  
  24. 24. Types of Data Provenance •  Black-box –  know (next to) nothing at compile-time –  at runtime: keep some data lineage –  most prov sensu WF work use this •  White-box –  statically (compile-time) analyzable –  q(Y1,Y2) :- p(X1,X2), r(X1,Y1), s(X2,Y2) –  Most prov sensu DB work use this •  Grey-box –  can “look inside” (some black boxes) –  … e.g. b/c they have subworkflows –  … or FP signatures: A :: t1, t2à t3,t4 –  … or semantic annotations (sem.types) f A q t1 t2 t3 t4 X1 X2 Y1 Y2 More  Provenance  Mileage  from  Workflows  and  Scripts   24  
  25. 25. Provenance  in  Databases   More  Provenance  Mileage  from  Workflows  and  Scripts   25   Source:  Val  Tannen  
  26. 26. Provenance  in  Databases   More  Provenance  Mileage  from  Workflows  and  Scripts   26   Source:  Val  Tannen  
  27. 27. Provenance  in  Databases   More  Provenance  Mileage  from  Workflows  and  Scripts   27   Source:  Val  Tannen  
  28. 28. AbstracQng  the  structure  of  querying   More  Provenance  Mileage  from  Workflows  and  Scripts   28   Source:  Val  Tannen   In  database  provenance,  tuples  are  either  combined   conjunc1vely  (*)  or  disjunc1vely  (+)  è  That’s  the  core  model!  
  29. 29. Provenance  Polynomials   One  Semiring  to  Rule  them  all!   (DB  theory  strikes!)   More  Provenance  Mileage  from  Workflows   and  Scripts   29   Green,  Karvounarakis,  Tannen.  Provenance  semirings,  PODS,  2007   Unifying  most  prior   work  in  a  simple  model!  
  30. 30. Example:  Go  from  X  to  Y  in  3  hops!   (e.g.,  a  =  CS      b  =  NCSA      c  =  GSLIS)   •  Database:          hop(X,Y)  :=         •  Query:    3hop(X,Y)  :-­‐              hop(X,  Z1),  hop(Z1,  Z2),  hop(Z2,Y).   More  Provenance  Mileage  from  Workflows  and  Scripts   30   a p b q r c s Note:  Can  not  go  from  c  to  a  in  3hops!     a ppp+pqr+qrp b ppq+qrq cpqs ppr+qrr rpq rqs hop(a,a,  p).   hop(a,b,  q).   hop(b,a,  r)   hop(b,c,  s).   3hop(a,a,  p3+2pqr).   3hop(a,b,  p2q+q2r).   …     3hop(a,c,  pqs).  
  31. 31. Provenance  Polynomials       More  Provenance  Mileage  from  Workflows   and  Scripts   31   ,,Mein  Schatz!”        p3  +  2pqr                          p3  +    pqr                        p  +  2pqr                          p  +    pqr                          pqr                          p  +    pqr                     p   a ppp+pqr+qrp b ppq+qrq cpqs ppr+qrr rpq rqs
  32. 32. 32 More  Provenance  Mileage  from  Workflows  and  Scripts   Provenance in Databases
  33. 33. NegaQon  &  Why-­‐Not  Provenance   More  Provenance  Mileage  from  Workflows  and  Scripts   33   •  Provenance  Semirings  work  well  for:   – PosiQve  Queries  (e.g.,  RA+  )   •  Challenges:  Handling  of     – set  difference  (~  negaQon)   – Why-­‐not  provenance       – Missing  Answer  provenance     •  A  fresh  look  at  provenance!   •  …  using  an  old  idea:  Game  semanQcs!      
  34. 34. Provenance  (or  Query  EvaluaIon)  Games   More  Provenance  Mileage  from  Workflows  and  Scripts   34    “SLD-­‐resoluQon  game”      A(X)  :–  B(X,Y,Z)    …  not  C(X,Y)  …     Eureka! [KLZ13]  Köhler,  S.,  Ludäscher,  B.,  &  Zinn,  D.  (2013).  First-­‐order  provenance  games.   In  Search  of  Elegance  in  the  Theory  and  PracIce  of  ComputaIon.  Springer  
  35. 35. TranslaQon:  Q(I) => G Q(I) More  Provenance  Mileage  from  Workflows  and  Scripts   35   A(X) C(X) B(X, Y ) r2(X, Y ) g1 2(X, Y ) g2 2(Y ) rB(X, Y ) rC (X) ¬A(X) ¬B(X, Y ) ¬C(X) B(X, Y ) C(X) X:=Y 9Y (a) Game template for QABC : A(X) : B(X, Y ), ¬C(Y ). ¬C(a) ¬C(b) ¬B(a, a) ¬B(a, b) rB(b, a) r2(b, a)¬A(b) ¬A(a) g1 2(a, a) B(a, b) B(a, a) C(a) g2 2(a) g2 2(b) C(b) ¬B(b, a) ¬B(b, b) rC (a) A(b) A(a) r2(a, b) r2(a, a) g1 2(a, b) rB(a, b) r2(b, b) g1 2(b, b) g1 2(b, a) B(b, b) B(b, a) 9a 9b 9b 9a (b) Instantiated QABC game on I = {B(a, b), B(b, a), C(a)}. Source  [KLZ13]  
  36. 36. Solve  G Q(I)  =>  Provenance!     More  Provenance  Mileage  from  Workflows  and  Scripts   36   ¬B(a, b)¬A(a) B(a, b) r2(a, b) g1 2(a, b) rB(a, b) (b) Instantiated QABC game on I = {B(a, b), B(b, a), C(a)}. ¬C(a) ¬C(b) ¬B(a, a) ¬B(a, b) rB(b, a) r2(b, a)¬A(b) ¬A(a) rB(a, b)B(a, b) B(a, a) C(a) g2 2(a) g2 2(b) C(b) ¬B(b, a) ¬B(b, b) rC (a) A(b) A(a) r2(a, b) r2(a, a) g1 2(a, b) g1 2(a, a) r2(b, b) g1 2(b, b) g1 2(b, a) B(b, b) B(b, a) 9a 9b 9b 9a (c) Solved game: lost positions are (dark) red; won positions are (light) green. Provenance edges (= good moves) are solid. Bad moves are dashed and not part of the provenance. A(a) is true (A(b) is false) as it is won (lost) in the solved game; the game provenance explains why (why-not). Figure 3: Provenance game for Q . The well-founded model of Source  [KLZ13]  
  37. 37. Provenance  ~  Query  EvaluaQon  Game     More  Provenance  Mileage  from  Workflows  and  Scripts   37   a p b q r c s (a) input I ... hop a a p a b q b a r b c s (b) ... annotated. 3hop a a p3 + 2pqr a b p2 q + q2 r a c pqs b a p2 r + qr2 b b pqr b c qrs (c) 3hop with provenance. r1(a, a, b, a) g2 1(a, a) ¬hop(b, a) g1 1(a, a) hop(b, a) g2 1(a, b) g3 1(b, a) rhop(b, a) r1(a, a, a, a) r1(a, a, a, b) 3hop(a, a) g3 1(a, a) rhop(a, a) hop(a, b) ¬hop(a, a) g1 1(a, b) rhop(a, b) g2 1(b, a) ¬hop(a, b) hop(a, a) 9 a,a 9 b,a 9 a,b (d) The game provenance of 3hop(a, a) ... ⇥ + ⇥ + + + + r ⇥ ⇥ + + p + ⇥ + q + ⇥ + (e) ... is p3 + 2pqr. Provenance  Game  on  GQ(I)       =    Provenance  Polynomials     …  for  posiQve  queries!   Source  [KLZ13]  
  38. 38. Provenance  ~  Query  EvaluaQon  Game     More  Provenance  Mileage  from  Workflows  and  Scripts   38   …  but  also  works  for  Why-­‐Not  provenance  &  non-­‐monotonic   queries  (i.e.,  Q  can  have  negaQon)  !!     Here:  not  3hop(c,a)  –  can’t  go  back  from            GSLIS    to        CS                                    c                        a   g2 1(c, a) ¬3hop(c, a) g2 1(c, c)g1 1(c, c) r1(c, a, c, b) ¬hop(c, b) hop(c, a) g2 1(b, b) ¬hop(a, c) hop(c, c) g1 1(c, a) r1(c, a, b, c)r1(c, a, a, b) 3hop(c, a) hop(b, b) g2 1(c, b)g2 1(a, c) r1(c, a, a, c) ¬hop(c, c) hop(c, b) ¬hop(c, a) g1 1(c, b) r1(c, a, b, b) ¬hop(b, b) g3 1(c, a) r1(c, a, a, a) r1(c, a, b, a) hop(a, c) r1(c, a, c, a) r1(c, a, c, c) 9 a,b 9 a,c 9 c,a 9 c,c9 b,c 9 b,b9 b,a9 a,a 9 c,b Figure 2: Why-not provenance for 3hop(c, a) using provenance games. gi 1 in the body of r1, thus claiming that gi 1 is false and hence that the r1 instance doesn’t derive t. The first player can counter and demonstrate that gi 1 is true by selecting a rule instance or fact as evidence for gi 1. The game proceeds in rounds until some player cannot move and thus loses (the opponent wins). In [KLZ13] it Source  [KLZ13]  
  39. 39. Database  Provenance:  Summary   •  Fine-­‐grained  “white-­‐box”  provenance   •  Solved  (preVy  much)  for  posiQve  queries   •  …  not  so  much  for  negaQon  and  “Why-­‐Not”   – AcCve  area  of  research!   •  Some  research  prototypes  …     •  …  and  some  real-­‐world  implementaCons!   •  Note:  Those  in  need  of  provenance  o`en   already  “do  it”!!   – Crash  recovery,  audiCng,  concurrency  control,  …     More  Provenance  Mileage  from  Workflows  and  Scripts   39  
  40. 40. Outline   •  All  things  “Provenance”  …     •  Provenance:  Why  should  you  care?   •  Provenance  in  Databases   – Why-­‐,  How-­‐,  …,  Why-­‐Not  Provenance   •  Provenance  in  ScienQfic  Workflows   •  YesWorkflow:  Doing  more  (someCmes  with  less)   More  Provenance  Mileage  from  Workflows  and  Scripts   40  
  41. 41. Scientific Workflows: ASAP! •  Automation –  wfs to automate computational aspects of science •  Scaling (exploit and optimize machine cycles) –  wfs should make use of parallel compute resources –  wfs should be able handle large data •  Abstraction, Evolution, Reuse (human cycles) –  wfs should be easy to (re-)use, evolve, share •  Provenance –  wfs should capture processing history, data lineage è traceable data- and wf-evolution è  Reproducible Science Trident   Workbench   VisTrails   More  Provenance  Mileage  from  Workflows  and  Scripts   41   Es  war  einmal  …      
  42. 42. Phylogenetics workflow in Kepler (2005) Graphical interface §  Canvas for assembling and displaying the workflow. §  Library of workflow blocks (‘actors’) that can be dragged onto the canvas and connected. §  Arrows that represent control dependencies or paths of data flow. §  A run button. These features are not essential to managing actual scientific workflows. What  some  of  us  think  of  when  we  hear  the   term  ‘scienQfic  workflows’   Source:  Tim  McPhillips   More  Provenance  Mileage  from  Workflows  and  Scripts   42  
  43. 43. 10  Key  FuncQons  of  a  Sci-­‐WFS   1.  Automate programs and services scientists already use. 2.  Schedule invocations of programs and services correctly and efficiently – in parallel where possible. 3.  Manage dataflow to, from, and between programs and services. 4.  Enable scientists (not just developers) to author or modify workflows easily. 5.  Predict what a workflow will do when executed: prospective provenance. 6.  Record what actually happens during workflow execution. 7.  Reveal retrospective provenance – how workflow products were derived from inputs via programs and services. 8.  Organize intermediate and final data products as desired by users. 9.  Enable scientists to version, share and publish their workflows. 10.  Empower scientists who wish to automate additional programs and services themselves. These functions–not actors—distinguish scientific workflow automation from general scientific software development. More  Provenance  Mileage  from  Workflows  and  Scripts   43   Tim  McPhillips  et  al.  
  44. 44. Yes, scripts are (can be) workflows too! Interactive Visualization More  Provenance  Mileage  from  Workflows  and  Scripts   44  
  45. 45. SKOPE:  Synthesized  Knowledge  Of  Past  Environments   More  Provenance  Mileage  from  Workflows  and  Scripts   45   Bocinsky,  Kohler  et  al.  study  rain-­‐fed  maize  of  Anasazi     –  Four  Corners;  AD  600–1500.  Climate  change  influenced  Mesa  Verde  MigraQons;  late   13th  century  AD.  Uses  network  of  tree-­‐ring  chronologies  to  reconstruct  a  spaQo-­‐ temporal  climate  field  at  a  fairly  high  resoluCon  (~800  m)  from  AD  1–2000.  Algorithm   esCmates  joint  informaCon  in  tree-­‐rings  and  a  climate  signal  to  idenCfy  “best”    tree-­‐ring   chronologies  for  climate  reconstrucCng.   K.  Bocinsky,  T.  Kohler,  A  2000-­‐year  reconstrucCon  of  the  rain-­‐fed   maize  agricultural  niche  in  the  US  Southwest.  Nature   Communica1ons.  doi:10.1038/ncomms6618     … implemented as an R Script …
  46. 46. …  HPCBio  Workflows  @  Illinois   More  Provenance  Mileage  from  Workflows  and  Scripts   46    NaIonal  Petascale   CompuIng  Facility   Broad  InsQtute:    Recommended  workflow  for  variant  analysis   Liudmila  Mainzer,   Victor  Jongeneel   HPC  Bio  @  Illinois   Quickly,  say:    #!/bin/bash  
  47. 47. It’s  Qme  to  shi`  control  …   More  Provenance  Mileage  from  Workflows  and  Scripts   47   •  …  back  from  being  consumers  of  someone   else’s  (=  our)  tools  ..     –  “Just  click  here!”   •  ...  to  tool  makers!   –  ScienCsts  who  author  workflows  as  scripts!   •  Go  where  the  wild  things  (users!)  are  …       –  Yes,  develop  for  “end  users”  …       –  …  but  don’t  forget  the  tool  makers!   •  Can  we  do  this  together?    
  48. 48. Mount   Sample   Screen Sample Align   Sample   Expose   Sample   Analyze   Images   Check   Criteri a   Calculat e   Strategy   Collect   Data  Set   Calculat e  Maps   List   Peaks   Run   Search   Refine   Structur e   Integrat e   Images   Scale   ReflecQon s   Merge   ReflecQons   Calc   Amplitude s   Collect Data Process   Data Solve   Structure Analyze   Density Blu-Ice LABELIT molrep   refmac   z   ipmosflm   xds pointless   scala   xtriage   truncate   rfree   Example:  AutoDrug  Workflow   More  Provenance  Mileage  from  Workflows  and  Scripts   48   Tsai,  Y.,  McPhillips,  S.  E.,  González,  A.,  McPhillips,  T.  M.,   Zinn,  D.,  Cohen,  A.  E.,  ...  &  SolCs,  S.  (2013).  AutoDrug:  fully   automated  macromolecular  crystallography  workflows  for   fragment-­‐based  drug  discovery.  Acta  Crystallographica   SecCon  D:  Biological  Crystallography,  69(5),  796-­‐803.  
  49. 49. Diffraction images Experimental electron density and protein model Full protein structure 3D  Protein  Structure   DeterminaQon  by  X-­‐ray   Crystallography     More  Provenance  Mileage  from  Workflows  and  Scripts   49   Source:  Tim  McPhillips  
  50. 50. Crystal   in   loop   Sample mounting robot Cassette shipping dewar Crystal mounting pin Sample cassette Automated  Sample  Handling   Alice,  the  high-­‐throughput  crystallographer:  When  the  first  shi|  of  her  beam   Cme  begins,  technicians  at  the  beam  line  load  the  three  casseVes  into  a  liquid   nitrogen  dewar  within  reach  of  the  sample-­‐mounCng  robot  and  close  the   radiaCon  door.    From  this  point  Alice  is  able  to  control  beam  line  operaCons   remotely.   More  Provenance  Mileage  from  Workflows  and  Scripts   50   Source:  Tim  McPhillips  
  51. 51. Remote  beam  line  operaQon   More  Provenance  Mileage  from  Workflows  and  Scripts   51   Source:  Tim  McPhillips  
  52. 52. Outline   •  All  things  “Provenance”  …     •  Provenance:  Why  should  you  care?   •  Provenance  in  Databases   – Why-­‐,  How-­‐,  …,  Why-­‐Not  Provenance   •  Provenance  in  ScienCfic  Workflows   •  YesWorkflow:  Doing  more  with  Provenance!   – …  someCmes  using  less  (e.g.,  no  provenance  recorder)   More  Provenance  Mileage  from  Workflows  and  Scripts   52  
  53. 53. GetModernClimate PRISM_annual_growing_season_precipitation SubsetAllData dendro_series_for_calibration dendro_series_for_reconstruction CAR_Analysis_unique cellwise_unique_selected_linear_models CAR_Analysis_union cellwise_union_selected_linear_models CAR_Reconstruction_union raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors CAR_Reconstruction_union_output ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif master_data_directory prism_directory tree_ring_datacalibration_years retrodiction_years ?   YesWorkflow:     Yes,  scripts  are  workflows,  too!   •  Script  vs  Workflows/ASAP:   – Automation:    *****   – Scaling:          **   – Abstraction:  *     – Provenance:    **   More  Provenance  Mileage  from  Workflows  and  Scripts   53  
  54. 54. Enter:  YesWorkflow!  (yesworkflow.org)   •  YesWorkflow  (YW)   –  Grass-­‐roots  effort       –  …  meeCng  the  scienCsts/users  where  they  R!   •  R,  Matlab,  (i)Python,  Jupyter,  …   –  Scripts  +  simple  user  annotaCons   •  =>  Reveal  the  workflow  model/abstracQon      …  that  underlies  the  (script)  implementaIon   •  =>  YW  can  give  us  more  of  ASAP!   –  First  YW:    ASAP  (AbstracCon)...   –  Then  YW-­‐recon:  ASAP  (reconstrucCng  runQme  Provenance)   54  More  Provenance  Mileage  from  Workflows  and  Scripts  
  55. 55. Related  Work,  other  Approaches   …  to  bring  workflow/provenance  benefits  to   scripts:   •  RunQme  Provenance  Recorders:   – use  (R,  Python,  ..)  libraries  and/or  code   instrumentaQon  to  capture  runQme  observables   •  file  read/write,  funcCon  calls,  program  variables  &  state,  …   – noWorkflow  system     •  [Murta-­‐Braganholo-­‐ChirigaC-­‐Koop-­‐Freire-­‐IPAW14]     •  exploit  Python  profiling  library  to  capture  runCme   provenance   =>  helps  with  "S"  and  "P"         More  Provenance  Mileage  from  Workflows  and  Scripts   55  
  56. 56. YW  (prospec1ve)  and     YW-­‐Recon  (retrospec1ve)  Provenance   •  1.  YW:  Annotate  Script  =>  YW  Model   –  Annotate  @BEGIN..@END,  @IN,  @OUT   –  Visualize,  share,  be  happy  J     •  2.  Run  script   –  Files  are  read  and  wriVen   –  Folder-­‐  &  Filenames  have  metadata   •  3.  YW-­‐Recon   –  Use  @URI  tags  that  link  YW  Model  ó  Persisted  Data   –  Run  URI-­‐template  queries     •  cf.  “ls  -­‐R”  &  RegEx  matching   •  4.  YW-­‐Query   –  Answer  the  user’s  provenance  queries     More  Provenance  Mileage  from  Workflows  and  Scripts   56  
  57. 57. YW  annotaQons:  Model  your  Workflow!   More  Provenance  Mileage  from  Workflows  and  Scripts   57  
  58. 58. YesWorkflow:  ProspecQve  &  RetrospecCve   Provenance  …  (almost)  for  free!     •  YW  annotaCons  in   the  script  (R,   Python,  Matlab)   are  used  to   recreate  the   workflow  view   from  the  script  …     More  Provenance  Mileage  from  Workflows  and  Scripts   58   cassette_id sample_score_cutoff sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img initialize_run run_log file:run/run_log.txt load_screening_results sample_namesample_quality calculate_strategy rejected_sample accepted_sample num_images energies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_id energy frame_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv YW!  
  59. 59. Voila!  The  Workflow  revealed!   More  Provenance  Mileage  from  Workflows  and  Scripts   59   cassette_id sample_score_cutoff sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img initialize_run run_log file:run/run_log.txt load_screening_results sample_namesample_quality calculate_strategy rejected_sample accepted_sample num_images energies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_id energy frame_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv
  60. 60.            Get  3  views  for  the  price  of  1!   More  Provenance  Mileage  from  Workflows  and  Scripts   60                        Process  view   Data  view   Combined  view  
  61. 61. GetModernClimate PRISM_annual_growing_season_precipitation SubsetAllData dendro_series_for_calibration dendro_series_for_reconstruction CAR_Analysis_unique cellwise_unique_selected_linear_models CAR_Analysis_union cellwise_union_selected_linear_models CAR_Reconstruction_union raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors CAR_Reconstruction_union_output ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif master_data_directory prism_directory tree_ring_datacalibration_years retrodiction_years Paleoclimate  ReconstrucQon  (EnviRecon.org)     More  Provenance  Mileage  from  Workflows  and  Scripts   61   •  …  explained  using  YesWorkflow!   Kyle  B.,  (computaConal)  archaeologist:     "It  took  me  about  20  minutes  to  comment.  Less   than  an  hour  to  learn  and  YW-­‐annotate,  all-­‐told."  
  62. 62. Provenance Lands 62   Workflow  Modeling  &  Design   (a.k.a.  prospec1ve  provenance   “Workflow-­‐land”)   RunQme  Provenance     (a.k.a.  traces,  logs,       retrospec1ve   provenance,   “Trace-­‐land”)   More  Provenance  Mileage  from  Workflows  and  Scripts  
  63. 63. run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     YW-­‐RECON:  ProspecCve  &  RetrospecQve   Provenance  …  (almost)  for  free!     More  Provenance  Mileage  from  Workflows  and  Scripts   63   cassette_id sample_score_cutoff sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img initialize_run run_log file:run/run_log.txt load_screening_results sample_namesample_quality calculate_strategy rejected_sample accepted_sample num_images energies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_id energy frame_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv •  URI-­‐templates  link  conceptual  enCCes   to  runQme  provenance  “le|  behind”  by   the  script  author  …     •  …  facilitaCng  provenance  reconstrucQon  
  64. 64. YW  (prospec1ve)  and     YW-­‐Recon  (retrospec1ve)  Provenance   •  1.  YW:  Annotate  Script  =>  YW  Model   –  Annotate  @BEGIN..@END,  @IN,  @OUT   –  Visualize,  share,  be  happy  J     •  2.  Run  script   –  Files  are  read  and  wriVen   –  Folder-­‐  &  Filenames  have  metadata   •  3.  YW-­‐Recon   –  Use  @URI  tags  that  link  YW  Model  ó  Persisted  Data   –  Run  URI-­‐template  queries     •  cf.  “ls  -­‐R”  &  RegEx  matching   •  4.  YW-­‐Query   –  Answer  the  user’s  provenance  queries     More  Provenance  Mileage  from  Workflows  and  Scripts   64  
  65. 65. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff Data  collecQon  workflow  (X-­‐ray  diffracCon)   More  Provenance  Mileage  from  Workflows  and  Scripts   65  
  66. 66. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff Data  collecCon  workflow:  runQme  data   More  Provenance  Mileage  from  Workflows  and  Scripts   66   run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     1.  YW  annotaQons  =>  YW  model   2.  Files  &  Folders  le`  by  a  run  =>  runQme  (meta-­‐)data  
  67. 67. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff Q1:  What  samples  did  the  script  run  collect  images   from?   run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     More  Provenance  Mileage  from  Workflows  and  Scripts   67  
  68. 68. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff Q2:  What  energies  were  used  for  image  collecCon  from   sample  DRT322?   run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     More  Provenance  Mileage  from  Workflows  and  Scripts   68  
  69. 69. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff Q3:  Where  is  the  raw  image  of  the  corrected  image   DRT322_11000ev_030.img?    run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     More  Provenance  Mileage  from  Workflows  and  Scripts   69  
  70. 70. initialize_run run_log file:run/run_log.txt load_screening_results sample_name sample_quality calculate_strategy rejected_sample accepted_sample num_imagesenergies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_idenergyframe_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img cassette_id sample_score_cutoff run/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     Q5:  What  casseqe-­‐id  had  the  sample  leading  to   DRT240_10000ev_001.img?   More  Provenance  Mileage  from  Workflows  and  Scripts   70  
  71. 71. Querying   Provenance   More  Provenance  Mileage  from  Workflows  and  Scripts   71  
  72. 72. Taking  YW  for  a  spin  …     •  “To  document  on-­‐the  fly,  specifically  for  a  given   workflow  configuraIon  invoked:     –  do  not  insert  annotaIons  into  code,   –  but  rather  have  code  print  annota1ons  into  a  special  log   during  execuIon,   –  then  parse  that  log!”      –  Liudmila  Mainzer   More  Provenance  Mileage  from  Workflows  and  Scripts   72   Source:  L  Mainzer,  V  Jongeneel  (IGB  &  NCSA)    
  73. 73. Conclusions   •  Provenance   –  …  in  databases   –  …  in  scienCfic  workflows   •  Scripts  are  (o|en)  workflows  too!   •  è  Need  to  support  provenance  management  for   scripts  and  scienCfic  workflows!   •  One  size  might  not  fit  all  …   –  Use  prospecCve,  retrospecCve  (recorded,  reconstructed   provenance)   •  Facilitate  “insider”  (or  “deep”)  provenance   –  …  the  stuff  scienCsts  need  to  get  their  job  done!   More  Provenance  Mileage  from  Workflows  and  Scripts   73  
  74. 74. Deep  Provenance  to  get  the  science  done!   •  When  reconstrucCng  the   past  climate,  need  to   know  which  tree-­‐ring   source  was  used!   More  Provenance  Mileage  from  Workflows  and  Scripts   74   CRTZ MVNP ESPN LANL Arizona Colorado New Mexico Utah Douglas fir Pinyon and juniper Spruce, pine, and true fir GHCN stations K.  Bocinsky,  T.  Kohler,  A  2000-­‐year  reconstrucCon  of  the  rain-­‐fed  maize  agricultural   niche  in  the  US  Southwest.  Nature  Communica1ons.  doi:10.1038/ncomms6618    
  75. 75. Conclusions  (Cont’d)   •  YesWorkflow:  Go  where  the  users  are!   –  …  they  already  capture  provenance  through  metadata!   •  Beware  your  level  of  provenance  abstracQon   –  Let  the  user  provide  a  workflow  model  easily!     •  YW-­‐Recon:   –  …  finishing  support  for  retrospecQve  provenance  without  using  a   runCme  provenance  recorder!   –  Key  insight:  scienCsts  already  leave  provenance  “bread  crumbs”   behind!  (it’s  not  an  accident!)   •  Future  Work:   –  Build  systems  that  work  with  the  exisCng  workflow  of  scienCsts!   –  There  are  many  research  quesCons  &  opportuniCes  out  there!   •  e.g.:  Why-­‐Not  provenance  for  scienCfic  workflows  anyone?     More  Provenance  Mileage  from  Workflows  and  Scripts   75  
  76. 76. References    …     More  Provenance  Mileage  from  Workflows  and  Scripts   76  
  77. 77. References  (cont’d)   More  Provenance  Mileage  from  Workflows  and  Scripts   77  

×