Your SlideShare is downloading. ×
Tapp11 presentation
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Tapp11 presentation

276
views

Published on

presentation from TAPP'11 workshop

presentation from TAPP'11 workshop


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
276
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript

    • 1. Incremental workflow improvement through analysis of its data provenance Paolo Missier School of Computing, Newcastle University, UK In collaboration with the eScience Central group at Newcastle:Prof. Paul Watson, Dr. Simon Woodman, Dr Hugo Hiden, Dr. Jacek Cala TAPP’11 workshop Heraklion, Crete June 20-21, 2011
    • 2. Motivation Despite the growing momentum around provenance as a premiere form of metadata, success exploitation stories trail models and technology advances2
    • 3. Motivation Despite the growing momentum around provenance as a premiere form of metadata, success exploitation stories trail models and technology advances • We are getting very good at recording provenance of data: – multiple data models (OPM, Provenir, Janus, Karma, PML,...) – provenance-aware system / service architectures ... • PASS, Karma – ... and workflows • Kepler, Taverna, Galaxy, VisTrails,... • But, what are systems/applications really doing with it? – deliver value to users? i.e., in e-science, in the Web • scientific reproducibility, quality, trust – optimize system analysis, performance? • enable partial re-run of resource-intensive processes2
    • 4. Broad goal and specific research context A systematic study on methods and applications of mining / learning techniques applied to large corpora of provenance metadata Reinforcement Learning Sensor readings Effectors Cloud-based workflows come with a sequent states of the managed element, and might Autonomic manager also include forecasts of any external processes clear cost model that generate workflow or traffic processed by the Analyze Plan managed element. – valuable testbed readily available: A key factor limiting rapid adoption and wide usage of self-* managing systems such as the sys- Monitor Knowledge Execute • real e-scienceFigure 1 is the difficulty of engineering a tem in applications • large provenanceaccurate knowledge in deployed sys- sufficiently graphs achieve acceptable performance module that can Sensor readings Effectors – dynamic optimization today’s computing systems are high- tems. Because requires many runs Managed element ly complex and distributed, developing accurate • Reference framework: models of them is a potentially complex and time- consuming task. Moreover, developing such mod- Figure 1. A standard autonomic computing monitor-analyze-plan- els might require original research. For example, execute (MAPE) loop. The autonomic manager continually monitors adaptive, self-managing software systems queuing network models of multitier Internet serv- sensor readings, analyzes and plans management decisions, and – systems that can dynamically adapt their behaviour in response via effectors. The MAPE components have ices have only recently appeared in the literature.4 executes the decisions to changing conditions in the inputs or in their environment onlyfull range of access to ato the likely effectiveness containing information In particular, researchers are approach proper treatment of the beginning to [1,2] pertaining central knowledge base of different management computing systems’ complex dynamic behaviors decisions in achieving policy objectives. within standard model-building frameworks. Stan- [1] IBM. An architectural blueprint for autonomic computing. Tech. rep., IBM, 2011 dard control models, for example, model such3 [2] Huebscher, M. effects onlyMcCann, J. A. A survey of autonomic computing - degrees, models, and C., and approximately, and standard queuing To overcome the knowledge bottleneck in devel- applications. ACMmodels ignore Surveys CSUR 40, 3 they oping self-* systems, ML approaches would ideally Computing them entirely given that (2008), 1–28.
    • 5. Broad goal and specific research context A systematic study on methods and applications of mining / learning techniques applied to large corpora of provenance metadata • An opportunistic starting point: optimization of resource-intensive, repetitive, provenance-aware e- Learning Reinforcement science workflows Sensor readings Effectors Cloud-based workflows come with a sequent states of the managed element, and might Autonomic manager also include forecasts of any external processes clear cost model that generate workflow or traffic processed by the Analyze Plan managed element. – valuable testbed readily available: A key factor limiting rapid adoption and wide usage of self-* managing systems such as the sys- Monitor Knowledge Execute • real e-scienceFigure 1 is the difficulty of engineering a tem in applications • large provenanceaccurate knowledge in deployed sys- sufficiently graphs achieve acceptable performance module that can Sensor readings Effectors – dynamic optimization today’s computing systems are high- tems. Because requires many runs Managed element ly complex and distributed, developing accurate • Reference framework: models of them is a potentially complex and time- consuming task. Moreover, developing such mod- Figure 1. A standard autonomic computing monitor-analyze-plan- els might require original research. For example, execute (MAPE) loop. The autonomic manager continually monitors adaptive, self-managing software systems queuing network models of multitier Internet serv- sensor readings, analyzes and plans management decisions, and – systems that can dynamically adapt their behaviour in response via effectors. The MAPE components have ices have only recently appeared in the literature.4 executes the decisions to changing conditions in the inputs or in their environment onlyfull range of access to ato the likely effectiveness containing information In particular, researchers are approach proper treatment of the beginning to [1,2] pertaining central knowledge base of different management computing systems’ complex dynamic behaviors decisions in achieving policy objectives. within standard model-building frameworks. Stan- [1] IBM. An architectural blueprint for autonomic computing. Tech. rep., IBM, 2011 dard control models, for example, model such3 [2] Huebscher, M. effects onlyMcCann, J. A. A survey of autonomic computing - degrees, models, and C., and approximately, and standard queuing To overcome the knowledge bottleneck in devel- applications. ACMmodels ignore Surveys CSUR 40, 3 they oping self-* systems, ML approaches would ideally Computing them entirely given that (2008), 1–28.
    • 6. Research Hypothesis: Provenance traces recorded for past runs of a workflow can be used to make future runs more efficient Approach: Add adaptive control to an existing workflow, with provenance analysis at its core → new recommender task4
    • 7. Research Hypothesis: Provenance traces recorded for past runs of a workflow can be used to make future runs more efficient Approach: Add adaptive control to an existing workflow, with provenance analysis at its core → new recommender task Applicable for instance to a “Panel of Experts” (Poe) input queue pattern: – N experts are activated on same inputs, best outputs are selected Provenance used for incremental correlation provenance case base of the inputs to the experts’ success rate Panel of Experts recommender - Provenance of run i indicates which experts performed well on their input expert online select - Adaptively pruning the process space: Expert 1 Expert n provenance analysis on run i+1, use provenance of output results merge computed by runs 1..n to select/prioritize invocation of experts output queue4
    • 8. Case study: DiscoveryBus workflow • QSAR: Quantitative structure-activity relationships – at forefront of Chemical Engineering research • OpenQsar project (http://www.openqsar.com): – Establish correlations between the structure of molecular compounds and some of their associated activities (toxicity, solubility, etc) (i) DS • DiscoveryBus: a workflow implementation of OpenQSAR Feature Selection – eScience Central cloud-based workflow system – datasets DS(i) are a family of FS(i)1 FS(i)k structurally homogeneous molecules – Feature Selection extracts few Model Discovery relevant features from DS(i) – Each learning scheme MB1...MBh generates a different predictive MB1 MBh models for molecular activity Repetitive and resource-intensive: Workflow execution repeated over about M(i)11 M(i)1h M(i)k1 M(i)kh 10K different input datasets5
    • 9. Role of provenance in DiscoveryBus (i) DS • Connection to Panel of Experts: Feature Selection – Experts model builders MBi – Experts outcome quality of generated (i) FS 1 (i) FS k model (predictive power, stability) • Optimization goal: Model Discovery – to prioritize invocation of the MBi based on their past performance on inputs similar to MB1 MBh FS(i)j M(i)11 M(i)1h M(i)k1 M(i)kh Provenance correlates quality of output models M(i)jk to intermediate feature sets FS(i)j: (i ) W asGeneratedBy (i ) W asDerivedF rom DS (i ) used Mjh → MB h → FSj →6
    • 10. The recommender • One Quality Matrix QMFS is associated to each Feature Set FS • QMFS[MBi] encodes the success history of model builder MBi in the workflow every time FS is used as input: QM FS [MB h ] = G, B input queue • G (resp B): number of times MBh has been observed to produce a good (resp. bad) model when applied to input FS provenance case base • QMFS is updated every time FS appears in Panel of Experts recommender the provenance graph • The builders’ historical success rate G expert induces a dynamic partial order on the MBi Expert 1 select Expert n online provenance analysis results merge • For each run, the Recommender: output queue • intercepts FS in the flow • returns partial order from QMFS7
    • 11. Early experimental results $) ) + ) , ) $) ) + ) , ) #) + ) , ) #) + ) , ) ")+), ) ")+), ) ;89400 278- - +=;>29 ()+), ) ()+), ) 0 *) + ) , ) *)+ ) , ) <6 !)+), ) ! )+), ) - . / 0 12- . 1 31 50 40 $ 6 0 278- - 29: . 1 - . / 0 12- . 1 31 50 40 ’ &) + ) , ) &) + ) , ) - . / 0 12- . 1 31 50 40 % - . / 0 12- . 1 31 50 40 & %+), ) ) %+), ) ) 786 2716 ’ )+), ) ’ )+), ) $) + ) , ) $) + ) , ) )+), ) )+), ) # ! & &’ "% $) & # $ ) & )& ) ’ )& % "# (% #" (’ " # % ( !" #$ % $& "# (# !" !% ’" () )$ % *’ "% *" % ’ "# $# #& % "$ ** &( ’# ’ *! ’ (% ) (% " )$ $# ’ % & ! % *$ () " "# # $) $$ $’ $’ $% $& $! $* $( $" $# $# 6278- - 29: . 1 98+ ;890 • max_attempts is the accuracy/resources trade-off parameter – max_attempts = n: only first n out of H model builders are invoked • Chart shows net accuracy over the entire available history of runs – success rate / number of recommendations given8
    • 12. Limitations and ongoing work • Approach suffers when FS space is % ### " sparse %#### $" ### – most FS seen only once 0 ; ) , ( 15 $#### • Recommender abstains when QMFS ! " ### not sufficiently populated -, ! #### • This is the main hurdle to successful " ### optimization # ! " !# !" $# $" %# %" &# &" ’ &" ( ) * +, -./ 0 0 -, ( 1, 2.3 .4( 5.26 78 .9: .-, , / ( , • Strategy: increase space density by clustering the FS • needs a distance metric over the set of FS • hierarchical clustering should provide a way to experiment with accuracy/efficiency trade-offs9
    • 13. Short summary • What happens when you try to apply mining/learning techniques to large corpora of provenance metadata? – any interesting applications / use cases? – which techniques apply? – are there significant research challenges, or an orchard of low-hanging fruits? • privacy in provenance mining • what provenance models lend themselves well to mining • ...10