Paper presentation @IPAW'08


Published on

Exploiting provenance to make sense of automated decisions in scientific workflows

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Searching for “nuggets of quality knowledge”
  • Embedding the sub-flow requires a deployment descriptor : Adapters between host flow and quality subflow Data and control links between host flow tasks and quality flow tasks
  • Paper presentation @IPAW'08

    1. 1. Exploiting provenance to make sense of automated decisions in scientific workflows <ul><ul><li>Paolo Missier , Suzanne Embury, Richard Stapenhurst </li></ul></ul><ul><ul><li>Information Management Group </li></ul></ul><ul><ul><li>School of Computer Science </li></ul></ul><ul><ul><li>The University of Manchester, UK </li></ul></ul>
    2. 2. Outline <ul><li>Setting: the problem of quality control in scientific workflows </li></ul><ul><ul><li>the Qurator project (2004-2007)‏ </li></ul></ul><ul><li>Quality control is an automated decision process </li></ul><ul><ul><li>accept /reject data based on user-defined criteria </li></ul></ul><ul><ul><li>part of the workflow  quality workflow </li></ul></ul><ul><li>Role of workflow provenance in explaining automated decisions </li></ul><ul><ul><li>why was data element X accepted/rejected? </li></ul></ul>
    3. 3. Scope of provenance analysis <ul><li>Model-driven quality workflows: </li></ul><ul><ul><li>automatically generated from a specification </li></ul></ul><ul><ul><li>makes for a predictable workflow structure </li></ul></ul><ul><li>Services in quality workflows are semantically annotated </li></ul><ul><li>The provenance data model exploits the semantics: </li></ul><ul><ul><li>provenance queries leverage the ontology </li></ul></ul><ul><ul><li>provenance elements explained in ontology terms </li></ul></ul>
    4. 4. Motivation <ul><li>Scientific workflows accelerate the rate at which results are produced </li></ul><ul><li>Quality control on the results becomes paramount </li></ul><ul><ul><li>automation / high throughput limit the options for systematic human inspection </li></ul></ul><ul><ul><li>use of public resources (data, services) may introduce noise: e.g. dirty data </li></ul></ul><ul><li>Risk of producing invalid results </li></ul><ul><ul><li>but: quality metrics vary with data and application domain </li></ul></ul>
    5. 5. Example: protein identification process Data output Protein identification algorithm “ Wet lab” experiment Protein Hitlist Protein function prediction Correct entry  true positive This evidence is independent of the algorithm / SW package It is readily available and inexpensive to obtain Evidence : mass coverage (MC) measures the amount of protein sequence matched Hit ratio (HR) gives an indication of the signal to noise ratio in a mass spectrum ELDP reflects the completeness of the digestion that precedes the peptide mass fingerprinting
    6. 6. Quality process components <ul><li>The Qurator hypothesis [VLDB06] </li></ul><ul><li>quality controls have a common process representation </li></ul><ul><ul><li>regardless of their specific data and application domain </li></ul></ul>PMF score = (HR x 100) + MC + (ELDP x 10)‏ Quality assertion : <ul><li>Evidence : </li></ul><ul><li>mass coverage (MC)‏ </li></ul><ul><li>Hit ratio (HR)‏ </li></ul><ul><li>ELDP </li></ul>actions rules: if (score < x)‏ then reject Collect evidence Evaluate conditions Execute actions Compute assertions Protein identification Protein Hitlist Protein function prediction Quality filtering
    7. 7. From quality processes to quality workflows <ul><li>Approach in practice: </li></ul><ul><li>users provide a declarative specification of an abstract quality process (a “Quality View”)‏ </li></ul><ul><li>The abstract process is automatically translated into a quality workflow </li></ul><ul><ul><li>this makes arbitrary Taverna workflows “quality-aware” </li></ul></ul>
    8. 8. Example: original proteomics workflow Quality flow embedding point
    9. 9. Example: embedded quality workflow
    10. 10. Qurator provenance component <ul><li>Specialised for quality workflows </li></ul>scope: workflow run data being quality assessed quality metrics applied to the data value of metric on the data evidence used to compute metrics quality rules based on metrics values statistics
    11. 11. Semantics of quality processors upper ontology for Information Quality extensions to the proteomics domain services and data
    12. 12. Provenance model <ul><li>Provenance elements are individuals of ontology classes </li></ul><ul><ul><li>OWL ontology => RDF provenance data </li></ul></ul><ul><li>Static model – RDF graph </li></ul><ul><ul><li>workflow graph structure, services </li></ul></ul><ul><ul><li>auto-generated along with the quality workflow itself </li></ul></ul><ul><li>Dynamic model – RDF graph </li></ul><ul><ul><li>populated during workflow execution </li></ul></ul><ul><ul><li>RDF resources can be elements of the static model </li></ul></ul><ul><ul><li>data values are literals </li></ul></ul>
    13. 13. Static model (fragment)‏
    14. 14. Dynamic model (fragment)‏ return all action outcomes for a given workflow and data item: SELECT ?action ?outcome ?workflow WHERE { ?binding data_item ”P33897” . ?binding action name ?action . ?binding value ?outcome . ?binding workflow ?workflow . ?binding rdf:type ”actionBinding” . FILTER (regex(?workflow, ”4IPQF26RXW2”)) }
    15. 15. Provenance service interface <ul><li>Java SPARQL API (Jena ARQ) ‏ </li></ul><ul><ul><li>GUI shown earlier is an example </li></ul></ul><ul><li>Queries are straightforward SPARQL </li></ul><ul><ul><li>3-layer workflow pattern => no recursion </li></ul></ul>
    16. 16. Conclusions <ul><li>An experiment in “semantic provenance” </li></ul><ul><ul><li>restricted to quality workflows </li></ul></ul><ul><li>Semantic service annotations => high-level provenance query / presentation </li></ul><ul><li>Key enabler: </li></ul><ul><li>workflow is the result of a compilation step </li></ul><ul><ul><li>regular pattern facilitates analysis / presentation </li></ul></ul><ul><ul><li>Speculative conclusion: </li></ul></ul><ul><ul><li>workflows are targets, not sources... </li></ul></ul><ul><ul><li>model-driven generation of workflows has benefits and will become increasingly common </li></ul></ul>