Exploiting provenance to make sense of automated decisions in scientific workflows Paolo Missier , Suzanne Embury, Richard Stapenhurst Information Management Group School of Computer Science The University of Manchester, UK
Outline Setting: the problem of quality control in scientific workflows the  Qurator  project (2004-2007)‏ Quality control is an automated decision process accept /reject data based on user-defined criteria part of the workflow     quality workflow Role of workflow provenance in explaining automated decisions why was data element X accepted/rejected?
Scope of provenance analysis Model-driven  quality workflows: automatically generated from a specification makes for a predictable workflow structure Services in quality workflows are  semantically annotated The provenance data model exploits the semantics: provenance queries leverage the ontology provenance elements explained in ontology terms
Motivation Scientific workflows accelerate the rate at which results are produced Quality control on the results becomes paramount automation / high throughput limit the options for systematic human inspection use of public resources (data, services) may introduce noise: e.g. dirty data Risk of producing invalid results but:  quality metrics vary with data and application domain
Example: protein identification process Data output Protein identification algorithm “ Wet lab” experiment Protein Hitlist Protein function prediction Correct entry    true positive This evidence is independent of the algorithm / SW package It is  readily available and inexpensive  to obtain Evidence : mass coverage (MC)  measures the amount of protein sequence matched Hit ratio (HR)  gives an indication of the signal to noise ratio in a mass spectrum ELDP  reflects the completeness of the digestion that precedes the peptide mass fingerprinting
Quality process components The Qurator hypothesis [VLDB06] quality controls have a common process representation regardless of their specific data and application domain PMF score =  (HR x 100) +  MC +  (ELDP x 10)‏ Quality assertion : Evidence : mass coverage (MC)‏ Hit ratio (HR)‏ ELDP actions rules: if (score < x)‏ then reject Collect evidence  Evaluate conditions Execute actions Compute assertions Protein identification Protein Hitlist Protein function prediction Quality filtering
From quality processes to quality workflows Approach in practice: users provide a declarative specification of an abstract quality process (a “Quality View”)‏ The abstract process is automatically translated into a  quality workflow this makes  arbitrary Taverna workflows “quality-aware”
Example: original proteomics workflow Quality flow embedding point
Example: embedded quality workflow
Qurator provenance component Specialised for quality workflows scope: workflow run data being quality assessed quality metrics applied to the data value of metric on the data evidence used to compute metrics quality rules based on metrics values statistics
Semantics of quality processors upper ontology for Information Quality extensions to the proteomics domain services and data
Provenance model Provenance elements are individuals of ontology classes OWL ontology =>  RDF provenance data Static model  – RDF graph workflow graph structure, services auto-generated along with the quality workflow itself Dynamic model  – RDF graph populated during workflow execution RDF resources can be elements of the static model data values are literals
Static model (fragment)‏
Dynamic model (fragment)‏ return all action outcomes for a given workflow and data item: SELECT  ?action ?outcome ?workflow  WHERE { ?binding data_item ”P33897” .  ?binding action name ?action .  ?binding value ?outcome .  ?binding workflow ?workflow .  ?binding rdf:type ”actionBinding” .  FILTER (regex(?workflow, ”4IPQF26RXW2”)) }
Provenance service interface Java SPARQL API (Jena ARQ) ‏ GUI shown earlier is an example Queries are straightforward SPARQL 3-layer workflow pattern => no recursion
Conclusions An experiment in “semantic provenance” restricted to  quality workflows Semantic service annotations => high-level provenance query / presentation Key enabler: workflow is the result of a compilation step regular pattern facilitates analysis / presentation Speculative conclusion: workflows are targets, not sources... model-driven generation of workflows has benefits and will become increasingly common

Paper presentation @IPAW'08

  • 1.
    Exploiting provenance tomake sense of automated decisions in scientific workflows Paolo Missier , Suzanne Embury, Richard Stapenhurst Information Management Group School of Computer Science The University of Manchester, UK
  • 2.
    Outline Setting: theproblem of quality control in scientific workflows the Qurator project (2004-2007)‏ Quality control is an automated decision process accept /reject data based on user-defined criteria part of the workflow  quality workflow Role of workflow provenance in explaining automated decisions why was data element X accepted/rejected?
  • 3.
    Scope of provenanceanalysis Model-driven quality workflows: automatically generated from a specification makes for a predictable workflow structure Services in quality workflows are semantically annotated The provenance data model exploits the semantics: provenance queries leverage the ontology provenance elements explained in ontology terms
  • 4.
    Motivation Scientific workflowsaccelerate the rate at which results are produced Quality control on the results becomes paramount automation / high throughput limit the options for systematic human inspection use of public resources (data, services) may introduce noise: e.g. dirty data Risk of producing invalid results but: quality metrics vary with data and application domain
  • 5.
    Example: protein identificationprocess Data output Protein identification algorithm “ Wet lab” experiment Protein Hitlist Protein function prediction Correct entry  true positive This evidence is independent of the algorithm / SW package It is readily available and inexpensive to obtain Evidence : mass coverage (MC) measures the amount of protein sequence matched Hit ratio (HR) gives an indication of the signal to noise ratio in a mass spectrum ELDP reflects the completeness of the digestion that precedes the peptide mass fingerprinting
  • 6.
    Quality process componentsThe Qurator hypothesis [VLDB06] quality controls have a common process representation regardless of their specific data and application domain PMF score = (HR x 100) + MC + (ELDP x 10)‏ Quality assertion : Evidence : mass coverage (MC)‏ Hit ratio (HR)‏ ELDP actions rules: if (score < x)‏ then reject Collect evidence Evaluate conditions Execute actions Compute assertions Protein identification Protein Hitlist Protein function prediction Quality filtering
  • 7.
    From quality processesto quality workflows Approach in practice: users provide a declarative specification of an abstract quality process (a “Quality View”)‏ The abstract process is automatically translated into a quality workflow this makes arbitrary Taverna workflows “quality-aware”
  • 8.
    Example: original proteomicsworkflow Quality flow embedding point
  • 9.
  • 10.
    Qurator provenance componentSpecialised for quality workflows scope: workflow run data being quality assessed quality metrics applied to the data value of metric on the data evidence used to compute metrics quality rules based on metrics values statistics
  • 11.
    Semantics of qualityprocessors upper ontology for Information Quality extensions to the proteomics domain services and data
  • 12.
    Provenance model Provenanceelements are individuals of ontology classes OWL ontology => RDF provenance data Static model – RDF graph workflow graph structure, services auto-generated along with the quality workflow itself Dynamic model – RDF graph populated during workflow execution RDF resources can be elements of the static model data values are literals
  • 13.
  • 14.
    Dynamic model (fragment)‏return all action outcomes for a given workflow and data item: SELECT ?action ?outcome ?workflow WHERE { ?binding data_item ”P33897” . ?binding action name ?action . ?binding value ?outcome . ?binding workflow ?workflow . ?binding rdf:type ”actionBinding” . FILTER (regex(?workflow, ”4IPQF26RXW2”)) }
  • 15.
    Provenance service interfaceJava SPARQL API (Jena ARQ) ‏ GUI shown earlier is an example Queries are straightforward SPARQL 3-layer workflow pattern => no recursion
  • 16.
    Conclusions An experimentin “semantic provenance” restricted to quality workflows Semantic service annotations => high-level provenance query / presentation Key enabler: workflow is the result of a compilation step regular pattern facilitates analysis / presentation Speculative conclusion: workflows are targets, not sources... model-driven generation of workflows has benefits and will become increasingly common

Editor's Notes

  • #6 Searching for “nuggets of quality knowledge”
  • #10 Embedding the sub-flow requires a deployment descriptor : Adapters between host flow and quality subflow Data and control links between host flow tasks and quality flow tasks