Paper presentation @IPAW'08
Upcoming SlideShare
Loading in...5

Paper presentation @IPAW'08



Exploiting provenance to make sense of automated decisions in scientific workflows

Exploiting provenance to make sense of automated decisions in scientific workflows



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Searching for “nuggets of quality knowledge”
  • Embedding the sub-flow requires a deployment descriptor : Adapters between host flow and quality subflow Data and control links between host flow tasks and quality flow tasks

Paper presentation @IPAW'08 Paper presentation @IPAW'08 Presentation Transcript

  • Exploiting provenance to make sense of automated decisions in scientific workflows
      • Paolo Missier , Suzanne Embury, Richard Stapenhurst
      • Information Management Group
      • School of Computer Science
      • The University of Manchester, UK
  • Outline
    • Setting: the problem of quality control in scientific workflows
      • the Qurator project (2004-2007)‏
    • Quality control is an automated decision process
      • accept /reject data based on user-defined criteria
      • part of the workflow  quality workflow
    • Role of workflow provenance in explaining automated decisions
      • why was data element X accepted/rejected?
  • Scope of provenance analysis
    • Model-driven quality workflows:
      • automatically generated from a specification
      • makes for a predictable workflow structure
    • Services in quality workflows are semantically annotated
    • The provenance data model exploits the semantics:
      • provenance queries leverage the ontology
      • provenance elements explained in ontology terms
  • Motivation
    • Scientific workflows accelerate the rate at which results are produced
    • Quality control on the results becomes paramount
      • automation / high throughput limit the options for systematic human inspection
      • use of public resources (data, services) may introduce noise: e.g. dirty data
    • Risk of producing invalid results
      • but: quality metrics vary with data and application domain
  • Example: protein identification process Data output Protein identification algorithm “ Wet lab” experiment Protein Hitlist Protein function prediction Correct entry  true positive This evidence is independent of the algorithm / SW package It is readily available and inexpensive to obtain Evidence : mass coverage (MC) measures the amount of protein sequence matched Hit ratio (HR) gives an indication of the signal to noise ratio in a mass spectrum ELDP reflects the completeness of the digestion that precedes the peptide mass fingerprinting
  • Quality process components
    • The Qurator hypothesis [VLDB06]
    • quality controls have a common process representation
      • regardless of their specific data and application domain
    PMF score = (HR x 100) + MC + (ELDP x 10)‏ Quality assertion :
    • Evidence :
    • mass coverage (MC)‏
    • Hit ratio (HR)‏
    • ELDP
    actions rules: if (score < x)‏ then reject Collect evidence Evaluate conditions Execute actions Compute assertions Protein identification Protein Hitlist Protein function prediction Quality filtering
  • From quality processes to quality workflows
    • Approach in practice:
    • users provide a declarative specification of an abstract quality process (a “Quality View”)‏
    • The abstract process is automatically translated into a quality workflow
      • this makes arbitrary Taverna workflows “quality-aware”
  • Example: original proteomics workflow Quality flow embedding point
  • Example: embedded quality workflow
  • Qurator provenance component
    • Specialised for quality workflows
    scope: workflow run data being quality assessed quality metrics applied to the data value of metric on the data evidence used to compute metrics quality rules based on metrics values statistics
  • Semantics of quality processors upper ontology for Information Quality extensions to the proteomics domain services and data
  • Provenance model
    • Provenance elements are individuals of ontology classes
      • OWL ontology => RDF provenance data
    • Static model – RDF graph
      • workflow graph structure, services
      • auto-generated along with the quality workflow itself
    • Dynamic model – RDF graph
      • populated during workflow execution
      • RDF resources can be elements of the static model
      • data values are literals
  • Static model (fragment)‏
  • Dynamic model (fragment)‏ return all action outcomes for a given workflow and data item: SELECT ?action ?outcome ?workflow WHERE { ?binding data_item ”P33897” . ?binding action name ?action . ?binding value ?outcome . ?binding workflow ?workflow . ?binding rdf:type ”actionBinding” . FILTER (regex(?workflow, ”4IPQF26RXW2”)) }
  • Provenance service interface
    • Java SPARQL API (Jena ARQ) ‏
      • GUI shown earlier is an example
    • Queries are straightforward SPARQL
      • 3-layer workflow pattern => no recursion
  • Conclusions
    • An experiment in “semantic provenance”
      • restricted to quality workflows
    • Semantic service annotations => high-level provenance query / presentation
    • Key enabler:
    • workflow is the result of a compilation step
      • regular pattern facilitates analysis / presentation
      • Speculative conclusion:
      • workflows are targets, not sources...
      • model-driven generation of workflows has benefits and will become increasingly common