Paper presentation @IPAW'08
Upcoming SlideShare
Loading in...5
×
 

Paper presentation @IPAW'08

on

  • 560 views

Exploiting provenance to make sense of automated decisions in scientific workflows

Exploiting provenance to make sense of automated decisions in scientific workflows

Statistics

Views

Total Views
560
Views on SlideShare
560
Embed Views
0

Actions

Likes
1
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Searching for “nuggets of quality knowledge”
  • Embedding the sub-flow requires a deployment descriptor : Adapters between host flow and quality subflow Data and control links between host flow tasks and quality flow tasks

Paper presentation @IPAW'08 Paper presentation @IPAW'08 Presentation Transcript

  • Exploiting provenance to make sense of automated decisions in scientific workflows
      • Paolo Missier , Suzanne Embury, Richard Stapenhurst
      • Information Management Group
      • School of Computer Science
      • The University of Manchester, UK
  • Outline
    • Setting: the problem of quality control in scientific workflows
      • the Qurator project (2004-2007)‏
    • Quality control is an automated decision process
      • accept /reject data based on user-defined criteria
      • part of the workflow  quality workflow
    • Role of workflow provenance in explaining automated decisions
      • why was data element X accepted/rejected?
  • Scope of provenance analysis
    • Model-driven quality workflows:
      • automatically generated from a specification
      • makes for a predictable workflow structure
    • Services in quality workflows are semantically annotated
    • The provenance data model exploits the semantics:
      • provenance queries leverage the ontology
      • provenance elements explained in ontology terms
  • Motivation
    • Scientific workflows accelerate the rate at which results are produced
    • Quality control on the results becomes paramount
      • automation / high throughput limit the options for systematic human inspection
      • use of public resources (data, services) may introduce noise: e.g. dirty data
    • Risk of producing invalid results
      • but: quality metrics vary with data and application domain
  • Example: protein identification process Data output Protein identification algorithm “ Wet lab” experiment Protein Hitlist Protein function prediction Correct entry  true positive This evidence is independent of the algorithm / SW package It is readily available and inexpensive to obtain Evidence : mass coverage (MC) measures the amount of protein sequence matched Hit ratio (HR) gives an indication of the signal to noise ratio in a mass spectrum ELDP reflects the completeness of the digestion that precedes the peptide mass fingerprinting
  • Quality process components
    • The Qurator hypothesis [VLDB06]
    • quality controls have a common process representation
      • regardless of their specific data and application domain
    PMF score = (HR x 100) + MC + (ELDP x 10)‏ Quality assertion :
    • Evidence :
    • mass coverage (MC)‏
    • Hit ratio (HR)‏
    • ELDP
    actions rules: if (score < x)‏ then reject Collect evidence Evaluate conditions Execute actions Compute assertions Protein identification Protein Hitlist Protein function prediction Quality filtering
  • From quality processes to quality workflows
    • Approach in practice:
    • users provide a declarative specification of an abstract quality process (a “Quality View”)‏
    • The abstract process is automatically translated into a quality workflow
      • this makes arbitrary Taverna workflows “quality-aware”
  • Example: original proteomics workflow Quality flow embedding point
  • Example: embedded quality workflow
  • Qurator provenance component
    • Specialised for quality workflows
    scope: workflow run data being quality assessed quality metrics applied to the data value of metric on the data evidence used to compute metrics quality rules based on metrics values statistics
  • Semantics of quality processors upper ontology for Information Quality extensions to the proteomics domain services and data
  • Provenance model
    • Provenance elements are individuals of ontology classes
      • OWL ontology => RDF provenance data
    • Static model – RDF graph
      • workflow graph structure, services
      • auto-generated along with the quality workflow itself
    • Dynamic model – RDF graph
      • populated during workflow execution
      • RDF resources can be elements of the static model
      • data values are literals
  • Static model (fragment)‏
  • Dynamic model (fragment)‏ return all action outcomes for a given workflow and data item: SELECT ?action ?outcome ?workflow WHERE { ?binding data_item ”P33897” . ?binding action name ?action . ?binding value ?outcome . ?binding workflow ?workflow . ?binding rdf:type ”actionBinding” . FILTER (regex(?workflow, ”4IPQF26RXW2”)) }
  • Provenance service interface
    • Java SPARQL API (Jena ARQ) ‏
      • GUI shown earlier is an example
    • Queries are straightforward SPARQL
      • 3-layer workflow pattern => no recursion
  • Conclusions
    • An experiment in “semantic provenance”
      • restricted to quality workflows
    • Semantic service annotations => high-level provenance query / presentation
    • Key enabler:
    • workflow is the result of a compilation step
      • regular pattern facilitates analysis / presentation
      • Speculative conclusion:
      • workflows are targets, not sources...
      • model-driven generation of workflows has benefits and will become increasingly common