Your SlideShare is downloading. ×
Paper presentation @IPAW'08
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Paper presentation @IPAW'08

389
views

Published on

Exploiting provenance to make sense of automated decisions in scientific workflows

Exploiting provenance to make sense of automated decisions in scientific workflows

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
389
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Searching for “nuggets of quality knowledge”
  • Embedding the sub-flow requires a deployment descriptor : Adapters between host flow and quality subflow Data and control links between host flow tasks and quality flow tasks
  • Transcript

    • 1. Exploiting provenance to make sense of automated decisions in scientific workflows
        • Paolo Missier , Suzanne Embury, Richard Stapenhurst
        • Information Management Group
        • School of Computer Science
        • The University of Manchester, UK
    • 2. Outline
      • Setting: the problem of quality control in scientific workflows
        • the Qurator project (2004-2007)‏
      • Quality control is an automated decision process
        • accept /reject data based on user-defined criteria
        • part of the workflow  quality workflow
      • Role of workflow provenance in explaining automated decisions
        • why was data element X accepted/rejected?
    • 3. Scope of provenance analysis
      • Model-driven quality workflows:
        • automatically generated from a specification
        • makes for a predictable workflow structure
      • Services in quality workflows are semantically annotated
      • The provenance data model exploits the semantics:
        • provenance queries leverage the ontology
        • provenance elements explained in ontology terms
    • 4. Motivation
      • Scientific workflows accelerate the rate at which results are produced
      • Quality control on the results becomes paramount
        • automation / high throughput limit the options for systematic human inspection
        • use of public resources (data, services) may introduce noise: e.g. dirty data
      • Risk of producing invalid results
        • but: quality metrics vary with data and application domain
    • 5. Example: protein identification process Data output Protein identification algorithm “ Wet lab” experiment Protein Hitlist Protein function prediction Correct entry  true positive This evidence is independent of the algorithm / SW package It is readily available and inexpensive to obtain Evidence : mass coverage (MC) measures the amount of protein sequence matched Hit ratio (HR) gives an indication of the signal to noise ratio in a mass spectrum ELDP reflects the completeness of the digestion that precedes the peptide mass fingerprinting
    • 6. Quality process components
      • The Qurator hypothesis [VLDB06]
      • quality controls have a common process representation
        • regardless of their specific data and application domain
      PMF score = (HR x 100) + MC + (ELDP x 10)‏ Quality assertion :
      • Evidence :
      • mass coverage (MC)‏
      • Hit ratio (HR)‏
      • ELDP
      actions rules: if (score < x)‏ then reject Collect evidence Evaluate conditions Execute actions Compute assertions Protein identification Protein Hitlist Protein function prediction Quality filtering
    • 7. From quality processes to quality workflows
      • Approach in practice:
      • users provide a declarative specification of an abstract quality process (a “Quality View”)‏
      • The abstract process is automatically translated into a quality workflow
        • this makes arbitrary Taverna workflows “quality-aware”
    • 8. Example: original proteomics workflow Quality flow embedding point
    • 9. Example: embedded quality workflow
    • 10. Qurator provenance component
      • Specialised for quality workflows
      scope: workflow run data being quality assessed quality metrics applied to the data value of metric on the data evidence used to compute metrics quality rules based on metrics values statistics
    • 11. Semantics of quality processors upper ontology for Information Quality extensions to the proteomics domain services and data
    • 12. Provenance model
      • Provenance elements are individuals of ontology classes
        • OWL ontology => RDF provenance data
      • Static model – RDF graph
        • workflow graph structure, services
        • auto-generated along with the quality workflow itself
      • Dynamic model – RDF graph
        • populated during workflow execution
        • RDF resources can be elements of the static model
        • data values are literals
    • 13. Static model (fragment)‏
    • 14. Dynamic model (fragment)‏ return all action outcomes for a given workflow and data item: SELECT ?action ?outcome ?workflow WHERE { ?binding data_item ”P33897” . ?binding action name ?action . ?binding value ?outcome . ?binding workflow ?workflow . ?binding rdf:type ”actionBinding” . FILTER (regex(?workflow, ”4IPQF26RXW2”)) }
    • 15. Provenance service interface
      • Java SPARQL API (Jena ARQ) ‏
        • GUI shown earlier is an example
      • Queries are straightforward SPARQL
        • 3-layer workflow pattern => no recursion
    • 16. Conclusions
      • An experiment in “semantic provenance”
        • restricted to quality workflows
      • Semantic service annotations => high-level provenance query / presentation
      • Key enabler:
      • workflow is the result of a compilation step
        • regular pattern facilitates analysis / presentation
        • Speculative conclusion:
        • workflows are targets, not sources...
        • model-driven generation of workflows has benefits and will become increasingly common