Invited talk @Roma La Sapienza, April '07

503 views

Published on

Modelling and computing the quality of information in e-science
(a Qurator talk)

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
503
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • From traditional DQ to the biologist’s problem of defining quality based on data semantics
  • Data produced for the first time Mention evolution of experimental techniques Its production not streamlined No agreement on how to define its quality
  • Searching for “nuggets of quality knowledge”
  • Here is the compilation model for mapping bound views to a sub-workflow
  • Embedding the sub-flow requires a deployment descriptor : Adapters between host flow and quality subflow Data and control links between host flow tasks and quality flow tasks
  • Activated during execution of the quality sub-flow – blocks the workflow for the duration of the interaction
  • Our quality view specification language allows users to define abstract quality processes. Evidence types are ontology classes. Evidence values are class individuals, which are represented by variables. These variables are bound to values at runtime; the values themselves are either fetched from a repository of persistent annotations, or they are computed on demand by annotation functions. In our use cases, we have found examples of both. This process steps abstracts out from the issue of annotation lifetime Assertions are computed by services, which are represented by ontology classes, too. The tagName is the single output of the service (one for each input data item) Finally, the action step contains the condition/action pairs – here conditions are expressed on the variables introduced earlier, which define the scope. The semantics of the action step is that the expression is evaluated for each data item, and the corresponding action is taken, eg the item is sent to a specific channel
  • Benefit of this model: Ability to share definitions within a community Consistency checking through reasoning -- cite previous papers? Flexibility
  • From right to left: Data / knowledge layer Framework services Quality views management Targeted compiler(s)
  • Invited talk @Roma La Sapienza, April '07

    1. 1. Modelling and computing the quality of information in e-science Paolo Missier , Suzanne Embury, Mark Greenwood School of Computer Science University of Manchester, UK Alun Preece, Binling Jin Department of Computing Science University of Aberdeen, UK http://www.qurator.org Roma, 3/4/07
    2. 2. Quality of data <ul><li>Main driver, historically: data cleaning for </li></ul><ul><li>Integration : use of same IDs across data sources </li></ul><ul><li>Warehousing, analytics : </li></ul><ul><ul><li>restore completeness, </li></ul></ul><ul><ul><li>reconcile referential constraints </li></ul></ul><ul><ul><li>cross-validation of numeric data by aggregation </li></ul></ul><ul><li>Focus: </li></ul><ul><li>Record de-duplication, reconciliation, “linkage” </li></ul><ul><ul><li>Ample literature – see eg Nov 2006 issue of IEEE TKDE </li></ul></ul><ul><li>Consistency of data across sources </li></ul><ul><li>Managing uncertainty in databases (Trio - Stanford) </li></ul>Data quality control in the data management practice
    3. 3. Common quality issues <ul><li>Completeness : not missing any of the results </li></ul><ul><li>Correctness : each data should reflect the actual real-world entity that it is intended to model </li></ul><ul><ul><li>The actual address where you live, the correct balance in your bank account… </li></ul></ul><ul><li>Timeliness : delivered in time for use by a consumer process </li></ul><ul><ul><li>Eg stock information </li></ul></ul><ul><li>… </li></ul>
    4. 4. Taxonomy for data quality dimensions
    5. 5. Our motivation: quality in public e-science data <ul><li>Large volumes of data in many public repositories </li></ul><ul><li>Increasingly creative uses for this data </li></ul>Problem: using third party data of unknown quality may result in misleading scientific conclusions GenBank UniProt EnsEMBL Entrez dbSNP
    6. 6. Some quality issues in biology <ul><li>“ Quality” covers a broader spectrum of issues than traditional DQ </li></ul><ul><li>“ X% of database A may be wrong ( unreliable ) – but I have no easy way to test that” </li></ul><ul><li>“ This microarray data looks ok but is testing the wrong hypothesis ” </li></ul><ul><li>The output from this sequence matching algorithm produces false positives </li></ul><ul><li>… </li></ul>Each of these issues calls for a separate testing procedure Difficult to generalize
    7. 7. Correctness in biology - examples No false positives: Every protein in the output is actually present in the cell sample Generate peptides peak lists, match peak lists (eg Imprint) Qualitative proteomics: Protein identification No false positives, no false negatives Microarray data analysis Transcriptomics: Gene expression report (up/down-regulation) Functional annotation f for p correct if function f can reliably be attributed to p Manual curation Uniprot protein annotation Correctness Creation process Data type
    8. 8. Defining quality in e-science is challenging <ul><li>In-silico experiments express cutting-edge research </li></ul><ul><ul><li>Experimental data liable to change rapidly </li></ul></ul><ul><ul><li>Definitions of quality are themselves experimental </li></ul></ul><ul><li>Scientists’ quality requirements often just a hunch </li></ul><ul><ul><li>Quality tests missing or based on experimental heuristics </li></ul></ul><ul><ul><li>Definitions of quality criteria are personal and subjective </li></ul></ul><ul><li>Quality controls tightly coupled to data processing </li></ul><ul><ul><li>Often implicit and embedded in the experiment </li></ul></ul><ul><ul><li>Not reusable </li></ul></ul>“ Quality”  personal criteria for data acceptability
    9. 9. Research goals <ul><li>Make personal definitions of quality explicit and formal </li></ul><ul><ul><li>Identify a common denominator for quality concepts </li></ul></ul><ul><ul><li>Expressed as a conceptual model for Information Quality </li></ul></ul>Elicit “nuggets” of latent quality knowledge from the experts <ul><li>Make existing data processing quality-aware </li></ul><ul><ul><li>Define an architectural framework that accommodates personal definitions of quality </li></ul></ul><ul><ul><li>Compute quality levels and expose them to the user </li></ul></ul>
    10. 10. Example: protein identification Data output Protein identification algorithm “ Wet lab” experiment Protein Hitlist Protein function prediction Correct entry  true positive This evidence is independent of the algorithm / SW package It is readily available and inexpensive to obtain Evidence : mass coverage (MC) measures the amount of protein sequence matched Hit ratio (HR) gives an indication of the signal to noise ratio in a mass spectrum ELDP reflects the completeness of the digestion that precedes the peptide mass fingerprinting
    11. 11. Correctness of protein identification Estimator function: (computes a score rather than a probability) PMF score = (HR x 100) + MC + (ELDP x 10) Prediction performance – comparing 3 models: ROC curve: True positives vs false positives
    12. 12. Quality process components Data output Protein identification algorithm “ Wet lab” experiment Protein Hitlist Protein function prediction Goal: to automatically add the additional filtering step in a principled way <ul><li>Evidence : </li></ul><ul><li>mass coverage (MC) </li></ul><ul><li>Hit ratio (HR) </li></ul><ul><li>ELDP </li></ul>PMF score = (HR x 100) + MC + (ELDP x 10) Quality filtering Quality assertion :
    13. 13. Quality Assertions <ul><li>QA(D): any function of evidence (metadata for D) that computes equivalence classes on D </li></ul><ul><li>Score model (total or partial order) </li></ul><ul><li>Classification model: </li></ul>C Quality-equivalent regions      D         B A Actions associated to regions: Eg accept/reject but possibly more
    14. 14. Layered definition of Quality DB DB Data sources custom quality knowledge Quality Assertions functions QA QA QA Quality Views: definition of acceptability regions QV QV QV QV quality evidence annotations Env Annotation functions Long-lived reusable Commodities Expert-defined Dynamic User controlled
    15. 15. Abstract Quality Views <ul><li>An operational definition for personal quality: </li></ul><ul><li>Formulate a quality assertion on the dataset: </li></ul><ul><ul><li>i.e. a ranking of proteins by PMF score </li></ul></ul><ul><ul><li>“ quality knolwedge, possibly subjective” </li></ul></ul><ul><li>Identify underlying evidence necessary to compute the assertion </li></ul><ul><ul><li>the variables used to compute the score (HR, MC, ELDP) </li></ul></ul><ul><ul><li>Objective, inexpensive </li></ul></ul><ul><li>Define annotation functions that compute evidence values </li></ul><ul><ul><li>Functions that compute HR, MC, ELDP </li></ul></ul><ul><li>Define quality regions on the ranked dataset </li></ul><ul><ul><li>In this case, intervals of acceptability </li></ul></ul><ul><li>Associate actions to each region </li></ul>
    16. 16. Computable quality views as commodities <ul><li>Cost-effective quality-awareness for data processing: </li></ul><ul><li>Reuse of high-level definitions of quality views </li></ul><ul><li>Compilation of abstract quality views into quality components </li></ul>Abstract quality views binding and compilation Executable Quality process <ul><li>runtime environment </li></ul><ul><li>data-specific quality services </li></ul>Qurator architectural framework:
    17. 17. Quality hypotheses discovery and testing abstract quality view Quality model Performance assessment Execution on test data Compilation Compilation Targeted Compilation Quality-enhanced User environment Quality-enhanced User environment Quality-enhanced User environment Target-specific Quality component Target-specific Quality component Target-specific Quality component Deployment Deployment Deployment <ul><li>Multiple target environments: </li></ul><ul><li>Workflow </li></ul><ul><li>query processor </li></ul>Quality model definition
    18. 18. Experimental quality <ul><li>Making data processing quality-aware using Quality Views </li></ul><ul><ul><li>Query, browsing, retrieval, data-intensive workflows </li></ul></ul> Discovery and validation: “nuggets of quality knowldege” Quality View Model testing Test datasets  Embedding quality views and flow-through testing +
    19. 19. Execution model for Quality views <ul><li>Binding  compilation  executable component </li></ul><ul><ul><li>Sub-flow of an existing workflow </li></ul></ul><ul><ul><li>Query processing interceptor </li></ul></ul>Host workflow Abstract Quality view Embedded quality workflow QV compiler D D’ Quality view on D’ Host workflow: D  D’ Qurator quality framework Services registry Services implementation
    20. 20. Example: original proteomics workflow Taverna workflow Quality flow embedding point
    21. 21. Example: embedded quality workflow
    22. 22. Interactive conditions / actions
    23. 23. Generic quality process pattern Collect evidence - Fetch persistent annotations - Compute on-the-fly annotations <variables <var variableName=&quot; Coverage “ evidence=&quot; q:Coverage &quot;/> <var variableName=&quot; PeptidesCount “ evidence=&quot; q:PeptidesCount &quot;/> </variables> Evaluate conditions Execute actions <action> <filter> <condition> ScoreClass in {``q:high'', ``q:mid''} and Coverage > 12 </condition> </filter> </action> Compute assertions Classifier Classifier Classifier <QualityAssertion serviceName=&quot; PIScoreClassifier &quot; serviceType=&quot; q:PIScoreClassifier &quot; tagSemType=&quot; q:PIScoreClassification &quot; tagName=&quot; ScoreClass &quot; Persistent evidence
    24. 24. Reference (semantic) model quality evidence annotations custom quality knowledge DB DB Env Data sources Annotation functions Quality Assertions functions QA QA QA Quality Views definition of acceptability regions QV QV QV QV Common Semantic Model (IQ Ontology)
    25. 25. A semantic model for quality concepts Quality “upper ontology” (OWL) Evidence annotations are class instances Quality evidence types Evidence Meta-data model (RDF)
    26. 26. Main taxonomies and properties assertion-based-on-evidence: QualityAssertion  QualityEvidence is-evidence-for: QualityEvidence  DataEntity Class restriction: MassCoverage   is-evidence-for . ImprintHitEntry Class restriction: PIScoreClassifier   assertion-based-on-evidence . HitScore PIScoreClassifier   assertion-based-on-evidence . Mass Coverage
    27. 27. The ontology-driven user interface Detecting inconsistencies: no annotators for this Evidence type Detecting inconsistencies: Unsatisfied input requirements for Quality Assertion
    28. 28. Qurator architecture
    29. 29. Quality-aware query processing
    30. 30. Research issues <ul><li>Quality modelling: </li></ul><ul><li>Provenance as evidence </li></ul><ul><ul><li>Can data/process provenance be turned into evidence? </li></ul></ul><ul><li>Experimental elicitation of new Quality Assertions </li></ul><ul><ul><li>Seeking new collaborations with biologists! </li></ul></ul><ul><li>Classification with uncertainty </li></ul><ul><ul><li>Data elements belong to a quality class with some probability </li></ul></ul><ul><li>Computing Quality Assertions with limited evidence </li></ul><ul><ul><li>Evidence may be expensive and sometimes unavailable </li></ul></ul><ul><ul><li>Robust classification / score models </li></ul></ul><ul><li>Architecture: </li></ul><ul><li>Metadata management model </li></ul><ul><ul><li>Quality Evidence is a type of metadata with known features… </li></ul></ul>
    31. 31. Summary <ul><li>For complex data types, often no single “correct” and agreed-upon definition of quality of data </li></ul><ul><li>Qurator provides an environment for fast prototyping of quality hypotheses </li></ul><ul><ul><li>Based on the notion of “evidence” supporting a quality hypothesis </li></ul></ul><ul><ul><li>With support for an incremental learning cycle </li></ul></ul><ul><li>Quality views offer an abstract model for making data processing environments quality-aware </li></ul><ul><ul><li>To be compiled into executable components and embedded </li></ul></ul><ul><ul><li>Qurator provides an invocation framework for Quality Views </li></ul></ul>Publications: http://www.qurator.org Qurator is registered with OMII-UK

    ×