SlideShare a Scribd company logo
1 of 29
Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of Manchester, UK Alun Preece, Binling Jin Department of Computing Science University of Aberdeen, UK http://www.qurator.org Aberdeen, 24/1/07
Quality of data ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],The need for data quality control is rooted in the data management practice
Common quality issues ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Taxonomy for data quality dimensions
Our motivation: quality in public e-science data ,[object Object],[object Object],Problem: using third party data of unknown quality may result in misleading scientific conclusions GenBank UniProt EnsEMBL Entrez dbSNP
Some quality issues in biology ,[object Object],[object Object],[object Object],[object Object],[object Object],Each of these issues calls for a separate testing procedure Difficult to generalize
Correctness in biology - examples No false positives: Every protein in the output is actually present in the cell sample Generate peptides peak lists, match peak lists (eg Imprint) Qualitative proteomics: Protein identification No false positives, no false negatives Microarray data analysis Transcriptomics: Gene expression report (up/down-regulation) Functional annotation  f  for  p  correct if function  f  can  reliably  be attributed to  p Manual curation Uniprot protein annotation Correctness Creation process Data type
Defining quality in e-science is challenging ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Research goals ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Elicit “nuggets” of latent quality knowledge from the experts
Example: protein identification Data output Protein identification algorithm “ Wet lab” experiment Protein Hitlist Protein function prediction Correct entry    true positive This evidence is independent of the algorithm / SW package It is  readily available and inexpensive  to obtain Evidence : mass coverage (MC)  measures the amount of protein sequence matched Hit ratio (HR)  gives an indication of the signal to noise ratio in a mass spectrum ELDP  reflects the completeness of the digestion that precedes the peptide mass fingerprinting
Correctness of protein identification Estimator function:  (computes a score rather than a probability) PMF score = (HR x 100) + MC + (ELDP x 10) Prediction performance – comparing 3 models: ROC curve: True positives vs false positives
Quality process components Data output Protein identification algorithm “ Wet lab” experiment Protein Hitlist Protein function prediction Goal: to automatically add the additional filtering step in a principled way ,[object Object],[object Object],[object Object],[object Object],PMF score =  (HR x 100) +  MC +  (ELDP x 10) Quality filtering Quality assertion :
Quality Assertions ,[object Object],[object Object],[object Object],analyze Reject < analyze < accept      D         reject accept Actions  associated to regions
Abstract quality views ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Computable quality views as commodities ,[object Object],[object Object],[object Object],Abstract quality views binding and compilation Executable Quality process ,[object Object],[object Object],Qurator architectural framework:
Quality hypotheses discovery and testing abstract quality view Quality model Performance assessment Execution on test data Compilation Compilation Targeted Compilation Quality-enhanced User environment Quality-enhanced User environment Quality-enhanced User environment Target-specific Quality component Target-specific Quality component Target-specific Quality component Deployment Deployment Deployment ,[object Object],[object Object],[object Object],Quality model definition
Experimental quality ,[object Object],[object Object],   Discovery and validation of “Quality nuggets” Quality View Model testing Test datasets    Embedding quality views and flow-through testing +
Execution model for Quality views ,[object Object],[object Object],[object Object],Host workflow Abstract Quality view Embedded quality workflow QV compiler D D’ Quality view on D’ Host workflow: D    D’ Qurator quality framework Services registry Services implementation
Example: original proteomics workflow Taverna workflow Quality flow embedding point
Example: embedded quality workflow
Interactive conditions / actions
Generic quality process pattern Collect evidence  - Fetch persistent annotations - Compute on-the-fly annotations <variables <var variableName=&quot; Coverage “ evidence=&quot; q:Coverage &quot;/>  <var variableName=&quot; PeptidesCount “ evidence=&quot; q:PeptidesCount &quot;/>  </variables> Evaluate conditions Execute actions <action> <filter> <condition> ScoreClass  in {``q:high'', ``q:mid''} and  Coverage  > 12 </condition> </filter> </action> Compute assertions Classifier Classifier Classifier <QualityAssertion serviceName=&quot; PIScoreClassifier &quot;  serviceType=&quot; q:PIScoreClassifier &quot;  tagSemType=&quot; q:PIScoreClassification &quot;  tagName=&quot; ScoreClass &quot; Persistent evidence
A semantic model for quality concepts Quality “upper ontology” (OWL) Evidence annotations are class instances Quality evidence types Evidence Meta-data model (RDF)
Main taxonomies and properties assertion-based-on-evidence:   QualityAssertion    QualityEvidence is-evidence-for:  QualityEvidence    DataEntity Class restriction: MassCoverage       is-evidence-for . ImprintHitEntry Class restriction: PIScoreClassifier       assertion-based-on-evidence . HitScore PIScoreClassifier       assertion-based-on-evidence . Mass Coverage
The ontology-driven user interface Detecting inconsistencies: no annotators for this Evidence type Detecting inconsistencies:  Unsatisfied input requirements for Quality Assertion
Qurator architecture
Quality-aware query processing
Research issues ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Publications:  http://www.qurator.org Qurator is registered with  OMII-UK

More Related Content

What's hot

Analysis of Textual Data Classification with a Reddit Comments Dataset
Analysis of Textual Data Classification with a Reddit Comments DatasetAnalysis of Textual Data Classification with a Reddit Comments Dataset
Analysis of Textual Data Classification with a Reddit Comments DatasetAdamBab
 
The Role of Audit Analysis in CyberSecurity
The Role of Audit Analysis in CyberSecurityThe Role of Audit Analysis in CyberSecurity
The Role of Audit Analysis in CyberSecurityTyrone Grandison
 
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTSQUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTSShakas Technologies
 
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...IRJET Journal
 
Applying a Systematic Review on Adaptive Security for DSPL
 Applying a Systematic Review on Adaptive Security for DSPL Applying a Systematic Review on Adaptive Security for DSPL
Applying a Systematic Review on Adaptive Security for DSPLcsandit
 
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTSNexgen Technology
 
SDTM (Study Data Tabulation Model)
SDTM (Study Data Tabulation Model)SDTM (Study Data Tabulation Model)
SDTM (Study Data Tabulation Model)SWAROOP KUMAR K
 
Query aware determinization of uncertain
Query aware determinization of uncertainQuery aware determinization of uncertain
Query aware determinization of uncertainjpstudcorner
 
Query aware determinization of uncertain objects
Query aware determinization of uncertain objectsQuery aware determinization of uncertain objects
Query aware determinization of uncertain objectsCloudTechnologies
 

What's hot (11)

Analysis of Textual Data Classification with a Reddit Comments Dataset
Analysis of Textual Data Classification with a Reddit Comments DatasetAnalysis of Textual Data Classification with a Reddit Comments Dataset
Analysis of Textual Data Classification with a Reddit Comments Dataset
 
Ijcatr04051005
Ijcatr04051005Ijcatr04051005
Ijcatr04051005
 
The Role of Audit Analysis in CyberSecurity
The Role of Audit Analysis in CyberSecurityThe Role of Audit Analysis in CyberSecurity
The Role of Audit Analysis in CyberSecurity
 
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTSQUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...IRJET-  	  Optimal Number of Cluster Identification using Robust K-Means for ...
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
 
Applying a Systematic Review on Adaptive Security for DSPL
 Applying a Systematic Review on Adaptive Security for DSPL Applying a Systematic Review on Adaptive Security for DSPL
Applying a Systematic Review on Adaptive Security for DSPL
 
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 
7171
71717171
7171
 
SDTM (Study Data Tabulation Model)
SDTM (Study Data Tabulation Model)SDTM (Study Data Tabulation Model)
SDTM (Study Data Tabulation Model)
 
Query aware determinization of uncertain
Query aware determinization of uncertainQuery aware determinization of uncertain
Query aware determinization of uncertain
 
Query aware determinization of uncertain objects
Query aware determinization of uncertain objectsQuery aware determinization of uncertain objects
Query aware determinization of uncertain objects
 

Viewers also liked

отчет миллионеров (восстановлен)
отчет миллионеров (восстановлен)отчет миллионеров (восстановлен)
отчет миллионеров (восстановлен)dino4ka
 
Data Trajectories: tracking the reuse of published data for transitive credi...
Data Trajectories: tracking the reuse of published datafor transitive credi...Data Trajectories: tracking the reuse of published datafor transitive credi...
Data Trajectories: tracking the reuse of published data for transitive credi...Paolo Missier
 
Workflows, experimental findings, and their provenance: towards semantically ...
Workflows, experimental findings, and their provenance: towards semantically ...Workflows, experimental findings, and their provenance: towards semantically ...
Workflows, experimental findings, and their provenance: towards semantically ...Paolo Missier
 
Invited cloud-e-Genome project talk at 2015 NGS Data Congress
Invited cloud-e-Genome project talk at 2015 NGS Data CongressInvited cloud-e-Genome project talk at 2015 NGS Data Congress
Invited cloud-e-Genome project talk at 2015 NGS Data CongressPaolo Missier
 
презентация1
презентация1презентация1
презентация1dino4ka
 
Keystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenanceKeystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenancePaolo Missier
 

Viewers also liked (7)

отчет миллионеров (восстановлен)
отчет миллионеров (восстановлен)отчет миллионеров (восстановлен)
отчет миллионеров (восстановлен)
 
Data Trajectories: tracking the reuse of published data for transitive credi...
Data Trajectories: tracking the reuse of published datafor transitive credi...Data Trajectories: tracking the reuse of published datafor transitive credi...
Data Trajectories: tracking the reuse of published data for transitive credi...
 
Tapp 13-talk
Tapp 13-talkTapp 13-talk
Tapp 13-talk
 
Workflows, experimental findings, and their provenance: towards semantically ...
Workflows, experimental findings, and their provenance: towards semantically ...Workflows, experimental findings, and their provenance: towards semantically ...
Workflows, experimental findings, and their provenance: towards semantically ...
 
Invited cloud-e-Genome project talk at 2015 NGS Data Congress
Invited cloud-e-Genome project talk at 2015 NGS Data CongressInvited cloud-e-Genome project talk at 2015 NGS Data Congress
Invited cloud-e-Genome project talk at 2015 NGS Data Congress
 
презентация1
презентация1презентация1
презентация1
 
Keystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenanceKeystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenance
 

Similar to Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science

Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paolo Missier
 
Towards Automatic Evaluation of Learning Object Metadata Quality
Towards Automatic Evaluation of Learning Object Metadata QualityTowards Automatic Evaluation of Learning Object Metadata Quality
Towards Automatic Evaluation of Learning Object Metadata QualityXavier Ochoa
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
Quality Metrics for Learning Object Metadata
Quality Metrics for Learning Object MetadataQuality Metrics for Learning Object Metadata
Quality Metrics for Learning Object MetadataXavier Ochoa
 
Integrative information management for systems biology
Integrative information management for systems biologyIntegrative information management for systems biology
Integrative information management for systems biologyNeil Swainston
 
Biomarker Strategies
Biomarker StrategiesBiomarker Strategies
Biomarker StrategiesTom Plasterer
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amatoSSSW
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataUC Davis
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge DiscoverySSSW
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009Ian Foster
 
The Role Of The Sqa In Software Development By Jim Coleman
The Role Of The Sqa In Software Development By Jim ColemanThe Role Of The Sqa In Software Development By Jim Coleman
The Role Of The Sqa In Software Development By Jim ColemanJames Coleman
 
Predictive Analytics in Healthcare
Predictive Analytics in HealthcarePredictive Analytics in Healthcare
Predictive Analytics in HealthcareEdgewater
 
B2 2005 introduction_load_testing_blackboard_primer_draft
B2 2005 introduction_load_testing_blackboard_primer_draftB2 2005 introduction_load_testing_blackboard_primer_draft
B2 2005 introduction_load_testing_blackboard_primer_draftSteve Feldman
 
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedMachine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedSri Ambati
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuantUniversity
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
 
Data Quality
Data QualityData Quality
Data Qualityjerdeb
 
A CDR implementation based on openEHR ARM persistence method
A CDR implementation based on openEHR ARM persistence methodA CDR implementation based on openEHR ARM persistence method
A CDR implementation based on openEHR ARM persistence methodxudong_lu
 

Similar to Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science (20)

Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005
 
Towards Automatic Evaluation of Learning Object Metadata Quality
Towards Automatic Evaluation of Learning Object Metadata QualityTowards Automatic Evaluation of Learning Object Metadata Quality
Towards Automatic Evaluation of Learning Object Metadata Quality
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
Quality Metrics for Learning Object Metadata
Quality Metrics for Learning Object MetadataQuality Metrics for Learning Object Metadata
Quality Metrics for Learning Object Metadata
 
Integrative information management for systems biology
Integrative information management for systems biologyIntegrative information management for systems biology
Integrative information management for systems biology
 
Biomarker Strategies
Biomarker StrategiesBiomarker Strategies
Biomarker Strategies
 
Knowledge discovery claudiad amato
Knowledge discovery claudiad amatoKnowledge discovery claudiad amato
Knowledge discovery claudiad amato
 
Multivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic DataMultivariate Analysis and Visualization of Proteomic Data
Multivariate Analysis and Visualization of Proteomic Data
 
QTP AUTOMATION TESTING SYLLABUS
QTP AUTOMATION TESTING SYLLABUSQTP AUTOMATION TESTING SYLLABUS
QTP AUTOMATION TESTING SYLLABUS
 
Tutorial Knowledge Discovery
Tutorial Knowledge DiscoveryTutorial Knowledge Discovery
Tutorial Knowledge Discovery
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
 
The Role Of The Sqa In Software Development By Jim Coleman
The Role Of The Sqa In Software Development By Jim ColemanThe Role Of The Sqa In Software Development By Jim Coleman
The Role Of The Sqa In Software Development By Jim Coleman
 
Predictive Analytics in Healthcare
Predictive Analytics in HealthcarePredictive Analytics in Healthcare
Predictive Analytics in Healthcare
 
B2 2005 introduction_load_testing_blackboard_primer_draft
B2 2005 introduction_load_testing_blackboard_primer_draftB2 2005 introduction_load_testing_blackboard_primer_draft
B2 2005 introduction_load_testing_blackboard_primer_draft
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedMachine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
Data Quality
Data QualityData Quality
Data Quality
 
A CDR implementation based on openEHR ARM persistence method
A CDR implementation based on openEHR ARM persistence methodA CDR implementation based on openEHR ARM persistence method
A CDR implementation based on openEHR ARM persistence method
 

More from Paolo Missier

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 

More from Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 

Recently uploaded

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 

Invited talk @Aberdeen, '07: Modelling and computing the quality of information in e-science

  • 1. Modelling and computing the quality of information in e-science Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of Manchester, UK Alun Preece, Binling Jin Department of Computing Science University of Aberdeen, UK http://www.qurator.org Aberdeen, 24/1/07
  • 2.
  • 3.
  • 4. Taxonomy for data quality dimensions
  • 5.
  • 6.
  • 7. Correctness in biology - examples No false positives: Every protein in the output is actually present in the cell sample Generate peptides peak lists, match peak lists (eg Imprint) Qualitative proteomics: Protein identification No false positives, no false negatives Microarray data analysis Transcriptomics: Gene expression report (up/down-regulation) Functional annotation f for p correct if function f can reliably be attributed to p Manual curation Uniprot protein annotation Correctness Creation process Data type
  • 8.
  • 9.
  • 10. Example: protein identification Data output Protein identification algorithm “ Wet lab” experiment Protein Hitlist Protein function prediction Correct entry  true positive This evidence is independent of the algorithm / SW package It is readily available and inexpensive to obtain Evidence : mass coverage (MC) measures the amount of protein sequence matched Hit ratio (HR) gives an indication of the signal to noise ratio in a mass spectrum ELDP reflects the completeness of the digestion that precedes the peptide mass fingerprinting
  • 11. Correctness of protein identification Estimator function: (computes a score rather than a probability) PMF score = (HR x 100) + MC + (ELDP x 10) Prediction performance – comparing 3 models: ROC curve: True positives vs false positives
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19. Example: original proteomics workflow Taverna workflow Quality flow embedding point
  • 22. Generic quality process pattern Collect evidence - Fetch persistent annotations - Compute on-the-fly annotations <variables <var variableName=&quot; Coverage “ evidence=&quot; q:Coverage &quot;/> <var variableName=&quot; PeptidesCount “ evidence=&quot; q:PeptidesCount &quot;/> </variables> Evaluate conditions Execute actions <action> <filter> <condition> ScoreClass in {``q:high'', ``q:mid''} and Coverage > 12 </condition> </filter> </action> Compute assertions Classifier Classifier Classifier <QualityAssertion serviceName=&quot; PIScoreClassifier &quot; serviceType=&quot; q:PIScoreClassifier &quot; tagSemType=&quot; q:PIScoreClassification &quot; tagName=&quot; ScoreClass &quot; Persistent evidence
  • 23. A semantic model for quality concepts Quality “upper ontology” (OWL) Evidence annotations are class instances Quality evidence types Evidence Meta-data model (RDF)
  • 24. Main taxonomies and properties assertion-based-on-evidence: QualityAssertion  QualityEvidence is-evidence-for: QualityEvidence  DataEntity Class restriction: MassCoverage   is-evidence-for . ImprintHitEntry Class restriction: PIScoreClassifier   assertion-based-on-evidence . HitScore PIScoreClassifier   assertion-based-on-evidence . Mass Coverage
  • 25. The ontology-driven user interface Detecting inconsistencies: no annotators for this Evidence type Detecting inconsistencies: Unsatisfied input requirements for Quality Assertion
  • 28.
  • 29.

Editor's Notes

  1. From traditional DQ to the biologist’s problem of defining quality based on data semantics
  2. Data produced for the first time Mention evolution of experimental techniques Its production not streamlined No agreement on how to define its quality
  3. Searching for “nuggets of quality knowledge”
  4. Here is the compilation model for mapping bound views to a sub-workflow
  5. Embedding the sub-flow requires a deployment descriptor : Adapters between host flow and quality subflow Data and control links between host flow tasks and quality flow tasks
  6. Activated during execution of the quality sub-flow – blocks the workflow for the duration of the interaction
  7. Our quality view specification language allows users to define abstract quality processes. Evidence types are ontology classes. Evidence values are class individuals, which are represented by variables. These variables are bound to values at runtime; the values themselves are either fetched from a repository of persistent annotations, or they are computed on demand by annotation functions. In our use cases, we have found examples of both. This process steps abstracts out from the issue of annotation lifetime Assertions are computed by services, which are represented by ontology classes, too. The tagName is the single output of the service (one for each input data item) Finally, the action step contains the condition/action pairs – here conditions are expressed on the variables introduced earlier, which define the scope. The semantics of the action step is that the expression is evaluated for each data item, and the corresponding action is taken, eg the item is sent to a specific channel
  8. Benefit of this model: Ability to share definitions within a community Consistency checking through reasoning -- cite previous papers? Flexibility
  9. From right to left: Data / knowledge layer Framework services Quality views management Targeted compiler(s)