SlideShare a Scribd company logo
Data Quality
Jeremy Debattista
ADAPT Centre, Trinity College Dublin
This research has received funding from the Irish Research Council Government of Ireland Postdoctoral Fellowship award (GOIPD/2017/1204)
and the ADAPT Centre for Digital Content Technology, funded under the SFI Research Centres Programme (Grant 13/RC/2106) and co-funded by
theEuropeanRegionalDevelopmentFund.
www.adaptcentre.ie
1
How many of you...
… check product review before purchasing?
Image and Reviews taken from
https://www.amazon.co.uk/Echo-Dot-Smart-Speaker-Alexa/dp/B0792KWK57/
www.adaptcentre.ie
2
How many of you...
… check trip advisor to find the right restaurant?
Images taken from TripAdvisor.com
www.adaptcentre.ie
3
Quality: A definition from a Personal Perspective
Crowd Image by James Cridland, taken from https://www.flickr.com/photos/jamescridland/613445810/. Licensed under CC-BY 2.0
What does quality mean to you?
www.adaptcentre.ie
4
Quality: A definition
Robert Pirsig
Joseph Juran
Phillip Crosby
www.adaptcentre.ie
5
Quality: A definition – Pirsig’s Perspective
Robert Pirsig
… the result of care
Zen and the Art of Motorcycle Maintenance (1974)
Photo taken from: https://www.goodreads.com
www.adaptcentre.ie
6
Quality: A definition – Juran’s Perspective
… fitness for use
Quality Control Handbook (1974)
Joseph Juran
Photo taken from: https://www.toolshero.com
www.adaptcentre.ie
7
Quality: A definition – Crosby’s Perspective
… conformance to
requirements
Quality is Free : The Art of Making Quality
Certain. Mentor book. (1979)
Phillip Crosby
Photo taken from: https://ceopedia.org
www.adaptcentre.ie
8
Data Quality – What is data quality?
What characterised good quality for the
datasets you needed to perform a task?
www.adaptcentre.ie
9
Quality in terms of data is:
• Multi-dimensional concept
• Characterise quality for a particular task
• Variety of quality measures, Subjective or Objective for different
tasks
• e.g. Accessibility, Trustworthiness, Consistency
High quality data = data that fits for its intended use.
Data Quality Definition
www.adaptcentre.ie
10
Data Quality – Why is it important?
DATA
www.adaptcentre.ie
11
Data Quality – A Strategy for Organisations
• Data Quality is expensive
• Data Quality is not just about assessing but also about improving.
Figure from Ismael Caballero, Jorge Merino, Manuel Serrano, Mario Piattini, Data Quality for Big Data: Addressing Veracity and Value, 2016
www.adaptcentre.ie
12
Data Quality – Identify problems early!
A simplistic view of the semantic publishing process
(Un/semi-)structured
data sources
Processing/Uplifting
Schemas
Mapping
Transform
Fusion
Semantic
(Knowledge) Graph
www.adaptcentre.ie
13
Data Quality – Identify problems early!
A simplistic view of the semantic publishing process
(Un/semi-)structured
data sources
Processing/Uplifting
Schemas
Mapping
Transform
Fusion
Semantic
(Knowledge) Graph
• Potentially external data
• No structure and context to the data
• Certification of quality?
www.adaptcentre.ie
14
Data Quality – Identify problems early!
A simplistic view of the semantic publishing process
(Un/semi-)structured
data sources
Processing/Uplifting
Schemas
Mapping
Transform
Fusion
Semantic
(Knowledge) Graph
• Gives context to raw data
• Drives the resulting knowledge graphs
• Should be free of contradictions and incorrect definitions
www.adaptcentre.ie
15
Data Quality – Identify problems early!
A simplistic view of the semantic publishing process
(Un/semi-)structured
data sources
Processing/Uplifting
Schemas
Mapping
Transform
Fusion
Semantic
(Knowledge) Graph
• Incorrect/Incomplete mappings (e.g. typos)
• Catch errors here, as otherwise errors in your KG will multiply
www.adaptcentre.ie
16
Data Quality – Identify problems early!
A simplistic view of the semantic publishing process
(Un/semi-)structured
data sources
Processing/Uplifting
Schemas
Mapping
Transform
Fusion
Semantic
(Knowledge) Graph
• Are external data sources fit for the task at hand?
www.adaptcentre.ie
17
Data Quality – Identify problems early!
A simplistic view of the semantic publishing process
(Un/semi-)structured
data sources
Processing/Uplifting
Schemas
Mapping
Transform
Fusion
Semantic
(Knowledge) Graph
• Any quality issues not dealt with before will definitely be here
• Big data, time consuming, more expensive to clean
www.adaptcentre.ie
18
Linked Data Quality Metrics
Figure from: A. J. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, and S. Auer. Quality Assessment for Linked Data: A survey.
www.adaptcentre.ie
19
Linked Data Quality Metrics - Accessibility
Are Linked Data resources readily available to be re-used in
different applications/context?
Example Metrics:
• Availability of SPARQL endpoints and RDF Data Dumps
• Dereferenceability of resources
• Indication of machine/human readable license
• Links to external datasets
• Correct usage of hash/slash URIs
www.adaptcentre.ie
20
Linked Data Quality Metrics - Intrinsic
Measures metrics that are related to the correctness and
coherence of the data, independent of the user’s context
Example Metrics:
• Syntactic valid dataset
• Incorrect datatype specification (e.g. “23.42”^^xsd:integer)
• Outlier detection
• Correct domain and range definition
• Data conciseness
www.adaptcentre.ie
21
Linked Data Quality Metrics - Contextual
Measures metrics dependent on the task at hand.
Example Metrics:
• Trustworthiness of data
• Identification of timely data
• Provenance information
www.adaptcentre.ie
22
Linked Data Quality Metrics - Representational
How well is the data represented in terms of common best
practices and guidelines?
Example Metrics:
• Re-using existing vocabularies
• Usage of undefined classes/properties
• Provide different serialisation formats for the data
• Use of multiple languages
www.adaptcentre.ie
23
ISO/IEC 25012 Standard
• Every metric identified in the
research was mapped to the
ISO/IEC 25012 Model:
§ The Inherent Category –
measures intrinsic quality
characteristics.
§ The System Category –
measures the degree of quality
when the system is used.
§ The Inherent-System
Category – which includes
metrics covering both aspects.
http://iso25000.com/index.php/en/iso-25000-standards/iso-25012
www.adaptcentre.ie
24
Problems with Assessing the Quality of Big Datasets
• Metrics classified in Zaveri et al. did not take into consideration time
and space complexity
• Efficient computation of impractical quality metrics when assessing
big datasets
• Solving intractable problems?
• Trade-off? Faster computation time against metric’s value precision
www.adaptcentre.ie
25
Probabilistic Techniques for Assessing Datasets
• Sampling
• Reservoir sampling
• Stratified sampling
• Bloom Filters
• Random Walks/Markov Chains
• Clustering
www.adaptcentre.ie
26
Quality Assessment – A Conceptual Methodology
1. Identify Quality Measures for the task at hand
• What are the important characteristics of my task?
2. Re-use or define quality metrics
3. Prepare the quality assessment
a) Access point of dataset in question
b) External Resources such as gold standard
4. Running the quality assessment
5. Assessment representation
a) Immediate use
b) Mid-to-long term use
www.adaptcentre.ie
27
Linked Data Quality Frameworks – Over the Years
Flemming LinkQA Sieve RDF Unit Triple
Check
Mate
LiQuate TRELLIS tRDF/tSP
ARQL
WIQA Luzzu
Scalability X ✓ ✓ ✓ N/A N/A N/A ✓ N/A ✓
Extensibility X Java XML SPARQL X Bayesian
Rules
X tSPARQL
Rules
WIQA PL Java or
LQML
Quality
Metadata
X X ✓
(Optional)
✓
(DQV)
X X X X X ✓(daQ)
Quality
Report
HTML HTML X HTML or
RDF
X X X X X RDF
Collaboration X X X X ✓ X ✓ X X X
Cleaning
Support
X X ✓ X X X X X X X
Last Update 2010 2011 2014 2017 2013 2014 2005 2014 2009 2018
www.adaptcentre.ie
28
Luzzu – A Quality Assessment Framework for Linked
Data
• Four Principles:
1. Extensibility
2. Scalability
3. Interoperability
4. Customisability
Luzzu
Thread Pool
Metrics Identification
List Metrics Impl. Library
Metric 1
Metric 2
Metric 3
…
Metric n
Dataset /
SPARQL Endpoint
Stream Processing
<s,p,o>
Quality Metadata
Quality Problem
Report
Try it out:
http://www.github.com/Luzzu/Framework
www.adaptcentre.ie
29
Luzzu – A Quality Assessment Framework for Linked
Data
• Four Principles:
1. Extensibility
2. Scalability
3. Interoperability
4. Customisability
Luzzu
Thread Pool
Metrics Identification
List Metrics Impl. Library
Metric 1
Metric 2
Metric 3
…
Metric n
Dataset /
SPARQL Endpoint
Stream Processing
<s,p,o>
Quality Metadata
Quality Problem
Report
Try it out:
http://www.github.com/Luzzu/Framework
www.adaptcentre.ie
30
W3C Data Quality Vocabulary (DQV)
https://www.w3.org/TR/vocab-dqv/
www.adaptcentre.ie
31
W3C Data Quality Vocabulary (DQV)
• Policies: Express policies or agreements a dataset follows defined by some
data quality concerns
• Annotations: Providing rating, certificates, feedback etc…
• Feedback: Comments from data consumers on a dataset (imagine
comments in Trip Advisor)
https://www.w3.org/TR/vocab-dqv/
www.adaptcentre.ie
32
Web of Data Quality - Aggregated
www.adaptcentre.ie
33
Web of Data Quality - Aggregated
Dataset (http://)
Aggregated
Quality
Score
Pos
zbw.eu 84.72% 1st
id.sgcb.mcu.es 83.91% 2nd
kdata.kr 82.22% 3rd
morelab.deusto.es 80.12% 4th
mapasinteractivos.didactalia.net 74.18% 5th
...
citeseer.rkbexplorer.com 48.31% 126th
prefix.cc 46.64% 127th
kent.zpr.fer.hr 46.61% 128th
transport.data.gov.uk 45.09% 129th
lingvoj.org 41.41% 130th
www.adaptcentre.ie
34
Web of Data Quality – Accessibility Category
www.adaptcentre.ie
35
Web of Data Quality – Accessibility Category
Accessibility Category:
Examples: Availability of Resources,
Licensing, Server Performance
Lessons Learned:
• Average Conformance: 30%
• Standard Deviation: 19%
• Low usage of Machine-Readable
Licences (17 out of 131 datasets)
and Human-Readable Licences (11
out of 131 datasets)
www.adaptcentre.ie
36
Web of Data Quality – Contextual Category
www.adaptcentre.ie
37
Web of Data Quality – Contextual Category
Contextual Category:
Examples: Provenance of Data, Human
Comprehensibility
Lessons Learned:
• Average Conformance: 13%
• Standard Deviation: 13%
• Poor conformance w.r.t. basic
provenance information (e.g.
creator of dataset), and
traceability of data (predicates
defining origin of data)
• More effort towards human
labelling and description of
resources by publishers
www.adaptcentre.ie
38
Web of Data Quality – Intrinsic Category
www.adaptcentre.ie
39
Web of Data Quality – Intrinsic Category
Intrinsic Category:
Examples: Syntactic Validity,
Consistency, Conciseness
Lessons Learned:
• Average Conformance: 77%
• Standard Deviation: 13%
• Overall high conformance for
almost all metrics
• Conformance towards the usage of
correct domain or range datatypes
should be improved (average
conformance ≈ 60%)
www.adaptcentre.ie
40
Web of Data Quality – Representational Category
www.adaptcentre.ie
41
Web of Data Quality – Representational Category
Representational Category:
Examples: Interoperability, Versatility,
Interpretability, Data Representation
Lessons Learned:
• Average Conformance: 63%
• Standard Deviation: 14%
• Data publishers should re-use
more existing terms (average
conformance ≈ 34%)
www.adaptcentre.ie
42
Linked Open Data Cloud – A Dataset Portal
Dataset Portal: http://luzzu.adaptcentre.ie
www.adaptcentre.ie
43
Conclusions
Quality is different
for everyone
Cost vs need for
assessment
Detect quality issues
earlier!
SoTA evolved to meet
the consumers need
to characterise
fitness for intended
use
The quality of the
Web of Data is not
bad – but needs to
improve
www.adaptcentre.ie
44
References
• J. Debattista, S. Auer, C. Lange. Luzzu - A Methodology and Framework for Linked Data Quality
Assessment. In ACM Journal of Data Information Quality. V8 I1, November 2016
• J. Debattista, S. Londoño, C. Lange, S. Auer. Quality Assessment of Linked Datasets using
Probabilistic Approximation. In 12th European Semantic Web Conference Proceedings 2015, 221-
236, Springer
• J. Debattista. Scalable Quality Assessment of Linked Data. (Thesis) Universitäts-und
Landesbibliothek Bonn 2017
• A. J. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, and S. Auer. Quality Assessment for
Linked Data: A survey. Semantic Web Journal, 2015
• J. Debattista, C. Lange, S. Auer. Representing dataset quality metadata using multi-dimensional
views. In Proceedings of the 10th International Conference on Semantic Systems (SEMANTiCS
’14), 92-99, ACM
• S. McGurk, J. Debattista, C. Abela. Towards Ontology Quality Assessment. 4th Workshop on
Linked Data Quality (LDQ)
www.adaptcentre.ie
45
Data Quality
@jerdeb
jeremy.debattista@adaptcentre.ie
Question Time!

More Related Content

What's hot

Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
DATAVERSITY
 
Data Governance Best Practices
Data Governance Best PracticesData Governance Best Practices
Data Governance Best Practices
DATAVERSITY
 
Data quality metrics infographic
Data quality metrics infographicData quality metrics infographic
Data quality metrics infographic
Intellspot
 
Implementing Effective Data Governance
Implementing Effective Data GovernanceImplementing Effective Data Governance
Implementing Effective Data Governance
Christopher Bradley
 
DAS Slides: Building a Data Strategy - Practical Steps for Aligning with Busi...
DAS Slides: Building a Data Strategy - Practical Steps for Aligning with Busi...DAS Slides: Building a Data Strategy - Practical Steps for Aligning with Busi...
DAS Slides: Building a Data Strategy - Practical Steps for Aligning with Busi...
DATAVERSITY
 
Data quality architecture
Data quality architectureData quality architecture
Data quality architecture
anicewick
 
DAS Slides: Data Governance - Combining Data Management with Organizational ...
DAS Slides: Data Governance -  Combining Data Management with Organizational ...DAS Slides: Data Governance -  Combining Data Management with Organizational ...
DAS Slides: Data Governance - Combining Data Management with Organizational ...
DATAVERSITY
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
DATAVERSITY
 
Data Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and RoadmapsData Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and Roadmaps
DATAVERSITY
 
Data Architecture - The Foundation for Enterprise Architecture and Governance
Data Architecture - The Foundation for Enterprise Architecture and GovernanceData Architecture - The Foundation for Enterprise Architecture and Governance
Data Architecture - The Foundation for Enterprise Architecture and Governance
DATAVERSITY
 
Data Management, Metadata Management, and Data Governance – Working Together
Data Management, Metadata Management, and Data Governance – Working TogetherData Management, Metadata Management, and Data Governance – Working Together
Data Management, Metadata Management, and Data Governance – Working Together
DATAVERSITY
 
Data Governance
Data GovernanceData Governance
Data Governance
Rob Lux
 
Data Quality Management
Data Quality ManagementData Quality Management
Data Quality Management
Melissa Data India
 
Strategic Business Requirements for Master Data Management Systems
Strategic Business Requirements for Master Data Management SystemsStrategic Business Requirements for Master Data Management Systems
Strategic Business Requirements for Master Data Management Systems
Boris Otto
 
Why data governance is the new buzz?
Why data governance is the new buzz?Why data governance is the new buzz?
Why data governance is the new buzz?
Aachen Data & AI Meetup
 
Review of Data Management Maturity Models
Review of Data Management Maturity ModelsReview of Data Management Maturity Models
Review of Data Management Maturity Models
Alan McSweeney
 
Improving Data Literacy Around Data Architecture
Improving Data Literacy Around Data ArchitectureImproving Data Literacy Around Data Architecture
Improving Data Literacy Around Data Architecture
DATAVERSITY
 
Data Governance Takes a Village (So Why is Everyone Hiding?)
Data Governance Takes a Village (So Why is Everyone Hiding?)Data Governance Takes a Village (So Why is Everyone Hiding?)
Data Governance Takes a Village (So Why is Everyone Hiding?)
DATAVERSITY
 
Data-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success StoriesData-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success Stories
DATAVERSITY
 
Data Quality
Data QualityData Quality
Data Quality
Vijaya K
 

What's hot (20)

Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 
Data Governance Best Practices
Data Governance Best PracticesData Governance Best Practices
Data Governance Best Practices
 
Data quality metrics infographic
Data quality metrics infographicData quality metrics infographic
Data quality metrics infographic
 
Implementing Effective Data Governance
Implementing Effective Data GovernanceImplementing Effective Data Governance
Implementing Effective Data Governance
 
DAS Slides: Building a Data Strategy - Practical Steps for Aligning with Busi...
DAS Slides: Building a Data Strategy - Practical Steps for Aligning with Busi...DAS Slides: Building a Data Strategy - Practical Steps for Aligning with Busi...
DAS Slides: Building a Data Strategy - Practical Steps for Aligning with Busi...
 
Data quality architecture
Data quality architectureData quality architecture
Data quality architecture
 
DAS Slides: Data Governance - Combining Data Management with Organizational ...
DAS Slides: Data Governance -  Combining Data Management with Organizational ...DAS Slides: Data Governance -  Combining Data Management with Organizational ...
DAS Slides: Data Governance - Combining Data Management with Organizational ...
 
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
 
Data Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and RoadmapsData Governance Best Practices, Assessments, and Roadmaps
Data Governance Best Practices, Assessments, and Roadmaps
 
Data Architecture - The Foundation for Enterprise Architecture and Governance
Data Architecture - The Foundation for Enterprise Architecture and GovernanceData Architecture - The Foundation for Enterprise Architecture and Governance
Data Architecture - The Foundation for Enterprise Architecture and Governance
 
Data Management, Metadata Management, and Data Governance – Working Together
Data Management, Metadata Management, and Data Governance – Working TogetherData Management, Metadata Management, and Data Governance – Working Together
Data Management, Metadata Management, and Data Governance – Working Together
 
Data Governance
Data GovernanceData Governance
Data Governance
 
Data Quality Management
Data Quality ManagementData Quality Management
Data Quality Management
 
Strategic Business Requirements for Master Data Management Systems
Strategic Business Requirements for Master Data Management SystemsStrategic Business Requirements for Master Data Management Systems
Strategic Business Requirements for Master Data Management Systems
 
Why data governance is the new buzz?
Why data governance is the new buzz?Why data governance is the new buzz?
Why data governance is the new buzz?
 
Review of Data Management Maturity Models
Review of Data Management Maturity ModelsReview of Data Management Maturity Models
Review of Data Management Maturity Models
 
Improving Data Literacy Around Data Architecture
Improving Data Literacy Around Data ArchitectureImproving Data Literacy Around Data Architecture
Improving Data Literacy Around Data Architecture
 
Data Governance Takes a Village (So Why is Everyone Hiding?)
Data Governance Takes a Village (So Why is Everyone Hiding?)Data Governance Takes a Village (So Why is Everyone Hiding?)
Data Governance Takes a Village (So Why is Everyone Hiding?)
 
Data-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success StoriesData-Ed Webinar: Data Quality Success Stories
Data-Ed Webinar: Data Quality Success Stories
 
Data Quality
Data QualityData Quality
Data Quality
 

Similar to Data Quality

ALIGNED Data Curation Methods and Tools
ALIGNED Data Curation Methods and ToolsALIGNED Data Curation Methods and Tools
ALIGNED Data Curation Methods and Tools
AlignedProject
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
Lucy McKenna
 
Blackboard Learn Deployment: A Detailed Update of Managed Hosting and SaaS De...
Blackboard Learn Deployment: A Detailed Update of Managed Hosting and SaaS De...Blackboard Learn Deployment: A Detailed Update of Managed Hosting and SaaS De...
Blackboard Learn Deployment: A Detailed Update of Managed Hosting and SaaS De...
Blackboard APAC
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
DataWorks Summit
 
Perspectives on the Role of Trustworthy Repository Standards in Data Journal ...
Perspectives on the Role of Trustworthy Repository Standards in Data Journal ...Perspectives on the Role of Trustworthy Repository Standards in Data Journal ...
Perspectives on the Role of Trustworthy Repository Standards in Data Journal ...
The University of Edinburgh
 
Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...
Riccardo Albertoni
 
Linked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and LuzzuLinked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and Luzzu
jerdeb
 
Rabobank - There is something about Data
Rabobank - There is something about DataRabobank - There is something about Data
Rabobank - There is something about Data
BigDataExpo
 
Burton - Security, Privacy and Trust
Burton - Security, Privacy and TrustBurton - Security, Privacy and Trust
Burton - Security, Privacy and Trust
National Information Standards Organization (NISO)
 
Facing the Data Challenge: Institutions, Disciplines, Services and Risks
Facing the Data Challenge: Institutions, Disciplines, Services and RisksFacing the Data Challenge: Institutions, Disciplines, Services and Risks
Facing the Data Challenge: Institutions, Disciplines, Services and Risks
LizLyon
 
The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...
Projeto RCAAP
 
Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)
Péter Király
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Amazon Web Services
 
RDM & ELNs @ Edinburgh
RDM & ELNs @ EdinburghRDM & ELNs @ Edinburgh
RDM & ELNs @ Edinburgh
EDINA, University of Edinburgh
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
RWE & Patient Analytics Leveraging Databricks – A Use Case
RWE & Patient Analytics Leveraging Databricks – A Use CaseRWE & Patient Analytics Leveraging Databricks – A Use Case
RWE & Patient Analytics Leveraging Databricks – A Use Case
Databricks
 
Data sharing in the Netherlands
Data sharing in the NetherlandsData sharing in the Netherlands
Data sharing in the Netherlands
Jisc RDM
 
Qiagram
QiagramQiagram
Qiagram
jwppz
 
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL EndpointsA BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
Enrico Daga
 
LOP – Capturing and Linking Open Provenance on LOD Cycle
LOP – Capturing and Linking Open Provenance on LOD CycleLOP – Capturing and Linking Open Provenance on LOD Cycle
LOP – Capturing and Linking Open Provenance on LOD Cycle
rogers.rj
 

Similar to Data Quality (20)

ALIGNED Data Curation Methods and Tools
ALIGNED Data Curation Methods and ToolsALIGNED Data Curation Methods and Tools
ALIGNED Data Curation Methods and Tools
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
 
Blackboard Learn Deployment: A Detailed Update of Managed Hosting and SaaS De...
Blackboard Learn Deployment: A Detailed Update of Managed Hosting and SaaS De...Blackboard Learn Deployment: A Detailed Update of Managed Hosting and SaaS De...
Blackboard Learn Deployment: A Detailed Update of Managed Hosting and SaaS De...
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
 
Perspectives on the Role of Trustworthy Repository Standards in Data Journal ...
Perspectives on the Role of Trustworthy Repository Standards in Data Journal ...Perspectives on the Role of Trustworthy Repository Standards in Data Journal ...
Perspectives on the Role of Trustworthy Repository Standards in Data Journal ...
 
Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...
 
Linked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and LuzzuLinked Data Quality Assessment – daQ and Luzzu
Linked Data Quality Assessment – daQ and Luzzu
 
Rabobank - There is something about Data
Rabobank - There is something about DataRabobank - There is something about Data
Rabobank - There is something about Data
 
Burton - Security, Privacy and Trust
Burton - Security, Privacy and TrustBurton - Security, Privacy and Trust
Burton - Security, Privacy and Trust
 
Facing the Data Challenge: Institutions, Disciplines, Services and Risks
Facing the Data Challenge: Institutions, Disciplines, Services and RisksFacing the Data Challenge: Institutions, Disciplines, Services and Risks
Facing the Data Challenge: Institutions, Disciplines, Services and Risks
 
The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...
 
Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)Towards an extensible measurement of metadata quality (DATeCH 2017)
Towards an extensible measurement of metadata quality (DATeCH 2017)
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
 
RDM & ELNs @ Edinburgh
RDM & ELNs @ EdinburghRDM & ELNs @ Edinburgh
RDM & ELNs @ Edinburgh
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
RWE & Patient Analytics Leveraging Databricks – A Use Case
RWE & Patient Analytics Leveraging Databricks – A Use CaseRWE & Patient Analytics Leveraging Databricks – A Use Case
RWE & Patient Analytics Leveraging Databricks – A Use Case
 
Data sharing in the Netherlands
Data sharing in the NetherlandsData sharing in the Netherlands
Data sharing in the Netherlands
 
Qiagram
QiagramQiagram
Qiagram
 
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL EndpointsA BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
 
LOP – Capturing and Linking Open Provenance on LOD Cycle
LOP – Capturing and Linking Open Provenance on LOD CycleLOP – Capturing and Linking Open Provenance on LOD Cycle
LOP – Capturing and Linking Open Provenance on LOD Cycle
 

Recently uploaded

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 

Recently uploaded (20)

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 

Data Quality

  • 1. Data Quality Jeremy Debattista ADAPT Centre, Trinity College Dublin This research has received funding from the Irish Research Council Government of Ireland Postdoctoral Fellowship award (GOIPD/2017/1204) and the ADAPT Centre for Digital Content Technology, funded under the SFI Research Centres Programme (Grant 13/RC/2106) and co-funded by theEuropeanRegionalDevelopmentFund.
  • 2. www.adaptcentre.ie 1 How many of you... … check product review before purchasing? Image and Reviews taken from https://www.amazon.co.uk/Echo-Dot-Smart-Speaker-Alexa/dp/B0792KWK57/
  • 3. www.adaptcentre.ie 2 How many of you... … check trip advisor to find the right restaurant? Images taken from TripAdvisor.com
  • 4. www.adaptcentre.ie 3 Quality: A definition from a Personal Perspective Crowd Image by James Cridland, taken from https://www.flickr.com/photos/jamescridland/613445810/. Licensed under CC-BY 2.0 What does quality mean to you?
  • 5. www.adaptcentre.ie 4 Quality: A definition Robert Pirsig Joseph Juran Phillip Crosby
  • 6. www.adaptcentre.ie 5 Quality: A definition – Pirsig’s Perspective Robert Pirsig … the result of care Zen and the Art of Motorcycle Maintenance (1974) Photo taken from: https://www.goodreads.com
  • 7. www.adaptcentre.ie 6 Quality: A definition – Juran’s Perspective … fitness for use Quality Control Handbook (1974) Joseph Juran Photo taken from: https://www.toolshero.com
  • 8. www.adaptcentre.ie 7 Quality: A definition – Crosby’s Perspective … conformance to requirements Quality is Free : The Art of Making Quality Certain. Mentor book. (1979) Phillip Crosby Photo taken from: https://ceopedia.org
  • 9. www.adaptcentre.ie 8 Data Quality – What is data quality? What characterised good quality for the datasets you needed to perform a task?
  • 10. www.adaptcentre.ie 9 Quality in terms of data is: • Multi-dimensional concept • Characterise quality for a particular task • Variety of quality measures, Subjective or Objective for different tasks • e.g. Accessibility, Trustworthiness, Consistency High quality data = data that fits for its intended use. Data Quality Definition
  • 11. www.adaptcentre.ie 10 Data Quality – Why is it important? DATA
  • 12. www.adaptcentre.ie 11 Data Quality – A Strategy for Organisations • Data Quality is expensive • Data Quality is not just about assessing but also about improving. Figure from Ismael Caballero, Jorge Merino, Manuel Serrano, Mario Piattini, Data Quality for Big Data: Addressing Veracity and Value, 2016
  • 13. www.adaptcentre.ie 12 Data Quality – Identify problems early! A simplistic view of the semantic publishing process (Un/semi-)structured data sources Processing/Uplifting Schemas Mapping Transform Fusion Semantic (Knowledge) Graph
  • 14. www.adaptcentre.ie 13 Data Quality – Identify problems early! A simplistic view of the semantic publishing process (Un/semi-)structured data sources Processing/Uplifting Schemas Mapping Transform Fusion Semantic (Knowledge) Graph • Potentially external data • No structure and context to the data • Certification of quality?
  • 15. www.adaptcentre.ie 14 Data Quality – Identify problems early! A simplistic view of the semantic publishing process (Un/semi-)structured data sources Processing/Uplifting Schemas Mapping Transform Fusion Semantic (Knowledge) Graph • Gives context to raw data • Drives the resulting knowledge graphs • Should be free of contradictions and incorrect definitions
  • 16. www.adaptcentre.ie 15 Data Quality – Identify problems early! A simplistic view of the semantic publishing process (Un/semi-)structured data sources Processing/Uplifting Schemas Mapping Transform Fusion Semantic (Knowledge) Graph • Incorrect/Incomplete mappings (e.g. typos) • Catch errors here, as otherwise errors in your KG will multiply
  • 17. www.adaptcentre.ie 16 Data Quality – Identify problems early! A simplistic view of the semantic publishing process (Un/semi-)structured data sources Processing/Uplifting Schemas Mapping Transform Fusion Semantic (Knowledge) Graph • Are external data sources fit for the task at hand?
  • 18. www.adaptcentre.ie 17 Data Quality – Identify problems early! A simplistic view of the semantic publishing process (Un/semi-)structured data sources Processing/Uplifting Schemas Mapping Transform Fusion Semantic (Knowledge) Graph • Any quality issues not dealt with before will definitely be here • Big data, time consuming, more expensive to clean
  • 19. www.adaptcentre.ie 18 Linked Data Quality Metrics Figure from: A. J. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, and S. Auer. Quality Assessment for Linked Data: A survey.
  • 20. www.adaptcentre.ie 19 Linked Data Quality Metrics - Accessibility Are Linked Data resources readily available to be re-used in different applications/context? Example Metrics: • Availability of SPARQL endpoints and RDF Data Dumps • Dereferenceability of resources • Indication of machine/human readable license • Links to external datasets • Correct usage of hash/slash URIs
  • 21. www.adaptcentre.ie 20 Linked Data Quality Metrics - Intrinsic Measures metrics that are related to the correctness and coherence of the data, independent of the user’s context Example Metrics: • Syntactic valid dataset • Incorrect datatype specification (e.g. “23.42”^^xsd:integer) • Outlier detection • Correct domain and range definition • Data conciseness
  • 22. www.adaptcentre.ie 21 Linked Data Quality Metrics - Contextual Measures metrics dependent on the task at hand. Example Metrics: • Trustworthiness of data • Identification of timely data • Provenance information
  • 23. www.adaptcentre.ie 22 Linked Data Quality Metrics - Representational How well is the data represented in terms of common best practices and guidelines? Example Metrics: • Re-using existing vocabularies • Usage of undefined classes/properties • Provide different serialisation formats for the data • Use of multiple languages
  • 24. www.adaptcentre.ie 23 ISO/IEC 25012 Standard • Every metric identified in the research was mapped to the ISO/IEC 25012 Model: § The Inherent Category – measures intrinsic quality characteristics. § The System Category – measures the degree of quality when the system is used. § The Inherent-System Category – which includes metrics covering both aspects. http://iso25000.com/index.php/en/iso-25000-standards/iso-25012
  • 25. www.adaptcentre.ie 24 Problems with Assessing the Quality of Big Datasets • Metrics classified in Zaveri et al. did not take into consideration time and space complexity • Efficient computation of impractical quality metrics when assessing big datasets • Solving intractable problems? • Trade-off? Faster computation time against metric’s value precision
  • 26. www.adaptcentre.ie 25 Probabilistic Techniques for Assessing Datasets • Sampling • Reservoir sampling • Stratified sampling • Bloom Filters • Random Walks/Markov Chains • Clustering
  • 27. www.adaptcentre.ie 26 Quality Assessment – A Conceptual Methodology 1. Identify Quality Measures for the task at hand • What are the important characteristics of my task? 2. Re-use or define quality metrics 3. Prepare the quality assessment a) Access point of dataset in question b) External Resources such as gold standard 4. Running the quality assessment 5. Assessment representation a) Immediate use b) Mid-to-long term use
  • 28. www.adaptcentre.ie 27 Linked Data Quality Frameworks – Over the Years Flemming LinkQA Sieve RDF Unit Triple Check Mate LiQuate TRELLIS tRDF/tSP ARQL WIQA Luzzu Scalability X ✓ ✓ ✓ N/A N/A N/A ✓ N/A ✓ Extensibility X Java XML SPARQL X Bayesian Rules X tSPARQL Rules WIQA PL Java or LQML Quality Metadata X X ✓ (Optional) ✓ (DQV) X X X X X ✓(daQ) Quality Report HTML HTML X HTML or RDF X X X X X RDF Collaboration X X X X ✓ X ✓ X X X Cleaning Support X X ✓ X X X X X X X Last Update 2010 2011 2014 2017 2013 2014 2005 2014 2009 2018
  • 29. www.adaptcentre.ie 28 Luzzu – A Quality Assessment Framework for Linked Data • Four Principles: 1. Extensibility 2. Scalability 3. Interoperability 4. Customisability Luzzu Thread Pool Metrics Identification List Metrics Impl. Library Metric 1 Metric 2 Metric 3 … Metric n Dataset / SPARQL Endpoint Stream Processing <s,p,o> Quality Metadata Quality Problem Report Try it out: http://www.github.com/Luzzu/Framework
  • 30. www.adaptcentre.ie 29 Luzzu – A Quality Assessment Framework for Linked Data • Four Principles: 1. Extensibility 2. Scalability 3. Interoperability 4. Customisability Luzzu Thread Pool Metrics Identification List Metrics Impl. Library Metric 1 Metric 2 Metric 3 … Metric n Dataset / SPARQL Endpoint Stream Processing <s,p,o> Quality Metadata Quality Problem Report Try it out: http://www.github.com/Luzzu/Framework
  • 31. www.adaptcentre.ie 30 W3C Data Quality Vocabulary (DQV) https://www.w3.org/TR/vocab-dqv/
  • 32. www.adaptcentre.ie 31 W3C Data Quality Vocabulary (DQV) • Policies: Express policies or agreements a dataset follows defined by some data quality concerns • Annotations: Providing rating, certificates, feedback etc… • Feedback: Comments from data consumers on a dataset (imagine comments in Trip Advisor) https://www.w3.org/TR/vocab-dqv/
  • 33. www.adaptcentre.ie 32 Web of Data Quality - Aggregated
  • 34. www.adaptcentre.ie 33 Web of Data Quality - Aggregated Dataset (http://) Aggregated Quality Score Pos zbw.eu 84.72% 1st id.sgcb.mcu.es 83.91% 2nd kdata.kr 82.22% 3rd morelab.deusto.es 80.12% 4th mapasinteractivos.didactalia.net 74.18% 5th ... citeseer.rkbexplorer.com 48.31% 126th prefix.cc 46.64% 127th kent.zpr.fer.hr 46.61% 128th transport.data.gov.uk 45.09% 129th lingvoj.org 41.41% 130th
  • 35. www.adaptcentre.ie 34 Web of Data Quality – Accessibility Category
  • 36. www.adaptcentre.ie 35 Web of Data Quality – Accessibility Category Accessibility Category: Examples: Availability of Resources, Licensing, Server Performance Lessons Learned: • Average Conformance: 30% • Standard Deviation: 19% • Low usage of Machine-Readable Licences (17 out of 131 datasets) and Human-Readable Licences (11 out of 131 datasets)
  • 37. www.adaptcentre.ie 36 Web of Data Quality – Contextual Category
  • 38. www.adaptcentre.ie 37 Web of Data Quality – Contextual Category Contextual Category: Examples: Provenance of Data, Human Comprehensibility Lessons Learned: • Average Conformance: 13% • Standard Deviation: 13% • Poor conformance w.r.t. basic provenance information (e.g. creator of dataset), and traceability of data (predicates defining origin of data) • More effort towards human labelling and description of resources by publishers
  • 39. www.adaptcentre.ie 38 Web of Data Quality – Intrinsic Category
  • 40. www.adaptcentre.ie 39 Web of Data Quality – Intrinsic Category Intrinsic Category: Examples: Syntactic Validity, Consistency, Conciseness Lessons Learned: • Average Conformance: 77% • Standard Deviation: 13% • Overall high conformance for almost all metrics • Conformance towards the usage of correct domain or range datatypes should be improved (average conformance ≈ 60%)
  • 41. www.adaptcentre.ie 40 Web of Data Quality – Representational Category
  • 42. www.adaptcentre.ie 41 Web of Data Quality – Representational Category Representational Category: Examples: Interoperability, Versatility, Interpretability, Data Representation Lessons Learned: • Average Conformance: 63% • Standard Deviation: 14% • Data publishers should re-use more existing terms (average conformance ≈ 34%)
  • 43. www.adaptcentre.ie 42 Linked Open Data Cloud – A Dataset Portal Dataset Portal: http://luzzu.adaptcentre.ie
  • 44. www.adaptcentre.ie 43 Conclusions Quality is different for everyone Cost vs need for assessment Detect quality issues earlier! SoTA evolved to meet the consumers need to characterise fitness for intended use The quality of the Web of Data is not bad – but needs to improve
  • 45. www.adaptcentre.ie 44 References • J. Debattista, S. Auer, C. Lange. Luzzu - A Methodology and Framework for Linked Data Quality Assessment. In ACM Journal of Data Information Quality. V8 I1, November 2016 • J. Debattista, S. Londoño, C. Lange, S. Auer. Quality Assessment of Linked Datasets using Probabilistic Approximation. In 12th European Semantic Web Conference Proceedings 2015, 221- 236, Springer • J. Debattista. Scalable Quality Assessment of Linked Data. (Thesis) Universitäts-und Landesbibliothek Bonn 2017 • A. J. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, and S. Auer. Quality Assessment for Linked Data: A survey. Semantic Web Journal, 2015 • J. Debattista, C. Lange, S. Auer. Representing dataset quality metadata using multi-dimensional views. In Proceedings of the 10th International Conference on Semantic Systems (SEMANTiCS ’14), 92-99, ACM • S. McGurk, J. Debattista, C. Abela. Towards Ontology Quality Assessment. 4th Workshop on Linked Data Quality (LDQ)