This presentation is a supplementary material for the following article -> Nikiforova, A., & Bicevskis, J. (2019). An Extended Data Object-driven Approach to Data Quality Evaluation: Contextual Data Quality Analysis. In ICEIS (1) (pp. 274-281).
The research is an extension of a data object-driven approach to data quality evaluation allowing to analyse data object quality in scope of multiple data objects. Previously presented approach was used to analyse one particular data object, mainly focusing on syntactic analysis. It means that the primary data object quality can be analysed against secondary data objects of unlimited number. This opportunity allows making more comprehensive, in-depth contextual data object analysis. The given analysis was applied to open data, making comparison between previously obtained results and results of application of the extended approach, underlying importance and benefits of the given extension.
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUAL DATA QUALITY ANALYSIS
1. AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA
QUALITY EVALUATION: CONTEXTUAL DATA QUALITY ANALYSIS
21st International Conference on Enterprise Information Systems (ICEIS),
Heraklion, Crete – Greece, 2019
Anastasija Nikiforova, Janis Bicevskis
Faculty of Computing, University of Latvia
Anastasija.Nikiforova@lu.lv
2. “Quality” is a desirable goal to be achieved through management of the
production process.
«Data quality» is a relative concept, largely dependent on specific
requirements resulting from the data use.
QUALITY AND DATA QUALITY
Source: Bičevska (2018)
Source: ISO 9001:2015: Quality
management principles.
2016
Decisions resulting from bad
data cost the US economy
$3.1 trillion dollars per year
-IBM
2017
Organizations believe poor data quality
to be responsible for an average of $15
million per year in losses
-Gartner
Data quality weaknesses
can lead to huge losses
!!! The same data may be
sufficiently qualitative in one case
BUT
completely useless under other
circumstances.
3. «Dimensions are not defined in a measurable and formal way»
-Batini et al., 2016, DAMA, 2019, Huang et al., 1999, Eppler, 2006
«…Even amongst data quality professionals the key data quality dimensions
are not universally agreed. This state of affairs has led to much confusion
within the data quality community and is even more bewildering for those
who are new to the discipline and more importantly to business
stakeholders…»
-DAMA, 2019
RELATED RESEARCHES
General studies on data and information quality - define different
dimensions of quality and their groupings as well as data assessment
methodologies.
Assessments of specific industry data and information quality - sector-
specific methods.
• Cancer registry, Healthcare, Manufacturing, Chemical Hazard Risk Assessments, etc.
BUT!!!
There is no consensus on data quality dimensions
and their usability.
How to relate particular dimension (and which
one?) to a particular use-case???
Dimensions of the same name can have different
semantics in different researches.
Problem: necessity to involve data quality experts at every stage of data
quality analysis process
Solution: data object-driven approach to data quality evaluation
(Bicevskis, Bicevska, Nikiforova, Oditis, 2018)
4. TDQM data quality lifecycle
Data quality
definition
Data quality
measuring
Data quality
analysis
Data quality
improvemen
t
MAIN PRINCIPLES OF THE PROPOSED
SOLUTION
Each specific application can have its own specific DQ checks;
DQ requirements can be formulated on several levels
• from informal text in natural language
• to an automatically executable model, SQL statements or program
code;
DQ can be checked in various stages of the data processing;
DQ definition language is graphical DSL:
• the diagrams are easy to read, create, understand and edit even by
non-IT and non-Data Quality professionals;
• syntax and semantics can be easily applied to any new IS.
5. !!! All three components are
defined by using a graphical
domain specific language
(DSL)**
**Three DSL families were developed as graphic languages
based on the possibilities of the modelling platform DIMOD
1. DATA OBJECT (DO) - the set of values of the parameters that characterize a real-life object
primary data object - the initial DO which quality is analysed;
secondary data object – DO that determines the context for analysis of the primary DO.
* Many objects of the same structure form class of data objects
2. DATA QUALITY REQUIREMENTS - conditions that must be met in order a data object is
considered of high quality.
** May contain: informal or formalized implementation-independent descriptions of conditions
3. DATA QUALITY MEASURING PROCESS - procedures should be performed to evaluate the
data object’s quality.
DATA QUALITY MODEL
instead of dimensions
6. DATA QUALITY ANALYSIS. STEP-BY-STEP GUIDE
0-1. Definition of the use case
0-2. Analysis of source data
1-1. Definition of the primary data object
1-2. Definition of the secondary data object(-s)
1-3. Primary and secondary data objects linking
2-1. Primary data object quality specification
2-2. Primary and secondary data objects linking conditions
3. Data quality measuring process
defined using
graphical DSL
4-1. Analysis of the results
4-2. Data quality improvement (MS DQS)
7. Use-cases:
1. company search/ identification
(by its name, registration
number, incorporation date);
2. contacting by post
(by address and postal code)
Company registers of:
United Kingdom (UK)
Latvia (LV)
Estonia (EE)
Norway (NOR)
Global Open Data Index
UK: 1st place
LV: 18st place
EE: -
NOR: 1st place
APPROBATION. DATA SETS
Country # of columns # of columns with quality problems
(number, %)
United Kingdom 55 15 (27.3%)
Latvia 22 11 (50%)
Estonia 14 7 (50%)
Norway 42 8 (19%)
8. 1) company identification
(by its name, registration number and incorporation date)
2) contacting by post
(by its address and postal code)
Country Identificat
ion
Name Reg.
Nr.
Incorporation
date
UK
-
1
0.0001%
0
3 invalid
0.0004%
Latvia - 10
0.0025%
0 94 NULL
0.02%
Estonia + 0 0 -
Norway - 0 0 9 doubtful
0.001%
Contactin
g by post
Address Postal
code
- 7 514 NULL –
1%
4 invalid –
0.0005%
12 151
1.6%
- 366
0.09%
20 498
5.16%
- 29 918
11.24%
22 621
8.5%
- 68 128
6.2%
14 683
1.3%
APPROBATION. RESULTS
Mainly syntactic analysis was done -
analysis in scope of one data object
!!!
More in-depth and comprehensive
analysis should be done -
analysis in scope of multiple data
objects
9. TOTAL: 128 different values,
that possibly contain data quality problems
Various names indicating the
same country
USA
United States
United States of America
Northern Ireland
Republic of Ireland
Ireland
Virgin Island
British Virgin Island
Virgin Islands, British
Scotland
Scotland UK
…
???
Which of them
is valid?
APPROBATION. ADDITIONAL CHECKS
OF «COMPANIES HOUSE» (UK)
# Type of issue Example
1
various names
indicating the same
country
USA, United
States and United
States of America
etc.
2
names of dissolved
countries
Czechoslovakia
Yugoslavia
USSR
3
values indicating
administrative
division or region
Wales
Scotland
England & Wales
England
…
4 not countries at all
“SW7”
“EAST SUSSEX”
“BWI”
“DE 19901”
The single data object analysis indicates the mere
existence of the data quality problem without
detecting all the defective records.
The secondary data object is
needed!!!
10. • Data object is platform-independent.
• The checking of parameter values is local and
formal process.
• The quality checking for one of the DO
parameters values is an examination of properties
of the individual values, e.g. whether:
• (1) a text string may serve as a value of the field Name,
• (2) value of the field Address is a correct address.
• Can be formulated at different levels of abstraction:
• from the formal language grammar
• to definitions of variables in programming languages.
DATA OBJECT
Secondary DO
Primary DO
11. • Quality conditions are defined only for the
primary data object.
• DQ requirements are defined by using logical
expressions.
• The names of DO attributes/ fields serve as
operands in the logical expressions.
• Both syntactical and semantical data quality can
be analysed according to unified principles.
DATA QUALITY SPECIFICATION
SendMessage
Assess Field "CountryOfOrigin"
checkvalueExists(CountryOfOrigin)
Assess Field "URI"
checkValueExists(URI)
checkValueURI(URI,
'http://business.data.gov.uk.id/company/$CompanyName')
Assess Field "CompanyNumber"
checkValueExists(CompanyNumber)
checkValueDigits(8)
Assess Field "RegAddress AddressLine1"
checkValueExists(RegAddress AddressLine1)
Assess Field "IncorporationDate"
checkValueExists(IncorporationDate)
checkValueDate(IncorporationDate, "DD/MM/YYYY")
Assess Field "RegAddress AddressPostCode"
checkvalueExists(RegAddress AddressPostCode)
Assess Field "CompanyName"
checkValueExists(CompanyName)
SendMessage
SendMessage
SendMessage
SendMessage
SendMessage
SendMessage
SendMessage
Assess Field "RegAddressCountry"
checkvalueExists(RegAddressCountry)
ShortName
OfficialName
ISO2
ISO3
UNDP
checkCountryOfOriginName(Country,
CountryOfOrigin)
checkRegAddressCountryName(Country,
RegAddressCountry)
NO
NO
OK
NO
NO
NO
NO
NO
NO
OK
OK
OK
OK
OK
OK
OK
Secondary DO
Link between
primary and
secondary DOs
(informal rule)
12. DATA QUALITY MEASURING
PROCESS
The activities to be taken to select data object values from data sources.
One or more steps to evaluate the quality of the data, each of which describes one
test for the compliance of the data object with a specific quality specification.
+
Gather values of the secondary DOs from the data sources if the parameter indicating
the secondary DO’s value in scope of defined quality condition is true:
1. read/ write operations from data source into database,
2. connection of primary and secondary data objects via appropriate
parameters
The steps to improve data quality automatically or manually triggering changes in
the data source.
For contextual
checks
The language describing the quality evaluation
process involves verification activities for a
particular DO that can be defined:
informally as a natural language text,
using UML activity diagrams,
in the own DSL.
Additionally, processing of DO classes instances
may require looping constructions, similar to
iterator used in C#.
13. • A concrete DO or a class of DO is used as an
input for a quality verification process.
• The quality verification process creates a test
protocol.
In case of SQL:
SELECT statement specifies the target DO
WHERE clause specifies quality
requirements
+
JOIN clause link primary and secondary
DOs
DATA QUALITY MEASURING
PROCESS
14. BERMUDA
BWI
…
CZECHOSLOVAKIA
DE 19901
EAST SUSSEX
ENGLAND
ENGLAND & WALES
GIBRALTAR
Great Britain
HOLLAND
…
JERSEY
…
ST VINCENT
NORTHERN
IRELAND
REPUBLIC OF
IRELAND
Country Of Origin Short Name Official Name ISO3 ISO2
… … … … …
DE 19901 NULL NULL NULL NULL
GREECE Greece the Hellenic Republic GRC GR
… … … … …
LATVIA Latvia the Republic of Latvia LVA LV
… … … … …
United States of
America
United States
of America
the United States of
America
USA US
… … … … …
Invalid names
TOTAL: 128 different values,
that possibly can contain data quality
problems
TOTAL: 48 different values,
that definetely have data quality problems
Various names indicating
the same country
USA
United States
United States of America
Northern Ireland
Republic of Ireland
Ireland
Virgin Island
British Virgin Island
Virgin Islands, British
Scotland
Scotland UK
…
REPUBLIC OF NIGERIA
…
SCOTLAND UK
SOUTH KOREA
SW7
TADJIKISTAN
TAIWAN
TURKS & CAICOS
ISLANDS
UNITED STATES
UK
USSR
VENEZUELA
VIETNAM
VIRGIN ISLANDS
WEST GERMANY
YEMEN ARAB
REPUBLIC
YUGOSLAVIA
???
Which of
them is valid?
Results in scope of single data object Results in scope of multiple data objects
SINGLE vs MULTIPLE DATA
OBJECT ANALYSIS • Analysis of 2 parameters containing names of
countries against 4 representations of countries’
names and their subdivisions.
• Although this problem was observed in 27.6%
records, it could be solved by making just 48
corrections.
• All values of “CountryOfOrigin” and 73 of 74 values
of “RegAddress Country” conform to one standard,
i.e., the short name.
ONLY 13 instead of 48 invalid
values were detected!!!
15. Data quality analysis in context of multiple data objects was applied to 23 «external» open datasets,
+ 22 different secondary DOs were used;
21 of 23 datasets (91.3%) have at least few data quality issues that weren’t detected previously;
• initial version: indicated records potentially containing data quality problems - very resources-consuming
process.
• proposed extension of the approach: detects only the records with the certain data quality.
The initial analysis detected 128 values:
• only 13 values with data quality problem instead of 48.
• 115 values didn’t have data quality problems (false negative).
In this particular case, results of analysis were
improved by 72.9%.
FEW REMARKS
!!! The proposed structure eliminated the necessity of additional in-depth quality
analysis, as well as writing complex queries and individual analysis of the results.
16. An data object-driven approach to data quality evaluation:
• 3 components: data object, quality specification, quality measuring process defined using graphical DSLs;
• provide ability to analyse «foreign»/ «external» data without the involvement of data holders (higher level of
abstraction);
• very intuitive – suitable even for non-IT and non-DQ experts.
The contextual quality analysis significantly improves data quality analysis results:
• possibility to analyse real data object’s quality within the context of multiple data objects;
• detects the records with the certain data quality problem.
• the number of possible controls, where the proposed extended approach can yield valuable results, is very high.
Both syntactical and contextual data quality are analysed according to unified description principles
the diagram’s structure remained easy to read, create, understand and edit.
User’s participation in [open] data quality analysis using the presented approach brings benefits not
only the users themselves, but also data holders, when users share their feedbacks, as data holders are not
even aware of data quality problems.
RESULTS
17. application and evaluation of the extended approach in the cases of complex data object’s
structure, including supplementing data objects when direct connection between the primary and
the secondary data objects is not possible,
detection of possible limitations of the proposed extended approach,
ensuring possibility to evaluate data sets’ evolution,
assessment of possibility to provide users with suggestions for data improvement,
developing data quality theory.
FUTURE WORK
18. THANK YOU!
For more information, see ResearchGate
See also anastasijanikiforova.com
For questions or any other queries, contact me via email - Anastasija.Nikiforova@lu.lv
Article: Nikiforova, A., & Bicevskis, J. (2019). An Extended Data Object-driven Approach to Data Quality
Evaluation: Contextual Data Quality Analysis. In ICEIS (1) (pp. 274-281).