Successfully reported this slideshow.
Your SlideShare is downloading. ×

Analysis of open health data quality using data object-driven approach to data quality evaluation: insights from a Latvian context

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 23 Ad

Analysis of open health data quality using data object-driven approach to data quality evaluation: insights from a Latvian context

Download to read offline

This presentation is a supplementary material for the following article -> Nikiforova, A. (2019). Analysis of open health data quality using data object-driven approach to data quality evaluation: insights from a Latvian context. In IADIS International Conference e-Health (pp. 119-126).
This research focuses on the analysis of the quality of open health data that are freely available and can be used by everyone for their own purposes. The quality of open data is crucial as it can lead to unreliable decision-making and financial losses, however, the quality of open health data has even more critical role.Despite its importance, this topic is rarely discussed.Therefore, the previously proposed data object-driven approach to data quality evaluation is applied to open health data in Latvia in order to (a) evaluate their quality, highlighting common quality issues that should be considered by both, users and data publishers, (b) demonstrate that the used approach is suitable for given purpose as it is simple enough,and ensures the involvement of users even without IT and data quality knowledge (domain experts) in the data quality analysis examining data for their own purposes. The proposed solution seems to be useful in establishing communication between data users and publishers,improving the overall quality of data.

This presentation is a supplementary material for the following article -> Nikiforova, A. (2019). Analysis of open health data quality using data object-driven approach to data quality evaluation: insights from a Latvian context. In IADIS International Conference e-Health (pp. 119-126).
This research focuses on the analysis of the quality of open health data that are freely available and can be used by everyone for their own purposes. The quality of open data is crucial as it can lead to unreliable decision-making and financial losses, however, the quality of open health data has even more critical role.Despite its importance, this topic is rarely discussed.Therefore, the previously proposed data object-driven approach to data quality evaluation is applied to open health data in Latvia in order to (a) evaluate their quality, highlighting common quality issues that should be considered by both, users and data publishers, (b) demonstrate that the used approach is suitable for given purpose as it is simple enough,and ensures the involvement of users even without IT and data quality knowledge (domain experts) in the data quality analysis examining data for their own purposes. The proposed solution seems to be useful in establishing communication between data users and publishers,improving the overall quality of data.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Analysis of open health data quality using data object-driven approach to data quality evaluation: insights from a Latvian context (20)

Advertisement

More from Anastasija Nikiforova (13)

Recently uploaded (20)

Advertisement

Analysis of open health data quality using data object-driven approach to data quality evaluation: insights from a Latvian context

  1. 1. ANALYSIS OF OPEN HEALTH DATA QUALITY USING DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: INSIGHTS FROM A LATVIAN CONTEXT 13th Multi Conference on Computer Science and Information Systems 11th International Conference on e-Health 17 – 19 July 2019, Porto, Portugal Anastasija Nikiforova Faculty of Computing, University of Latvia Anastasija.Nikiforova@lu.lv
  2. 2. (The New York Times, The Economist, WIRED) Def. I: «Open data» are data that anyone can access, use and share.  The popularity of open data continuously increases.  For instance, European Data Portal collects more than 800 thousand data sets. OPEN DATA The aggregate economic impact from applications based on open data across the EU27 economy is estimated to be €140 billion annually. Open Government Data (OGD):  impact economic growth,  improving government services,  reducing fraud,  reducing wastes. The McKinsey Global Institute report estimated that open data could add over $3 trillion annually in total value to the global economy.
  3. 3. The list of researches indicates the existence of data quality problems in open data:  Ferney et al., 2017;  Kerr et al., 2007;  Kuk and Davies, 2011;  Martin, 2014;  Nikiforova, 2018a, 2018b;  Nikiforova and Bicevskis, 2019;  Vetrò et al., 2016  etc.. 8 PRINCIPLES OF OPEN DATA  OGD: the quality aspect takes only the 4th place by popularity after policy, benefit and risk, although quality can impact these aspects. (Klein et al., 2018)  Data quality appears as one of most problematical dimensions for open data portals. Def. II: «Quality» is a desirable goal to be achieved through management of the production process. Def. III: «Data quality» is a relative concept, largely dependent on specific requirements resulting from the data use. (SunlightFoundation, 2007), (European Data Portal, 2018) Open data must be: 1. complete 3. primary 2. timely 4. accessible 7. machine-processable 5. non-discriminatory 6. licence-free 8. non-proprietary And what about data quality*??? *
  4. 4. Latvia:  is one of 70 countries participating in the Open Government Partnership - an international platform for domestic reformers that committed to making their governments more open, accountable, and responsive to citizens;  is the fast-tracker (among beginners, followers, fast-trackers, trend-setters); -Open Data Maturity report  has the highest rate of open data maturity in comparison with neighbourhoods from Baltic States and Scandinavian countries. THE STATE OF OPEN DATA IN LATVIA  In 2017 the Latvian Ministry of Environmental Protection and Regional Development has launched the new Latvian Open Data Portal:  The state of the quality for Latvia is the worst aspect among impact, policy, portal, and quality (only 62% while the average is 71%), compared with the average rate for all analysed countries. Open data maturity of Latvian open data portal: • in 2016 - 31st, • in 2017 – 20th, • in 2018 - 12th. As for the quality aspect – 11th place with just 370 out of 520 points. at the moment of its launch 33 data sets from 13 data publishers in July of 2018 139 data sets from 41 publishers in June of 2019 228 data sets from 62 publishers.
  5. 5. OPEN HEALTH(CARE) DATA I  Aims and possible uses of open health(care) data can be very different, since health data and information are characterized by multiple number of possible applications, uses and users.  The volume of health(care) data continuously increasing, and it is expected to grow dramatically in the years ahead.  Open health(care) data is one of the most popular categories of open data. (Cabitza and Batini, 2016) Health and healthcare data are very broad concepts*, this research focuses on one subdomain - open health data. *Def. IV: «Health care data» are items of knowledge about an individual patient or a group of patients. *Def. V: «Health data» are any representation of facts related to the health of single individuals or entire populations and that is suitable for communication, interpretation or processing by manual or electronic means; (World Health Organization, 2003) Abdelhak M, Grostick S, Hanken MA, 2012) Healthcare is characterized by highly complex labor- and skill intensive services where the actors involved still rely primarily on paper tools, their own cognition (competencies and memory), and other traditional methods. (Cabitza & Batini, 2016) HUMAN FACTOR!!!
  6. 6. OPEN HEALTH(CARE) DATA II  Between 56% and 79% of Internet users seek health information online:  - 35%,  - 42%, with the lowest proportion in the Southern countries:  - 30%,  - 23%. (Andreassen et al., 2007)  Open health(care) data must be of high quality, as they:  are needed for health(care) planning and administrative purposes:  can be useful searching data on medications, their dose, contraindications and other information available for the wide audience. • provide a sampling frame for medical research, • facilitate quality assurance of the health(care) services, • etc. • form the basis for health and medicines authority’s hospital statistics, or health economic calculations, • provide authorities with data to support hospital planning, • monitor the frequency of various diseases and treatments, The list of researches discussing quality of health(care) data in many countries comes to the one conclusion – health(care) data have data quality problems.
  7. 7. Assumption: as the level of details of “open” data might be lower in comparison with “closed” data stored in databases, quality checks can be simpler. open data are usually used by wide audience that may not have deep knowledge in IT or data quality areas a solution should be simple enough ensuring particular users with possibility to take part in the analysis of «third-party» open data for their own purposes OPEN [HEALTH] DATA QUALITY Solution: previously proposed user-oriented data object-driven approach (Bicevskis, Bicevska, Nikiforova, Oditis, 2018), (Nikiforova, 2019) !!! The same data may be sufficiently qualitative in one case BUT completely useless under other circumstances.
  8. 8.  General studies on data and information quality - define different dimensions of quality and their groupings. ✘ The key data quality dimensions are not universally*; ✘ There is no agreement on their meanings and usability **; ✘ Each dimension can be supplied with one or more metrics that varies from one solution to another; ✘ The number of different data quality dimensions, their definitions and grouping are often useful for only particular solution. Question: How to relate particular dimension (and which one?) to a particular use- case??? RELATED RESEARCHES Problem: necessity to involve data quality experts at every stage of data quality analysis process. Solution: data object-driven approach to data quality evaluation. (Bicevskis, Bicevska, Nikiforova, Oditis, 2018), (Nikiforova, Bicevskis, 2019) * «… This state of affairs has led to much confusion within the data quality community and is even more bewildering for those who are new to the discipline and more importantly to business stakeholders…» (DAMA UK, 2018) ** In different proposals, dimensions of the same name can have different semantics and vice versa. (Batini, 2016) Example I: (Kerr, et al., 2007): New Zealand’s healthcare data:  6 data quality dimensions,  24 characteristics  69 data quality criteria. Example II: (Dahbi et al., 2018; Weiskopf et al., 2013):  2 data quality dimensions: accuracy and completeness
  9. 9. TDQM data quality lifecycle Data quality definition Data quality measuring Data quality analysis Data quality improvement MAIN PRINCIPLES OF THE PROPOSED SOLUTION  Each specific application can have its own specific DQ checks;  DQ requirements can be formulated on several levels:  DQ can be checked in various stages of the data processing;  DQ definition language is graphical DSL: • the diagrams are easy to read, create, understand and edit even by non-IT and non-DQ experts; • syntax and semantics can be easily applied to any new IS. from informal text in natural language to an automatically executable model, SQL statements or program code;
  10. 10. !!! All three components are defined by using a graphical domain specific language (DSL)** **Three DSL families were developed as graphic languages based on the possibilities of the modelling platform DIMOD 1. DATA OBJECT (DO) - the set of values of the parameters that characterize a real-life object  primary data object - the initial DO which quality is analysed;  secondary data object – DO that determines the context for analysis of the primary DO. * Many objects of the same structure form class of data objects 2. DATA QUALITY REQUIREMENTS - conditions that must be met in order a data object is considered of high quality. ** May contain: informal or formalized implementation-independent descriptions of conditions 3. DATA QUALITY MEASURING PROCESS - procedures should be performed to evaluate the data object’s quality. DATA QUALITY MODEL instead of dimensions
  11. 11.  15 data sets from 7 different data publishers;  15 primary data objects, 11 secondary data objects were involved in data quality analysis and applied on 35 parameters of primary data objects;  The most popular and frequently occurred data quality issues: ✘ contextual data quality issues; ✘ empty values (completeness); ✘ multiple notation for the same object in scope of one data object and even parameter; ✘ issues in interrelated parameters. DATA QUALITY ANALYSIS OF OPEN HEALTH(CARE) DATA ✘ only 6 out of 15 data sets are updated as frequently as it is promised; ✘ only 8 out of 15 data sets are supplied with explanation of parameters; ✔ almost all available data sets are provided in machine-readable format:  the most popular open data format - .xlsx (53.3%), while 26.7% in .zip, including data sets in .xlsx and .csv format, ✘ 1 data set cannot be considered open data.
  12. 12. Medicinal_Product ISO3 varchar ISO2 varchar OfficialName varchar ShortName varchar Country Code (ISO-3166-1) varchar ShortName_LV varchar OfficialName_LV varchar pharmaceutical_form varchar original_name varchar product_id varchar exp_country_en varchar marketing_authorisation_holder varchar exp_country_lv varchar atc_code varchar authorisation_procedure enumerable {Eiropas centralizētā reģistrācijas procedūra, Nacionālā reģistrācijas procedūra, ...} summary_of_product_ characteristics varchar - pattern Country_LV ATC ATC_code varchar  Data object is platform-independent.  The checking of parameter values is local and formal process.  The quality checking for one of the DO parameters values is an examination of properties of the individual values, e.g. whether:  a text string may serve as a value of the field Name,  value of the field Address is a correct address.  Can be formulated at different levels of abstraction:  from the formal language grammar  to definitions of variables in programming languages. DATA OBJECT Secondary DO Primary DO
  13. 13. SendMessage Assess Field "product_id" checkValueExists(product_id) Assess Field "original_name" checkValueExists(original_name) Assess Field "pharmaceutical_form" checkValueExists(pharmaceutical_form) SendMessage SendMessage SendMessage Assess Field "marketing_authorisation_holder" checkValueExists(marketing_authorisation_holder) Assess Field "exp_country_en" checkValueExists(exp_country_en) Assess Field "exp_country_lv" checkValueExists(exp_country_lv) Assess Field "atc_code" checkValueExists(atc_code) SendMessage SendMessage SendMessage Assess Field "authorisation_procedure" checkValueExists(authorisation_procedure) checkValueEnumerable(authorisation_procedure) Assess Field "summary_of_product_ characteristics" checkValueExists(summary_of_product_ characteristics) checkValueSummary_of_product_ characteristics(Summary_of_product_ characteristics, 'https://www.zva.gov.lv/zalu-registrs/attachments/ pdf.php?id=%'+'&src=description') SendMessage SendMessage ISO3 ISO2 OfficialName checkMarketing_authorisation _holderName(Country, marketing_authorisation_holder) checkExp_country_enName (Country, exp_country_en) checkExp_country_lvName (Country_LV, exp_country_lv) checkAtc_codeName (ATC, atc_code) ShortName ATC_code ShortName_LV Code (ISO-3166-1) OfficialName_LV OK OK OK NO NO NO OK OK NO OK NO NO NO OK OK NO NO OK  Quality conditions are defined only for the primary data object.  DQ requirements are defined by using logical expressions.  The names of DO attributes/ fields serve as operands in the logical expressions.  Both syntactical and semantical data quality can be analysed according to unified principles. DATA QUALITY SPECIFICATION Secondary DO Link between primary and secondary DOs (informal rule)
  14. 14. DATA QUALITY MEASURING PROCESS The activities to be taken to select data object values from data sources. One or more steps to evaluate the quality of the data, each of which describes one test for the compliance of the data object with a specific quality specification. + Gather values of the secondary DOs from the data sources if the parameter indicating the secondary DO’s value in scope of defined quality condition is true: 1. read/ write operations from data source into database, 2. connection of primary and secondary data objects via appropriate parameters The steps to improve data quality automatically or manually triggering changes in the data source. For contextual checks  The language describing the quality evaluation process involves verification activities for a particular DO that can be defined:  informally as a natural language text,  using UML activity diagrams,  in the own DSL.  Additionally, processing of DO classes instances may require looping constructions, similar to iterator used in C#.
  15. 15.  A concrete DO or a class of DO is used as an input for a quality verification process.  The quality verification process creates a test protocol. In case of SQL:  SELECT statement specifies the target DO  WHERE clause specifies quality requirements +  JOIN clause link primary and secondary DOs DATA QUALITY MEASURING PROCESS Read data from data sources and write into DB "Medicinal_Product" Read data from data sources and write into DB "Country" SendMessage Assess Field "product_id" SELECT * from [dbo].[Medicinal_product] WHERE [ product_id] IS NULL Assess Field "original_name" SELECT * from [dbo].[Medicinal_product] WHERE [original_name] IS NULL Assess Field "pharmaceutical_form" SELECT * from [dbo].[Medicinal_product] WHERE [pharmaceutical_form] IS NULL SendMessage SendMessage SendMessage Assess Field "marketing_authorisation_holder" select * from [dbo].[Medicinal_product] LEFT JOIN [dbo].[country] ON [dbo].[country].[Short name] = (right(marketing_authorisation_holder, charindex(',', reverse(marketing_authorisation_holder)) - 2)) OR [dbo].[country].[Official name] = (right(marketing_authorisation_holder, charindex(',', reverse(marketing_authorisation_holder)) - 2)) OR [dbo].[country].[ISO3] = (right(marketing_authorisation_holder, charindex(',', reverse(marketing_authorisation_holder)) - 2)) WHERE [dbo].[country].[Short name] IS NULL AND [dbo].[country].[Official name] IS NULL AND [dbo].[country].[ISO3] IS NULL Assess Field "exp_country_en" select * from [dbo].[Medicinal_product] LEFT JOIN [dbo].[country] ON [dbo].[country].[Short name] = (exp_country_en) OR [dbo].[country].[Official name] = (exp_country_en) OR [dbo].[country].[ISO3] = (exp_country_en) WHERE [dbo].[country].[Short name] IS NULL AND [dbo].[country].[Official name] IS NULL AND [dbo].[country].[ISO3] IS NULL Assess Field "exp_country_lv" select * from [dbo].[Medicinal_product] LEFT JOIN [dbo].[country_lv] ON [dbo].[country_lv].[Code (ISO-3166-1)] = (exp_country_lv) OR [dbo].[country_lv].[ShortName_LV] = (exp_country_lv) OR [dbo].[country_lv].[LongName_LV] = (exp_country_lv) WHERE [dbo].[country_lv].[ Code (ISO-3166-1)] IS NULL AND [dbo].[country_lv].[ShortName_LV] IS NULL AND [dbo].[country_lv].[ LongName_LV] IS NULL Assess Field "atc_code" SELECT product_id, REPLACE(SUBSTRING(atc_code, CHARINDEX(';', atc_code), LEN(atc_code)), ';', '') as atc1, LEFT(atc_code, CHARINDEX(';', atc_code) - 1) as atc2 into #atc_divided FROM [dbo].[Medicinal_product] WHERE LEFT(atc_code, CHARINDEX(';', atc_code) - 0) NOT LIKE ''; SELECT product_id FROM [dbo].[Medicinal_product] LEFT JOIN [dbo].[ATC] ON [dbo].[ATC].[ATC_code] = [dbo].[Medicinal_product].[atc_code] WHERE [dbo].[ATC].[ATC_code] IS NULL EXCEPT SELECT product_id FROM #atc_divided SendMessage SendMessage SendMessage Read data from data sources and write into DB "Country_LV" Read data from data sources and write into DB "ATC" Assess Field "authorisation_procedure" SELECT * from [dbo].[Medicinal_product] WHERE authorisation_procedure IS NULL OR authorisation_procedure NOT LIKE 'Eiropas centralizētā reģistrācijas procedūra' AND authorisation_procedure NOT LIKE 'Nacionālā reģistrācijas procedūra' AND ... AND authorisation_procedure NOT LIKE 'Decentralizētā reģistrācijas procedūra' Assess Field "summary_of_product_ characteristics" SELECT * from [dbo].[Medicinal_product] WHERE where summary_of_product_characteristics IS NULL OR summary_of_product_characteristics NOT LIKE 'https://www.zva.gov.lv/zalu-registrs/attachments/ pdf.php?id=%'+'&src=description' SendMessage SendMessage OK OK OK NO NO NO OK OK NO OK NO NO NO OK OK NO NO OK
  16. 16. Publisher Dataset Context issues/ context total Empty/ Total Multiple notation/ Total Clean/ Total Centre for Disease Prevention and Control Incidence of 2nd type diabetes in Latvia - 0/6 0/6 (0) 6/6 Ministry of Welfare Distribution of persons receiving tech aid by AT 2/2 (100%) 3/7 (43%) 0/7 (0) 2/7 Number of social service providers 2/2 (100%) 22/27 (82%) 10/27 (37%) 4/27 Persons with disabilities by the severity of the disability and AT 2/2 (100%) 0/23 (0) 0/23 (0) 20/23 Number of children with disabilities by AT 2/2 (100%) 0/10 (0) 0/10 (0) 8/10 State labour inspectorate Accidents at work (0-1/1) (0-100%) 1/10 (10%) 0/10 (0) 8/10 Occupational diseases confirmed 4/5 (80%) 2/11 (18%) 1/11 (0.09%) 9/11 National Blood Donor Centre Statistics National Blood Donor Center Statistics - 0/4 (0) 0/4 (0) 4/4 State Agency of medicines Register of licensed pharmaceutical companies 1/2 (50%) 17/38 (45%) 0/38 (0) 19/38 Medicines consumption statistics 3/3 (100%) 5/8 (63%) 2/8 (25%) 0/8 Medicinal Product Register of Latvia 4/9 (44%) 21/41 (51%) 1/41 (2%) 14/41 Food and veterinary service Food supplements register 2/2 (100%) 30/35 (86%) 4/35 (11%) 5/35 Dietary foodstuffs register 2/2 (100%) 19/22 (87%) 4/22 (18%) 3/22 APPROBATION. RESULTS
  17. 17. DATA QUALITY ANALYSIS OF OPEN HEALTH(CARE) DATA: CONTEXTUAL ISSUES  Only 1 data set out of 12 (8.3%) didn’t had any data quality issues (“Accidents at work”), however, some manipulations were needed in order to achieve this result.  In total 25 out of 35 parameters (71.4%) had at least few data quality issues. Data set “Accidents at work” Value: «88.3332-03» «88.3332-03» Data set «Work codes» Value I: “8332” AND value II: “03” Value I: “8332” AND Value II: “03” = Example II: 4 data sets published by the Ministry of Welfare:  [ATTU code] and [City, county] parameters are supposed to store the code of the administrative territory and city that must correspond to the secondary data object “Classification of Administrative Territories and Territorial Units”; ✘ 3 values are invalid – aren’t available in the secondary data set: “Total”, “Abroad” and “Address isn’t specified”.  Possibly, the data publisher is aware of this, as the appropriate values make sense; BUT!!! !!! This data quality problem can be easily unnoticed and can lead to inaccurate data analysis results.
  18. 18.  Example I: “Number of social service providers” data set: 3 parameters: [Service with accommodation] and [Service without accommodation] and [Service with and without accommodation]; BUT!!! For 95 records this assumption is not in force.  Example II: “Number of children with disabilities by administrative territory” data set: For 121 records this assumption is not in force. At least two possible explanations: 1) there are data quality problems; 2) these field aren’t interconnected, and the sum of values of the first two parameters not necessarily should be equal with the value of the 3rd parameter. From the users’ viewpoint: [Service with and without accommodation] = [Service with accommodation] + [Service without accommodation] DATA QUALITY ANALYSIS OF OPEN HEALTH(CARE) DATA: CONTEXTUAL ISSUES Another problem for 4 out of 15 data sets (26.7%) - different number of interrelated values that may appear in different ways: (a) values in different languages, (b) ID number and name, (c) name and supplementary data such as type, country, phone number of representatives. which of these options??? Dataset Context issues/ context total Incidence of 2nd type diabetes in Latvia 0/0 Distribution of persons receiving tech aid by AT 2/2 (100%) Number of social service providers 2/2 (100%) Persons with disabilities by the severity of the disability … 2/2 (100%) Number of children with disabilities by AT 2/2 (100%) Accidents at work (0-1/1) (0-100%) Occupational diseases confirmed 4/5 (80%) National Blood Donor Center Statistics 0/0 Register of licensed pharmaceutical companies 1/2 (50%) Medicines consumption statistics 3/3 (100%) Medicinal Product Register of Latvia 4/9 (44%) Food supplements register 2/2 (100%) Dietary foodstuffs register 2/2 (100%) Veterinary medicinal product register 1/3 (33%) [1# group] = [18-29 years 1# group] + [30-44 years 1# group] + … + [>=65 years 1# group]; [2# group] = [18-29 years 2# group] + [30-44 years 2# group] + … + [>=65 years 2# group]; [3# group] = [18-29 years 3# group] + [30-44 years 3# group] + … + [>=65 years 3# group] !!! Data publishers must provide a brief explanation of the parameters and how numerical data was gotten
  19. 19. DATA QUALITY ANALYSIS OF OPEN HEALTH(CARE) DATA: COMPLETENESS  For 136 out of 167 (81.4%) analysed parameters at least one value was empty.  The number of empty values per parameter varies from 1 to all values of a certain parameter.  The total number of empty values in analysed data sets is 15%.  Problem of empty values appears even for the primary data of the data sets:  Example: “Dietary foodstuffs register”data set: ✘ 4 records don’t have [Name] and [ProducerName].  This issue is almost “traditional” in many sectors and countries.  However, some researches demonstrate a high level of data completeness can be achieved. (Schmidt et al., 2015) (Oliveira, 2016) (Wanner et al., 2018) (Tomic, 2015) (Yi, 2019) (Sigurdardottir, 2012) (Larsen, 2009) Dataset Empty/ Total Incidence of 2nd type diabetes in Latvia 0/6 Distribution of persons receiving tech aid by AT 3/7 (43%) Number of social service providers 22/27 (82%) Persons with disabilities by the severity of the disability … 0/23 (0) Number of children with disabilities by AT 0/10 (0) Accidents at work 1/10 (10%) Occupational diseases confirmed 2/11 (18%) National Blood Donor Center Statistics 0/4 (0) Register of licensed pharmaceutical companies 17/38 (45%) Medicines consumption statistics 5/8 (63%) Medicinal Product Register of Latvia 21/41 (51%) Food supplements register 30/35 (86%) Dietary foodstuffs register 19/22 (87%) Veterinary medicinal product register 16/26 (62%) NOTE: 28 of 136 detected empty values may not be considered as quality issues, however, while there are no any notes from the data publisher regarding their nullability, there is no certainty, that there are no any problems there, as empty values may have different interpretations.
  20. 20. DATA QUALITY ANALYSIS OF OPEN HEALTH(CARE) DATA: MULTIPLE NOTATIONS FOR A SINGLE OBJECT  Multiple notations for a single object within a single data set and even a parameter: ✘ in 6 out of 15 data sets (40%) in 22 out of 167 parameters (13.2%). May appear in different ways such as a different name: • This problem is also widely spread for many sectors and even countries. OGD of the UK (Kuk and Davies, 2011). for one country for instance, (a) USA vs. United States vs. United States of America; (b) Northern Ireland vs. Republic of Ireland vs. Ireland; (c) Scotland vs. Scotland UK, etc. different patterns for one value for instance, phone or registration number: with or without (1) code or (2) delimiter; type of delimiter etc. different notations indicating the absence of a value: NULL and ‘0’** Do both NULL and ‘0’ values have the same meaning??? ‘0’ can point out to the value that is equal to zero, while NULL can mean that the value isn’t known. **often called “heterogeneity” for the type of preparation, ingredient or unit size for instance, (a) singular, (b) plural, (c) shortened form, (d) with a spelling mistake, etc. Dataset Multiple notation/ Total Incidence of 2nd type diabetes in Latvia 0/6 (0) Distribution of persons receiving tech aid by AT 0/7 (0) Number of social service providers 10/27 (37%) Persons with disabilities by the severity of the disability … 0/23 (0) Number of children with disabilities by AT 0/10 (0) Accidents at work 0/10 (0) Occupational diseases confirmed 1/11 (0.09%) National Blood Donor Center Statistics 0/4 (0) Register of licensed pharmaceutical companies 0/38 (0) Medicines consumption statistics 2/8 (25%) Medicinal Product Register of Latvia 1/41 (2%) Food supplements register 4/35 (11%) Dietary foodstuffs register 4/22 (18%) Veterinary medicinal product register 0/26 (0) In 5 out 8 cases it could be solved, involving the mechanisms, controlling the list of permissible values.
  21. 21.  Despite the importance of data quality, the quality of open data is not always one of the main areas of analysis and evaluation of open data.  Open health(care) data have a high number of different data quality problem, however, data publishers (who provides data used in their IS), probably, don’t even aware of them. The most frequently occurred are: ✘ contextual data quality issues; ✘ empty values even for primary data; ✘ multiple denotation for the same object within one data object and even a parameter; ✘ issues on interrelated parameters. RESULTS I
  22. 22.  Such an analysis and use of a data object-driven approach to data quality evaluation can be applied not only to open health(care) data but also to other structured and semi-structured data - this solution is effective in many domains.  The advantages of the used approach:  it can be applied to “third-party” data sets without any information on how data were accrued and processed – it is an external mechanism with a higher level of abstraction,  it can be used even by users without IT and DQ knowledge.  The use of open data brings significant benefits data providers as because of multiple number of possible use-cases, data users address various challenges that can rarely be solved by data providers alone. This can improve data quality not only at the national level, but also at the international level. RESULTS II
  23. 23. THANK YOU! For more information, see ResearchGate See also anastasijanikiforova.com For questions or any other queries, contact me via email - Anastasija.Nikiforova@lu.lv Article: Nikiforova, A. (2019). Analysis of open health data quality using data object-driven approach to data quality evaluation: insights from a Latvian context. In IADIS International Conference e-Health (pp. 119-126).

×