Successfully reported this slideshow.
Your SlideShare is downloading. ×

Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can Save Your Business

Ad

STAKEHOLDER-CENTRED IDENTIFICATION OF
DATA QUALITY ISSUES:
KNOWLEDGE THAT CAN SAVE YOUR BUSINESS
The International Confere...

Ad

AIM & RESEARCH QUESTIONS
(RQ1) What are the main data quality issues to be considered when conducting data quality analysi...

Ad

RELATED RESEARCHES
•«… This state of affairs has led to much confusion within the data quality community and is even
more ...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Check these out next

1 of 15 Ad
1 of 15 Ad

Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can Save Your Business

Download to read offline

This presentations is a supplementary material for presenting the "Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can Save Your Business" (authored by Anastasija Nikiforova and Natalija Kozmina) research paper during the The International Conference on Intelligent Data Science Technologies and Applications (IDSTA2021), November 15-16, 2021. Tartu, Estonia (web-based)
Read paper here -> Nikiforova, A., & Kozmina, N. (2021, November). Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can Save Your Business. In 2021 Second International Conference on Intelligent Data Science Technologies and Applications (IDSTA) (pp. 66-73). IEEE -> https://ieeexplore.ieee.org/abstract/document/9660802?casa_token=LFJa20LrXAwAAAAA:wVwhTcCPWqxdloAvDQ3-l98KkkLx70xzG3zNvIIkJbC6wvJ4VxwX_VGc3mmW_7c1T-QJlOtTiao

This presentations is a supplementary material for presenting the "Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can Save Your Business" (authored by Anastasija Nikiforova and Natalija Kozmina) research paper during the The International Conference on Intelligent Data Science Technologies and Applications (IDSTA2021), November 15-16, 2021. Tartu, Estonia (web-based)
Read paper here -> Nikiforova, A., & Kozmina, N. (2021, November). Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can Save Your Business. In 2021 Second International Conference on Intelligent Data Science Technologies and Applications (IDSTA) (pp. 66-73). IEEE -> https://ieeexplore.ieee.org/abstract/document/9660802?casa_token=LFJa20LrXAwAAAAA:wVwhTcCPWqxdloAvDQ3-l98KkkLx70xzG3zNvIIkJbC6wvJ4VxwX_VGc3mmW_7c1T-QJlOtTiao

Advertisement
Advertisement

More Related Content

Advertisement

Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can Save Your Business

  1. 1. STAKEHOLDER-CENTRED IDENTIFICATION OF DATA QUALITY ISSUES: KNOWLEDGE THAT CAN SAVE YOUR BUSINESS The International Conference on Intelligent Data Science Technologies and Applications (IDSTA2021) November 15-16, 2021. Tartu, Estonia (web-based) Anastasija Nikiforova, Natalija Kozmina “Innovative Information Technologies” Laboratory, Programming Department Faculty of Computing, University of Latvia
  2. 2. AIM & RESEARCH QUESTIONS (RQ1) What are the main data quality issues to be considered when conducting data quality analysis? (RQ2) What do users with advanced data quality knowledge think of a list of defined data quality issues and requirements as a result of the literature analysis, i.e., are all these issues important in their view? (RQ3) Are the data quality requirements identified while answering previous RQs valid for real-world data? (RQ4) What is the list of data quality requirements to be included in the data quality analysis and in the specification of the data quality tool? The goal of this study is to determine the most common data quality issues (i.e., defects) that affect users' experience with data and their reuse, as well as intent for their use in the future, potentially resulting in financial losses for businesses. 19% of businesses had lost their customers using inaccurate or incomplete data in 2019 “Global Marketing Alliance, The cost of bad data: have you done the math?”, 2020 The 2020 edition of “Magic Quadrant for Data Quality Solutions” found that organizations estimate the average cost of poor data quality at more than $12 million per year Gartner Magic Quadrant for Data Quality Solutions, 2020,
  3. 3. RELATED RESEARCHES •«… This state of affairs has led to much confusion within the data quality community and is even more bewildering for those who are new to the discipline and more importantly to business stakeholders…» (DAMA UK, 2018) ** In different proposals, dimensions of the same name can have different semantics and vice versa. (Batini, 2016) General studies on data and information quality - define different dimensions of quality and their groupings ✘ The key data quality dimensions are not universally agreed upon*; ✘ There is no agreement on their meanings and usability **; ✘ Each dimension can be supplied with one or more metrics that varies from one solution to another; ✘ The number of different data quality dimensions, their definitions and grouping are often useful for only particular solution. Question: How to relate particular dimension (and which one?) to a particular use-case???
  4. 4. RESEARCH DESIGN
  5. 5. Step Ia: results of the literature review Step Ib: results of the brainstorming session, identifying and removing duplicates (30 DQ-users) Step II: results of DELPHI analysis (12 experts) (Laranjeiro et al., 2015) - 22 studies (Scannapieco et al., 2002) – 6 studies (ISO/IEC, 2008) (Torchiano et al., 2017) (Rafique et al., 2012) (Askham et al., 2013) (Utamachant et al., 2018) (Wang and Strong, 1996) 1.accuracy/ correctness 2.objectivity 3.reputation/ traceability 4.believability/ credibility 5.timeliness 6.completeness 7.relevancy 8.value-added 9.interpretability 10.access security 11.currentness 12.representational consistency 13.consistency/ concise representation 14.accessibility 15.precision 16.efficiency 17.recoverability 18.portability 19.response time 20.adequacy 21.confidentiality (privacy, security) 22.understandability (ease of understanding, interpretability) 1.accuracy/ correctness 2.traceability 3.believability/ credibility 4.timeliness, currentness 5.completeness 6.consistency 7.accessibility 8.confidentiality/ privacy, security 9.understandability (ease of understanding, clarity, interpretability) DATA QUALITY DIMENSIONS: 2-STEP IDENTIFICATION
  6. 6. Dimension* Level DT/DS Data quality issue associated accuracy/ correctness DT Incorrect/inaccurate values that do not belong to the domain Misspelling Precision Special characters Duplicates/uniqueness violations Incorrect references Different aggregation levels traceability DS DT untraceable believability/ credibility DS non-credible timeliness, currentness DS DT Outdated temporal data completeness DT Missing value ... ... ... DATA QUALITY DIMENSIONS AND ASSOCIATED DATA QUALITY ISSUES IDENTIFIED (PART I) *For definition of each dimension we have used, please, refer to the article
  7. 7. Dimension Level DT/DS Data quality issue associated ... ... ... consistency DS DT Different representations (intra-relational constraint) Different word orderings between values of one attribute Use of synonyms / multiple notation for one object in scope of one attribute Use of synonyms / multiple notation for one object in scope of different datasets Different encoding formats, Wrong data type Different aggregation levels Different units Special characters accessibility DS Special characters Misspelling, Different encoding formats Different aggregation levels Different units Use of synonyms / multiple notation for one object in scope of different datasets Bulk download confidentiality/ privacy, security DS unsecure / non-confidential understandability (ease of understanding, clarity, interpretability) DS DT unclear DATA QUALITY DIMENSIONS AND ASSOCIATED DATA QUALITY ISSUES IDENTIFIED (PART II)
  8. 8. Step I: results of the literature review Step II: results of the brainstorming session, identifying and removing duplicates (30 DQ-users) Step III: results of DELPHI analysis (12 experts) (Laranjeiro et al., 2015) - 22 studies (Scannapieco et al., 2002) – 6 studies (ISO/IEC, 2008) (Torchiano et al., 2017) (Rafique et al., 2012) (Askham et al., 2013) (Utamachant et al., 2018) (Wang and Strong, 1996) 1.accuracy/ correctness 2.objectivity 3.reputation/ traceability 4.believability/ credibility 5.timeliness 6.completeness 7.relevancy 8.value-added 9.interpretability 10.access security 11.currentness 12.representational consistency 13.consistency/ concise representation 14.accessibility 15.precision 16.efficiency 17.recoverability 18.portability 19.response time 20.adequacy 21.confidentiality (privacy, security) 22.understandability (ease of understanding, interpretability) 1.accuracy/ correctness 2.traceability 3.believability/ credibility 4.timeliness, currentness 5.completeness 6.consistency 7.accessibility 8.confidentiality/ privacy, security 9.understandability (ease of understanding, clarity, interpretability) DATA QUALITY DIMENSIONS: STEP III
  9. 9. Data quality problem in question Frequency of checks (datasets) Frequency of issues in DS (#defective data sets/#total) Frequency of issues (#defective parameters/ #total) QD1: Incorrect/inaccurate values that does not belong to the domain 40.00% 16.67% 15.38% QD1: Misspelling 86.67% 7.69% 3.33% QD1: Precision 40.00% 0 0 QD1: Special characters 10% 13.33% 25.93% QD1: Duplicates / uniqueness violations 93.33% 28.57% 18.18% QD1: Incorrect references 80.00% 16.67% 13.33% QD1: Different aggregation levels 80.00% 16.67% 13.33% QD2: Traceability (DT) 66.67% 0 0 QD2: Traceability (DS) 93.33% 14.29% 6.67% QD3: Believability/ credibility 100% 13.33% 2.27% QD4: Outdated temporal data (DT) 93.33% 7.14% 10.00% QD4: Outdated temporal data (DS) 93.33% 64.29% 28.82% QD5: Completeness 93.33% 64.29% 28.82% ... ... ... ... RESULTS OF APPLYING DATA QUALITY REQUIREMENTS TO OPEN GOVERNMENT DATA (part I)
  10. 10. Data quality problem in question Frequency of checks (datasets) Frequency of issues in DS (#defective data sets/#total) Frequency of issues (#defective parameters/ #total) QD6: Different representations (Intra-relational constraint) 86.67% 61.54% 61.90% QD6: Different word orderings between values of one attribute 93.33% 42.86% 25.00% QD6: Use of synonyms / multiple notation for one object in scope of one attribute 86.67% 61.54% 61.90% QD6: Use of synonyms / multiple notation for one object in different datasets 93.33% 50.00% 26.32% QD6:Different encoding formats 80.00% 0 0 QD6: Wrong data type 86.67% 7.69% 0.80% QD6:Different aggregation levels 46.67% 57.14% 25.93% QD6: Different units 53.33% 25.00% 21.74% QD6: Special characters 46.67% 57.14% 25.93% QD7: Special characters 86.67% 7.69% 8.57% QD7: Misspelling 90.00% 6.67% 8.33% QD7: Different encoding formats 33.33% 0 0 QD7: Different aggregation levels 80.00% 8.33% 10.00% QD7: Different units 80.00% 16.67% 21.74% QD7: Use of synonyms / multiple notation for one object in scope of different datasets 86.67% 30.77% 21.74% QD7: Bulk download 100.00% 20.00% 20.00% QD8: Confidentiality/ privacy, security 0 0 0 QD9: Understandability (DT) 100.00% 20.00% 11.76% QD9: Understandability (DS) 100.00% 66.67% 25.93%
  11. 11. RESULTS This study has raised and answered 4 research questions: the list of main data quality issues to be considered when conducting data quality analysis was identified in course of the literature analysis, which was then filtered out during the brainstorming session. in terms of the DELPHI analysis with 12 experts the list was reduced to 9 data quality dimensions and 15 data quality issues mapped to each other, dividing data quality issues into two categories depending on their level, i.e., data and data set levels.  the validity of the data quality issues identified was examined by applying the list of data quality requirements set in RQ1 and RQ2 to 30 real open government data sets from the Latvian open government data portal. 14 data quality issues to be transformed into requirements for the web-based tool under development have been identified with 6 more appearing in some cases (<10% of data sets) to be considered for implementation.
  12. 12. CONCLUSIONS I The concept and topic of “data quality” attracts researchers for more than three decades, and its popularity certainly will not change in the future - the data are not only an integral part of our lives and business. With the popularity of the open government data, their value now is even higher than ever. The paradigm according to which the data quality control and management is performed in closed systems, is no longer valid. This leads to the modification of already existing and the development of new data quality dimensions, their classification, data quality issues, etc.
  13. 13. CONCLUSIONS I The results showed that most of the defects are representative for OGD available to each stakeholder. The OGD have data quality issues which, as demonstrated by OGD-related studies, have a negative impact on users’ readiness and willingness to re-use these data for their purposes such as innovative service and solutions. Let's keep in mind that the data are worth reusing only if they are usable both in terms of their value and quality, otherwise bringing businesses losses. Further studies on the topic include the development of the web-based data quality analysis tool where the knowledge obtained during this study will serve as a specification of the functionality to be covered by it.
  14. 14. DATA AVAILABILITY Data are available in Open Access (under CC-BY)  DOI: https://doi.org/10.5281/zenodo.4604656 https://www.eosc-hub.eu/open-science-info
  15. 15. THANK YOU FOR ATTENTION! QUESTIONS? For more information, see ResearchGate See also anastasijanikiforova.com For questions or any other queries, contact me via email - Anastasija.Nikiforova@lu.lv

×