"Methodology for Assessment of Linked Data Quality: A Framework" at Workshop on Linked Data Quality
Paper: https://dl.dropboxusercontent.com/u/2265375/LDQ/ldq2014_submission_3.pdf
User Interface Design by Sketching: A Complexity Analysis of Widget Represent...Jean Vanderdonckt
User interface design by sketching, as well as other sketching activities, typically involves sketching objects through representations that should combine meaningfulness for the end users and easiness for the recognition engines. To investigate this relationship, a multi-platform user interface design tool has been developed that enables designers to sketch design ideas in multiple levels of fidelity with multistroke
gestures supporting widget representations and
commands. A usability analysis of these activities, as they are submitted to a recognition engine, suggests that the level of fidelity, the amount of constraints imposed on the representations, and the visual difference of representations positively impact the sketching activity as a whole. Implications for further sketch representations in user interface design and beyond are provided based on usability guidelines.
Data Usability Assessment for Remote Sensing Data: Accuracy of Interactive Da...Beniamino Murgante
Data Usability Assessment for Remote Sensing Data: Accuracy of Interactive Data Quality Interpretation
Erik Borg, Bernd Fichtelmann - German Aerospace Center, German Remote Sensing Data Center
Hartmut Asche - Department of Geography, University of Potsdam
Leveraging DBpedia for Adaptive Crowdsourcing in Linked Data Quality AssessmentUmair ul Hassan
Crowdsourcing has emerged as a powerful paradigm for quality assessment and improvement of Linked Data. A major challenge of employing crowdsourcing, for quality assessment in Linked Data, is the cold-start problem: how to estimate the reliability of crowd workers and assign the most reliable workers to tasks? We address this challenge by
proposing a novel approach for generating test questions from DBpedia based on the topics associated with quality assessment tasks. These test questions are used to estimate the reliability of the new workers. Subsequently, the tasks are dynamically assigned to reliable workers to help improve the accuracy of collected responses. Our proposed approach, ACRyLIQ, is evaluated using workers hired from Amazon Mechanical Turk, on two real-world Linked Data datasets. We validate the proposed approach in terms of accuracy and compare it against the baseline approach of reliability estimate using gold-standard task. The results demonstrate that our proposed approach achieves high accuracy without using gold-standard tasks.
Linked Data Quality assessment applied and integrated to the Linked Data generation and publication workflow. Presented at the Data Quality tutorial, satellite event at SEMANTICS2016.
User Interface Design by Sketching: A Complexity Analysis of Widget Represent...Jean Vanderdonckt
User interface design by sketching, as well as other sketching activities, typically involves sketching objects through representations that should combine meaningfulness for the end users and easiness for the recognition engines. To investigate this relationship, a multi-platform user interface design tool has been developed that enables designers to sketch design ideas in multiple levels of fidelity with multistroke
gestures supporting widget representations and
commands. A usability analysis of these activities, as they are submitted to a recognition engine, suggests that the level of fidelity, the amount of constraints imposed on the representations, and the visual difference of representations positively impact the sketching activity as a whole. Implications for further sketch representations in user interface design and beyond are provided based on usability guidelines.
Data Usability Assessment for Remote Sensing Data: Accuracy of Interactive Da...Beniamino Murgante
Data Usability Assessment for Remote Sensing Data: Accuracy of Interactive Data Quality Interpretation
Erik Borg, Bernd Fichtelmann - German Aerospace Center, German Remote Sensing Data Center
Hartmut Asche - Department of Geography, University of Potsdam
Leveraging DBpedia for Adaptive Crowdsourcing in Linked Data Quality AssessmentUmair ul Hassan
Crowdsourcing has emerged as a powerful paradigm for quality assessment and improvement of Linked Data. A major challenge of employing crowdsourcing, for quality assessment in Linked Data, is the cold-start problem: how to estimate the reliability of crowd workers and assign the most reliable workers to tasks? We address this challenge by
proposing a novel approach for generating test questions from DBpedia based on the topics associated with quality assessment tasks. These test questions are used to estimate the reliability of the new workers. Subsequently, the tasks are dynamically assigned to reliable workers to help improve the accuracy of collected responses. Our proposed approach, ACRyLIQ, is evaluated using workers hired from Amazon Mechanical Turk, on two real-world Linked Data datasets. We validate the proposed approach in terms of accuracy and compare it against the baseline approach of reliability estimate using gold-standard task. The results demonstrate that our proposed approach achieves high accuracy without using gold-standard tasks.
Linked Data Quality assessment applied and integrated to the Linked Data generation and publication workflow. Presented at the Data Quality tutorial, satellite event at SEMANTICS2016.
Assessing and Refining Mappings to RDF to Improve Dataset Qualityandimou
RDF dataset quality assessment is currently performed primarily after data is published. However, there is neither a systematic way to incorporate its results into the dataset nor the assessment into the publishing workflow. Adjustments are manually {but rarely{ applied. Nevertheless, the root of the violations which often derive from the mappings that specify how the RDF dataset will be generated, is not identified. We suggest an incremental, iterative and uniform validation workflow for RDF datasets stemming originally from (semi-)structured data (e.g., CSV, XML, JSON). In this work, we focus on assessing and improving their mappings. We incorporate (i) a test-driven approach for assessing the mappings instead of the RDF dataset itself, as mappings reflect how the dataset will be formed when generated; and (ii) perform semi-automatic mapping refinements based on the results of the quality assessment. The proposed workflow is applied to diverse cases, e.g., large, crowdsourced datasets such as DBpedia, or newly generated, such as iLastic. Our evaluation indicates the eefficiency of our workflow, as it significantly improves the overall quality of an RDF dataset in the observed cases.
METHODS, MATHEMATICAL MODELS, DATA QUALITY ASSESSMENT AND RESULT INTERPRETATI...HTAi Bilbao 2012
METHODS, MATHEMATICAL MODELS, DATA QUALITY ASSESSMENT AND RESULT INTERPRETATION: SOLUTIONS DEVELOPED IN THE IFEDH FRAMEWORK
G. Zauner
dwh Simulation Services
Vienna , Austria
A brief introduction to Data Quality rule development and implementation covering:
- What are Data Quality Rules.
- Examples of Data Quality Rules.
- What are the benefits of rules.
- How can I create my own rules?
- What alternate approaches are there to building my own rules?
The presentation also includes a very brief overview of our Data Quality Rule services. For more information on this please contact us.
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...Mark Wilkinson
This slide deck accompanies the manuscript "Interoperability and FAIRness through a novel combination of Web technologies", submitted to PeerJ Computer Science: https://doi.org/10.7287/peerj.preprints.2522v1
It describes the output of the "Skunkworks" FAIR implementation group, who were tasked with building a prototype infrastructure that would fulfill the FAIR Principles for scholarly data publishing. We show how a novel combination of the Linked Data Platform, RDF Mapping Language (RML) and Triple Pattern Fragments (TPF) can be combined to create a scholarly publishing infrastructure that is markedly interoperable, at both the metadata and the data level.
This slide deck (or something close) will be presented at the Dutch Techcenter for Life Sciences Partners Workshop, November 4, 2016.
Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993-R
This paper is a supplementary material for the following article -> Bicevskis, J., Nikiforova, A., Bicevska, Z., Oditis, I., & Karnitis, G. (2019, October). A step towards a data quality theory. In 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS) (pp. 303-308). IEEE.
Data quality issues have been topical for many decades. However, a unified data quality theory has not been proposed yet, since many concepts associated with the term “data quality” are not straightforward enough. The paper proposes a user-oriented data quality theory based on clearly defined concepts. The concepts are defined by using three groups of domain-specific languages (DSLs): (1) the first group uses the concept of a data object to describe the data to be analysed, (2) the second group describes the data quality requirements, and (3) the third group describes the process of data quality evaluation. The proposed idea proved to be simple enough, but at the same time very effective in identifying data defects, despite the different structures of data sets and the complexity of data. Approbation of the approach demonstrated several advantages: (a) a graphical data quality model allows defining of data quality even by non-IT and non-data quality professionals, (b) data quality model is not related to the information system that has accumulated data, i.e., this approach lets users analyse the “third-party” data, and (c) data quality can be described at least at two levels of abstraction - informally, using natural language, or formally, including executable program routines or SQL statements.
The paper proposes a user-oriented data quality theory based on clearly defined concepts. The concepts are defined byusing three groups of domain-specific languages(DSLs): (1) the first group usestheconcept of a data object to describe the data to be analysed, (2) the second group describes the data quality requirements, and (3) the third group describes the process of data quality evaluation. The proposed ideaproved to be simple enough,but at the same time very effectivein identifyingdata defects, despitethedifferent structures of data sets andthe complexity ofdata. Approbation of the approach demonstratedseveral advantages: (a) a graphical data quality model allows defining of data quality even by non-IT and non-data qualityprofessionals, (b) data quality model is not related to the information system that has accumulated data, i.e., this approach letsusers analysethe"third-party” data, and (c) data quality can be described at least attwo levelsof abstraction –informally,using natural language,or formally,including executable program routines or SQL statements.
In this fast-paced data-driven world, the fallout from a single data quality issue can cost thousands of dollars in a matter of hours. To catch these issues quickly, system monitoring for data quality requires a different set of strategies from other continuous regression efforts. Like a race car pit crew, you need detection mechanisms that not only don’t interfere with what you are monitoring but also allow for strategic analysis off-track. You need to use every second your subject is at rest to repair and clean up problems that could affect performance. As the systems in race cars vary, the tools and resources available to the data quality professional vary from one organization to the next. You need to be able to leverage the tools at hand to implement your solutions. Shauna Ayers and Catherine Cruz Agosto show you how to develop testing strategies to detect issues with data integration timing, operational dependencies, reference data management, and data integrity—even in production systems. See how you can leverage this testing to provide proactive notification alerts and feed business intelligence dashboards to communicate the health of your organization’s data systems to both operation support and non-technical personnel.
5 Practical Steps to a Successful Deep Learning ResearchBrodmann17
Deep Learning has gained a huge popularity over the last several years. Especially due to its magnificent progress in many domains.
Many resources are out there including open source implementations of recent research advancements. This vast availability is somehow misleading because when one actually wants to create a Deep Learning based product, he soon realizes that there is a large gap between these open source implementations and a real production grade Deep Learning product. Closing this gap can take months of work involving large costs, especially on man power and compute power.
Throughout this talk I will talk based on my experience leading the research at Brodmann17 about several aspects we have found to be important for building Deep Learning based computer vision products.
Research on product quality control of multi varieties and small batch based ...IRJESJOURNAL
ABSTRACT: -This paper mainly studies the application of statistical process control in multi-variety and small-batch production environment. The paper puts forward the method of quality control based on Bayesian theory. First, Bayesian theory is used to estimate the parameters of the production process. Then Bayesian model is used to control the production of many varieties and small batches based on Bayesian parameter estimation. A Bayesian control model identification method is proposed. Finally, an example is given to verify the feasibility of the method. The results show that this method can be a quality control method for many kinds of small batch products
Presentation for I-Semantics 2013 conference on "User-driven Quality Evaluation of DBpedia", link to full paper: http://svn.aksw.org/papers/2013/ISemantics_DBpediaDQ/public.pdf.
Assessing and Refining Mappings to RDF to Improve Dataset Qualityandimou
RDF dataset quality assessment is currently performed primarily after data is published. However, there is neither a systematic way to incorporate its results into the dataset nor the assessment into the publishing workflow. Adjustments are manually {but rarely{ applied. Nevertheless, the root of the violations which often derive from the mappings that specify how the RDF dataset will be generated, is not identified. We suggest an incremental, iterative and uniform validation workflow for RDF datasets stemming originally from (semi-)structured data (e.g., CSV, XML, JSON). In this work, we focus on assessing and improving their mappings. We incorporate (i) a test-driven approach for assessing the mappings instead of the RDF dataset itself, as mappings reflect how the dataset will be formed when generated; and (ii) perform semi-automatic mapping refinements based on the results of the quality assessment. The proposed workflow is applied to diverse cases, e.g., large, crowdsourced datasets such as DBpedia, or newly generated, such as iLastic. Our evaluation indicates the eefficiency of our workflow, as it significantly improves the overall quality of an RDF dataset in the observed cases.
METHODS, MATHEMATICAL MODELS, DATA QUALITY ASSESSMENT AND RESULT INTERPRETATI...HTAi Bilbao 2012
METHODS, MATHEMATICAL MODELS, DATA QUALITY ASSESSMENT AND RESULT INTERPRETATION: SOLUTIONS DEVELOPED IN THE IFEDH FRAMEWORK
G. Zauner
dwh Simulation Services
Vienna , Austria
A brief introduction to Data Quality rule development and implementation covering:
- What are Data Quality Rules.
- Examples of Data Quality Rules.
- What are the benefits of rules.
- How can I create my own rules?
- What alternate approaches are there to building my own rules?
The presentation also includes a very brief overview of our Data Quality Rule services. For more information on this please contact us.
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...Mark Wilkinson
This slide deck accompanies the manuscript "Interoperability and FAIRness through a novel combination of Web technologies", submitted to PeerJ Computer Science: https://doi.org/10.7287/peerj.preprints.2522v1
It describes the output of the "Skunkworks" FAIR implementation group, who were tasked with building a prototype infrastructure that would fulfill the FAIR Principles for scholarly data publishing. We show how a novel combination of the Linked Data Platform, RDF Mapping Language (RML) and Triple Pattern Fragments (TPF) can be combined to create a scholarly publishing infrastructure that is markedly interoperable, at both the metadata and the data level.
This slide deck (or something close) will be presented at the Dutch Techcenter for Life Sciences Partners Workshop, November 4, 2016.
Spanish Ministerio de Economía y Competitividad grant number TIN2014-55993-R
This paper is a supplementary material for the following article -> Bicevskis, J., Nikiforova, A., Bicevska, Z., Oditis, I., & Karnitis, G. (2019, October). A step towards a data quality theory. In 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS) (pp. 303-308). IEEE.
Data quality issues have been topical for many decades. However, a unified data quality theory has not been proposed yet, since many concepts associated with the term “data quality” are not straightforward enough. The paper proposes a user-oriented data quality theory based on clearly defined concepts. The concepts are defined by using three groups of domain-specific languages (DSLs): (1) the first group uses the concept of a data object to describe the data to be analysed, (2) the second group describes the data quality requirements, and (3) the third group describes the process of data quality evaluation. The proposed idea proved to be simple enough, but at the same time very effective in identifying data defects, despite the different structures of data sets and the complexity of data. Approbation of the approach demonstrated several advantages: (a) a graphical data quality model allows defining of data quality even by non-IT and non-data quality professionals, (b) data quality model is not related to the information system that has accumulated data, i.e., this approach lets users analyse the “third-party” data, and (c) data quality can be described at least at two levels of abstraction - informally, using natural language, or formally, including executable program routines or SQL statements.
The paper proposes a user-oriented data quality theory based on clearly defined concepts. The concepts are defined byusing three groups of domain-specific languages(DSLs): (1) the first group usestheconcept of a data object to describe the data to be analysed, (2) the second group describes the data quality requirements, and (3) the third group describes the process of data quality evaluation. The proposed ideaproved to be simple enough,but at the same time very effectivein identifyingdata defects, despitethedifferent structures of data sets andthe complexity ofdata. Approbation of the approach demonstratedseveral advantages: (a) a graphical data quality model allows defining of data quality even by non-IT and non-data qualityprofessionals, (b) data quality model is not related to the information system that has accumulated data, i.e., this approach letsusers analysethe"third-party” data, and (c) data quality can be described at least attwo levelsof abstraction –informally,using natural language,or formally,including executable program routines or SQL statements.
In this fast-paced data-driven world, the fallout from a single data quality issue can cost thousands of dollars in a matter of hours. To catch these issues quickly, system monitoring for data quality requires a different set of strategies from other continuous regression efforts. Like a race car pit crew, you need detection mechanisms that not only don’t interfere with what you are monitoring but also allow for strategic analysis off-track. You need to use every second your subject is at rest to repair and clean up problems that could affect performance. As the systems in race cars vary, the tools and resources available to the data quality professional vary from one organization to the next. You need to be able to leverage the tools at hand to implement your solutions. Shauna Ayers and Catherine Cruz Agosto show you how to develop testing strategies to detect issues with data integration timing, operational dependencies, reference data management, and data integrity—even in production systems. See how you can leverage this testing to provide proactive notification alerts and feed business intelligence dashboards to communicate the health of your organization’s data systems to both operation support and non-technical personnel.
5 Practical Steps to a Successful Deep Learning ResearchBrodmann17
Deep Learning has gained a huge popularity over the last several years. Especially due to its magnificent progress in many domains.
Many resources are out there including open source implementations of recent research advancements. This vast availability is somehow misleading because when one actually wants to create a Deep Learning based product, he soon realizes that there is a large gap between these open source implementations and a real production grade Deep Learning product. Closing this gap can take months of work involving large costs, especially on man power and compute power.
Throughout this talk I will talk based on my experience leading the research at Brodmann17 about several aspects we have found to be important for building Deep Learning based computer vision products.
Research on product quality control of multi varieties and small batch based ...IRJESJOURNAL
ABSTRACT: -This paper mainly studies the application of statistical process control in multi-variety and small-batch production environment. The paper puts forward the method of quality control based on Bayesian theory. First, Bayesian theory is used to estimate the parameters of the production process. Then Bayesian model is used to control the production of many varieties and small batches based on Bayesian parameter estimation. A Bayesian control model identification method is proposed. Finally, an example is given to verify the feasibility of the method. The results show that this method can be a quality control method for many kinds of small batch products
Presentation for I-Semantics 2013 conference on "User-driven Quality Evaluation of DBpedia", link to full paper: http://svn.aksw.org/papers/2013/ISemantics_DBpediaDQ/public.pdf.
Presentation ADEQUATe Project: Workshop on Quality Assessment and Improvement...Martin Kaltenböck
Presentation of the ADEQUATe Project in the course of the Workshop on Quality Assessment and Improvements in Open Data (Catalogues), taking place at the annual open data conference Switzerland (that took place 14 June 2016 in Lausanne, see: http://www.opendata.ch).
Workshop speakers / facilitators: Johann Höchtl (Danube University Krems), Jürgen Umbrich (University of Economics, Vienna), Martin Kaltenböck (Semantic Web Company).
More infos: http://www.adequate.at
Mechanisms for Data Quality and Validation in Citizen ScienceAndrea Wiggins
Presentation for a paper on ways to improve data quality for citizen science. Presentation delivered by Nathan Prestopnik at a workshop on citizen science at eScience 2011.
International Journal of Mathematics and Statistics Invention (IJMSI)inventionjournals
International Journal of Mathematics and Statistics Invention (IJMSI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJMSI publishes research articles and reviews within the whole field Mathematics and Statistics, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Recommender Systems Fairness Evaluation via Generalized Cross EntropyVito Walter Anelli
Fairness in recommender systems has been considered with respect to sensitive attributes of users (e.g., gender, race) or items (e.g., revenue in a multistakeholder setting). Regardless, the concept has been commonly interpreted as some form of equality – i.e., the degree to which the system is meeting the information needs of all its users in an equal sense. In this paper, we argue that fairness in recommender systems does not necessarily imply equality, but instead it should consider a distribution of resources based on merits and needs. We present a probabilistic framework based on generalized cross entropy to evaluate fairness of recommender systems under this perspective, where we show that the proposed framework is flexible and explanatory by allowing to incorporate domain knowledge (through an ideal fair distribution) that can help to understand which item or user aspects a recommendation algorithm is over- or under-representing. Results on two real-world datasets show the merits of the proposed evaluation framework both in terms of user and item fairness.
"Using Linked Data to Evaluate the Impact of Research and Development in Europe: A Structural Equation Model" presented at ISWC 2013 (http://link.springer.com/chapter/10.1007/978-3-642-41338-4_16)
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
6. Phase I: Requirement Analysis
Step I: Use Case Analysis
- Description that best illustrates the intended
usage of the dataset(s)
Two types of users
➢Consumers
➢Potential consumers
7. Phase II: Quality Assessment
Step II: Identification of quality issues
➢Based on the use case
➢Checklist-based approach
➢Yes - 1, No - 0
➢List of quality dimensions
10. Data Quality Score
➢Ratio
○ DQscore = 1 - (V/T)
■ V - total no. of instances that violate a DQ rule
■ T - total no. of relevant instances
■ for each property
○ DQweightedscore= (DQscore * wi / W)
■ wi - weight
■ W - sum of all weighted factors of the properties
■ for quality of overall properties
11. Phase III: Quality Improvement
Step V: Root Cause Analysis
➢Analyze cause of each quality issue
➢Helps user interpret the results
➢Detect whether the problem occurs in the
original dataset
➢In case original dataset is unavailable,
analyze the available dataset to determine
the cause
13. Conclusion and Future Work
➢Assessment methodology - 3 phases, 6
steps
➢Focus on use case
➢Improvement phase
!
Future Work
➢Application to an actual use case
➢Build a tool