The document introduces the topic of data quality and discusses common data quality problems. It defines data quality problems as issues that exist within single data sources or across multiple sources. Specific problem types are outlined that can occur at the attribute value, tuple, relation, and cross-relation levels. Methods for detecting different types of data quality problems are also presented. Finally, the document discusses phases of the data cleaning process and available tool support.
1. Introduction to Data Quality
Pradeeban Kathiravelu
INESC-ID Lisboa
Instituto Superior T´ecnico, Universidade de Lisboa
Lisbon, Portugal
Data Quality – Presentation 1.
March 19, 2015.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 1 / 21
2. Introduction
Introduction
Data is an important asset for the organizations.
Data warehouses and exploration tools depend on data quality.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 2 / 21
4. Introduction
1. DQ Problems within a Single Data Source
1.1. DQ Problems within a Single Relation
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 4 / 21
5. Introduction
1.1.1. An Attribute Value of a Single Tuple
Missing value.
Syntax violation.
Outdated value.
Interval violation.
Set violation.
Misspelled error.
Inadequate value to the
attribute context.
Value items beyond the
attribute context.
Meaningless value.
Value with imprecise or doubtful
meaning.
Domain constraint violation.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 5 / 21
7. Introduction
1.1.2. The Values of a Single Attribute
Uniqueness value violation.
Synonyms existence.
Domain constraint violation.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 7 / 21
9. Introduction
1.1.3. The Attribute Values of a Single Tuple
Semi-empty tuple.
Inconsistency among attribute values.
Domain constraint violation.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 9 / 21
11. Introduction
1.1.4. The Attribute Values of Several Tuples
Redundancy about an entity.
Inconsistency about an entity.
Domain constraint violation.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 11 / 21
13. Introduction
1.2. Relationships among Multiple Relations
Referential integrity violation.
Outdated reference.
Syntax inconsistency
Inconsistency among related attribute values.
Circularity among tuples in a self-relationship.
Domain constraint violation.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 13 / 21
18. Introduction
Phases of Data Cleaning
Data analysis.
Data profiling.
Data mining.
Descriptive data mining models.
Clustering, summarization, association discovery and sequence
discovery.
Definition of transformation workflow and mapping rules.
Verification.
Transformation.
Backflow of cleaned data.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 18 / 21
19. Introduction
Tool Support
Data analysis and reengineering tools.
Data profiling - MigrationArchitect.
Data mining - WizRule and DataMiningSuite.
Data reengineering - Integrity.
Specialized cleaning tools
Special domain cleaning - idCentric, PureIntegrate, QuickAddress,
Reunion, and Trillium.
Duplicate elimination - DataCleanser, Merge/PurgeLibrary, matchIT,
and MasterMerge .
ETL (Extraction, Transformation, Loading) Tools
CopyManager, DataStage, Extract, PowerMart, DecisionBase,
DataTransformationService, MetaSuite, SagentSolution, and
WarehouseAdministrator.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 19 / 21
20. Introduction
Conclusions
Identification, classification and systematization of DQ problems.
Taxonomy using a bottom-up approach.
Definition of methods to detect DQ problems
represented as binary classification trees.
Thank you!
Questions?
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 20 / 21
21. Introduction
References
Oliveira, P., Rodrigues, F., Henriques, P., & Galhardas, H. (2005,
June). A taxonomy of data quality problems. In Proc. 2nd Int.
Workshop on Data and Information Quality (in conjunction with
CAiSE 2005), Porto, Portugal.
Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current
approaches. IEEE Data Eng. Bull., 23(4), 3-13.
Barateiro, J., & Galhardas, H. (2005). A Survey of Data Quality
Tools. Datenbank-Spektrum, 14(15-21), 48.
Kim, W.; Choi, B.-J.; Hong, E.-K.; Kim, S.-K. and Lee, D. – A
Taxonomy of Dirty Data. Data Mining and Knowledge Discovery, 7,
2003. pp. 81-99.
M¨uller, H. and Freytag, J.-C. – Problems, Methods, and Challenges in
Comprehensive Data Cleansing. Technical Report HUB-IB-164,
Humboldt University, Berlin, 2003.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 21 / 21