SlideShare a Scribd company logo
1 of 21
Download to read offline
Introduction to Data Quality
Pradeeban Kathiravelu
INESC-ID Lisboa
Instituto Superior T´ecnico, Universidade de Lisboa
Lisbon, Portugal
Data Quality – Presentation 1.
March 19, 2015.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 1 / 21
Introduction
Introduction
Data is an important asset for the organizations.
Data warehouses and exploration tools depend on data quality.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 2 / 21
Introduction
Data Quality Problems
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 3 / 21
Introduction
1. DQ Problems within a Single Data Source
1.1. DQ Problems within a Single Relation
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 4 / 21
Introduction
1.1.1. An Attribute Value of a Single Tuple
Missing value.
Syntax violation.
Outdated value.
Interval violation.
Set violation.
Misspelled error.
Inadequate value to the
attribute context.
Value items beyond the
attribute context.
Meaningless value.
Value with imprecise or doubtful
meaning.
Domain constraint violation.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 5 / 21
Introduction
Detecting DQ Problems
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 6 / 21
Introduction
1.1.2. The Values of a Single Attribute
Uniqueness value violation.
Synonyms existence.
Domain constraint violation.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 7 / 21
Introduction
Detecting DQ Problems
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 8 / 21
Introduction
1.1.3. The Attribute Values of a Single Tuple
Semi-empty tuple.
Inconsistency among attribute values.
Domain constraint violation.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 9 / 21
Introduction
Detecting DQ Problems
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 10 / 21
Introduction
1.1.4. The Attribute Values of Several Tuples
Redundancy about an entity.
Inconsistency about an entity.
Domain constraint violation.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 11 / 21
Introduction
Detecting DQ Problems
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 12 / 21
Introduction
1.2. Relationships among Multiple Relations
Referential integrity violation.
Outdated reference.
Syntax inconsistency
Inconsistency among related attribute values.
Circularity among tuples in a self-relationship.
Domain constraint violation.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 13 / 21
Introduction
Detecting DQ Problems
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 14 / 21
Introduction
2. Multiple Data Sources
Syntax inconsistency.
Different measure units.
Representation inconsistency.
Different aggregation levels.
Synonyms existence.
Homonyms existence.
Redundancy about an entity.
Inconsistency about an entity.
Domain constraint violation.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 15 / 21
Introduction
Detecting DQ Problems
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 16 / 21
Introduction
Data Cleaning Problems (Rahm and Do)
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 17 / 21
Introduction
Phases of Data Cleaning
Data analysis.
Data profiling.
Data mining.
Descriptive data mining models.
Clustering, summarization, association discovery and sequence
discovery.
Definition of transformation workflow and mapping rules.
Verification.
Transformation.
Backflow of cleaned data.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 18 / 21
Introduction
Tool Support
Data analysis and reengineering tools.
Data profiling - MigrationArchitect.
Data mining - WizRule and DataMiningSuite.
Data reengineering - Integrity.
Specialized cleaning tools
Special domain cleaning - idCentric, PureIntegrate, QuickAddress,
Reunion, and Trillium.
Duplicate elimination - DataCleanser, Merge/PurgeLibrary, matchIT,
and MasterMerge .
ETL (Extraction, Transformation, Loading) Tools
CopyManager, DataStage, Extract, PowerMart, DecisionBase,
DataTransformationService, MetaSuite, SagentSolution, and
WarehouseAdministrator.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 19 / 21
Introduction
Conclusions
Identification, classification and systematization of DQ problems.
Taxonomy using a bottom-up approach.
Definition of methods to detect DQ problems
represented as binary classification trees.
Thank you!
Questions?
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 20 / 21
Introduction
References
Oliveira, P., Rodrigues, F., Henriques, P., & Galhardas, H. (2005,
June). A taxonomy of data quality problems. In Proc. 2nd Int.
Workshop on Data and Information Quality (in conjunction with
CAiSE 2005), Porto, Portugal.
Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current
approaches. IEEE Data Eng. Bull., 23(4), 3-13.
Barateiro, J., & Galhardas, H. (2005). A Survey of Data Quality
Tools. Datenbank-Spektrum, 14(15-21), 48.
Kim, W.; Choi, B.-J.; Hong, E.-K.; Kim, S.-K. and Lee, D. – A
Taxonomy of Dirty Data. Data Mining and Knowledge Discovery, 7,
2003. pp. 81-99.
M¨uller, H. and Freytag, J.-C. – Problems, Methods, and Challenges in
Comprehensive Data Cleansing. Technical Report HUB-IB-164,
Humboldt University, Berlin, 2003.
Pradeeban Kathiravelu (IST-ULisboa) Data Quality 21 / 21

More Related Content

Similar to Introduction to Data Quality

Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
Maribel Acosta Deibe
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Template
butest
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
Andre Freitas
 

Similar to Introduction to Data Quality (20)

Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 
Mappings Validation
Mappings ValidationMappings Validation
Mappings Validation
 
LinkedUp - Linked Data & Education
LinkedUp - Linked Data & EducationLinkedUp - Linked Data & Education
LinkedUp - Linked Data & Education
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 
Presentation by Ivan Schotsmans (DV Community) at the Data Vault Modelling an...
Presentation by Ivan Schotsmans (DV Community) at the Data Vault Modelling an...Presentation by Ivan Schotsmans (DV Community) at the Data Vault Modelling an...
Presentation by Ivan Schotsmans (DV Community) at the Data Vault Modelling an...
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Template
 
DataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open DataDataGraft: Data-as-a-Service for Open Data
DataGraft: Data-as-a-Service for Open Data
 
Oracle openworld-presentation
Oracle openworld-presentationOracle openworld-presentation
Oracle openworld-presentation
 
SDN-Based Enhancements to QoS and Data Quality in Multi-Tenanted Data Center ...
SDN-Based Enhancements to QoS and Data Quality in Multi-Tenanted Data Center ...SDN-Based Enhancements to QoS and Data Quality in Multi-Tenanted Data Center ...
SDN-Based Enhancements to QoS and Data Quality in Multi-Tenanted Data Center ...
 
Toronto OpenRefine MeetUp Nov 2015
Toronto OpenRefine MeetUp Nov 2015Toronto OpenRefine MeetUp Nov 2015
Toronto OpenRefine MeetUp Nov 2015
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Pptbb
PptbbPptbb
Pptbb
 
Data Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data CleaningData Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data Cleaning
 
IRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current ApproachesIRJET- A Review of Data Cleaning and its Current Approaches
IRJET- A Review of Data Cleaning and its Current Approaches
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
Metadata quality Assurance Framework at QQML2016 - short
Metadata quality Assurance Framework at QQML2016 - shortMetadata quality Assurance Framework at QQML2016 - short
Metadata quality Assurance Framework at QQML2016 - short
 
Lecture 23
Lecture 23Lecture 23
Lecture 23
 
Dbm630_lecture01
Dbm630_lecture01Dbm630_lecture01
Dbm630_lecture01
 
Dbm630 Lecture01
Dbm630 Lecture01Dbm630 Lecture01
Dbm630 Lecture01
 
Burton - Security, Privacy and Trust
Burton - Security, Privacy and TrustBurton - Security, Privacy and Trust
Burton - Security, Privacy and Trust
 

More from Pradeeban Kathiravelu, Ph.D.

More from Pradeeban Kathiravelu, Ph.D. (20)

Google Summer of Code_2023.pdf
Google Summer of Code_2023.pdfGoogle Summer of Code_2023.pdf
Google Summer of Code_2023.pdf
 
Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022
 
Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022Google Summer of Code (GSoC) 2022
Google Summer of Code (GSoC) 2022
 
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
Niffler: A DICOM Framework for Machine Learning and Processing Pipelines.
 
Google summer of code (GSoC) 2021
Google summer of code (GSoC) 2021Google summer of code (GSoC) 2021
Google summer of code (GSoC) 2021
 
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
A DICOM Framework for Machine Learning Pipelines against Real-Time Radiology ...
 
Google Summer of Code (GSoC) 2020 for mentors
Google Summer of Code (GSoC) 2020 for mentorsGoogle Summer of Code (GSoC) 2020 for mentors
Google Summer of Code (GSoC) 2020 for mentors
 
Google Summer of Code (GSoC) 2020
Google Summer of Code (GSoC) 2020Google Summer of Code (GSoC) 2020
Google Summer of Code (GSoC) 2020
 
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Data Services with Bindaas: RESTful Interfaces for Diverse Data SourcesData Services with Bindaas: RESTful Interfaces for Diverse Data Sources
Data Services with Bindaas: RESTful Interfaces for Diverse Data Sources
 
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degreeThe UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
The UCLouvain Public Defense of my EMJD-DC Double Doctorate Ph.D. degree
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
 My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos... My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Compos...
 
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
My Ph.D. Defense - Software-Defined Systems for Network-Aware Service Composi...
 
UCL Ph.D. Confirmation 2018
UCL Ph.D. Confirmation 2018UCL Ph.D. Confirmation 2018
UCL Ph.D. Confirmation 2018
 
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Software-Defined Systems for Network-Aware Service Composition and Workflow P...Software-Defined Systems for Network-Aware Service Composition and Workflow P...
Software-Defined Systems for Network-Aware Service Composition and Workflow P...
 
Moving bits with a fleet of shared virtual routers
Moving bits with a fleet of shared virtual routersMoving bits with a fleet of shared virtual routers
Moving bits with a fleet of shared virtual routers
 
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
Software-Defined Data Services: Interoperable and Network-Aware Big Data Exec...
 
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...
On-Demand Service-Based Big Data Integration: Optimized for Research Collabor...
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
 
Software-Defined Inter-Cloud Composition of Big Services
Software-Defined Inter-Cloud Composition of Big ServicesSoftware-Defined Inter-Cloud Composition of Big Services
Software-Defined Inter-Cloud Composition of Big Services
 
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
Scalability and Resilience of Multi-Tenant Distributed Clouds in the Big Serv...
 

Recently uploaded

Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
pritamlangde
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 

Recently uploaded (20)

Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 

Introduction to Data Quality

  • 1. Introduction to Data Quality Pradeeban Kathiravelu INESC-ID Lisboa Instituto Superior T´ecnico, Universidade de Lisboa Lisbon, Portugal Data Quality – Presentation 1. March 19, 2015. Pradeeban Kathiravelu (IST-ULisboa) Data Quality 1 / 21
  • 2. Introduction Introduction Data is an important asset for the organizations. Data warehouses and exploration tools depend on data quality. Pradeeban Kathiravelu (IST-ULisboa) Data Quality 2 / 21
  • 3. Introduction Data Quality Problems Pradeeban Kathiravelu (IST-ULisboa) Data Quality 3 / 21
  • 4. Introduction 1. DQ Problems within a Single Data Source 1.1. DQ Problems within a Single Relation Pradeeban Kathiravelu (IST-ULisboa) Data Quality 4 / 21
  • 5. Introduction 1.1.1. An Attribute Value of a Single Tuple Missing value. Syntax violation. Outdated value. Interval violation. Set violation. Misspelled error. Inadequate value to the attribute context. Value items beyond the attribute context. Meaningless value. Value with imprecise or doubtful meaning. Domain constraint violation. Pradeeban Kathiravelu (IST-ULisboa) Data Quality 5 / 21
  • 6. Introduction Detecting DQ Problems Pradeeban Kathiravelu (IST-ULisboa) Data Quality 6 / 21
  • 7. Introduction 1.1.2. The Values of a Single Attribute Uniqueness value violation. Synonyms existence. Domain constraint violation. Pradeeban Kathiravelu (IST-ULisboa) Data Quality 7 / 21
  • 8. Introduction Detecting DQ Problems Pradeeban Kathiravelu (IST-ULisboa) Data Quality 8 / 21
  • 9. Introduction 1.1.3. The Attribute Values of a Single Tuple Semi-empty tuple. Inconsistency among attribute values. Domain constraint violation. Pradeeban Kathiravelu (IST-ULisboa) Data Quality 9 / 21
  • 10. Introduction Detecting DQ Problems Pradeeban Kathiravelu (IST-ULisboa) Data Quality 10 / 21
  • 11. Introduction 1.1.4. The Attribute Values of Several Tuples Redundancy about an entity. Inconsistency about an entity. Domain constraint violation. Pradeeban Kathiravelu (IST-ULisboa) Data Quality 11 / 21
  • 12. Introduction Detecting DQ Problems Pradeeban Kathiravelu (IST-ULisboa) Data Quality 12 / 21
  • 13. Introduction 1.2. Relationships among Multiple Relations Referential integrity violation. Outdated reference. Syntax inconsistency Inconsistency among related attribute values. Circularity among tuples in a self-relationship. Domain constraint violation. Pradeeban Kathiravelu (IST-ULisboa) Data Quality 13 / 21
  • 14. Introduction Detecting DQ Problems Pradeeban Kathiravelu (IST-ULisboa) Data Quality 14 / 21
  • 15. Introduction 2. Multiple Data Sources Syntax inconsistency. Different measure units. Representation inconsistency. Different aggregation levels. Synonyms existence. Homonyms existence. Redundancy about an entity. Inconsistency about an entity. Domain constraint violation. Pradeeban Kathiravelu (IST-ULisboa) Data Quality 15 / 21
  • 16. Introduction Detecting DQ Problems Pradeeban Kathiravelu (IST-ULisboa) Data Quality 16 / 21
  • 17. Introduction Data Cleaning Problems (Rahm and Do) Pradeeban Kathiravelu (IST-ULisboa) Data Quality 17 / 21
  • 18. Introduction Phases of Data Cleaning Data analysis. Data profiling. Data mining. Descriptive data mining models. Clustering, summarization, association discovery and sequence discovery. Definition of transformation workflow and mapping rules. Verification. Transformation. Backflow of cleaned data. Pradeeban Kathiravelu (IST-ULisboa) Data Quality 18 / 21
  • 19. Introduction Tool Support Data analysis and reengineering tools. Data profiling - MigrationArchitect. Data mining - WizRule and DataMiningSuite. Data reengineering - Integrity. Specialized cleaning tools Special domain cleaning - idCentric, PureIntegrate, QuickAddress, Reunion, and Trillium. Duplicate elimination - DataCleanser, Merge/PurgeLibrary, matchIT, and MasterMerge . ETL (Extraction, Transformation, Loading) Tools CopyManager, DataStage, Extract, PowerMart, DecisionBase, DataTransformationService, MetaSuite, SagentSolution, and WarehouseAdministrator. Pradeeban Kathiravelu (IST-ULisboa) Data Quality 19 / 21
  • 20. Introduction Conclusions Identification, classification and systematization of DQ problems. Taxonomy using a bottom-up approach. Definition of methods to detect DQ problems represented as binary classification trees. Thank you! Questions? Pradeeban Kathiravelu (IST-ULisboa) Data Quality 20 / 21
  • 21. Introduction References Oliveira, P., Rodrigues, F., Henriques, P., & Galhardas, H. (2005, June). A taxonomy of data quality problems. In Proc. 2nd Int. Workshop on Data and Information Quality (in conjunction with CAiSE 2005), Porto, Portugal. Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 3-13. Barateiro, J., & Galhardas, H. (2005). A Survey of Data Quality Tools. Datenbank-Spektrum, 14(15-21), 48. Kim, W.; Choi, B.-J.; Hong, E.-K.; Kim, S.-K. and Lee, D. – A Taxonomy of Dirty Data. Data Mining and Knowledge Discovery, 7, 2003. pp. 81-99. M¨uller, H. and Freytag, J.-C. – Problems, Methods, and Challenges in Comprehensive Data Cleansing. Technical Report HUB-IB-164, Humboldt University, Berlin, 2003. Pradeeban Kathiravelu (IST-ULisboa) Data Quality 21 / 21