Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Quality: A Raising Data Warehousing Concern


Published on

Characteristics of Data Warehouse
Benefits of a data warehouse
Designing of Data Warehouse
Extract, Transform, Load (ETL)
Data Quality
Classification Of Data Quality Issues
Causes Of Data Quality
Impact of Data Quality Issues
Cost of Poor Data Quality
Confidence and Satisfaction-based impacts
Impact on Productivity
Risk and Compliance impacts
Why Data Quality Influences?
Causes of Data Quality Problems
How to deal: Missing Data
Data Corruption
Data: Out of Range error
Techniques of Data Quality Control
Data warehousing security

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Data Quality: A Raising Data Warehousing Concern

  1. 1. Data Quality: A Raising Data Warehousing Concern Presented by: Chowdhury, Mohammad Aminul Hoque
  2. 2. Data Warehousing
  3. 3. Characteristics of Data Warehouse • Data warehousing it supports to management on decision making • It is Subject Oriented and gives information about a company's ongoing operations • Data is gathered in Integrated way into the data warehouse from a variety of sources and merged into a coherently • Data warehouse is a Time-variant and is identified with a particular time period • It is Non-volatile means stable.
  4. 4. Benefits of a data warehouse  Maintain data history  Integrate data from multiple source systems, enabling a central view  Improve data quality, by providing codes and descriptions, or even fixing bad data  Present the organization's information consistently  Provide a single common data model for all data source  Restructure the data to makes sense the users  Restructure the data to delivers excellent query performance  Making decision–support queries easier.
  5. 5. Designing of Data Warehouse  Top-down, bottom-up approaches or a combination of both  software engineering point of view: Waterfall and Spiral Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measures 1. Star schema 2. Snowflake schema 3. Fact constellations
  6. 6. Extract, Transform, Load (ETL)
  7. 7. Extract  ETL process involves extracting the data from the source systems.  ETL Architecture Pattern  Most data warehousing projects consolidate data from different source systems  Each separate system may also use a different data organization and/or format  The goal of the extraction phase is to convert the data into a single format appropriate for transformation processing.
  8. 8. Transform  This stage applies a series of rules to extract data from source to derive the data for loading into the end target  Selecting only certain columns to load.  Translating coded values (e.g., 1 for male and 2 for female)  Encoding free-form values (e.g., mapping "Male" to "M")  Deriving a new calculated value  Sorting  Joining data from multiple sources (e.g., lookup, merge) and de-duplicating the data  Aggregation (e.g summarizing multiple rows of data — total sales for each store, and region, etc.)
  9. 9. Transform  Generating surrogate-key values  Transposing or pivoting  Splitting a column into multiple columns  Lookup and validate the relevant data from tables or referential files for slowly changing dimensions  Applying any form of simple or complex data validation.
  10. 10. Load  This phase loads the data into data warehouse.  This process varies widely. Some data warehouses may overwrite existing information with cumulative information;  However, the entry of data for any one year window is made in a historical manner.  As the load phase interacts with a database and contribute to overall data quality performance of the ETL process  ETL can be used to transform the data into a format suitable for the new application to use.
  11. 11. Data Quality Data quality is an essential characteristic that determines the reliability of data for making decisions. High-quality data: Complete Accurate Available Timely
  12. 12. Classification Of Data Quality Issues Data Quality Issues at Data Sources Data Quality Issues at Data Profiling Stage Data Quality issues at Data Staging ETL Data Quality Problems at Data Modelling
  13. 13. DATA SOURCE  The sources of dirty data include data entry error and update error  Part of the data comes from text files, part from MS Excel and from other sources  Some files are result of manual consolidation of multiple files as a result of which data quality might be compromised. DATA PROFILE • A process of developing information about data instead of information from data. Cont...
  14. 14. Example of Data Profiling
  15. 15. DATA STAGING ETL • A data cleaning process is executed in the data staging area to improve the accuracy • The data staging area is the place where all grooming is done on data after it is called from the source systems • It is a prime location for validating data quality from source or auditing and tracking down data issues. Cont..
  16. 16. DATA MODELLING • Schema Design of the greatly influences the quality of the analysis • Operational applications uses UML class model for conceptual data modelling • Issues as slowly changing dimensions, rapidly changing dimension, and multi valued dimensions etc. Cont..
  17. 17. Causes Of Data Quality CAUSES OF DATA QUALITY PROBLEMS AT DATA SOURCES • Wrong information entered into source system • As time and proximity from the source increase, the chances for getting correct data decrease • Inability to handle with ageing data contribute to data quality problems • Varying timeliness of data sources • System fields designed to allow free forms (Field not having adequate length). • Missing values in data sources • Additional columns • Use of different representation formats in data sources
  18. 18. Causes Of Data Quality CAUSES OF DATA QUALITY PROBLEMS AT DATA PROFILING • Unreliable and incomplete metadata of data source • User Generated SQL queries for the data profiling purpose leaves the data quality problems. • Inability of evaluation of data structure, data values and data relationships before data integration, propagates poor data quality • Inability of integration between Data profiling and ETL causes Data quality problem • Inappropriate selection of Automated profiling tool cause data quality issues • Insufficient structural analysis of the data sources in the profiling stage.
  19. 19. Cont.. CAUSES OF DATA QUALITY ISSUES AT DATA STAGING AND ETL PHASE • Different business rules of various data sources Creates problem of data quality. • Business rules lack currency contributes to DQ • Lack of capturing only changes in source files • Lack of periodical refreshing of the integrated data storage • Disabling data integrity constraints in data staging tables cause wrong data and relationships to be extracted • Purging of data from the Data warehouse cause data quality problem • The inability to restart the ETL process from checkpoints without losing data • Lack of automatically generating rules for ETL tools to build mapping that detect and fix data defects • Unhandled null values causes data quality problem
  20. 20. Cont.. CAUSES OF DATA QUALITY ISSUES AT DATA WAREHOUSE SCHEM A DESIGN • Incomplete or wrong requirement analysis of the project lead to poor schema design • Lack of currency in business rules cause poor requirement analysis • Choice of dimensional modelling (STAR,SNOWFLAKE,FACTCONSTALLATION) schema contribute to data quality. • Late identification of slowly changing dimensions contribute to data quality problems. • Late arriving dimensions cause DQ Problems. • Multi valued dimensions cause DQ problems • Incomplete/Wrong identification of facts/dimensions, bridge tables or relationship tables or their • Inability to support database schema refactoring cause data quality problems
  21. 21. DQ TOOLS
  23. 23. Impact of Data Quality Issues
  24. 24. Cost of Poor Data Quality
  25. 25. Confidence and Satisfaction- based impacts  Bad quality of data results in low confidence in forecasting, inconsistent operational and management reporting.  Its will cause delayed or improper decisions.  It impacts satisfaction of customer, employee, or supplier which leads to decreased organizational trust.  Ex : An international bank, for example, could not meet its customer satisfaction goals because agents in its 23 contact centres all followed different operational processes, using up to 18 different apps — many of which contained duplicate data — to serve a single customer.
  26. 26. Impact on Productivity  Workloads : Increased need for reconciliation of reports  Throughput : Increased time for data gathering and preparation, reduced time for direct data analysis, delays in delivering information products, lengthened production and manufacturing cycles  Output quality : Mistrusted reports  Supply chain : Out-of-stock, delivery delays, missed deliveries, duplicate costs for product delivery
  27. 27. Risk and Compliance impacts  Risk and compliance impacts associated with credit assessment, investment risks, competitive risk, capital investment and/or development, fraud, and leakage, and compliance with government regulations, industry expectations, or self-imposed policies (such as privacy policies). Ex: Healthcare Systems dealing with sensitive information about patients’ health condition. The privacy of these kind of data should be protected.
  28. 28. Examples of Data Quality Problem• Retail company found over 1m records contained home tel number of “000000000” and addresses containing flight numbers • Insurance company found customer records with 99/99/99 in creation date field of policy • Car rental company discovered duplicate agreement numbers in their European data warehouse • Healthcare company found 9 different values in gender field • Food/Beverage retail chain found the same product was their No 1 and No 2 best sellers across their business
  29. 29. Example cont..
  30. 30. Example cont..
  31. 31. Example cont..
  32. 32. Why Data Quality Influences?  Schema Design influences the quality of the analysis  Poor data handling procedures and processes  Failure to stick on to data entry and maintenance procedures  Errors in the migration process from one system to another  External and third-party data that may not fit
  33. 33. Causes of Data Quality Problems  Dimensional modelling (STAR, SNOWFLAKE, FACTCONSTALLATION) schema Choosing  Multi-valued dimensions  Incomplete/Wrong identification of facts/dimensions, bridge tables or relationship tables  Incomplete/missing values  Corrupted values  Out of range values  Wrong data  Duplicate data  Dissimilar data formats  Incompatible structures
  34. 34. Missing Data  nonresponse, no information is provided  when data collection improperly  mistakes in data entry How to deal • Imputation • Reconstruction • Denial/Remove • Interpolation
  35. 35. Data Corruption  Undetected/Silent  Detected
  36. 36. Out of Range error
  37. 37.  Use specific business rules of various data sources  Enabling data integrity constraints in data staging  Providing internal profiling or integration to third- party data profiling and cleansing tools  Automatically generating rules for ETL tools to build mapping Techniques of Data Quality Control
  38. 38. Data warehousing security  Appropriate to summaries and aggregates of data  Exploration data warehouse  Data encryption and enhancing privacy. For more information visit