Characteristics of Data Warehouse
Benefits of a data warehouse
Designing of Data Warehouse
Extract, Transform, Load (ETL)
Data Quality
Classification Of Data Quality Issues
Causes Of Data Quality
Impact of Data Quality Issues
Cost of Poor Data Quality
Confidence and Satisfaction-based impacts
Impact on Productivity
Risk and Compliance impacts
Why Data Quality Influences?
Causes of Data Quality Problems
How to deal: Missing Data
Data Corruption
Data: Out of Range error
Techniques of Data Quality Control
Data warehousing security
3. Characteristics of Data Warehouse
• Data warehousing it supports to management on decision
making
• It is Subject Oriented and gives information about a
company's ongoing operations
• Data is gathered in Integrated way into the data warehouse
from a variety of sources and merged into a coherently
• Data warehouse is a Time-variant and is identified with a
particular time period
• It is Non-volatile means stable.
4. Benefits of a data warehouse
Maintain data history
Integrate data from multiple source systems, enabling a central
view
Improve data quality, by providing codes and descriptions, or
even fixing bad data
Present the organization's information consistently
Provide a single common data model for all data source
Restructure the data to makes sense the users
Restructure the data to delivers excellent query performance
Making decision–support queries easier.
5. Designing of Data Warehouse
Top-down, bottom-up approaches or a combination of both
software engineering point of view: Waterfall and Spiral
Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions & measures
1. Star schema
2. Snowflake schema
3. Fact constellations
7. Extract
ETL process involves extracting the data from
the source systems.
ETL Architecture Pattern
Most data warehousing projects consolidate
data from different source systems
Each separate system may also use a different
data organization and/or format
The goal of the extraction phase is to convert
the data into a single format appropriate for
transformation processing.
8. Transform
This stage applies a series of rules to extract data from source
to derive the data for loading into the end target
Selecting only certain columns to load.
Translating coded values (e.g., 1 for male and 2 for female)
Encoding free-form values (e.g., mapping "Male" to "M")
Deriving a new calculated value
Sorting
Joining data from multiple sources (e.g., lookup, merge) and
de-duplicating the data
Aggregation (e.g summarizing multiple rows of data — total
sales for each store, and region, etc.)
9. Transform
Generating surrogate-key values
Transposing or pivoting
Splitting a column into multiple columns
Lookup and validate the relevant data from tables or
referential files for slowly changing dimensions
Applying any form of simple or complex data validation.
10. Load
This phase loads the data into data warehouse.
This process varies widely. Some data warehouses may
overwrite existing information with cumulative information;
However, the entry of data for any one year window is made in
a historical manner.
As the load phase interacts with a database and contribute to
overall data quality performance of the ETL process
ETL can be used to transform the data into a format suitable
for the new application to use.
11. Data Quality
Data quality is an essential characteristic that determines the
reliability of data for making decisions.
High-quality data:
Complete
Accurate
Available
Timely
12. Classification Of Data Quality
Issues
Data Quality Issues at Data Sources
Data Quality Issues at Data Profiling Stage
Data Quality issues at Data Staging ETL
Data Quality Problems at Data Modelling
13. DATA SOURCE
The sources of dirty data include data entry error and
update error
Part of the data comes from text files, part from MS Excel
and from other sources
Some files are result of manual consolidation of multiple
files as a result of which data quality might be
compromised.
DATA PROFILE
• A process of developing information about data
instead of information from data.
Cont...
15. DATA STAGING ETL
• A data cleaning process is executed in the data
staging area to improve the accuracy
• The data staging area is the place where all
grooming is done on data after it is called from
the source systems
• It is a prime location for validating data quality
from source or auditing and tracking down data
issues.
Cont..
16. DATA MODELLING
• Schema Design of the greatly influences the
quality of the analysis
• Operational applications uses UML class model
for conceptual data modelling
• Issues as slowly changing dimensions, rapidly
changing dimension, and multi valued
dimensions etc.
Cont..
17. Causes Of Data Quality
CAUSES OF DATA QUALITY PROBLEMS AT DATA SOURCES
• Wrong information entered into source system
• As time and proximity from the source increase, the
chances for getting correct data decrease
• Inability to handle with ageing data contribute to data
quality problems
• Varying timeliness of data sources
• System fields designed to allow free forms (Field not
having adequate length).
• Missing values in data sources
• Additional columns
• Use of different representation formats in data sources
18. Causes Of Data Quality
CAUSES OF DATA QUALITY PROBLEMS AT DATA
PROFILING
• Unreliable and incomplete metadata of data source
• User Generated SQL queries for the data profiling
purpose leaves the data quality problems.
• Inability of evaluation of data structure, data values
and data relationships before data integration,
propagates poor data quality
• Inability of integration between Data profiling and
ETL causes Data quality problem
• Inappropriate selection of Automated profiling tool
cause data quality issues
• Insufficient structural analysis of the data sources in
the profiling stage.
19. Cont..
CAUSES OF DATA QUALITY ISSUES AT DATA STAGING AND ETL PHASE
• Different business rules of various data sources Creates
problem of data quality.
• Business rules lack currency contributes to DQ
• Lack of capturing only changes in source files
• Lack of periodical refreshing of the integrated data storage
• Disabling data integrity constraints in data staging tables
cause wrong data and relationships to be extracted
• Purging of data from the Data warehouse cause data quality
problem
• The inability to restart the ETL process from checkpoints
without losing data
• Lack of automatically generating rules for ETL tools to build
mapping that detect and fix data defects
• Unhandled null values causes data quality problem
20. Cont..
CAUSES OF DATA QUALITY ISSUES AT DATA WAREHOUSE SCHEM A DESIGN
• Incomplete or wrong requirement analysis of the project lead to poor
schema design
• Lack of currency in business rules cause poor requirement analysis
• Choice of dimensional modelling
(STAR,SNOWFLAKE,FACTCONSTALLATION) schema contribute to data
quality.
• Late identification of slowly changing dimensions contribute to data
quality problems.
• Late arriving dimensions cause DQ Problems.
• Multi valued dimensions cause DQ problems
• Incomplete/Wrong identification of facts/dimensions, bridge tables or
relationship tables or their
• Inability to support database schema refactoring cause data quality
problems
25. Confidence and Satisfaction-
based impacts
Bad quality of data results in low confidence in
forecasting, inconsistent operational and management
reporting.
Its will cause delayed or improper decisions.
It impacts satisfaction of customer, employee, or
supplier which leads to decreased organizational trust.
Ex : An international bank, for example, could not meet
its customer satisfaction goals because agents in its 23
contact centres all followed different operational
processes, using up to 18 different apps — many of which
contained duplicate data — to serve a single customer.
26. Impact on Productivity
Workloads : Increased need for reconciliation of reports
Throughput : Increased time for data gathering and
preparation, reduced time for direct data analysis,
delays in delivering information products, lengthened
production and manufacturing cycles
Output quality : Mistrusted reports
Supply chain : Out-of-stock, delivery delays, missed
deliveries, duplicate costs for product delivery
27. Risk and Compliance impacts
Risk and compliance impacts associated with credit
assessment, investment risks, competitive risk, capital
investment and/or development, fraud, and leakage,
and compliance with government regulations, industry
expectations, or self-imposed policies (such as privacy
policies).
Ex: Healthcare Systems dealing with sensitive information
about patients’ health condition. The privacy of these
kind of data should be protected.
28. Examples of Data Quality
Problem• Retail company found over 1m records contained
home tel number of “000000000” and addresses
containing flight numbers
• Insurance company found customer records with
99/99/99 in creation date field of policy
• Car rental company discovered duplicate agreement
numbers in their European data warehouse
• Healthcare company found 9 different values in
gender field
• Food/Beverage retail chain found the same product
was their No 1 and No 2 best sellers across their
business
33. Why Data Quality Influences?
Schema Design influences the quality of the analysis
Poor data handling procedures and processes
Failure to stick on to data entry and maintenance
procedures
Errors in the migration process from one system to
another
External and third-party data that may not fit
34. Causes of Data Quality Problems
Dimensional modelling (STAR, SNOWFLAKE, FACTCONSTALLATION) schema
Choosing
Multi-valued dimensions
Incomplete/Wrong identification of facts/dimensions, bridge tables or
relationship tables
Incomplete/missing values
Corrupted values
Out of range values
Wrong data
Duplicate data
Dissimilar data formats
Incompatible structures
35. Missing Data
nonresponse, no information is provided
when data collection improperly
mistakes in data entry
How to deal
• Imputation
• Reconstruction
• Denial/Remove
• Interpolation
38. Use specific business rules of various data sources
Enabling data integrity constraints in data staging
Providing internal profiling or integration to third-
party data profiling and cleansing tools
Automatically generating rules for ETL tools to
build mapping
Techniques of Data Quality Control
39. Data warehousing security
Appropriate to summaries and aggregates of data
Exploration data warehouse
Data encryption and enhancing privacy.
For more information visit
http://aminchowdhury.info