What is data inconsistency? 
Data inconsistency exists when different and conflicting versions of 
the same data appear in different places. Data inconsistency creates 
unreliable information, because it will be difficult to determine which 
version of the information is correct. (It's difficult to make correct - 
and timely - decisions if those decisions are based on conflicting 
information.) 
Data inconsistency is likely to occur when there is data redundancy. 
Data redundancy occurs when the data file/database file contains 
redundant - unnecessarily duplicated - data. That's why one major goal 
of good database design is to eliminate data redundancy.
Data Validation?
Contd…… 
• Data validation is intended to provide certain 
well-defined guarantees for fitness, accuracy, and 
consistency for any of various kinds of user input 
into an application or automated system. Data 
validation rules can be defined and designed 
using any of various methodologies, and be 
deployed in any of various contexts. 
• Data validation rules may be defined, designed 
and deployed in different ways according to 
requirement.
Contd…… 
Figure 1: Simple schematic for a data warehouse. The ETL process extracts information from 
the source databases, transforms it and then loads it into the data warehouse.
Contd…. 
• In evaluating the basics of data validation, 
generalizations can be made regarding the different 
types of validation, according to the scope, complexity, 
and purpose of the various validation operations to be 
carried out. 
• For example: 
• Data type validation; 
• Range and constraint validation; 
• Code and Cross-reference validation- such as one or 
more tests against regular expressions; and 
• Structured validation-- LDAP.
Data Integration in Data Warehousing 
Generally speaking, a data integration system combines the data residing at 
different sources, and provides a unified, reconciled view of these data, called global 
schema, which can be queried by the user. In the design of a data integration system, 
an important aspect is the way in which the global schema is specified, i.e., which 
data model is adopted and what kind of constraints on the data can be expressed. 
Moreover, a basic decision is related to the problem of how to specify the relation 
between the sources and the global schema. There are basically two approaches for 
this problem. The first approach, called global-as view (GAV), requires that the 
global schema is expressed in terms of the data sources. More precisely, to every 
concept of the global schema, a view over the data sources is associated, so that its 
meaning is specified in terms of the data residing at the sources. In the second 
approach, called local-as-view (LAV), the global schema is specified independently 
from the sources, and the relationships between the global schema and the 
sources are established by defining every source as a view over the global schema.
Problems in data integration 
• Inconsistencies 
• Redundancies
METADATA IN THE DATA WAREHOUSE 
• Metadata is simply defined as data about data. The data that are used to 
represent other data is known as metadata. For example the index of a 
book serve as metadata for the contents in the book. In other words we 
can say that metadata is the summarized data that leads us to the 
detailed data. 
• Categories of Metadata 
• The metadata can be broadly categorized into three categories: 
• Business Metadata - This metadata has the data ownership information, 
business definition and changing policies. 
• Technical Metadata - Technical metadata includes database system 
names, table and column names and sizes, data types and allowed 
values. Technical metadata also includes structural information such as 
primary and foreign key attributes and indices. 
• Operational Metadata - This metadata includes currency of data and 
data lineage.Currency of data means whether data is active, archived or 
purged. Lineage of data means history of data migrated and 
transformation applied on it.
Mapping 
• A basic part of the data warehouse environment 
is that of mapping from the operational 
• environment into the data warehouse. The 
mapping includes a wide variety of facets, 
• including, but not limited to: 
• • mapping from one attribute to another, 
• • conversions, 
• • changes in naming conventions, 
• • changes in physical characteristics of data, 
• • filtering of data, etc.
Retention and Purging? 
• What is data retention and Purging? 
• There are certain requirement to purge / archive / delete the data in data 
warehouse after a certain period of time, often termed as retention period of the 
data warehouse. Once a the retention period is reached, data from the data 
warehouse are purged or deleted or archived into separate place usually 
comprising of low cost storage medium (e.g. tape drive). 
• Why data purging is required? 
• In a idealistic scenario, we assume data warehouse to store data for good. 
However there are some reasons why this might not be a good idea in a real 
scenario: 
• There are cost overhead associated with the amount of data that we store. This 
includes the cost of storage medium, infrastructure and human resources 
necessary to manage the data 
• There is direct impact of data volume to the performance of a data warehouse. 
More data means more time consuming sorting and searching operations 
• End users of the data warehouse in the business side may not be interested in the 
very old fact and figures. Data might lose its importance and relevance with the 
changing business landscape. Retaining such impertinent data may not be required

Datawarehousing Terminology

  • 1.
    What is datainconsistency? Data inconsistency exists when different and conflicting versions of the same data appear in different places. Data inconsistency creates unreliable information, because it will be difficult to determine which version of the information is correct. (It's difficult to make correct - and timely - decisions if those decisions are based on conflicting information.) Data inconsistency is likely to occur when there is data redundancy. Data redundancy occurs when the data file/database file contains redundant - unnecessarily duplicated - data. That's why one major goal of good database design is to eliminate data redundancy.
  • 2.
  • 3.
    Contd…… • Datavalidation is intended to provide certain well-defined guarantees for fitness, accuracy, and consistency for any of various kinds of user input into an application or automated system. Data validation rules can be defined and designed using any of various methodologies, and be deployed in any of various contexts. • Data validation rules may be defined, designed and deployed in different ways according to requirement.
  • 4.
    Contd…… Figure 1:Simple schematic for a data warehouse. The ETL process extracts information from the source databases, transforms it and then loads it into the data warehouse.
  • 5.
    Contd…. • Inevaluating the basics of data validation, generalizations can be made regarding the different types of validation, according to the scope, complexity, and purpose of the various validation operations to be carried out. • For example: • Data type validation; • Range and constraint validation; • Code and Cross-reference validation- such as one or more tests against regular expressions; and • Structured validation-- LDAP.
  • 6.
    Data Integration inData Warehousing Generally speaking, a data integration system combines the data residing at different sources, and provides a unified, reconciled view of these data, called global schema, which can be queried by the user. In the design of a data integration system, an important aspect is the way in which the global schema is specified, i.e., which data model is adopted and what kind of constraints on the data can be expressed. Moreover, a basic decision is related to the problem of how to specify the relation between the sources and the global schema. There are basically two approaches for this problem. The first approach, called global-as view (GAV), requires that the global schema is expressed in terms of the data sources. More precisely, to every concept of the global schema, a view over the data sources is associated, so that its meaning is specified in terms of the data residing at the sources. In the second approach, called local-as-view (LAV), the global schema is specified independently from the sources, and the relationships between the global schema and the sources are established by defining every source as a view over the global schema.
  • 7.
    Problems in dataintegration • Inconsistencies • Redundancies
  • 8.
    METADATA IN THEDATA WAREHOUSE • Metadata is simply defined as data about data. The data that are used to represent other data is known as metadata. For example the index of a book serve as metadata for the contents in the book. In other words we can say that metadata is the summarized data that leads us to the detailed data. • Categories of Metadata • The metadata can be broadly categorized into three categories: • Business Metadata - This metadata has the data ownership information, business definition and changing policies. • Technical Metadata - Technical metadata includes database system names, table and column names and sizes, data types and allowed values. Technical metadata also includes structural information such as primary and foreign key attributes and indices. • Operational Metadata - This metadata includes currency of data and data lineage.Currency of data means whether data is active, archived or purged. Lineage of data means history of data migrated and transformation applied on it.
  • 9.
    Mapping • Abasic part of the data warehouse environment is that of mapping from the operational • environment into the data warehouse. The mapping includes a wide variety of facets, • including, but not limited to: • • mapping from one attribute to another, • • conversions, • • changes in naming conventions, • • changes in physical characteristics of data, • • filtering of data, etc.
  • 10.
    Retention and Purging? • What is data retention and Purging? • There are certain requirement to purge / archive / delete the data in data warehouse after a certain period of time, often termed as retention period of the data warehouse. Once a the retention period is reached, data from the data warehouse are purged or deleted or archived into separate place usually comprising of low cost storage medium (e.g. tape drive). • Why data purging is required? • In a idealistic scenario, we assume data warehouse to store data for good. However there are some reasons why this might not be a good idea in a real scenario: • There are cost overhead associated with the amount of data that we store. This includes the cost of storage medium, infrastructure and human resources necessary to manage the data • There is direct impact of data volume to the performance of a data warehouse. More data means more time consuming sorting and searching operations • End users of the data warehouse in the business side may not be interested in the very old fact and figures. Data might lose its importance and relevance with the changing business landscape. Retaining such impertinent data may not be required