Datawarehousing Terminology

What is data inconsistency?
Data inconsistency exists when different and conflicting versions of
the same data appear in different places. Data inconsistency creates
unreliable information, because it will be difficult to determine which
version of the information is correct. (It's difficult to make correct -
and timely - decisions if those decisions are based on conflicting
information.)
Data inconsistency is likely to occur when there is data redundancy.
Data redundancy occurs when the data file/database file contains
redundant - unnecessarily duplicated - data. That's why one major goal
of good database design is to eliminate data redundancy.

Contd……
• Data validation is intended to provide certain
well-defined guarantees for fitness, accuracy, and
consistency for any of various kinds of user input
into an application or automated system. Data
validation rules can be defined and designed
using any of various methodologies, and be
deployed in any of various contexts.
• Data validation rules may be defined, designed
and deployed in different ways according to
requirement.

Contd……
Figure 1: Simple schematic for a data warehouse. The ETL process extracts information from
the source databases, transforms it and then loads it into the data warehouse.

Contd….
• In evaluating the basics of data validation,
generalizations can be made regarding the different
types of validation, according to the scope, complexity,
and purpose of the various validation operations to be
carried out.
• For example:
• Data type validation;
• Range and constraint validation;
• Code and Cross-reference validation- such as one or
more tests against regular expressions; and
• Structured validation-- LDAP.

Data Integration in Data Warehousing
Generally speaking, a data integration system combines the data residing at
different sources, and provides a unified, reconciled view of these data, called global
schema, which can be queried by the user. In the design of a data integration system,
an important aspect is the way in which the global schema is specified, i.e., which
data model is adopted and what kind of constraints on the data can be expressed.
Moreover, a basic decision is related to the problem of how to specify the relation
between the sources and the global schema. There are basically two approaches for
this problem. The first approach, called global-as view (GAV), requires that the
global schema is expressed in terms of the data sources. More precisely, to every
concept of the global schema, a view over the data sources is associated, so that its
meaning is specified in terms of the data residing at the sources. In the second
approach, called local-as-view (LAV), the global schema is specified independently
from the sources, and the relationships between the global schema and the
sources are established by defining every source as a view over the global schema.

Problems in data integration
• Inconsistencies
• Redundancies

METADATA IN THE DATA WAREHOUSE
• Metadata is simply defined as data about data. The data that are used to
represent other data is known as metadata. For example the index of a
book serve as metadata for the contents in the book. In other words we
can say that metadata is the summarized data that leads us to the
detailed data.
• Categories of Metadata
• The metadata can be broadly categorized into three categories:
• Business Metadata - This metadata has the data ownership information,
business definition and changing policies.
• Technical Metadata - Technical metadata includes database system
names, table and column names and sizes, data types and allowed
values. Technical metadata also includes structural information such as
primary and foreign key attributes and indices.
• Operational Metadata - This metadata includes currency of data and
data lineage.Currency of data means whether data is active, archived or
purged. Lineage of data means history of data migrated and
transformation applied on it.

Mapping
• A basic part of the data warehouse environment
is that of mapping from the operational
• environment into the data warehouse. The
mapping includes a wide variety of facets,
• including, but not limited to:
• • mapping from one attribute to another,
• • conversions,
• • changes in naming conventions,
• • changes in physical characteristics of data,
• • filtering of data, etc.

Retention and Purging?
• What is data retention and Purging?
• There are certain requirement to purge / archive / delete the data in data
warehouse after a certain period of time, often termed as retention period of the
data warehouse. Once a the retention period is reached, data from the data
warehouse are purged or deleted or archived into separate place usually
comprising of low cost storage medium (e.g. tape drive).
• Why data purging is required?
• In a idealistic scenario, we assume data warehouse to store data for good.
However there are some reasons why this might not be a good idea in a real
scenario:
• There are cost overhead associated with the amount of data that we store. This
includes the cost of storage medium, infrastructure and human resources
necessary to manage the data
• There is direct impact of data volume to the performance of a data warehouse.
More data means more time consuming sorting and searching operations
• End users of the data warehouse in the business side may not be interested in the
very old fact and figures. Data might lose its importance and relevance with the
changing business landscape. Retaining such impertinent data may not be required

Datawarehousing Terminology

More Related Content

What's hot

Viewers also liked

Similar to Datawarehousing Terminology

Recently uploaded

Datawarehousing Terminology