Er. Nawaraj Bhandari
Data Warehouse/Data Mining
Data Warehouse Physical Design
Physical design is the phase of a database design following the logical design that
identifies the actual database tables and index structures used to implement the
In the physical design, you look at the most effective way of storing and retrieving
the objects as well as handling them from a transportation and backup/recovery
Physical design decisions are mainly driven by query performance and
database maintenance aspects.
During the logical design phase, you defined a model for your data warehouse
consisting of entities, attributes, and relationships. The entities are linked
together using relationships. Attributes are used to describe the entities. The
unique identifier (UID) distinguishes between one instance of an entity and
Figure: Logical Design Compared with Physical Design
During the physical design process, you translate the expected schemas
into actual database structures.
At this time, you have to map:
■ Entities to tables
■ Relationships to foreign key constraints
■ Attributes to columns
■ Primary unique identifiers to primary key constraints
■ Unique identifiers to unique key constraints
Physical Data Model
Features of physical data model include:
Specification all tables and columns.
Specification of Foreign keys.
De-normalization may be performed if necessary.
At this level, specification of logical data model is realized in the database.
The steps for physical data model design involves:
Conversion of entities into tables,
Conversion of relationships into foreign keys, Conversion of attributes into
Changes to the physical data model based on the physical constraints.
Physical Design Objectives
Involves tradeoffs among
Ease of Administration
Physical Design Structures
Once you have converted your logical design to a physical one,
you will need to create some or all of the following structures:
■ Tables and Partitioned Tables
■ Integrity Constraints
Some of these structures require disk space. Others exist only in
the data dictionary. Additionally, the following structures may be
created for performance improvement:
■ Indexes and Partitioned Indexes
■ Materialized Views
A tablespace consists of one or more datafiles, which are physical
structures within the operating system you are using.
A datafile is associated with only one tablespace.
From a design perspective, tablespaces are containers for
physical design structures.
Tables and Partitioned Tables
Tables are the basic unit of data storage. They are the
container for the expected amount of raw data in your
Using partitioned tables instead of non-partitioned ones
addresses the key problem of supporting very large data
volumes by allowing you to divide them into smaller and
more manageable pieces.
Partitioning large tables improves performance because
each partitioned piece is more manageable.
A view is a tailored presentation of the data contained in one or
more tables or other views.
A view takes the output of a query and treats it as a table.
Views do not require any space in the database.
Improving Performance with the Use of Views
or columns of
A view is a virtual table which
completely acts as a real table.
The use of view as a way to improve
Views can be used to combine tables,
so that instead of joining tables in a
query, the query will just access the
view and thus be quicker.
We can perform different SQL queries.
Integrity constraints are used to enforce business rules associated
with your database and to prevent having invalid information in
In data warehousing environments, constraints are only used for
NOT NULL constraints are particularly common in data
Indexes and Partitioned Indexes
Indexes are optional structures associated with tables.
Indexes are just like tables in that you can partition them (but the
partitioning strategy is not dependent upon the table structure)
Partitioning indexes makes it easier to manage the data warehouse
during refresh and improves query performance.
Materialized views are query results that have been stored in
advance so long-running calculations are not necessary when you
actually execute your SQL statements.
From a physical design point of view, materialized views resemble
tables or partitioned tables and behave like indexes in that they are
used transparently and improve performance.
Data Warehouse: A Multi-Tiered Architecture
(2) OLAP Engine
Data Sources (3) Front-End Tools
(1) Data Storage
ETL comes from Data Warehousing and stands for Extract-Transform-Load.
ETL covers a process of how the data are loaded from the source system
to the data warehouse.
Currently, the ETL encompasses a cleaning step as a separate step. The
sequence is then Extract-Clean-Transform-Load.
The Extract step covers the data extraction from the source system and
makes it accessible for further processing.
The main objective of the extract step is to retrieve all the required data
from the source system with as little resources as possible.
The extract step should be designed in a way that it does not negatively
affect the source system in terms or performance, response time or any
kind of locking.
There are several ways to perform the extract:
Update notification - if the source system is able to provide a notification that a record has
been changed and describe the change, this is the easiest way to get the data.
Incremental extract - some systems may not be able to provide notification that an update
has occurred, but they are able to identify which records have been modified and provide
an extract of such records. During further ETL steps, the system needs to identify changes
and propagate it down. Note, that by using daily extract, we may not be able to handle
deleted records properly.
Full extract - some systems are not able to identify which data has been changed at all, so
a full extract is the only way one can get the data out of the system. The full extract
requires keeping a copy of the last extract in the same format in order to be able to identify
changes. Full extract handles deletions as well.
The cleaning step is one of the most important as it ensures the quality of
the data in the data warehouse. Cleaning should perform basic data
unification rules, such as:
Making identifiers unique (sex categories Male/Female/Unknown, M/F/null,
Man/Woman/Not Available are translated to standard Male/Female/Unknown)
Convert null values into standardized Not Available/Not Provided value
Convert phone numbers, ZIP codes to a standardized form
Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str
Validate address fields against each other (State/Country, City/State, City/ZIP code,
The transform step applies a set of rules to transform the data from the
source to the target.
This includes converting any measured data to the same dimension (i.e.
conformed dimension) using the same units so that they can later be
The transformation step also requires joining data from several sources,
generating aggregates, generating surrogate keys(candidate key), sorting,
deriving new calculated values, and applying advanced validation rules.
OLAP Server Architectures
Types of OLAP Servers
Relational OLAP (ROLAP)
Multidimensional OLAP (MOLAP)
Hybrid OLAP (HOLAP)
Relational OLAP (ROLAP)
Relational OLAP servers are placed between relational back-end server and
client front-end tools. To store and manage the warehouse data, the relational
OLAP uses relational or extended-relational DBMS.
ROLAP servers can be easily used with existing RDBMS.
ROLAP tools do not use pre-calculated data cubes.
Multidimensional OLAP (MOLAP) uses array-based multidimensional storage
engines for multidimensional views of data. With multidimensional data stores,
the storage utilization may be low if the data set is sparse. Therefore, many
MOLAP servers use two levels of data storage representation to handle dense
and sparse data-sets
MOLAP allows fastest indexing to the pre-computed summarized data.
Easier to use, therefore MOLAP is suitable for inexperienced users.
MOLAP vs. ROLAP
Information retrieval is fast. Information retrieval is comparatively slow.
Uses sparse array to store data-sets. Uses relational table.
MOLAP is best suited for inexperienced
users, since it is very easy to use.
ROLAP is best suited for experienced users.
Maintains a separate database for data
It may not require space other than available in
the Data warehouse.
Hybrid OLAP (HOLAP)
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers
higher scalability of ROLAP and faster computation of MOLAP.
HOLAP servers allows to store the large data volumes of detailed
information. The aggregations are stored separately in MOLAP store.
Distributed Data Warehouse
(DDW) Data shared across multiple data repositories, for the purpose
of OLAP. Each data warehouse may belong to one or many
organizations. The sharing implies a common format or definition of
data elements (e.g. using XML).
Distributed data warehousing encompasses a complete enterprise DW
but has smaller data stores that are built separately and joined
physically over a network, providing users with access to relevant
reports without impacting on performance.
A distributed DW, the nucleus of all enterprise data, sends relevant
data to individual data marts from which users can access information
for order management, customer billing, sales analysis, and other
reporting and analytic functions.
Data Warehouse Manager
Collects data inputs from a variety of sources, including legacy
operational systems, third-party data suppliers, and informal sources.
Assures the quality of these data inputs by correcting spelling,
removing mistakes, eliminating null data, and combining multiple
Releases the data from the data staging area to the individual data
marts on a regular schedule.
Measures the costs and benefits.
Estimates the cost and benefits
The data warehouse is a great idea, but it is complex to build and
requires investment. Why not use a cheap and fast approach
by eliminating the transformation steps of repositories for metadata
and another database.
This approach is termed the 'virtual data warehouse'. To accomplish
this there is need to define 4 kinds of information:
A data dictionary containing the definitions of the various databases.
A description of the relationship among the data elements.
The description of the way user will interface with the system.
The algorithms and business rules that define what to do and how to do it.
1. Sam Anahory, Dennis Murray, “Data warehousing In the Real World”, Pearson
2. Kimball, R. “The Data Warehouse Toolkit”, Wiley, 1996.
3. Teorey, T. J., “Database Modeling and Design: The Entity-Relationship Approach”,
Morgan Kaufmann Publishers, Inc., 1990.
4. “An Overview of Data Warehousing and OLAP Technology”, S. Chaudhuri,
5. “Data Warehousing with Oracle”, M. A. Shahzad
6. “Data Mining Concepts and Techniques”, Morgan Kaufmann J. Han, M Kamber
Second Edition ISBN : 978-1-55860-901-3