Basics+of+Datawarehousing

Datawarehouse :
Bill Inmon in 1990, which he defined in the following way :
"A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of
data in support of management's decision making process". He defined the terms in the
sentence as follows:
Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence
system. An enterprise has one data warehouse, and data marts source their information
from the data warehouse. In the data warehouse, information is stored in 3rd normal
form.
Subject Oriented:
Data that gives information about a particular subject instead of about a company's
ongoing operations.
Integrated:
Data that is gathered into the data warehouse from a variety of sources and merged into a
coherent whole.
Time-variant:
All data in the data warehouse is identified with a particular time period.
Non-volatile :
Data is stable in a data warehouse. More data is added but data is never removed.
However, a single-subject data warehouse is typically referred to as a data mart,
while data warehouses are generally enterprise in scope.
Also, data warehouses can be volatile. Due to the large amount of storage required for a
data warehouse, (multi-terabyte data warehouses are not uncommon), only a certain
number of periods of history are kept in the warehouse. For instance, if three years of
data are decided on and loaded into the warehouse, every month the oldest month will be
"rolled off" the database, and the newest month added.
===============================================================
Ralph Kimball provided a much simpler definition of a data warehouse.
a data warehouse is "a copy of transaction data specifically structured for query and
analysis".
Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within
the enterprise. Information is always stored in the dimensional model.
===============================================================

Steps :
• Requirement Gathering
• Physical Environment Setup
• Data Modeling
• ETL
• OLAP Cube Design
• Front End Development
• Performance Tuning
• Quality Assurance
• Rolling out to Production
• Production Maintenance
• Incremental Enhancements
Components of Dimensional Data Model :
Dimension: A category of information. For example, the time dimension.
Attribute: A unique level within a dimension. For example, Month is an attribute in the Time
Dimension.
Hierarchy: The specification of levels that represents relationship between different attributes
within a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter→
Month Day.→ →
Fact Table : A fact table is a table that contains the measures of interest. For example, sales
amount would be such a measure.
A dimensional model includes fact tables and lookup tables. Fact tables connect to
one or more lookup tables, but fact tables do not have direct relationships to one another.
In designing data models for data warehouses / data marts, the most commonly used
schema types are Star Schema and Snowflake Schema.
Star Schema: In the star schema design, a single object (the fact
table) sits in the middle and is radially connected to other
surrounding objects (dimension lookup tables) like a star. A star
schema can be simple or complex. A simple star consists of one fact
table; a complex star can have more than one fact table. Fact tables in
star schema are mostly in third normal form (3NF), but dimensional
tables are in de-normalized second normal form (2NF).
Snowflake Schema: The snowflake schema (sometimes called snowflake join
schema) is a more complex schema than the star schema because the tables
which describe the dimensions are normalized.
The main advantage of the snowflake schema is the improvement in query performance
due to minimized disk storage requirements and joining smaller lookup tables. The main
disadvantage of the snowflake schema is the additional maintenance efforts needed due to
the increase number of lookup tables

Dimensions :
what are the types of dimension tables
There are three types of Dimensions
Confirmed Dimensions, Junk Dimensions, Degenerative Dimensions
Conformed Dimension: A dimension that has exactly the same meaning and
content when being referred from different fact tables. Comfirmed is some thing
which can be shared by shared by multiple Fact Tables or multiple Data Marts. Some
of the examples are time dimension, customer dimensions, product dimension.
Junk Dimensions :
Occasionally, there are miscellaneous attributes, such as yes/no attributes or
comment attributes, that don’t fit into tight star schemas. Rather than discarding flag
fields and yes/no attributes, place them in a junk dimension. In addition, you can
handle comment and open-ended text attributes by creating a text-based junk
dimension.
A junk dimension is a convenient grouping of flags and indicators. It's helpful, but
not absolutely required, if there's a positive correlation among the values.
what is degenerated dimension?
I have a fact table that stores insurance contracts and one important dimension is
the year signed. So the fact table does have many columns, like CUSTOMER_ID,
CONTRACT_ID, etc and one column YEAR_SIGNED as varchar(4). The
CUSTOMER_ID is the foreign key column to the DIM_CUSTOMER with all the
customer date, name address, .... CONTRACT_ID relates to the DIM_CONTRACT with
all the contract specific information. Any YEAR_SIGNED? Should I really have a
DIM_YEAR_SIGNED and it will have one column only. What other attributes should a
year have?
Therefore, we do not create an explicit dimension table and call that YEAR_SIGNED
column a degenerated dimension.
Degenerate dimension is a dimension key generated in the fact table that doesn't connected to
any dimension table i.e,it corresponds to a dimension table that has no attributes.

Types of Facts
There are three types of facts:
• Additive: Additive facts are facts that can be summed up through all of the dimensions
in the fact table.
• Semi-Additive: Semi-additive facts are facts that can be summed up for some of the
dimensions in the fact table, but not the others.
• Non-Additive: Non-additive facts are facts that cannot be summed up for any of the
dimensions present in the fact table.
Let us use examples to illustrate each of the three types of facts. The first example assumes that
we are a retailer, and we have a fact table with the following columns:
Date
Store
Product
Sales_Amount
The purpose of this table is to record the sales amount for each product in each store on a daily
basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you
can sum up this fact along any of the three dimensions present in the fact table -- date, store, and
product. For example, the sum of Sales_Amount for all 7 days in a week represent the total
sales amount for that week.
Say we are a bank with the following fact table:
Date
Account
Current_Balance
Profit_Margin
The purpose of this table is to record the current balance for each account at the end of each day,
as well as the profit margin for each account for each day. Current_Balance and
Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to
add them up for all accounts (what's the total current balance for all accounts in the bank?), but it
does not make sense to add them up through time (adding up all current balances for a given
account for each day of the month does not give us any useful information). Profit_Margin is a
non-additive fact, for it does not make sense to add them up for the account level or the day level.
Types of Fact Tables
Based on the above classifications, there are two types of fact tables:

• Cumulative: This type of fact table describes what has happened over a period of time.
For example, this fact table may describe the total sales by product by store by day. The
facts for this type of fact tables are mostly additive facts. The first example presented
here is a cumulative fact table.
• Snapshot: This type of fact table describes the state of things in a particular instance of
time, and usually includes more semi-additive and non-additive facts. The second
example presented here is a snapshot fact table.
• ===========================================================
=======
factless facts and in which scenario will you use such kinds of fact tables
Factless Fact : very useful fact tables don't have any facts at all
FIGURE 1
-- A factless fact table for recording student attendance on a daily basis at a college.
The five dimension tables contain rich descriptions of dates, students, courses,
teachers, and facilities. There are no additive, numeric facts.
Which classes were the most heavily attended? Which classes were the
most consistently attended? Which teachers taught the most students?
Tools :
 Scalability: How can the system grow as your data storage needs grow?
 Parallel Processing Support:

Popular Relational Databases
• Oracle ,Microsoft SQL Server ,IBM DB2,Teradata ,Sybase ,MySQL
Popular OS Platforms
• Linux
• FreeBSD
• Microsoft
ETL Tools :
• IBM WebSphere Information Integration (Ascential DataStage)
• Ab Initio
• Informatica
OLAP Tool Functionalities
1. MOLAP: In this type of OLAP, a cube is aggregated from the relational data
source (data warehouse). When user generates a report request, the MOLAP tool can
generate the create quickly because all data is already pre-aggregated within the
cube.
2. ROLAP: In this type of OLAP, instead of pre-aggregating everything into a cube,
the ROLAP engine essentially acts as a smart SQL generator. The ROLAP tool
typically comes with a 'Designer' piece, where the data warehouse administrator can
specify the relationship between the relational tables, as well as how dimensions,
attributes, and hierarchies map to the underlying database tables.
Popular Tools
• Business Objects
• Cognos
• Hyperion
• Microsoft Analysis Services
• MicroStrategy
Reporting Tool
• Business Objects (Crystal Reports)
• Cognos
• Actuate
==================================================================
Questions ?
What is Molap and Rolap? What is Diff between Them?

multidimensional online analytical processing and
relational online analytical processing. In MOLAP data is
stored in form of multidimensional cubes. The advantages of
this mode is that it provides excellent query performance
and the cubes are built for fast data retrieval. All
calculations are pre-generated when the cube is created and
can be easily applied while querying data.
In ROLAP, the data is stored in relational databases this model gives
the appearance of traditional OLAP’s slicing and dicing functionality.
The advantages of this model is it can handle a large amount of data
and can leverage all the functionalities of the relational database.
MOLAP has aggregated value stored in cube.Since the data is
aggregated, query performance is fast.
ROLAP has data sored in relational databases.Here query has
to access the database for retrieving the data every time.So
performance is slow when compared to molap. Size is larger
than molap.
===============================================================
What is BCP?
Bulk Copy Pogram
Two plugins are automatically installed with Data stage.
1. BCPLoad plugin-used to bulk load data in single table in
MS SQL server.
2. OraBulk Plugin
What is Data Mining?
Data mining is the process of finding correlations or patterns among dozens of fields
in large relational databases.
Generally, data mining (sometimes called data or knowledge discovery) is the
process of analyzing data from different perspectives and summarizing it into useful
information - information that can be used to increase revenue, cuts costs, or both.
These analysts look for patterns hidden in data.
how can one connect two fact tables ? is it possible ? how?
Fact Tables are connected by confirmed dimensions, Fact
tables cannot be connected directly, so means of dimension
we can connect.Example : We_site_id.
When should you use a STAR and when a SNOW-FLAKE schema?
STAR SCHEMA:-
1. If PERFORMANCE is the priority than go for
star schema,since here dimension tables are DE-NORMALIZED.

2. Usually star schema is the best option for end users due to
its simple design and navigation.
The snowflake schema (sometimes called snowflake join
schema) is a more complex schema than the star schema
because the tables which describe the dimensions are
normalized.
Snowflake schema is nothing but one dimension table will be
connected to another dimension table and so on.
1. If a dimension is very sparse (i.e. most of the
possible values for the dimension have no data) and/or a
2. dimension has a very long list of attributes which may be
used in a query, the dimension table may occupy a
significant proportion of the database and snow flaking may
be appropriate.
SNOW-FLAKE SCHEMA:-if MEMORY SPACE is the priority than go
for snoflake schema,since here dimension tables are
NORMALIZED
What is the difference between OLAP, ROLAP, MOLAP and HOLAP?
MOLAP
------
MOLAP(Multidimensional OLAP), provides the analysis of data
stored in a multi-dimensional data cube.
ROLAP
------
ROLAP stands for Relational Online Analytical Process that
provides multidimensional analysis of data, stored in a Relational
database(RDBMS).
HOLAP
------
HOLAP(Hybrid OLAP) a combination of both ROLAP and MOLAP can
provide multidimensional analysis simultaneously of data stored in a
multidimensional database and in a relational database(RDBMS).
DOLAP
-----
DOLAP(Desktop OLAP or Database OLAP)provide multidimensional analysis
locally in the client machine on the data collected from relational or
multidimensional database servers.
what is the difference between aggregate table and fact table ? how do you
load these two tables
Fact tables contains million of records and retriving the records from fact table takes
time.where as aggregate table contains limited data from all the required tables,and
we retrive the data it takes less time.

Which kind of index is preferred in DWH?
Bitmap index is the best one.
why because B-tree is suited for unique values(eg: empid) and
Bitmap is best for repeated values(eg: gender m/f)
What are CUBES?
The cubes divide the data into subsets that are defined by dimensions.
Cube Dimensions Measures
mscsCampaign Advertiser
DateHour
Events
Page Group
Site
UserType
Count EventsDistinct Users
OrdImpLeaf
mscsCampaignEvents Advertiser
DateHour
Events
Page Group
Site
UserType
Count EventsDistinct Users
===============================================================
What are materialized views ? how they can be used in datawarehouse to increase the
performance?
MVs are segments similar to tables, in which the output of queries is stored in the
database.
The following is a common query at Acme Bank:
SELECT acc_type, SUM(cleared_bal) totbal
FROM accounts
GROUP BY acc_type;
And the following is an MV, mv_bal, for this query:
CREATE OR REPLACE MATERIALIZED VIEW mv_bal
REFRESH ON DEMAND AS
SELECT acc_type, SUM(cleared_bal) totbal
FROM accounts
GROUP BY acc_type;

Now suppose a user wants to get the total of all account balances for the account type 'C'
and issues the following query:
SELECT SUM(cleared_bal)
FROM accounts
WHERE acc_type = 'C';
Because the mv_bal MV already contains the totals by account type, the user could have
gotten this information directly from the MV, by issuing the following:
SELECT totbal
FROM mv_bal
WHERE acc_type = 'C';
This query against the mv_bal MV would have returned results much more quickly than
the query against the accounts table. Running a query against the MV will be faster
than running the original query, because querying the MV does not query the source
tables.
To keep the data in sync, the MV is refreshed from time to time, either manually or
automatically. There are two ways to refresh data in MVs. In one of them, the MV is
completely wiped clean and then repopulated with data from the source
tables—a process known as complete refresh. In some cases, however, when the
source tables may have changed very little, it is possible to refresh the MV only for
changed records on the source tables—a process known as fast refresh. To
use fast refresh, however, you must have created the MV as fast-refreshable.
Because it updates only changed records, fast refresh is faster than complete
refresh. (See the Oracle Database Data Warehousing Guide for more information on
refreshing MVs.)
A materialized view can be either read-only, updatable, or writeable. Users cannot
perform data manipulation language (DML) statements on read-only materialized views,
but they can perform DML on updatable and writeable materialized views.
===============================================================
What is SQL*Loader and what is it used for?
SQL*Loader is a bulk loader utility used for moving data from external files into the
Oracle database.

Is there a SQL*Unloader to download data to a flat file?
Oracle does not supply any data unload utilities. Here are some workarounds:
Using SQL*Plus ,You can use SQL*Plus to select and format your data and then spool
it to a file.
Skipping unwanted data ?
One can skip unwanted header records or continue an interrupted load (for example if you run
out of space) by specifying the "SKIP=n" keyword. "n" specifies the number of logical rows to
skip.
sqlldr userid=ora_id/ora_passwd control=control_file_name.ctl skip=4
What is data purging ?
Explain about Control M JObs detaily?How to execute this.
What is the difference between a W/H and an OLTP application?
Difference between DSS & OLTP?
What is operational data source (ODS)?
What is Snow Flake Schema design in database?
What is ETL process in Data warehousing?
Advantages of de normalized data?
What is the difference between choosing a multidimensional database and a relational
database?
Mulitidimentional database: OLAP(OnLineAnnaliticalProcessing)
Relational database: OLTP(OnLineTransactionProcessing)

what is the difference between E-R modelling and Dimendional modelling? and what
are semi additive facts?
ER modeling:
- focused how data will be efficient for processing (insert, update, delete)
- Minimalize (limit to zero) data redundancies
Dimensional:
- focused how data will be efficient for retrieving
(example, by report and analysis tools).
- many data redundancies
- Consist of Fact and Dimension table
What is the difference between aggregate table and materliazed view?
Aggregate tables are pre-computed totals in the form of hierarchical mutidimensional
structure
materliazed view ,is an database object which caches the query result in a concrete
table and updates it from the original database table from time to time
Aggregate tables are used to speed up the query computing whereas materialized
view speed up the data retrieval .
How many clustered indexes can u create for a table in DWH?
You can have only one clustered index per table.
==========================================================

Views
A view takes the output of a query and makes it appear like a virtual table.
All operations performed on a view will affect data in the base table and so are
subject to the integrity constraints and triggers of the base table.
A View can also be used to improve security by restricting access to a predetermined
set of rows or columns.
one View can be based on another, a view can also JOIN a view with a table (GROUP
BY or UNION).
Read-Only vs Updatable Views The data dictionary views
ALL_UPDATABLE_COLUMNS, DBA_UPDATABLE_COLUMNS, and
USER_UPDATABLE_COLUMNS indicate which view columns are updatable.
An updatable view lets you insert, update, and delete rows in the view and propagate
the changes to the target master table.
In order to be updatable, a view cannot contain any of the following constructs:
SET or DISTINCT operators, an aggregate or analytic function, a GROUP BY, ORDER
BY, CONNECT BY, or START WITH clause, a subquery (or collection expression) in a
SELECT list or finally (with some exceptions) a JOIN .
Views that are not updatable can be modified using an INSTEAD OF trigger.
Materialized Views
Materialized views are schema objects that can be used to summarize, precompute ,
replicate, and distribute data
The existence of a materialized view is transparent to SQL, but when used for query
rewrites will improve the performance of SQL execution
MV are use more for performance improvement.
MV helps query rewrite..In shout if u have a MV defined as "select * from sales group by
region_id" and u have a query selct * from sales group by region_id fired on the oracle db. Oracle
will automatically re-write a query and refer it to MV instade of Sales table. Now in DW
environment this is a big performance improvement. There are some paramters which needs to
be set for this to happen.

MV can undergo fast referesh. In short if i have 10 Mill rows in the Fact table and i add 500 rows.
Then b making use of MVLOGS oracle will do a fast refresh on the MView. with extra 500 rows
only.
A materialized view provides indirect access to table data by storing the results of a
query in a separate schema object. Unlike an ordinary view, which does not take up
any storage space or contain any data.
An updatable materialized view lets you insert, update, and delete.
You can define a materialized view on a base table, partitioned table or view and you
can define indexes on a materialized view.
A materialized view can be stored in the same database as its base table(s) or in a
different database.
A materialized view log is a schema object that records changes to a master
table's data so that a materialized view defined on the master table can be refreshed
incrementally.
===================================================
Synonyms
A synonym is an alias for any table, view, materialized view, sequence,
procedure, function, or package.
A public synonym is owned by the user group PUBLIC and every user in a
database can access it.
A private synonym is in the schema of a specific user who has control over its
availability to others.
Synonyms are used to:
- Mask the real name and owner of a schema object
- Provide global (public) access to a schema object
- Provide location transparency for tables, views, or program units of a remote
database.
- Simplify SQL statements for database users
e.g. to query the table PATIENT_REFERRALS with SQL:

SELECT * FROM MySchema.PATIENT_REFERRALS;
CREATE PUBLIC SYNONYM referrals FOR
MySchema.PATIENT_REFERRALS;
After the public synonym is created, you can query with a simple SQL statement:
SELECT * FROM referrals;

Basics+of+Datawarehousing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Basics+of+Datawarehousing

Similar to Basics+of+Datawarehousing (20)

More from theextraaedge

More from theextraaedge (20)

Basics+of+Datawarehousing