SlideShare a Scribd company logo
A Beginners’ Guide
By Nishant Gupta
A data warehouse is a relational database that is a collection of large
amount of data designed for reporting and analysis purpose. It keeps
historical data from various source separating the analytical data from the
transactional data enabling its enterprise wide consolidation . The main
source of the data is cleaned, transformed and cataloged using
extraction, transportation, transformation, and loading (ETL) solution for
online analytical processing (OLAP), data mining capabilities, client
analysis tools, and other applications that manage the process of
gathering data and delivering it to business users.
A data warehouse is a subject-oriented, integrated, time-variant and non-
volatile collection of data in support of management's decision making
process.
- W.H.Inmon
A data warehouse is a copy of transaction data specifically structured for
query and analysis.
- Ralph Kimball
Online transaction processing, or OLTP, are those applications which
caters to daily transactional needs of a business. In these kind of systems
faster turnaround/response time is the key for data storing & retrieval point
of view.
Online analytical processing, or OLAP , are those systems which are
used for decision making & analytics purpose. Here throughput is more
critical than response time to generate results for ad hoc queries, business
intelligence ,relational reporting ,data mining & forecasting/budgeting
sake.
OLTP OLAP
Data Source Operational data Consolidated data from various OLTP sources
Purpose of data Daily Business transactions For Decision support, planning, reporting &
analysis
Insert & Updates Short & Fast Periodic long running batch jobs
Queries Simple queries that returns few
records
Often complex or ad hoc queries involving
aggregations
Processing Speed Very Fast Depends on volume of data
Space Requirements Less Large
Database Design Highly normalized De normalized
Back Up & Recovery Regular back ups & proper
recovery plans/policies
In frequent back up policy
Data Ware House Architecture
OS
OS
Raw Data
Summary
Data
Metadata
Inventory
Sales
Purchasing
Flat Files
STG DB
Analyti
cs
Report
ing
Mining
Data Source Staging Ware House Data Marts Users
Operational Source : These are basically those operational systems
which are used for business transaction purpose . It resides outside the
realms of datawarehosuing to provide performance efficiency &
availability to cater to routine business operations/transactions.
Data Staging Area: This is basically the places used both for data
storage & ETL processes such as cleansing the data (correcting
misspellings, resolving domain conflicts, dealing with missing elements,
or parsing into standard formats),combining data from multiple sources,
de duplicating data, and assigning warehouse keys. It should be off limit
to business user & may consist normalized structure .
Data Presentation Area(Mart): The data presentation area is where
data is organized, stored, and made available for direct querying by
users, report writers, and other analytical applications. It consists of a
dimensional model which is atomic & simple enabling performance-
enhancing summary data, or aggregates .
Data Access Tools: It is basically the variety of abilities provided to
ends user to query or consume the data ware house data for analytical &
decision making purpose. It could either be a simple ad hoc query or
data mining tool or any reporting application.
Metadata: It is not the actually the operational data but the configuration
data necessary for the general functioning of our data warehouse. It may
be system tables, partition settings, indexes, view definitions,
numerator/denominator definitions, Inclusion /exclusion definitions and
DBMS-level security privileges and grants ,business names and
definitions for the Data Mart tables and columns as well as constraint
filters, application template specifications, access and usage statistics,
and other user documentation.
Operational Data Store: An ODS is implemented to deliver adequate
operational reporting, especially when neither the legacy nor on-line
transaction processing (OLTP) systems provides them. Performance-
enhancing aggregations, significant historical time series, and extensive
descriptive attribution are specifically excluded from the ODS.
Dimensional Modeling is a technique often associated with logical
designing of a Data ware houses. It is a modeling method which
primarily consists of Facts & Dimensions. It has following main features :
 It is primarily query oriented
 It is structured according to data usage than the business rules/needs
 It is organized into base facts ,dimensions of those facts & look ups of
those dimensions
 It is based on identification of key grains of data and their
characteristics It usually comprises snapshot, grouped or Summary data
 It normally has less number of joins & its depth
 It is more extensible in nature with new data being easily
accommodated without changing the existing structure or query
Dimensional Models can be organized into various schemas like :
 Star Schema
 Snowflake Schema
 Constellation
Star Schema : It is the simplest form of the Data ware house schema .In
this one or more fact table is connected with multi dimensional tables
resembling the star formation. The center of the star consists of a large
fact table and the points of the star are the dimension tables. The
primary key in each dimension table is related to a foreign key in the fact
table. In other words, they all have the same level of granularity.
Benefits:
It provides a direct & simple mapping between the business entities and
the schema design.
It provides highly optimized query performance due to simple joins
between one Fact & Dimensions
It is widely supported by various BI tools
Product Dim
Time Dim
Sales Fact
Location Dim
Customer Dim
Snow Flake Schema is an extension of the Star schema where each point
of star is further explodes into more points . In other words, the
dimensional tables are further normalized(3rd Normal Form) into multiple
related look up tables each representing a level in the dimensional
hierarchy. The Snow flake schema derives its name due to its resemblance
to the shape of a real Snowflake . The "snow flaking" effect only affects
the dimension tables and not the fact tables.
Pros:
 It eliminates redundancy
 It requires less disk space for storage
 It represents the real world scenario in schema design
 It aids in transactional reporting via Data warehouse
Cons:
It requires additional maintenance effort due to increase in number of
look up tables
It increase number of joins resulting in poor performance of data retrieval
Product Dim
Time Dim
Sales Fact
Location Dim
Customer Dim
Product
Category Look
Up
Month Look
Up
Customer
Type Look Up
State Look Up
Fact Constellation Schema as name suggested is a group of fact
tables sharing multiple dimensions between each other representing
shape similar to like a constellation of stars (i.e., star schemas).It is
pretty complex in nature hence should be used for applications which are
highly sophisticated & complicated in nature.
Product
Dim
Time Dim
Sales Fact
Location
Dim
Customer
Dim
Shipping
Fact
Shipper
Dim
Time Dim
As per Ralph Kimball the following mentioned four steps are the back
bone of any dimensional design process :
Select the business process to model
Declare the grain of the business process
Choose the dimensions that apply to each fact table row
Identify the numeric facts that will populate each fact table row
Firstly we need to select the business process for which the
dimensional model needs to be designed. A business process may
require more than one dimensional model. A business process is a set of
related activities shared across line of services or business departments.
To identify the business processes of a dimensional model, we collect
the following metadata:
Business requirements & processes
Stakeholders
Source systems
Data quality related issues
Business process related glossary
Other business-related metadata
Next step is to identify & declare the grain of the of the model. The
grain of a table represents the most atomic level by which the tables may
be defined.
Preferably we should develop dimensional models for the most atomic
information captured by a business process. Atomic data is the most
detailed information collected; such data cannot be subdivided further.
Atomic data is highly dimensional & gives us the capability to drill down
to the lowest level of details. It really helps in slicing & dicing of the
data.
For example the grain of a SALES fact table might be stated as "Sales
volume by Day by Product by Store". Each record in this fact table is
therefore uniquely defined by a day, product and store.
Dimensions are basically used to describe the business entities of an
enterprise often composed of one or more hierarchies, that categorizes
data. It contains the textual descriptor of business used for filtering,
grouping & labeling. Dimension data is typically collected at the lowest
level of detail and then aggregated into higher level totals that are more
useful for analysis. These natural rollups or aggregations within a
dimension table are called hierarchies.
In other words, dimension tables contain attributes that describe fact
records in the fact table. Some of these attributes provide descriptive
information; others are used to specify how fact table data should be
summarized to provide useful information. In any case it must contain
one primary key used to uniquely identify each records in dimension
table, e.g. :
Product Dimension > Location Dimension
Time Dimension > Customer Dimension
There are various type of dimensions available, such as,
 Conformed Dimension
 Junk Dimension
 Degenerated Dimension
 Role playing Dimension
 Slowly changing Dimension
 Rapidly changing Dimension
Conformed dimensions are those which are either identical or a perfect
subset of the most granular ,detailed dimension. Conformed dimensions
have :
 Consistent dimension keys
 Consistent attribute column names
 Consistent attribute définitions
 Consistent attribute values
Dimension tables are not conformed if the attributes are labeled
differently or contain different values. In case the dimensions like
Product or Customer are deployed in on conformed manner then
different Data Marts can’t be merged or used together.
The various flavors of Conformed dimensions are :
 Exactly same dimensions joined with every possible fact tables across
data marts
 Conformed dimensions at a rolled up level of granularity like
maintaining weekly inventory snapshot along with daily snapshot. In
another situation like sales & forecasting facts are maintained at atomic
product level & brand level respectively. Roll-up dimensions conform to
the base-level atomic dimension if they are a strict subset of that atomic
dimension
Product Dimension
Product Key
Prod Description
Brand Description
SKU Number
Category
Brand Dimensions
Brand Key
Description
Category
Conforms
 Conformed dimensions subsets at the same granularity.
Appliance
Products
Apparel
Products
Enterprise Product Dimension
Drilling across (conforming)
In many scenarios ,in the process of identifying dimensions out of
transactional system tables comprise of many miscellaneous indicators
and flags, each of which takes on a small range of discrete values that
can’t b included in the dimensions resulting in very large fact tables. By
creating an abstract dimension, these flags and indicators are removed
from the fact table & placed into a dummy dimension. These dimensions
are called Junk Dimensions.
For e.g. , if we remove 10 two-value indicators, such as the cash
versus credit payment type, from the order fact table and place them into
a single dimension i.e. junk dimension with 1024 rows(2) with a single
small surrogate key included in the fact table.
Order Indicator
Key
Payment
Type
Payment Type
Group
Inbound
/Outbound
Commission Credit
Indicator
Order Type
Indicator
1 Cash Cash I C Regular
2 Cash Cash I N Display
3 Cash Cash O C Regular
A Degenerated dimension is a data dimension that is although
dimensional in nature but stored in fact table It doesn’t have any
separate dimension to join & one can use it to slice & dice the measures
in fact table.
For e.g., dimension key, such as a transaction number, invoice number,
ticket number, or bill-of-lading number, that has no attributes but is used
to provide a direct reference back to a transactional system without the
overhead of maintaining a separate dimension table.
Sales Fact table
POS Transaction No (DD)
Product ID (FK)
Date Key(FK)
Store ID (FK)
Sales quantity
Gross Profit
Cost ($)
Selling Price($)
Promotion Key(FK)
Product Dim
Date Dim
Store Dim
Promotion
Dim
A role-playing in a data warehouse occurs when a single dimension
simultaneously appears several times in the same fact table. The
underlying dimension may exist as a single physical table, but each of
the roles should be presented as a separately labeled view. For e.g., the
Date dimension can be used for the ordered date, scheduled shipping
date, shipment date, and invoice date in an order line fact or in a
Insurance domain a Customer dimension can be used as nominee,
proposer & beneficiary in a Policy Detail Fact . Order Transaction Fact
Order Date Key(FK)
Shipped Date(FK)
Invoice Date (FK)
Order No(DD)
Order quantity
Product Key(FK)
Total Amount
Discount Amount
Net Order Amount
Order Line No(DD)
Date Dimension
Product Dimension
A characteristic of dimensions is that its data is relatively static—data
may be added as new record, but data, as such changes infrequently .
Slowly Changing Dimensions (SCDs) are dimensions that have data that
changes slowly over a period of time , rather than being time barred or
scheduled.
To track changes of these dimensions are more dependent on business
needs & can be achieved through various ways as per the requirement.
The technique/methodology to handle or manage SCDs is termed as
Type 0 to Type 6.
Types of SCD’s
Type 0 : It is an approach in which the SCD is maintained in the same
form as it is created & the changes to the existing records are ignored. It
is a passive approach of tracking the dimension value changes .
Type 1: Overwrite the Value
With the type 1, we overwrite the old attribute value in the dimension
row, replacing it with the current value. In so doing, the attribute always
reflects the most recent assignment. This is most appropriate for
rectifying the certain type of data errors like misspelling of the name
,address etc or no value in keeping old description.
Example : Supplier state changes from CA to NY.
Old
New
Pros : Easy to maintain & fast.
Cons: No history can be kept & the re calculation or loading of the
aggregate fact table based on state.
Supplier_Key Supplier_Code Supplier Name Supplier State
123 XYZ Max Trading CA
Supplier_Key Supplier_Code Supplier Name Supplier State
123 XYZ Max Trading NY
Type 2: Add a dimension row
The Type 2 method tracks historical data by creating multiple records for
a given business key in the dimensional tables with separate surrogate
keys with effective date time and/or different version numbers or . Using
Type 2, we can keep the entire history of a records because a new
record is inserted each time a change is made.
The type 2 response is the primary technique for accurately tracking
slowly changing dimension attributes. It is extremely powerful because
the new dimension row automatically partitions history in the fact table.
Example : Supplier state changes from CA to NY
Supplier_
Key
Supplier_C
ode
Supplier
Name
Supplier
State
Version
123 XYZ Max Trading CA 0
123 XYZ Max Trading NY 1
Type 2 : Versioning
Pros :
History can be kept & tracking of unlimited dimension changes
It perfectly segments fact table history because pre change fact rows
use the pre change surrogate key.
Cons:
It contributes in rapid growth of dimension tables
Database operations like joins are expensive
Supplier_K
ey
Supplier_C
ode
Supplier
Name
Supplier
State
Start Date End Date
123 XYZ Max Trading CA 01/11/2009 07/22/2010
123 XYZ Max Trading NY 07/23/2010
Type 2: Effective Date Stamp Method
Type 3 : Add a Dimension Column
Type 3 solutions track changes horizontally in the dimension table by
adding new fields to contain the old data. Often only the original and
current values are retained and intermediate values are discarded but at
times we also retain the previous & current values only. This kind of type is
used where we want to compare the current data values with original or
previous data values in one go for the purpose like sales force
reorganizations etc.
Pros: Avoidance of multiple dimension records for single entity.
Cons: Limited history tracking & more complex queries to access the old
values.
Supplier_Key Supplier_Code Supplier Name Original Supplier
State
Current Supplier
State
123 XYZ Max Trading CA NY
Predictable Changes with Multiple Version Overlays :
In some cases we need to retain the history for 4-5 years or times where
the number is known to us or we can predict the change for a given
attribute .It is an extension to Type 3 where we can keep on adding the
column for a changing attribute as & when it changes while keeping the
latest data value in the current value column. For ,e.g., Sales rep district
got revised every year or two then we can design the Sales Rep Dim table
(as shown in figure). If his District changes
later then we need to only add another
column with label District for 2010 with value
from current district & overwrite the
current district with new value.
Sales Representative Dimension
Name
Key
Address
Current Sales District
District 2009
District 2008
…….
Unpredictable Changes with Single-Version Overlay:
It basically combines the approach of all type 1,2 & 3 into single type .
This make sense when one has to preserve historical accuracy
surrounding unpredictable attribute changes while supporting the ability to
report historical data according to the current values. This is only possible
by clubbing all the 3 types together as shown below :
In this new dimension row for Supplier, the current state will be identical to
the historical state. For all previous instances of that Supplier dimension
rows, the current state attribute will be overwritten to reflect the current
structure.
Supplier
_Key
Supplier
_Code
Supplier Name Historical S -
upplier State
Current
Supplier State
Current
Flag
Start
Date
End
Date
123 XYZ Max Trading CA IL N
124 XYZ Max Trading NY IL N
125 XYZ Max Trading IL IL Y
A rapidly changing dimension is a dimension if one or more of its
attributes changes frequently in many rows. For a rapidly changing
dimension, the dimension table can grow very large from the application
of numerous Type 2 changes.
For e.g. , lets take a case of Customer dimension with 10K records
with 10 change per customer per year will result in 500k records in 5
years which is acceptable but consider any financial or insurance
organization where not only changes but also the customer base is
huge that could result in addition of multi million record over the period
of time resulting in rapidly changing monster dimension problem like
browsing performance & change tracking challenges.
The solution is to break off frequently analyzed or frequently changing
attributes into a separate dimension, referred to as a minidimension &
track them as band. A separate dimension of variable valued
demographic attributes, such as age, gender, number of children, and
income level, presuming that these columns get used extensively. There
would be one row in this minidimension for each unique combination of
age, gender, number of children, and income level encountered in the
data, not one row per customer (refer figure below).Business needs will
determine which continuously variable attributes are suitable for
converting to bands.
Demographic Key Age Gender Income Level
1 20-24 M 0-$20000
2 20-24 M $20000-$24999
3 20-24 F 0-$20000
4 25-29 M 0-$20000
5 25-29 F 0-$20000
Every time we build a fact table row, we include two foreign keys related
to the customer: the regular customer dimension key and the
minidimension demographics key. As shown in Figure below, the
demographics key should be part of the fact table’s set of foreign keys in
order to provide efficient access to the fact table through the
demographics attributes.
Customer Dim
Customer Key (PK)
Customer ID (NK)
Customer Name
Address
DOB
…….
Age
Gender
Income
No of Children
Customer Dim
Customer Key (PK)
Customer ID (NK)
Customer Name
Address
……….
Cust Demo Dim
C Demo Key (PK)
C Age Band
Gender
C Income Band
becomes
Fact Table
Customer Key (FK)
C Demo Key (FK)
More Foreign
Keys…..
Facts…….
The minidimension terminology refers to when the demographics key is
part of the fact table composite key; if the demographics key is a foreign
key in the customer dimension, we refer to it as an outrigger which has
to be a Type 1 attribute
The best approach for efficiently browsing and tracking changes of key
attributes in really huge dimensions is to break off one or more
minidimensions from the dimension table, each consisting of small
clumps of attributes that have been administered to have a limited
number of values
To store exact values instead of the bands or ranges sometimes one
need to create fact less schema that focuses on attribute changes. In
this case the dimension & minidimension are connected via a dummy
fact table which only consists the keys from various dimension table but
no numeric measurement values
A fact table is the primary table in a dimensional model where the numerical
performance measurements of the business are stored. It basically consists of two
types of columns : one those contain measurements & the other which are foreign
key to dimensional tables. This list of dimensions defines the grain of the fact table
and tells us what the scope of the measurement is.
A row in a fact table corresponds to a measurement. A measurement is a
row in a fact table. All the measurements in a fact table must be at the same
grain.
Also Fact tables express the many-to-many relationships between
dimensions in dimensional models.
The primary key of a fact table is usually a composite key that is made up of all of
its foreign keys. Fact tables contain the content of the data warehouse and store
different types of measures like additive, non additive, and semi additive measures.
On the basis of measure :
Additive
Semi additive
Non additive
Fact less fact or Junction Fact
On the basis of measurement events:
Transactional snapshots
Periodic snapshots
Accumulating snapshots
An Additive facts are the measurements that can be summed up
through all of the dimensions in the fact table. The most useful facts in
a fact table are numeric and additive. In the figure, three of the facts,
sales quantity, sales dollar amount, and cost dollar amount, are
beautifully additive across all the dimensions. We can slice and dice the
fact table with impunity, and every sum of these three facts is valid and
correct. Retail Sales Transactions Fact
Date Key(FK)
Product Key (FK)
Store Key(FK)
Txn No
Sales Amount($)
Sales Quantity
Cost Amount($)
Gross Profit($)
Gross Margin(%)
A Semi Additive facts are the measurements that can be added across
few of the dimensions but not against all in the fact table. In the figure,
Bill amount is a semi additive fact because it can be added up for a date
or Store key to arrive at the total sales amount for the day or for a given
store but adding up against Product key doesn’t make any sense.
Retail Sales Transactions Fact
Date Key(FK)
Product Key (FK)
Store Key(FK)
Txn No
Sales Amount($)
Sales Quantity
Cost Amount($)
Gross Profit($)
Gross Margin(%)
Bill Amount($)
A Non Additive facts are the measurements that can not be added
across any of the dimensions in the fact table. Percentages and ratios,
such as gross margin, are non additive. The numerator and denominator
should be stored in the fact table. The ratio can be calculated in a data
access tool for any slice of the fact table by remembering to calculate the
ratio of the sums, not the sum of the ratios. In the figure, Unit Price &
Gross margin are non additive facts because the can’t be summarized
along any dimension.
Retail Sales Transactions Fact
Date Key(FK)
Product Key (FK)
Store Key(FK)
Txn No
Sales Amount($)
Gross Profit($)
Gross Margin(%)
Unit Price($)
A fact less facts are those tables that doesn’t have any measurement
metrics , it merely captures the relationship between the involved
dimension key. A fact table that has no facts but captures certain many
to-many relationships between the dimension keys. Most often used to
represent events or provide coverage information that does not appear in
other fact tables. For e.g. , to determine what products where on
promotion but didn’t sell requires a separate promotion coverage fact
table, we’d load one row in the fact table for each product on promotion
in a store each day (or week, since many retail promotions are a week in
duration) regardless of whether the product sold or not. This table is fact
less fact due to absence of any measurements.
Promotion Coverage Fact
Date Key(FK)
Product Key (FK)
Store Key(FK)
Promotion Key (FK)
Product Dim
Time Dim
Store Dim
Promotion Dim
Facts from multiple fact tables are conformed when the technical
definitions of the facts are equivalent. Conformed facts are allowed
to have the same name in separate tables and can be combined and
compared mathematically. If facts do not conform, then the different
interpretations must be given different names.
A transactional snapshot fact table represents a point of time in the
life of business events. A row exists in the fact table for a given
dimension only if a transaction event occurred. Transactional fact table
holds data of the most detailed level, causing
it to have a great number of dimensions
associated with it. Once a transaction has
been posted, needn’t be revisited.
Transactional Fact Table
Txn No
Item ID (FK)
Bill No (DD)
Count of Item
Cost
Profit
Selling Price
Store ID (FK)
Date Key(FK)
Tax Amount
Discount
Promotional Code (FK)
A Periodic snapshot fact represents a pre-defined interval or a period .
Unlike the transaction fact table, with the periodic snapshot, we
take a picture (hence the snapshot terminology) of
the activity at the end of a day, week, or month,
then another picture at the end of the next period,
and soon. The periodic snapshots are stacked
consecutively into the fact table. Daily snapshots
and monthly snapshots are common. A separate
record is placed in a periodic snapshot fact table
each period regardless of whether any activity has taken place in the
underlying transaction.
Periodic Fact Table
Item ID (FK)
Store ID (FK)
Date Key(FK)
Cost ($)
Last Selling Price($)
Quantity on hand
Quantity Sold
An accumulating snapshot fact table represents business activities
over a time period . Accumulating snapshots almost
always have multiple date stamps, representing the
predictable major events or phases that take place
during the course of a lifetime. Often there’s an
additional date column that indicates when the
snapshot row was last updated. Since many of these
dates are not known when the fact row is first loaded,
we must use surrogate date keys to handle undefined
dates. The fact table is revisited and updated as
activity occurs. A record is placed in an Accumulating
snapshot fact table just once, when the item that it
represents is first created rest of the time it is just
updated with new date values.
Accumulating Snapshot
Fact Table
Txn No
Order Date Key (FK)
Backlog Date Key (FK)
Release to Manufacturing
Date Key (FK)
Finished Inventory
Placement Date Key (FK)
Requested Ship Date Key
(FK)
Scheduled Ship Date Key
(FK)
Actual Ship Date Key (FK)
Arrival Date Key (FK)
Invoice Date Key (FK)
……..
A table with a multipart key capturing a many-to-many relationship
that can’t be accommodated by the natural granularity of a single fact
table or single-dimension table. Serves to bridge between the fact table
and the dimension table in order to allow many-valued dimensions or
ragged hierarchies. Sometimes referred to as a helper or associative
table. When using a bridge table, the facts in the fact table are multiplied
by the bridge table’s weighting factor to appropriately allocate the facts to
the multivalued dimension. It is called Weight Report & if we exclude this
then it turns out to be Impact report.
Diagnosis Group
Bridge
Diagnosis Group
Key(FK)
Diagnosis Key(FK)
Weighted Factor
Diagnosis
Group Dim
Diagnosis
Group(PK)
Health Care
Billing Line Item
Fact
Diagnosis
Dimension
Hierarchies are logical structures that use ordered levels as a means
of organizing data. A hierarchy can be used to define data
aggregation(lower levels aggregated & rolled up to higher levels).
For example, in a time dimension, a hierarchy might aggregate data
from the month level to the quarter level to the year level. A
hierarchy can also be used to define a navigational drill path and to
establish a family structure.
A level represents a position in a hierarchy
Level relationships specify top-to-bottom ordering of levels from
root to leaves node.
Each level is logically connected to the levels above(parent) and
below(children) it.
A dimension can be composed of more than one hierarchy. For
example, in the product dimension, there might be two hierarchies--
one for product categories and one for product suppliers.
Balanced
Un balanced
Ragged
Parent child relationship
CEO
CIO COO
Fin Head
IT Head
Emp 2
HR Head
Emp 1 Emp 3 Emp 4
In balanced hierarchies (balanced/standard), the branches of the
hierarchy all descend to the same level, with each member's parent
being at the level immediately above the member. They are
consistent because each level represents the same type of
information, and each level is logically equivalent . An common
example of a balanced hierarchy is one that represents time, where
the depth of each level (year, quarter, and month) is consistent.
2009
1st Quarter
Jan Mar
Feb
2010
1st Quarter
Jan Mar
Feb
Unbalanced hierarchies include levels that have a consistent parent-
child relationship, but have logically inconsistent levels. The
hierarchy branches can also have inconsistent depths. For e.g., the
organizational structure is unbalanced, with some branches in the
hierarchy having more levels than others. In an unbalanced
hierarchy, null values can appear on the lower levels of the
hierarchy.
CEO
CIO COO
IT Head
Emp 2
Admin 1
Emp 1
CFO
Admin 2
Fin 1 Fin 2
Emp 3
In a ragged hierarchy, the logical parent member of at least one
member is not in the level immediately above the member. This can
cause branches of the hierarchy to descend to different levels. A
ragged hierarchy can represent a geographic hierarchy in which the
meaning of each level such as city or country is used consistently,
but the depth of the hierarchy varies
North America
Athens
Greece
Europe
US
CA
San Francisco
A parent-child hierarchy is a hierarchy with multiple levels that track
the relationships within the hierarchy. A single table or view is used
that represents the parent-child hierarchy. A view can be used to
flatten the structure in case this kind of hierarchy used multiple
tables . The top level uses the parent key as the level key, whereas
the bottom level contains the child key. For example, in a hierarchy
that represents an organizational structure, you can have two levels:
Manager and Employee. The Manager level is the parent level, and
the Employee level is the child level
Dimensional Modeling Concepts_Nishant.ppt

More Related Content

Similar to Dimensional Modeling Concepts_Nishant.ppt

Dataware housing
Dataware housingDataware housing
Dataware housing
work
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
work
 
UNIT-5 DATA WAREHOUSING.docx
UNIT-5 DATA WAREHOUSING.docxUNIT-5 DATA WAREHOUSING.docx
UNIT-5 DATA WAREHOUSING.docx
DURGADEVIL
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
ganblues
 
The Data Warehouse Lifecycle
The Data Warehouse LifecycleThe Data Warehouse Lifecycle
The Data Warehouse Lifecycle
bartlowe
 

Similar to Dimensional Modeling Concepts_Nishant.ppt (20)

dw_concepts_2_day_course.ppt
dw_concepts_2_day_course.pptdw_concepts_2_day_course.ppt
dw_concepts_2_day_course.ppt
 
BI_LECTURE_4-2021.pptx
BI_LECTURE_4-2021.pptxBI_LECTURE_4-2021.pptx
BI_LECTURE_4-2021.pptx
 
Dataware housing
Dataware housingDataware housing
Dataware housing
 
Date Analysis .pdf
Date Analysis .pdfDate Analysis .pdf
Date Analysis .pdf
 
Chapter 4. Data Warehousing and On-Line Analytical Processing.ppt
Chapter 4. Data Warehousing and On-Line Analytical Processing.pptChapter 4. Data Warehousing and On-Line Analytical Processing.ppt
Chapter 4. Data Warehousing and On-Line Analytical Processing.ppt
 
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ NewyorksysWhat is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
What is OLAP -Data Warehouse Concepts - IT Online Training @ Newyorksys
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
Data Management
Data ManagementData Management
Data Management
 
UNIT-5 DATA WAREHOUSING.docx
UNIT-5 DATA WAREHOUSING.docxUNIT-5 DATA WAREHOUSING.docx
UNIT-5 DATA WAREHOUSING.docx
 
Oracle sql plsql & dw
Oracle sql plsql & dwOracle sql plsql & dw
Oracle sql plsql & dw
 
Data warehousing and online analytical processing
Data warehousing and online analytical processingData warehousing and online analytical processing
Data warehousing and online analytical processing
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
Datawarehousing & DSS
Datawarehousing & DSSDatawarehousing & DSS
Datawarehousing & DSS
 
11667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect411667 Bitt I 2008 Lect4
11667 Bitt I 2008 Lect4
 
The Data Warehouse Lifecycle
The Data Warehouse LifecycleThe Data Warehouse Lifecycle
The Data Warehouse Lifecycle
 
Introduction to Data Warehouse
Introduction to Data WarehouseIntroduction to Data Warehouse
Introduction to Data Warehouse
 
Dw concepts
Dw conceptsDw concepts
Dw concepts
 
DataWarehousingandAbInitioConcepts.ppt
DataWarehousingandAbInitioConcepts.pptDataWarehousingandAbInitioConcepts.ppt
DataWarehousingandAbInitioConcepts.ppt
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Chapter 2
Chapter 2Chapter 2
Chapter 2
 

Recently uploaded

Machine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptxMachine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptx
benishzehra469
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 

Recently uploaded (20)

2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
Machine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptxMachine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptx
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
how can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoinhow can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoin
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 

Dimensional Modeling Concepts_Nishant.ppt

  • 1. A Beginners’ Guide By Nishant Gupta
  • 2.
  • 3. A data warehouse is a relational database that is a collection of large amount of data designed for reporting and analysis purpose. It keeps historical data from various source separating the analytical data from the transactional data enabling its enterprise wide consolidation . The main source of the data is cleaned, transformed and cataloged using extraction, transportation, transformation, and loading (ETL) solution for online analytical processing (OLAP), data mining capabilities, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.
  • 4. A data warehouse is a subject-oriented, integrated, time-variant and non- volatile collection of data in support of management's decision making process. - W.H.Inmon A data warehouse is a copy of transaction data specifically structured for query and analysis. - Ralph Kimball
  • 5. Online transaction processing, or OLTP, are those applications which caters to daily transactional needs of a business. In these kind of systems faster turnaround/response time is the key for data storing & retrieval point of view. Online analytical processing, or OLAP , are those systems which are used for decision making & analytics purpose. Here throughput is more critical than response time to generate results for ad hoc queries, business intelligence ,relational reporting ,data mining & forecasting/budgeting sake.
  • 6. OLTP OLAP Data Source Operational data Consolidated data from various OLTP sources Purpose of data Daily Business transactions For Decision support, planning, reporting & analysis Insert & Updates Short & Fast Periodic long running batch jobs Queries Simple queries that returns few records Often complex or ad hoc queries involving aggregations Processing Speed Very Fast Depends on volume of data Space Requirements Less Large Database Design Highly normalized De normalized Back Up & Recovery Regular back ups & proper recovery plans/policies In frequent back up policy
  • 7. Data Ware House Architecture OS OS Raw Data Summary Data Metadata Inventory Sales Purchasing Flat Files STG DB Analyti cs Report ing Mining Data Source Staging Ware House Data Marts Users
  • 8. Operational Source : These are basically those operational systems which are used for business transaction purpose . It resides outside the realms of datawarehosuing to provide performance efficiency & availability to cater to routine business operations/transactions. Data Staging Area: This is basically the places used both for data storage & ETL processes such as cleansing the data (correcting misspellings, resolving domain conflicts, dealing with missing elements, or parsing into standard formats),combining data from multiple sources, de duplicating data, and assigning warehouse keys. It should be off limit to business user & may consist normalized structure . Data Presentation Area(Mart): The data presentation area is where data is organized, stored, and made available for direct querying by users, report writers, and other analytical applications. It consists of a dimensional model which is atomic & simple enabling performance- enhancing summary data, or aggregates .
  • 9. Data Access Tools: It is basically the variety of abilities provided to ends user to query or consume the data ware house data for analytical & decision making purpose. It could either be a simple ad hoc query or data mining tool or any reporting application. Metadata: It is not the actually the operational data but the configuration data necessary for the general functioning of our data warehouse. It may be system tables, partition settings, indexes, view definitions, numerator/denominator definitions, Inclusion /exclusion definitions and DBMS-level security privileges and grants ,business names and definitions for the Data Mart tables and columns as well as constraint filters, application template specifications, access and usage statistics, and other user documentation. Operational Data Store: An ODS is implemented to deliver adequate operational reporting, especially when neither the legacy nor on-line transaction processing (OLTP) systems provides them. Performance- enhancing aggregations, significant historical time series, and extensive descriptive attribution are specifically excluded from the ODS.
  • 10. Dimensional Modeling is a technique often associated with logical designing of a Data ware houses. It is a modeling method which primarily consists of Facts & Dimensions. It has following main features :  It is primarily query oriented  It is structured according to data usage than the business rules/needs  It is organized into base facts ,dimensions of those facts & look ups of those dimensions  It is based on identification of key grains of data and their characteristics It usually comprises snapshot, grouped or Summary data  It normally has less number of joins & its depth  It is more extensible in nature with new data being easily accommodated without changing the existing structure or query
  • 11. Dimensional Models can be organized into various schemas like :  Star Schema  Snowflake Schema  Constellation Star Schema : It is the simplest form of the Data ware house schema .In this one or more fact table is connected with multi dimensional tables resembling the star formation. The center of the star consists of a large fact table and the points of the star are the dimension tables. The primary key in each dimension table is related to a foreign key in the fact table. In other words, they all have the same level of granularity.
  • 12. Benefits: It provides a direct & simple mapping between the business entities and the schema design. It provides highly optimized query performance due to simple joins between one Fact & Dimensions It is widely supported by various BI tools Product Dim Time Dim Sales Fact Location Dim Customer Dim
  • 13. Snow Flake Schema is an extension of the Star schema where each point of star is further explodes into more points . In other words, the dimensional tables are further normalized(3rd Normal Form) into multiple related look up tables each representing a level in the dimensional hierarchy. The Snow flake schema derives its name due to its resemblance to the shape of a real Snowflake . The "snow flaking" effect only affects the dimension tables and not the fact tables. Pros:  It eliminates redundancy  It requires less disk space for storage  It represents the real world scenario in schema design  It aids in transactional reporting via Data warehouse Cons: It requires additional maintenance effort due to increase in number of look up tables It increase number of joins resulting in poor performance of data retrieval
  • 14. Product Dim Time Dim Sales Fact Location Dim Customer Dim Product Category Look Up Month Look Up Customer Type Look Up State Look Up
  • 15. Fact Constellation Schema as name suggested is a group of fact tables sharing multiple dimensions between each other representing shape similar to like a constellation of stars (i.e., star schemas).It is pretty complex in nature hence should be used for applications which are highly sophisticated & complicated in nature. Product Dim Time Dim Sales Fact Location Dim Customer Dim Shipping Fact Shipper Dim Time Dim
  • 16. As per Ralph Kimball the following mentioned four steps are the back bone of any dimensional design process : Select the business process to model Declare the grain of the business process Choose the dimensions that apply to each fact table row Identify the numeric facts that will populate each fact table row
  • 17. Firstly we need to select the business process for which the dimensional model needs to be designed. A business process may require more than one dimensional model. A business process is a set of related activities shared across line of services or business departments. To identify the business processes of a dimensional model, we collect the following metadata: Business requirements & processes Stakeholders Source systems Data quality related issues Business process related glossary Other business-related metadata
  • 18. Next step is to identify & declare the grain of the of the model. The grain of a table represents the most atomic level by which the tables may be defined. Preferably we should develop dimensional models for the most atomic information captured by a business process. Atomic data is the most detailed information collected; such data cannot be subdivided further. Atomic data is highly dimensional & gives us the capability to drill down to the lowest level of details. It really helps in slicing & dicing of the data. For example the grain of a SALES fact table might be stated as "Sales volume by Day by Product by Store". Each record in this fact table is therefore uniquely defined by a day, product and store.
  • 19. Dimensions are basically used to describe the business entities of an enterprise often composed of one or more hierarchies, that categorizes data. It contains the textual descriptor of business used for filtering, grouping & labeling. Dimension data is typically collected at the lowest level of detail and then aggregated into higher level totals that are more useful for analysis. These natural rollups or aggregations within a dimension table are called hierarchies. In other words, dimension tables contain attributes that describe fact records in the fact table. Some of these attributes provide descriptive information; others are used to specify how fact table data should be summarized to provide useful information. In any case it must contain one primary key used to uniquely identify each records in dimension table, e.g. : Product Dimension > Location Dimension Time Dimension > Customer Dimension
  • 20. There are various type of dimensions available, such as,  Conformed Dimension  Junk Dimension  Degenerated Dimension  Role playing Dimension  Slowly changing Dimension  Rapidly changing Dimension
  • 21. Conformed dimensions are those which are either identical or a perfect subset of the most granular ,detailed dimension. Conformed dimensions have :  Consistent dimension keys  Consistent attribute column names  Consistent attribute définitions  Consistent attribute values Dimension tables are not conformed if the attributes are labeled differently or contain different values. In case the dimensions like Product or Customer are deployed in on conformed manner then different Data Marts can’t be merged or used together.
  • 22. The various flavors of Conformed dimensions are :  Exactly same dimensions joined with every possible fact tables across data marts  Conformed dimensions at a rolled up level of granularity like maintaining weekly inventory snapshot along with daily snapshot. In another situation like sales & forecasting facts are maintained at atomic product level & brand level respectively. Roll-up dimensions conform to the base-level atomic dimension if they are a strict subset of that atomic dimension Product Dimension Product Key Prod Description Brand Description SKU Number Category Brand Dimensions Brand Key Description Category Conforms
  • 23.  Conformed dimensions subsets at the same granularity. Appliance Products Apparel Products Enterprise Product Dimension Drilling across (conforming)
  • 24. In many scenarios ,in the process of identifying dimensions out of transactional system tables comprise of many miscellaneous indicators and flags, each of which takes on a small range of discrete values that can’t b included in the dimensions resulting in very large fact tables. By creating an abstract dimension, these flags and indicators are removed from the fact table & placed into a dummy dimension. These dimensions are called Junk Dimensions. For e.g. , if we remove 10 two-value indicators, such as the cash versus credit payment type, from the order fact table and place them into a single dimension i.e. junk dimension with 1024 rows(2) with a single small surrogate key included in the fact table. Order Indicator Key Payment Type Payment Type Group Inbound /Outbound Commission Credit Indicator Order Type Indicator 1 Cash Cash I C Regular 2 Cash Cash I N Display 3 Cash Cash O C Regular
  • 25. A Degenerated dimension is a data dimension that is although dimensional in nature but stored in fact table It doesn’t have any separate dimension to join & one can use it to slice & dice the measures in fact table. For e.g., dimension key, such as a transaction number, invoice number, ticket number, or bill-of-lading number, that has no attributes but is used to provide a direct reference back to a transactional system without the overhead of maintaining a separate dimension table. Sales Fact table POS Transaction No (DD) Product ID (FK) Date Key(FK) Store ID (FK) Sales quantity Gross Profit Cost ($) Selling Price($) Promotion Key(FK) Product Dim Date Dim Store Dim Promotion Dim
  • 26. A role-playing in a data warehouse occurs when a single dimension simultaneously appears several times in the same fact table. The underlying dimension may exist as a single physical table, but each of the roles should be presented as a separately labeled view. For e.g., the Date dimension can be used for the ordered date, scheduled shipping date, shipment date, and invoice date in an order line fact or in a Insurance domain a Customer dimension can be used as nominee, proposer & beneficiary in a Policy Detail Fact . Order Transaction Fact Order Date Key(FK) Shipped Date(FK) Invoice Date (FK) Order No(DD) Order quantity Product Key(FK) Total Amount Discount Amount Net Order Amount Order Line No(DD) Date Dimension Product Dimension
  • 27. A characteristic of dimensions is that its data is relatively static—data may be added as new record, but data, as such changes infrequently . Slowly Changing Dimensions (SCDs) are dimensions that have data that changes slowly over a period of time , rather than being time barred or scheduled. To track changes of these dimensions are more dependent on business needs & can be achieved through various ways as per the requirement. The technique/methodology to handle or manage SCDs is termed as Type 0 to Type 6. Types of SCD’s Type 0 : It is an approach in which the SCD is maintained in the same form as it is created & the changes to the existing records are ignored. It is a passive approach of tracking the dimension value changes .
  • 28. Type 1: Overwrite the Value With the type 1, we overwrite the old attribute value in the dimension row, replacing it with the current value. In so doing, the attribute always reflects the most recent assignment. This is most appropriate for rectifying the certain type of data errors like misspelling of the name ,address etc or no value in keeping old description. Example : Supplier state changes from CA to NY. Old New Pros : Easy to maintain & fast. Cons: No history can be kept & the re calculation or loading of the aggregate fact table based on state. Supplier_Key Supplier_Code Supplier Name Supplier State 123 XYZ Max Trading CA Supplier_Key Supplier_Code Supplier Name Supplier State 123 XYZ Max Trading NY
  • 29. Type 2: Add a dimension row The Type 2 method tracks historical data by creating multiple records for a given business key in the dimensional tables with separate surrogate keys with effective date time and/or different version numbers or . Using Type 2, we can keep the entire history of a records because a new record is inserted each time a change is made. The type 2 response is the primary technique for accurately tracking slowly changing dimension attributes. It is extremely powerful because the new dimension row automatically partitions history in the fact table. Example : Supplier state changes from CA to NY Supplier_ Key Supplier_C ode Supplier Name Supplier State Version 123 XYZ Max Trading CA 0 123 XYZ Max Trading NY 1 Type 2 : Versioning
  • 30. Pros : History can be kept & tracking of unlimited dimension changes It perfectly segments fact table history because pre change fact rows use the pre change surrogate key. Cons: It contributes in rapid growth of dimension tables Database operations like joins are expensive Supplier_K ey Supplier_C ode Supplier Name Supplier State Start Date End Date 123 XYZ Max Trading CA 01/11/2009 07/22/2010 123 XYZ Max Trading NY 07/23/2010 Type 2: Effective Date Stamp Method
  • 31. Type 3 : Add a Dimension Column Type 3 solutions track changes horizontally in the dimension table by adding new fields to contain the old data. Often only the original and current values are retained and intermediate values are discarded but at times we also retain the previous & current values only. This kind of type is used where we want to compare the current data values with original or previous data values in one go for the purpose like sales force reorganizations etc. Pros: Avoidance of multiple dimension records for single entity. Cons: Limited history tracking & more complex queries to access the old values. Supplier_Key Supplier_Code Supplier Name Original Supplier State Current Supplier State 123 XYZ Max Trading CA NY
  • 32. Predictable Changes with Multiple Version Overlays : In some cases we need to retain the history for 4-5 years or times where the number is known to us or we can predict the change for a given attribute .It is an extension to Type 3 where we can keep on adding the column for a changing attribute as & when it changes while keeping the latest data value in the current value column. For ,e.g., Sales rep district got revised every year or two then we can design the Sales Rep Dim table (as shown in figure). If his District changes later then we need to only add another column with label District for 2010 with value from current district & overwrite the current district with new value. Sales Representative Dimension Name Key Address Current Sales District District 2009 District 2008 …….
  • 33. Unpredictable Changes with Single-Version Overlay: It basically combines the approach of all type 1,2 & 3 into single type . This make sense when one has to preserve historical accuracy surrounding unpredictable attribute changes while supporting the ability to report historical data according to the current values. This is only possible by clubbing all the 3 types together as shown below : In this new dimension row for Supplier, the current state will be identical to the historical state. For all previous instances of that Supplier dimension rows, the current state attribute will be overwritten to reflect the current structure. Supplier _Key Supplier _Code Supplier Name Historical S - upplier State Current Supplier State Current Flag Start Date End Date 123 XYZ Max Trading CA IL N 124 XYZ Max Trading NY IL N 125 XYZ Max Trading IL IL Y
  • 34. A rapidly changing dimension is a dimension if one or more of its attributes changes frequently in many rows. For a rapidly changing dimension, the dimension table can grow very large from the application of numerous Type 2 changes. For e.g. , lets take a case of Customer dimension with 10K records with 10 change per customer per year will result in 500k records in 5 years which is acceptable but consider any financial or insurance organization where not only changes but also the customer base is huge that could result in addition of multi million record over the period of time resulting in rapidly changing monster dimension problem like browsing performance & change tracking challenges.
  • 35. The solution is to break off frequently analyzed or frequently changing attributes into a separate dimension, referred to as a minidimension & track them as band. A separate dimension of variable valued demographic attributes, such as age, gender, number of children, and income level, presuming that these columns get used extensively. There would be one row in this minidimension for each unique combination of age, gender, number of children, and income level encountered in the data, not one row per customer (refer figure below).Business needs will determine which continuously variable attributes are suitable for converting to bands. Demographic Key Age Gender Income Level 1 20-24 M 0-$20000 2 20-24 M $20000-$24999 3 20-24 F 0-$20000 4 25-29 M 0-$20000 5 25-29 F 0-$20000
  • 36. Every time we build a fact table row, we include two foreign keys related to the customer: the regular customer dimension key and the minidimension demographics key. As shown in Figure below, the demographics key should be part of the fact table’s set of foreign keys in order to provide efficient access to the fact table through the demographics attributes. Customer Dim Customer Key (PK) Customer ID (NK) Customer Name Address DOB ……. Age Gender Income No of Children Customer Dim Customer Key (PK) Customer ID (NK) Customer Name Address ………. Cust Demo Dim C Demo Key (PK) C Age Band Gender C Income Band becomes Fact Table Customer Key (FK) C Demo Key (FK) More Foreign Keys….. Facts…….
  • 37. The minidimension terminology refers to when the demographics key is part of the fact table composite key; if the demographics key is a foreign key in the customer dimension, we refer to it as an outrigger which has to be a Type 1 attribute The best approach for efficiently browsing and tracking changes of key attributes in really huge dimensions is to break off one or more minidimensions from the dimension table, each consisting of small clumps of attributes that have been administered to have a limited number of values To store exact values instead of the bands or ranges sometimes one need to create fact less schema that focuses on attribute changes. In this case the dimension & minidimension are connected via a dummy fact table which only consists the keys from various dimension table but no numeric measurement values
  • 38. A fact table is the primary table in a dimensional model where the numerical performance measurements of the business are stored. It basically consists of two types of columns : one those contain measurements & the other which are foreign key to dimensional tables. This list of dimensions defines the grain of the fact table and tells us what the scope of the measurement is. A row in a fact table corresponds to a measurement. A measurement is a row in a fact table. All the measurements in a fact table must be at the same grain. Also Fact tables express the many-to-many relationships between dimensions in dimensional models. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. Fact tables contain the content of the data warehouse and store different types of measures like additive, non additive, and semi additive measures.
  • 39. On the basis of measure : Additive Semi additive Non additive Fact less fact or Junction Fact On the basis of measurement events: Transactional snapshots Periodic snapshots Accumulating snapshots
  • 40. An Additive facts are the measurements that can be summed up through all of the dimensions in the fact table. The most useful facts in a fact table are numeric and additive. In the figure, three of the facts, sales quantity, sales dollar amount, and cost dollar amount, are beautifully additive across all the dimensions. We can slice and dice the fact table with impunity, and every sum of these three facts is valid and correct. Retail Sales Transactions Fact Date Key(FK) Product Key (FK) Store Key(FK) Txn No Sales Amount($) Sales Quantity Cost Amount($) Gross Profit($) Gross Margin(%)
  • 41. A Semi Additive facts are the measurements that can be added across few of the dimensions but not against all in the fact table. In the figure, Bill amount is a semi additive fact because it can be added up for a date or Store key to arrive at the total sales amount for the day or for a given store but adding up against Product key doesn’t make any sense. Retail Sales Transactions Fact Date Key(FK) Product Key (FK) Store Key(FK) Txn No Sales Amount($) Sales Quantity Cost Amount($) Gross Profit($) Gross Margin(%) Bill Amount($)
  • 42. A Non Additive facts are the measurements that can not be added across any of the dimensions in the fact table. Percentages and ratios, such as gross margin, are non additive. The numerator and denominator should be stored in the fact table. The ratio can be calculated in a data access tool for any slice of the fact table by remembering to calculate the ratio of the sums, not the sum of the ratios. In the figure, Unit Price & Gross margin are non additive facts because the can’t be summarized along any dimension. Retail Sales Transactions Fact Date Key(FK) Product Key (FK) Store Key(FK) Txn No Sales Amount($) Gross Profit($) Gross Margin(%) Unit Price($)
  • 43. A fact less facts are those tables that doesn’t have any measurement metrics , it merely captures the relationship between the involved dimension key. A fact table that has no facts but captures certain many to-many relationships between the dimension keys. Most often used to represent events or provide coverage information that does not appear in other fact tables. For e.g. , to determine what products where on promotion but didn’t sell requires a separate promotion coverage fact table, we’d load one row in the fact table for each product on promotion in a store each day (or week, since many retail promotions are a week in duration) regardless of whether the product sold or not. This table is fact less fact due to absence of any measurements. Promotion Coverage Fact Date Key(FK) Product Key (FK) Store Key(FK) Promotion Key (FK) Product Dim Time Dim Store Dim Promotion Dim
  • 44. Facts from multiple fact tables are conformed when the technical definitions of the facts are equivalent. Conformed facts are allowed to have the same name in separate tables and can be combined and compared mathematically. If facts do not conform, then the different interpretations must be given different names.
  • 45. A transactional snapshot fact table represents a point of time in the life of business events. A row exists in the fact table for a given dimension only if a transaction event occurred. Transactional fact table holds data of the most detailed level, causing it to have a great number of dimensions associated with it. Once a transaction has been posted, needn’t be revisited. Transactional Fact Table Txn No Item ID (FK) Bill No (DD) Count of Item Cost Profit Selling Price Store ID (FK) Date Key(FK) Tax Amount Discount Promotional Code (FK)
  • 46. A Periodic snapshot fact represents a pre-defined interval or a period . Unlike the transaction fact table, with the periodic snapshot, we take a picture (hence the snapshot terminology) of the activity at the end of a day, week, or month, then another picture at the end of the next period, and soon. The periodic snapshots are stacked consecutively into the fact table. Daily snapshots and monthly snapshots are common. A separate record is placed in a periodic snapshot fact table each period regardless of whether any activity has taken place in the underlying transaction. Periodic Fact Table Item ID (FK) Store ID (FK) Date Key(FK) Cost ($) Last Selling Price($) Quantity on hand Quantity Sold
  • 47. An accumulating snapshot fact table represents business activities over a time period . Accumulating snapshots almost always have multiple date stamps, representing the predictable major events or phases that take place during the course of a lifetime. Often there’s an additional date column that indicates when the snapshot row was last updated. Since many of these dates are not known when the fact row is first loaded, we must use surrogate date keys to handle undefined dates. The fact table is revisited and updated as activity occurs. A record is placed in an Accumulating snapshot fact table just once, when the item that it represents is first created rest of the time it is just updated with new date values. Accumulating Snapshot Fact Table Txn No Order Date Key (FK) Backlog Date Key (FK) Release to Manufacturing Date Key (FK) Finished Inventory Placement Date Key (FK) Requested Ship Date Key (FK) Scheduled Ship Date Key (FK) Actual Ship Date Key (FK) Arrival Date Key (FK) Invoice Date Key (FK) ……..
  • 48. A table with a multipart key capturing a many-to-many relationship that can’t be accommodated by the natural granularity of a single fact table or single-dimension table. Serves to bridge between the fact table and the dimension table in order to allow many-valued dimensions or ragged hierarchies. Sometimes referred to as a helper or associative table. When using a bridge table, the facts in the fact table are multiplied by the bridge table’s weighting factor to appropriately allocate the facts to the multivalued dimension. It is called Weight Report & if we exclude this then it turns out to be Impact report. Diagnosis Group Bridge Diagnosis Group Key(FK) Diagnosis Key(FK) Weighted Factor Diagnosis Group Dim Diagnosis Group(PK) Health Care Billing Line Item Fact Diagnosis Dimension
  • 49. Hierarchies are logical structures that use ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation(lower levels aggregated & rolled up to higher levels). For example, in a time dimension, a hierarchy might aggregate data from the month level to the quarter level to the year level. A hierarchy can also be used to define a navigational drill path and to establish a family structure. A level represents a position in a hierarchy Level relationships specify top-to-bottom ordering of levels from root to leaves node. Each level is logically connected to the levels above(parent) and below(children) it. A dimension can be composed of more than one hierarchy. For example, in the product dimension, there might be two hierarchies-- one for product categories and one for product suppliers.
  • 50. Balanced Un balanced Ragged Parent child relationship CEO CIO COO Fin Head IT Head Emp 2 HR Head Emp 1 Emp 3 Emp 4
  • 51. In balanced hierarchies (balanced/standard), the branches of the hierarchy all descend to the same level, with each member's parent being at the level immediately above the member. They are consistent because each level represents the same type of information, and each level is logically equivalent . An common example of a balanced hierarchy is one that represents time, where the depth of each level (year, quarter, and month) is consistent. 2009 1st Quarter Jan Mar Feb 2010 1st Quarter Jan Mar Feb
  • 52. Unbalanced hierarchies include levels that have a consistent parent- child relationship, but have logically inconsistent levels. The hierarchy branches can also have inconsistent depths. For e.g., the organizational structure is unbalanced, with some branches in the hierarchy having more levels than others. In an unbalanced hierarchy, null values can appear on the lower levels of the hierarchy. CEO CIO COO IT Head Emp 2 Admin 1 Emp 1 CFO Admin 2 Fin 1 Fin 2 Emp 3
  • 53. In a ragged hierarchy, the logical parent member of at least one member is not in the level immediately above the member. This can cause branches of the hierarchy to descend to different levels. A ragged hierarchy can represent a geographic hierarchy in which the meaning of each level such as city or country is used consistently, but the depth of the hierarchy varies North America Athens Greece Europe US CA San Francisco
  • 54. A parent-child hierarchy is a hierarchy with multiple levels that track the relationships within the hierarchy. A single table or view is used that represents the parent-child hierarchy. A view can be used to flatten the structure in case this kind of hierarchy used multiple tables . The top level uses the parent key as the level key, whereas the bottom level contains the child key. For example, in a hierarchy that represents an organizational structure, you can have two levels: Manager and Employee. The Manager level is the parent level, and the Employee level is the child level