SlideShare a Scribd company logo
1 of 51
IT301
Advance Database System
Data Warehouse and the
Star Schema
Finally we are talking about something not
invented by IBM!
Inventor is unknown. Popularized by Ralph
Kimball and his company, Red Brick
Warehouse.
2
History
 First product introduced by Red Brick
Warehouse, a standalone system for data
warehouse
 Algorithm was figured out by Oracle and
Sybase. Oracle built into DBMS, Sybase
made separate software product.
 IBM bought Red Brick
3
Agenda
 Definition
 Why data warehouse
 Product History
 Processing Star queries
 Data warehouse in the enterprise
 Data warehouse design
 Relevance of normalization
 Star schema
 Processing the star schema
4
Definition
Data warehouse: A repository of integrated
information, available for queries and
analysis. Data and information are extracted
from heterogeneous sources as they are
generated
The point is that it’s not used for transaction
processing; that is, it’s read-only. And the
data can come from heterogeneous sources
and it can all be queried in one database.
5
Why Data Warehouse
 A data warehouse collects information from
many sources. It converts different data into
a uniform format, following the data analytics
requirements. Data warehousing ensures
production standards and quality information.
Different departments can generate
consistent results.
6
Data Warehouse vs. OLTP
OLTP DW
Purpose Automate day-to-day
operations
Analysis
Structure RDBMS RMBMS
Data Model Normalized Dimensional
Access SQL SQL and business analysis
programs
Data Data that runs the business Current and historical information
Condition of data Changing, incomplete Historical, complete, descriptive
7
Red Brick
 Invented data warehouse; they sold a hardware
product with a star schema database
 You loaded the Red Brick Warehouse and then
queried it for OLTP
 It featured new optimizations for star schemas, was
very fast
8
Enter Sybase
 Sybase learned the optimization and
developed their own product.
 The Sybase product was a stand-alone
software data warehouse product
 It couldn’t do general-purpose database
work, was just a data warehouse
 They appear to have copied the Red Brick
idea, without selling hardware
9
Enter Oracle
 Oracle, later, also copied the same
optimization
 They added a bitmap index to their database
product, and added the star schema
optimization
 Now their product could do data warehouse
as well as database
10
Status Today
 Oracle dominates the field today
 IBM eventually bought Red Brick so still
offers some sort of Red Brick product
 Sybase offers their OLTP product, now as an
offering of SAP
11
PROCESSING STAR QUERIES
So what is this algorithm that is so copied?
12
Star Schema
 Data warehouse relies on the star schema
 The data is not normalized
 DW is loaded from a normalized database
 There is a fact table surrounded by multiple
dimension tables
 Fact table has all measures for the subject
area, with foreign keys for dimensions for
each measure
13
A Sample OLTP Schema
14
orders
products
order
items
customers
Transformed to a Star Schema
15
products
customers
sales
channels
times
fact table
dimension
table
dimension
table
dimension
table
dimension
table
Time Product Customer Channel
Time Hour Day
of
Week
Month Year Season
Star Schema
16
Fact Table
Customer
Item
Supplier
Time
Location
Processing Star Queries
 Build a bitmap index on each foreign key
column of the fact table
 Index is a 2-dimensional array, one column
for each row being indexed, one row per
value of that column
 Bitmap indexes are typically much smaller
than b-tree indexes, that can be larger than
the data itself
17
Bitmap Index Example
18
Query Processing
 The typical query is a join of foreign keys of
dimension tables to the fact table
 This is processed in two phases:
1. From the fact table, retrieve all rows that are part
of the result, using bitmap indexes
2. Join the result of the step above to the
dimension tables
19
Example Query
Find sales and profits from the grocery
departments of stores in the West and
Southwest districts over the last three quarters
20
Example Query
SELECT
store.sales_district,
time.fiscal_period,
SUM(sales.dollar_sales) revenue,
SUM(dollar_sales) - SUM(dollar_cost) income
FROM
sales, store, time, product
WHERE
sales.store_key = store.store_key AND
sales.time_key = time.time_key AND
sales.product_key = product.product_key AND
time.fiscal_period IN ('3Q95', '4Q95', '1Q96') and
product.department = 'Grocery' AND
store.sales_district IN ('San Francisco', 'Los Angeles')
GROUP BY
store.sales_district, time.fiscal_period;
21
Phase 1
Finding the rows in the SALES table (using bitmap indexes):
SELECT ... FROM sales
WHERE
store_key IN (SELECT store_key FROM store WHERE
sales_district IN ('WEST', 'SOUTHWEST')) AND
time_key IN (SELECT time_key FROM time WHERE
quarter IN ('3Q96', '4Q96', '1Q97')) AND
product_key IN (SELECT product_key FROM product WHERE
department = 'GROCERY');
22
Phase 2
Now the fact table is joined to dimension
tables. For dimension tables of small
cardinality, a full-table scan may be used. For
large cardinality, a hash join could be used.
23
The Star Transformation
 Use bitmap indexes to retrieve all relevant
rows from the fact table, based on foreign
key values
– This happens very fast
 Join this result set to the dimension tables
– If there are many values, a hash join may be used
– If there are fewer values, a b-tree driven join may
be used
24
How DW Fits into the Enterprise
25
OLTP3
Data
Mart
Data
Warehouse
Data
Mart
Data
Mart
Data
Mart
Application A
Application B
Application C
User
User
User
User
User
User
User
Extract,
Transform
And Load
OLTP2
OLTP1
Integration
Integration
Data Warehouse Database Design
 A conventional database design for data
warehouse would lead to joins on large
amounts of data that would run slowly
 The star schema allows for fast processing of
very large quantities of data in the data
warehouse
 It also allows for very compact representation
of events that occur many times
26
A Sample OLTP Schema
27
orders
products
order
items
customers
Transformed to a Star Schema
28
products
customers
sales
channels
times
fact table
dimension
table
dimension
table
dimension
table
dimension
table
Star Schema
29
Fact Table
Customer
Item
Supplier
Time
Location
Fact Table
 The fact table contains the actual business process
measurements or metrics for a specific event, called facts,
usually numbers.
 A fact table represents facts by foreign keys from other tables,
called “dimension” tables
 These foreign keys are usually generated keys, in order to save
fact table space
 If you are building a DW of monthly sales in dollars, your fact
table will contain monthly sales, one row per month.
 If you are building a DW of retail sales, each row of the fact
table might have one row for each item sold.
30
Fact Table Design
A fact table may contain one or more facts. Usually you
create one fact table per business event. For
example if you want to analyze the sales numbers
and also advertising spending, they are two separate
business processes. So you will create two separate
fact tables, one for sales data and one for
advertising cost data. On the other hand if you want
to track the sales tax in addition to the sales number,
you simply create one more fact column in the Sales
fact table called Tax.
31
Dimension Table
 Dimension tables have a small number of rows
(compared to fact tables) but a large number of
columns
 For the lowest level of granularity of a fact in the fact
table, a dimension table will have one row that gives
all the categories for each value
 The dimension table is often all key, so a generated
key is used so that the fact table reference to the
dimension table can be small
32
33
Time Dimension Schema
Column Name Type
Dim_Id INTEGER (4)
Month SMALL INTEGER (2)
Month_Name VARCHAR (3)
Quarter SMALL INTEGER (4)
Quarter_Name VARCHAR (2)
Year SMALL INTEGER (2)
34
Time Dimension Data
TM _Dim_Id TM _Month TM_Month_Name TM _Quarter
TM_Quarter_N
ame
TM_Year
1001 1 Jan 1 Q1 2003
1002 2 Feb 1 Q1 2003
1003 3 Mar 1 Q1 2003
1004 4 Apr 2 Q2 2003
1005 5 May 2 Q2 2003
35
Location Dimension Schema
Column Name Type
Dim_Id INTEGER (4)
Loc_Code VARCHAR (4)
Name VARCHAR (50)
State_Name VARCHAR (20)
Country_Name VARCHAR (20)
36
Location Dimension Data
Dim_Id Loc_Code Name State_Name Country_Name
1001 IL01 Chicago Loop Illinois USA
1002 IL02 Arlington Hts Illinois USA
1003 NY01 Brooklyn New York USA
1004 TO01 Toronto Ontario Canada
1005 MX01 Mexico City Distrito Federal Mexico
37
Product Data Schema
Column Name Type
Dim_Id INTEGER (4)
SKU VARCHAR (10)
Name VARCHAR (30)
Category VARCHAR (30)
38
Product Data
Dim_Id SKU Name Category
1001 DOVE6K Dove Soap 6Pk Sanitary
1002 MLK66F# Skim Milk 1 Gal Dairy
1003 SMKSAL55 Smoked Salmon 6oz Meat
39
Categories in Dimension Tables
 Categories may or may not be hierarchical;
or can be both
 Categories provide canned values that can
be given to users for queries
40
Granularity (Grain) of the Fact Table
The level of detail of the fact table is known as
the grain of the fact table. In this example the
grain of the fact table is monthly
sales number per location per product.
41
Note about Granularity
 There may be multiple star schemas at
different levels of granularity, especially for
very large data warehouses
 The first could be the finest—say, each
transaction such as a sale
 The next could be an aggregation, like the
previous example
 There could be more levels of aggregation
42
Design Approach
1. Identify the business process.
In this step you will determine what is your business process that your data
warehouse represents. This process will be the source of your metrics or
measurements.
2. Identify the Grain
You will determine what does one row of fact table mean. In the previous example
you have decided that your grain is 'monthly sales per location per product'. It
might be daily sales or even each sale could be one row.
3. Identify the Dimensions
Your dimensions should be descriptive (SQL VARCHAR or CHARACTER) as much
as possible and conform to your grain.
4. Finally Identify the facts
In this step you will identify what are your measurements (or metrics or facts). The
facts should be numeric and should confirm to the grain defined in step 2.
43
Monthly Sales Fact Table Schema
Field Name Type
TM_Dim_Id INTEGER (4)
PR_ Dim_Id INTEGER (4)
LOC_ Dim_Id INTEGER (4)
Sales INTEGER (4)
44
Monthly Sales Fact Table Data
TM_Dim_Id PR_ Dim_Id LOC_ Dim_Id Sales
1001 1001 1003 435677
1002 1002 1001 451121
1003 1001 1003 98765
1001 1004 1001 65432
45
Data Mart
 A data mart is a collection of subject areas organized for decision
support based on the needs of a given department. Examples: finance
has their data mart, marketing has theirs, sales has theirs and so on.
 Each department generally runs its own data mart. Ownership of the
data mart allows each department to bypass the control that might
coordinate the data found in the different departments.
 Each department's data mart is peculiar to and specific to its own
needs. Typically, the database design for a data mart is built around a
star-join structure designed for that department.
 The data mart contains only a modicum of historical information and is
granular only to the point that it suits the needs of the department.
 The data mart may also include data from outside the organization,
such as purchased normative salary data that might be purchased by
an HR department.
46
About the Data Mart
 The structure of the data in the data mart may or may not be
compatible with the structure of data in the data warehouse.
 The amount of historical data found in the data mart is different
from the history of the data found in the warehouse. Data
warehouses contain robust amounts of history, while data marts
usually contain modest amounts of history.
 The subject areas found in the data mart are only faintly related
to the subject areas found in the data warehouse.
 The relationships found in the data mart may not be those
relationships that are found in the data warehouse.
 The types of queries satisfied in the data mart are quite
different from those queries found in the data warehouse.
47
Walmart’s Data Warehouse
 Half a petabyte in capacity (.5 x 1015 bytes)
 World’s largest DW
 Tracks 100 million customers buying billions of
products every week
 Every sale from every store is transmitted to
Bentonville every night
 Walmart has more than 18,000 retail stores, employs
2.2 million, serves 245 million customers every week
48
Typical Questions
 How much orange juice did we sell last year,
last month, last week in store X?
 What internal factors (position in store,
advertising campaigns...) influence orange
juice sales?
 How much orange juice are we going to sell
next week, next month, next year?
49
Example
50
51

More Related Content

Similar to IT301-Datawarehousing (1) and its sub topics.pptx

See sql server graphical execution plans in action tech republic
See sql server graphical execution plans in action   tech republicSee sql server graphical execution plans in action   tech republic
See sql server graphical execution plans in action tech republic
Kaing Menglieng
 

Similar to IT301-Datawarehousing (1) and its sub topics.pptx (20)

Data ware housing- Introduction to olap .
Data ware housing- Introduction to  olap .Data ware housing- Introduction to  olap .
Data ware housing- Introduction to olap .
 
DW-lecture2.ppt
DW-lecture2.pptDW-lecture2.ppt
DW-lecture2.ppt
 
Dwbi Project
Dwbi ProjectDwbi Project
Dwbi Project
 
Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
 
Datawarehosuing
DatawarehosuingDatawarehosuing
Datawarehosuing
 
3dw
3dw3dw
3dw
 
Pass 2018 introduction to dax
Pass 2018 introduction to daxPass 2018 introduction to dax
Pass 2018 introduction to dax
 
3dw
3dw3dw
3dw
 
Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
 
(Lecture 5)OLAP Querying.pdf
(Lecture 5)OLAP Querying.pdf(Lecture 5)OLAP Querying.pdf
(Lecture 5)OLAP Querying.pdf
 
Intro to Data warehousing Lecture 06
Intro to Data warehousing   Lecture 06Intro to Data warehousing   Lecture 06
Intro to Data warehousing Lecture 06
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
 
Star schema
Star schemaStar schema
Star schema
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
 
Multidimensional Data Analysis with JRuby
Multidimensional Data Analysis with JRubyMultidimensional Data Analysis with JRuby
Multidimensional Data Analysis with JRuby
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Introtosqltuning
IntrotosqltuningIntrotosqltuning
Introtosqltuning
 
Analisys Services
Analisys ServicesAnalisys Services
Analisys Services
 
IBM Cognos tutorial - ABC LEARN
IBM Cognos tutorial - ABC LEARNIBM Cognos tutorial - ABC LEARN
IBM Cognos tutorial - ABC LEARN
 
See sql server graphical execution plans in action tech republic
See sql server graphical execution plans in action   tech republicSee sql server graphical execution plans in action   tech republic
See sql server graphical execution plans in action tech republic
 

Recently uploaded

Recently uploaded (20)

WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
WSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid EnvironmentsWSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid Environments
 
WSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - Kanchana
 
WSO2Con2024 - Organization Management: The Revolution in B2B CIAM
WSO2Con2024 - Organization Management: The Revolution in B2B CIAMWSO2Con2024 - Organization Management: The Revolution in B2B CIAM
WSO2Con2024 - Organization Management: The Revolution in B2B CIAM
 
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
WSO2CON2024 - Why Should You Consider Ballerina for Your Next IntegrationWSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AI
 
WSO2Con2024 - Low-Code Integration Tooling
WSO2Con2024 - Low-Code Integration ToolingWSO2Con2024 - Low-Code Integration Tooling
WSO2Con2024 - Low-Code Integration Tooling
 
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdfAzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 

IT301-Datawarehousing (1) and its sub topics.pptx

  • 1. IT301 Advance Database System Data Warehouse and the Star Schema
  • 2. Finally we are talking about something not invented by IBM! Inventor is unknown. Popularized by Ralph Kimball and his company, Red Brick Warehouse. 2
  • 3. History  First product introduced by Red Brick Warehouse, a standalone system for data warehouse  Algorithm was figured out by Oracle and Sybase. Oracle built into DBMS, Sybase made separate software product.  IBM bought Red Brick 3
  • 4. Agenda  Definition  Why data warehouse  Product History  Processing Star queries  Data warehouse in the enterprise  Data warehouse design  Relevance of normalization  Star schema  Processing the star schema 4
  • 5. Definition Data warehouse: A repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated The point is that it’s not used for transaction processing; that is, it’s read-only. And the data can come from heterogeneous sources and it can all be queried in one database. 5
  • 6. Why Data Warehouse  A data warehouse collects information from many sources. It converts different data into a uniform format, following the data analytics requirements. Data warehousing ensures production standards and quality information. Different departments can generate consistent results. 6
  • 7. Data Warehouse vs. OLTP OLTP DW Purpose Automate day-to-day operations Analysis Structure RDBMS RMBMS Data Model Normalized Dimensional Access SQL SQL and business analysis programs Data Data that runs the business Current and historical information Condition of data Changing, incomplete Historical, complete, descriptive 7
  • 8. Red Brick  Invented data warehouse; they sold a hardware product with a star schema database  You loaded the Red Brick Warehouse and then queried it for OLTP  It featured new optimizations for star schemas, was very fast 8
  • 9. Enter Sybase  Sybase learned the optimization and developed their own product.  The Sybase product was a stand-alone software data warehouse product  It couldn’t do general-purpose database work, was just a data warehouse  They appear to have copied the Red Brick idea, without selling hardware 9
  • 10. Enter Oracle  Oracle, later, also copied the same optimization  They added a bitmap index to their database product, and added the star schema optimization  Now their product could do data warehouse as well as database 10
  • 11. Status Today  Oracle dominates the field today  IBM eventually bought Red Brick so still offers some sort of Red Brick product  Sybase offers their OLTP product, now as an offering of SAP 11
  • 12. PROCESSING STAR QUERIES So what is this algorithm that is so copied? 12
  • 13. Star Schema  Data warehouse relies on the star schema  The data is not normalized  DW is loaded from a normalized database  There is a fact table surrounded by multiple dimension tables  Fact table has all measures for the subject area, with foreign keys for dimensions for each measure 13
  • 14. A Sample OLTP Schema 14 orders products order items customers
  • 15. Transformed to a Star Schema 15 products customers sales channels times fact table dimension table dimension table dimension table dimension table Time Product Customer Channel Time Hour Day of Week Month Year Season
  • 17. Processing Star Queries  Build a bitmap index on each foreign key column of the fact table  Index is a 2-dimensional array, one column for each row being indexed, one row per value of that column  Bitmap indexes are typically much smaller than b-tree indexes, that can be larger than the data itself 17
  • 19. Query Processing  The typical query is a join of foreign keys of dimension tables to the fact table  This is processed in two phases: 1. From the fact table, retrieve all rows that are part of the result, using bitmap indexes 2. Join the result of the step above to the dimension tables 19
  • 20. Example Query Find sales and profits from the grocery departments of stores in the West and Southwest districts over the last three quarters 20
  • 21. Example Query SELECT store.sales_district, time.fiscal_period, SUM(sales.dollar_sales) revenue, SUM(dollar_sales) - SUM(dollar_cost) income FROM sales, store, time, product WHERE sales.store_key = store.store_key AND sales.time_key = time.time_key AND sales.product_key = product.product_key AND time.fiscal_period IN ('3Q95', '4Q95', '1Q96') and product.department = 'Grocery' AND store.sales_district IN ('San Francisco', 'Los Angeles') GROUP BY store.sales_district, time.fiscal_period; 21
  • 22. Phase 1 Finding the rows in the SALES table (using bitmap indexes): SELECT ... FROM sales WHERE store_key IN (SELECT store_key FROM store WHERE sales_district IN ('WEST', 'SOUTHWEST')) AND time_key IN (SELECT time_key FROM time WHERE quarter IN ('3Q96', '4Q96', '1Q97')) AND product_key IN (SELECT product_key FROM product WHERE department = 'GROCERY'); 22
  • 23. Phase 2 Now the fact table is joined to dimension tables. For dimension tables of small cardinality, a full-table scan may be used. For large cardinality, a hash join could be used. 23
  • 24. The Star Transformation  Use bitmap indexes to retrieve all relevant rows from the fact table, based on foreign key values – This happens very fast  Join this result set to the dimension tables – If there are many values, a hash join may be used – If there are fewer values, a b-tree driven join may be used 24
  • 25. How DW Fits into the Enterprise 25 OLTP3 Data Mart Data Warehouse Data Mart Data Mart Data Mart Application A Application B Application C User User User User User User User Extract, Transform And Load OLTP2 OLTP1 Integration Integration
  • 26. Data Warehouse Database Design  A conventional database design for data warehouse would lead to joins on large amounts of data that would run slowly  The star schema allows for fast processing of very large quantities of data in the data warehouse  It also allows for very compact representation of events that occur many times 26
  • 27. A Sample OLTP Schema 27 orders products order items customers
  • 28. Transformed to a Star Schema 28 products customers sales channels times fact table dimension table dimension table dimension table dimension table
  • 30. Fact Table  The fact table contains the actual business process measurements or metrics for a specific event, called facts, usually numbers.  A fact table represents facts by foreign keys from other tables, called “dimension” tables  These foreign keys are usually generated keys, in order to save fact table space  If you are building a DW of monthly sales in dollars, your fact table will contain monthly sales, one row per month.  If you are building a DW of retail sales, each row of the fact table might have one row for each item sold. 30
  • 31. Fact Table Design A fact table may contain one or more facts. Usually you create one fact table per business event. For example if you want to analyze the sales numbers and also advertising spending, they are two separate business processes. So you will create two separate fact tables, one for sales data and one for advertising cost data. On the other hand if you want to track the sales tax in addition to the sales number, you simply create one more fact column in the Sales fact table called Tax. 31
  • 32. Dimension Table  Dimension tables have a small number of rows (compared to fact tables) but a large number of columns  For the lowest level of granularity of a fact in the fact table, a dimension table will have one row that gives all the categories for each value  The dimension table is often all key, so a generated key is used so that the fact table reference to the dimension table can be small 32
  • 33. 33
  • 34. Time Dimension Schema Column Name Type Dim_Id INTEGER (4) Month SMALL INTEGER (2) Month_Name VARCHAR (3) Quarter SMALL INTEGER (4) Quarter_Name VARCHAR (2) Year SMALL INTEGER (2) 34
  • 35. Time Dimension Data TM _Dim_Id TM _Month TM_Month_Name TM _Quarter TM_Quarter_N ame TM_Year 1001 1 Jan 1 Q1 2003 1002 2 Feb 1 Q1 2003 1003 3 Mar 1 Q1 2003 1004 4 Apr 2 Q2 2003 1005 5 May 2 Q2 2003 35
  • 36. Location Dimension Schema Column Name Type Dim_Id INTEGER (4) Loc_Code VARCHAR (4) Name VARCHAR (50) State_Name VARCHAR (20) Country_Name VARCHAR (20) 36
  • 37. Location Dimension Data Dim_Id Loc_Code Name State_Name Country_Name 1001 IL01 Chicago Loop Illinois USA 1002 IL02 Arlington Hts Illinois USA 1003 NY01 Brooklyn New York USA 1004 TO01 Toronto Ontario Canada 1005 MX01 Mexico City Distrito Federal Mexico 37
  • 38. Product Data Schema Column Name Type Dim_Id INTEGER (4) SKU VARCHAR (10) Name VARCHAR (30) Category VARCHAR (30) 38
  • 39. Product Data Dim_Id SKU Name Category 1001 DOVE6K Dove Soap 6Pk Sanitary 1002 MLK66F# Skim Milk 1 Gal Dairy 1003 SMKSAL55 Smoked Salmon 6oz Meat 39
  • 40. Categories in Dimension Tables  Categories may or may not be hierarchical; or can be both  Categories provide canned values that can be given to users for queries 40
  • 41. Granularity (Grain) of the Fact Table The level of detail of the fact table is known as the grain of the fact table. In this example the grain of the fact table is monthly sales number per location per product. 41
  • 42. Note about Granularity  There may be multiple star schemas at different levels of granularity, especially for very large data warehouses  The first could be the finest—say, each transaction such as a sale  The next could be an aggregation, like the previous example  There could be more levels of aggregation 42
  • 43. Design Approach 1. Identify the business process. In this step you will determine what is your business process that your data warehouse represents. This process will be the source of your metrics or measurements. 2. Identify the Grain You will determine what does one row of fact table mean. In the previous example you have decided that your grain is 'monthly sales per location per product'. It might be daily sales or even each sale could be one row. 3. Identify the Dimensions Your dimensions should be descriptive (SQL VARCHAR or CHARACTER) as much as possible and conform to your grain. 4. Finally Identify the facts In this step you will identify what are your measurements (or metrics or facts). The facts should be numeric and should confirm to the grain defined in step 2. 43
  • 44. Monthly Sales Fact Table Schema Field Name Type TM_Dim_Id INTEGER (4) PR_ Dim_Id INTEGER (4) LOC_ Dim_Id INTEGER (4) Sales INTEGER (4) 44
  • 45. Monthly Sales Fact Table Data TM_Dim_Id PR_ Dim_Id LOC_ Dim_Id Sales 1001 1001 1003 435677 1002 1002 1001 451121 1003 1001 1003 98765 1001 1004 1001 65432 45
  • 46. Data Mart  A data mart is a collection of subject areas organized for decision support based on the needs of a given department. Examples: finance has their data mart, marketing has theirs, sales has theirs and so on.  Each department generally runs its own data mart. Ownership of the data mart allows each department to bypass the control that might coordinate the data found in the different departments.  Each department's data mart is peculiar to and specific to its own needs. Typically, the database design for a data mart is built around a star-join structure designed for that department.  The data mart contains only a modicum of historical information and is granular only to the point that it suits the needs of the department.  The data mart may also include data from outside the organization, such as purchased normative salary data that might be purchased by an HR department. 46
  • 47. About the Data Mart  The structure of the data in the data mart may or may not be compatible with the structure of data in the data warehouse.  The amount of historical data found in the data mart is different from the history of the data found in the warehouse. Data warehouses contain robust amounts of history, while data marts usually contain modest amounts of history.  The subject areas found in the data mart are only faintly related to the subject areas found in the data warehouse.  The relationships found in the data mart may not be those relationships that are found in the data warehouse.  The types of queries satisfied in the data mart are quite different from those queries found in the data warehouse. 47
  • 48. Walmart’s Data Warehouse  Half a petabyte in capacity (.5 x 1015 bytes)  World’s largest DW  Tracks 100 million customers buying billions of products every week  Every sale from every store is transmitted to Bentonville every night  Walmart has more than 18,000 retail stores, employs 2.2 million, serves 245 million customers every week 48
  • 49. Typical Questions  How much orange juice did we sell last year, last month, last week in store X?  What internal factors (position in store, advertising campaigns...) influence orange juice sales?  How much orange juice are we going to sell next week, next month, next year? 49
  • 51. 51