Data Warehousing and Business Intelligence is one of the hottest skills today, and is the cornerstone for reporting, data science, and analytics. This course teaches the fundamentals with examples plus a project to fully illustrate the concepts.
1. -
1
Data Warehousing
Case study creating a
Data warehouse for super market
Ch Anwar ul Hassan (Lecturer)
Department of Computer Science and Software
Engineering
Capital University of Sciences & Technology, Islamabad
Pakistan
anwarchaudary@gmail.com
2. -
2
Case Study- Introduction
Introduction: 15 years back Information technology gave a gift to this
world,-E-Commerce. Since than every small and big business has used it
to improve its outreach, customer count, sales, profit and each possible
aspect. But this was not sufficient. As data grew from MB to GB to PB,
these smart business felt a to store this data efficiently and to utilize it for
improving various aspect of business. One such domain is retail where
customer are products are key aspect. Which product is needed by what
type of customer and when are the key questions of retail business. If they
are answered well the can take retail business to new heights. In solving
these queries Data Warehouse plays an important role. It helps to analyze
key aspects to improve sale of retail stores. To know what customer
(customer location) buys and in which season, we need to have a look
over the whole data. So first we need to collect the whole historical data in
one place in a standard format. This is done by preparing data ware
house. There are many software which helps in this like Teradata, Netezza,
Oracle, Hadoop etc. Once the warehouse is prepared we can use this
dataset in many ways to answer endless queries. In this project I have
simulated the real time data warehouse preparation and answering
business queries.
3. -
3
Data Sources:
Data is the basic requirement of any data warehouse. Data
for this data warehouse is collected from three different
datasets. The first one is from a Global supermarket store,
from which I took data of five different stores from different
locations of USA for year 2012. Second is the revenue
collection from each store in each month. Third one
signifies which month fall in which season in USA.
Three of the dataset are easily coerced together as all the
dataset have same month_id in each dataset which is
used in sql query for lookup to populate data in fact table.
4. -
4
Data Sources:
Data source 1- This dataset had been fetched from www.Kaggle.com . Kaggle
is a repository of thousands of data set. This dataset contain data of
supermarket of whole globe. The data used in this data warehouse is of five
different state of USA which are New York, New Jersey, New Hampshire, Utah,
Texas. Link of the dataset:
https://kaggle2.blob.core.windows.net/datasets/1048/1903/global_superstore_20
16.xlsx.zip?sv=201 5-12-
11&sr=b&sig=V6MbJAh5QVwQC8wLLiPrsC8dKochxZ354VLclEnFuWM%3D&s
e=201704-07T08%3A21%3A15Z&sp=r
Data Source 2- This dataset is a dummy dataset which is generated by
mockaroo. This dataset contains the revenue of each month for each state.
https://www.mockaroo.com/
Data Source 3- This is the unstructured data set which I Scraped from the site:
https://www.englishclub.com/vocabulary/time-months-of-year.htm This data has
been uploaded into excel which looks like as shown in Fig.3.1 which is cleaned
in and made structured as shown in Fig.3.This dataset have seasons of USA.
5. -
5
Have you opened the link?
Please take the data sources
screenshot and share in chat …..
7. -
7
Data Warehouse Design and
Architecture:
To carry out the analysis of retail store in different
state of USA like how much is the revenue
generation, amount of product sold in what month
and in which season Kimball’s approach is used to
build this Data Warehouse.
Design Tool for this Data Warehouse:-
Sql Server Management Studio
Sql Server Integration Services
Sql Server Analysis Services
8. -
8
Data Warehouse Design and
Architecture:
I have followed the Kimball’s architecture which consist of the
following procedures :-
Identification of the Process of Business:- We need to
define the main process of business like acquiring customer,
acquiring the products, then sale process. We also need to
understand at what level sales data is summarized. Whether it
is daily, weekly or monthly level. This step helps in
determining the entities and their relationship as per business
requirement. Later on these entities becomes the dimensions
of the business. The most important entities are Cusotmer,
Product, Location, and time.
9. -
9
Data Warehouse Design and
Architecture:
• Defining the Grain:- Grains mean at what depth we need to
store the data for these dimension. It defined the granularity of
the system. In this project we are going to store sales of the
product at month level.
• Defining the Dimensions :- Once entities and grains are
decided we can decide the dimension. This dataset contains
five dimensions
10. -
10
Data Warehouse Design and
Architecture:
• Deciding the fact of the Data Warehouse:-Fact table
defines the measurable data we are going to store for the
dimesions. It is the pivot of star schema which contain all
the primary keys of dimensions and the measurable
quantities which are used to carry out business queries.
This fact data is designed in such a manner that it helps
in identifying which is our regular customer, how to
improve retail business as each season have variation in
selling of product, how much revenue is generated in
which state and last but not least which is the highest
selling product.
11. -
11
Advantage of Kimball’s Model:
Kimball model has slight different approach to build data
warehouse as it follows bottom up approach which help in
merging small datasets.
• Performane of Kimball model is better
• More focus is on Dimension which play important role for
analysis
• Focus of this approach is on the process of Building DW
• Less time consuming in creating the DataWarehouse
12. -
12
Overview of building data warehouse to
carry out Business intelligence queries:-
• In SSIS (Sql Server Integration Services) package ETL is done
three of the datasets are in excel sheet which are
extracted into the staging table.
• From staging table data is populated into the Dimensions
table.
• with the help of lookup tool(join) data is being populated
into the fact table.
• Cube is deployed in SSAS (Sql Server Analysis Services)
• Business queries are carried out in power BI
14. -
14
Star Schema
Star Schema: Star Schema looks like a star in which Fact Table act as a
pivot as it resides in the center, while multiple Dimensions are attached to
the fact table in a star like form having concepts of Foreign key. A simple
Star Schema usually have one Fact Table and multiple Dimensions but a
complex Star Schema can consist more than one Fact Table. Generally,
Fact Tables are in 3NF.
Fact Table: Fact Table consist two type of column;
(i) Measure columns
(ii) Foreign key column.
Measure columns consist of numeric values that can be measured or
count while foreign key column consist of column which act as primary
key in dimension tables. Measure column can be used in form of
aggregation or without aggregation for analysis of Business query.
15. -
15
Star Schema
• Dimension Table: Dimension table consist of Textual and
descriptive values. Each dimension Table have their own
primary key which is a unique table represent other column
values. The surrogate column known as foreign key column
in Fact Table is nothing else but they are the Primary key
column of Dimension Table
16. -
16
Advantage of Star Schema
Star schema has various merit which prove its efficiency as
well as its specialty in building a Data warehouse.
• Easy to generate an ETL process
• Complexity is low as table query has direct relationship
• Decrease the headache of Normalizing, as data in
dimension tables is stored in normal form
• It is very efficient to carry out metric analysis
• Each Dimension table is directly connected to Fact Table
• Navigation of Data is fast as of the nature of connection of
fact and dimension table.
17. -
17
Design of Data Warehouse: Dim_Table:
For this Retail Data warehouse five dimensions and one fact table have been
created.
Dim_Customer: Customer dimension consist of Customer name, Customer id,
Customer key. Customer key is the primary key in this dimension. It is generated
when we I create the dimension by entering query [Customer_Key] INT Identity
(1,1)PK. Now the question is why I generated this, as I was already having
customer_id. As the primary key should be unique, none of the value should be
repeated but as the customer is repeated their id will also repeat and that won’t
make the column unique,so to remove this redundancy Customer_key as the
primary key of this dimension is auto generated. Customer_name contains the
name of customer and customer_id column contain the id of customer. With this
dimension we can analyse which one is our regular customer.
18. -
18
Design of Data Warehouse: Dim_Table:
Dim_Product: Product dimension has product_key as the
primary key. Product_id contain id of the products.
Product_name contain the name of product sold.With the help
of this dimension we can analyze which is the highest selling
product and which customer buys what product.
19. -
19
Design of Data Warehouse: Dim_Table:
Dim_Location: Location dimension contain Location_Key as
primary key. State_id is the id of state. State_name contains
the name of state of store location. Region name contains the
region of the country. This dimension is helpful in analyzing
which state or region have higest number of customer,which
state got highest sale. It will also help in analyzing the
revenue earned in each state or region.
20. -
20
Design of Data Warehouse: Dim_Table:
Dim_Source: This dimension is fetched from unstructured
dataset. It contain Season_key as primary key. Se_month_id
is the id of a particular month. This Dimension will help in
analyzing which month shows the highest sale and which
season has what highest selling product.
21. -
21
Design of Data Warehouse: Dim_Table:
Dim_Month: This dimension contains Month_Key as Primary
Key. S_month_id contain the id of particular month.
Month_name contain the month.This dimension can be used
in analyzing highest sale in a state according to month or
which is the highest sold product in a month.
22. -
22
Design of Data Warehouse: Fact_Table:
For our retail superstore we have created one fact table which
is connected with each dimension table with foreign key
relationship. It has three columns for measurement.
• product_quantity- It contains the product of quantity sold.
• total_sale- It contain the sale amount of customer visit
wise.
• revenue- It contain the amount of revenue generated in
the store month wise.
23. -
23
Star Schema of Project:
Dimension tables and Fact Table is connected together using
Star schema as shown in Fig 12.
24. -
24
Extract Transform Load(ETL) process
For Building a data warehouse the
important thing is extracting data, then this
data is transformed into the staging area
and lastly loaded in destination area. This is
known as ETL process. To carry out ETL
process for SSIS toolbox is used. In ETL
process data from the External source is
Extracted into the staging Database. Next
step is to carry Transformation stage.
Loading stage is the end of ETL
process in which data is loaded in fact table.
At the end of ETL process data is populated
in fact table as well as in dimension table as
shown in Fig.6.
25. -
25
Extraction:
Data is extracted from external source in this phase. For this
project excel sheets are the external source. Otherwise it can
be any database or OLTP server. This extraction will load the
data into the the staging database base, which is ole db
destination as shown in Fig 14.
26. -
26
Extraction:
All the data is extracted into the database from these excel
files. We can also see the data which comes in staging phase
is stored in the database as;
dbo.Main_Stage
dbo.season_stage
dbo.state_stage etc as shown in Fig.
27. -
27
Extraction:
A Truncate Query is written in staging phase so that no
multiple data is generated due to multiple run as shown in
Fig 16.
28. -
28
Transformation:
After the data is extracted from excel to staging database,
next step which is done is transformation.For transformation i
have used lookup tool(join) and sql query as shown in
Fig.19.2 for loading the data from dimension tables.
29. -
29
Transformation:
we have five dimension tables in our
data base and 1 fact table.
dbo.Dim_Customer
dbo.Dim_Location
dbo.Dim_Month
dbo.Dim_Product
dbo.dim_Source
dbo.Retail_Fact
These dimensions are shown in
Fig.17.Dimensions are one of the
important factor in analyzing data.
Mapping should not be mismatched as it
will terminate the ETL flow
31. -
31
Loading:
After populating Dimension table next step is to populate Fact
table. Fact table contains all the primary key of the dimension
tables and some measurables which are used for analysis
purpose with some aggregation rule. Lookup tool (joins) is
used to populate the dimension table and Measures in fact
Table.
33. -
33
Deploying the CUBE:
It is the phase to carry out multidimensional representation of
data with the help of cube in SSAS which is further use to
analyze the data on the basis of measures which are present
in fact table and the descriptive, textual data present in
Dimension tables.
Here, Project Cube is successfully deployed as shown in
Fig.20 & Fig.21. After deploying the cube, phase of analysis
and reporting start’s where Business intelligence query is
carried out.
35. -
35
Business Analytics
Tool Used for Business Query-: Power BI
Power BI is used to carry out the analysis of this
Data Warehouse. For analyzing cube is imported in
power BI. with the help of descriptive, textual and
measurable quantity business queries have been
carried out.
36. -
36
Business Analytics
Following business query can be analyzed with the help of
our database
Case Study:1
Does Seasons(summer, spring, winter, autumn) in 3
different regions of USA effect the retail store business in
term of revenue collection.
Case Study:2
Sales generated in different states on basis of seasons
Case Study:3
Analytical Targeting of customers
Case Study:4
Seasons affecting the revenue of States
37. -
37
Case Study:1
Does Seasons(summer, spring, winter, autumn) in 3
different regions of USA effect the retail store business
in term of revenue collection
This Query touches all of the three dataset. To verify the
above Query we will take revenue, season name and region
name. Below Graph shows how much revenue is generated
in which region and in which season.
39. -
39
Case Study:1 - Analysis
Analysis:
From the clustered bar chart representation we can analyze
that highest revenue is generated in summer season followed
by autumn, then by winter and spring is responsible for least
revenue in each region of USA. Graph also shows that in all
the seasons store earns most of its revenue from Eastern US
and Western season stood last. This graph give a quick
insight to marketing and sales team that they need work on
Western region to increase sales and find the reason of spring
being so slow.
40. -
40
Case Study:2
Sales generated in different states on basis of seasons
This Query is generated from all the three dataset. To predict
above query Total sale, State and Season is used. Below is
the pie chart Fig.23 represent sale of different states in
different season.
42. -
42
Case Study: 2 - Analysis
Analysis:
This pie chart is used to analysis the sales of store in different
state in different season. As the Fig.23 shows that sale in
Texas in summer season is highest, followed by New York.
The pie chart shows that New York got highest sale in autumn
Season and is followed by Texas. So New York and Texas are
biggest buyers in any season. While rest of states are slow in
all seasons. So it seems state is very important factor in terms
of sales. We need to understand the needs of Western US
states which our store is not able to cater. Either we need to
change the products or increase some offers or may be store
manager is not very efficient. Season and State are very
important factor in US. The product which is suitable for New
York in Winter might not be suitable for Utah during same
time. This kind of variation is needed while planning store
products.
43. -
43
Case Study:3
Analytical Targeting of customers
To predicate the above query we need to check which
customer buys maximum number of products in which
season. Product quantity, Customer Name and season is
used for targeting specific customers.
45. -
45
Case Study: 3 - Analysis
Analysis:
The Donut chart Fig.24 represent customer who buys
maximum number of products in four different season. Figure
explains which customer bought what quantity of product in
which season. According to the business point of view we can
target the specific Customer and provide some more offers to
improve our sales.
46. -
46
Case Study:4
Seasons affecting the revenue of States
This query also touches three of the dataset.To analyze the
above query we used seasons, revenue, states to check the
amount of revenue generated from each state in every
season.
48. -
48
Case Study: 4 - Analysis
Analysis:
The above graphical representation Fig.25 shows how much
revenue is collected in each state in each season. New York
have generated highest amount of revenue in each
season.while New Hampshire have generated the least. In
perspective of business New York and Texas revenue
generation is significantly high.
49. -
49
Conclusion:
This data warehouse can help in depicting how we can target
specific customer in which region of the country. New York
and Texas have highest sale and highest revenue generation
while New Hampshire have significance less than each of the
other state.so to improve the sale in New Hampshire, Utah,
New Jersey. Seasons also play important role in retail
business as the sale in summer season is the highest of all.
with the help of this Data Warehouse we can also examine
which product is sold in which month so we can give some
extra offers on that particular product.