Introduction to Data Warehousing


            By
                 MAHESH.AMPOLU
Necessity is the mother of invention




            Why Data Warehouse?
Scenario

Unilever is a company with branches at UK,
India, America and Japan. The Sales Manager
wants quarterly sales report. Each branch has a
separate operational system.
Scenario 1 : Unilever company.
   India




    UK
            Sales per item type per branch    Sales
                   for first quarter.        Manager

  America




   Japan
Solution :      Unilever company.

 Extract sales information from each database.
 Store the information in a common repository at a
  single site.
Solution : Unilever company.
 India


                                         Report
  UK
                          Query &                  Sales
              Data      Analysis tools            Manager
            Warehouse

America




 Japan
Scenario :

Hindustan Unilever is a small,new company.
President of the company wants his company should
grow. He needs information so that he can make
correct decisions.
Solution :
 Improve  the quality of data before loading it
  into the warehouse.
 Perform data cleaning and transformation
  before loading the data.
 Use query analysis tools to support adhoc
  queries.
Solution
                                                  Expansio
                                                     n

                             sales

   Data      Query and Analysis             President
 Warehouse          tool

                                     time

                                                Improvemen
                                                     t
What is Data Warehouse??
Inmons’s definition :
 A data warehouse is
       -subject-oriented,
       -integrated,
       -time-variant,
       -nonvolatile
collection of data in support of management’s
decision making process.
Subject-oriented
 Data  warehouse is organized around subjects such
  as sales,product,customer.
 It focuses on modeling and analysis of data for
  decision makers.
 Excludes data not useful in decision support
  process.
Integration
 Data Warehouse is constructed by integrating
  multiple heterogeneous sources.
 Data Preprocessing are applied to ensure
  consistency.
                   RDBMS



                                            Data
                    Legacy                Warehouse
                    System



                   Flat File            Data Processing
                                        Data Transformation
Integration
 In   terms of data.
  – encoding structures.

  – Measurement of
       attributes.

  – physical attribute.
        of data            remarks




  – naming conventions.

  – Data type format
Time-variant
 Provides  information from historical perspective
  e.g. past 5-10 years
 Every key structure contains either implicitly or
  explicitly an element of time
Nonvolatile
 Data  once recorded cannot be updated.
 Data warehouse requires two operations in data
  accessing
   – Initial loading of data
   – Access of data




     load


                            access
Data Warehousing Architecture
Data Warehouse Architecture
 Data  Warehouse server
  – almost always a relational DBMS,rarely flat
     files
 OLAP servers
  – to support and operate on multi-dimensional
     data structures
 Clients
  – Query and reporting tools
  – Analysis tools
  – Data mining tools
OLTP vs OLAP
Data Warehouse Schema
 Star Schema
 Fact Constellation Schema
 Snowflake Schema
Star Schema
A star schema consists of at least one
 fact table and a number of dimension
 tables.

 Star
     Schema is highly recommended
 schema for SSAS cubes.
Star Schema
Store Dimension           Fact Table                   Time Dimension
 Store Key                 Store Key                  Period Key
 Store Name                Product Key                Year
 City                      Period Key                 Quarter
 State                     Units                      Month
 Region                    Price


                           Product Key
                           Product Desc

                         Product
                         Dimension

Benefits: Easy to understand, easy to define hierarchies, reduces
no. of physical
joins.
SnowFlake Schema
 Variant of star schema model.
 A single,large and central fact table and one
  or more tables for each dimension.
 Dimension tables are normalized i.e. split
  dimension table data into additional tables
SnowFlake Schema
Store Dimension             Fact Table                   Time Dimension
                             Store Key                   Period Key
Store Key
                             Product Key                 Year
Store Name
                             Period Key                  Quarter
City Key
                             Units                       Month
                             Price
 City Dimension
City Key
                             Product Key
City
                             Product Desc
State
Region                    Product
                          Dimension
Drawbacks: Time consuming joins,report generation slow
Fact Constellation
 Multiple fact tables share dimension tables.
 This schema is viewed as collection of stars
  hence called galaxy schema or fact
  constellation.
 Sophisticated application requires such
  schema.
Fact Constellation
    Sales                              Shipping
    Fact Table     Product
                   Dimension           Fact Table
Store Key
                                     Shipper Key
Product Key        Product Key       Store Key
Period Key         Product Desc      Product Key
Units
                                     Period Key
Price
                                     Units
                                     Price
                   Store Dimension

                   Store Key
                   Store Name
                   City
                   State
                   Region
Fact Constellation
    Sales                              Shipping
    Fact Table     Product
                   Dimension           Fact Table
Store Key
                                     Shipper Key
Product Key        Product Key       Store Key
Period Key         Product Desc      Product Key
Units
                                     Period Key
Price
                                     Units
                                     Price
                   Store Dimension

                   Store Key
                   Store Name
                   City
                   State
                   Region
Building Data Warehouse
 Data Selection
 Data Preprocessing
  – Fill missing values
  – Remove inconsistency
 Data Transformation & Integration
 Data Loading
 Data in warehouse is stored in form of fact tables
 and dimension tables.
Case Study
  Unilever is a new company which produces
  soaps,paste and baverages products with
  production unit located at NA.
 There products are sold in North,North West and
  Western region of India.
 They have sales units at India, America , UK and
  Japan.
 The President of the company wants sales
  information.
Sales Information

Report: The number of units sold.

113


Report: The number of units sold over time and date


 January       February       March         April
 14            41             33            25
Sales Information
Report : The number of items sold for each product with
time

             Jan Feb Mar Apr
Soaps                 6    17

Paste        6   16   6    8




                                        Time
bread        8   25   21

                                                   Product
Sales Information
Report: The number of items sold in each country for each
product with time
                     Jan   Feb Mar   Apr
India    Soaps                  3    10              City

         Paste       3     16   6
         bread       4     16   6




                                              Time
UK       soaps                  3    7

         paste       3               8                      Product
         bread       4     9    15
Sales Information
Report: The number of items sold and income in each region for
each product with time.
                     Jan        Feb          Mar          Apr
                     Rs     U   Rs      U    Rs      U    Rs      U

 India   Soaps                               7.44    3    24.80   10

         Paste       7.95   3   42.40   16   15.90   6

         bread       7.32   4   29.98   16   10.98   6

 UK      Soaps                               7.44    3    17.36   7

         paste       7.95   3                             21.20   8

         bread       7.32   4   16.47   9    27.45   15
Sales Measures & Dimensions

 Measure – Units sold, Amount.
 Dimensions – Product,Time,Region.
Sales Data Warehouse Model
Fact Table
Country      Product   Month      Units   Rupees
India        Soaps     January    3       7.95
India        Paste     January    4       7.32
UK           Soaps     January    3       7.95
UK           Paste     January    4       7.32
India        Bread     February   16      42.40
Sales Data Warehouse Model

  City_ID Prod_ID   Month      Units   Rupees
  1       589       1/1/1998   3       7.95
  1       1218      1/1/1998   4       7.32
  2       589       1/1/1998   3       7.95
  2       1218      1/1/1998   4       7.32
  1       589       2/1/1998   16      42.40
Sales Data Warehouse Model
Product Dimension Tables
  Prod_ID     Product_Name        Product_Category_ID
  589         Soaps               1
  590         Paste               1
  288         Bread               2


  Product_Category_Id Product_Category
  1                   domestic
  2                   food
Sales Data Warehouse Model
Region Dimension Table

City_ID       City       Region      Country

1             India      West        India
2             UK         NorthWest   India
Sales Data Warehouse Model
   Time




                                 Product
          Sales Fact   Product
                                 Category




  Region
Data Warehousing includes
 Build Data Warehouse
 Online analysis processing(OLAP).
 Presentation.

            Cleaning ,Selection &
            Integration

RDBMS                                                  Presentation




Flat File
                                                                      Client
                             Warehouse & OLAP server
Thank You

Data warehousing

  • 1.
    Introduction to DataWarehousing By MAHESH.AMPOLU
  • 2.
    Necessity is themother of invention Why Data Warehouse?
  • 3.
    Scenario Unilever is acompany with branches at UK, India, America and Japan. The Sales Manager wants quarterly sales report. Each branch has a separate operational system.
  • 4.
    Scenario 1 :Unilever company. India UK Sales per item type per branch Sales for first quarter. Manager America Japan
  • 5.
    Solution : Unilever company.  Extract sales information from each database.  Store the information in a common repository at a single site.
  • 6.
    Solution : Unilevercompany. India Report UK Query & Sales Data Analysis tools Manager Warehouse America Japan
  • 7.
    Scenario : Hindustan Unileveris a small,new company. President of the company wants his company should grow. He needs information so that he can make correct decisions.
  • 8.
    Solution :  Improve the quality of data before loading it into the warehouse.  Perform data cleaning and transformation before loading the data.  Use query analysis tools to support adhoc queries.
  • 9.
    Solution Expansio n sales Data Query and Analysis President Warehouse tool time Improvemen t
  • 10.
    What is DataWarehouse??
  • 11.
    Inmons’s definition : A data warehouse is -subject-oriented, -integrated, -time-variant, -nonvolatile collection of data in support of management’s decision making process.
  • 12.
    Subject-oriented  Data warehouse is organized around subjects such as sales,product,customer.  It focuses on modeling and analysis of data for decision makers.  Excludes data not useful in decision support process.
  • 13.
    Integration  Data Warehouseis constructed by integrating multiple heterogeneous sources.  Data Preprocessing are applied to ensure consistency. RDBMS Data Legacy Warehouse System Flat File Data Processing Data Transformation
  • 14.
    Integration  In terms of data. – encoding structures. – Measurement of attributes. – physical attribute. of data remarks – naming conventions. – Data type format
  • 15.
    Time-variant  Provides information from historical perspective e.g. past 5-10 years  Every key structure contains either implicitly or explicitly an element of time
  • 16.
    Nonvolatile  Data once recorded cannot be updated.  Data warehouse requires two operations in data accessing – Initial loading of data – Access of data load access
  • 17.
  • 18.
    Data Warehouse Architecture Data Warehouse server – almost always a relational DBMS,rarely flat files  OLAP servers – to support and operate on multi-dimensional data structures  Clients – Query and reporting tools – Analysis tools – Data mining tools
  • 19.
  • 20.
    Data Warehouse Schema Star Schema  Fact Constellation Schema  Snowflake Schema
  • 21.
    Star Schema A starschema consists of at least one fact table and a number of dimension tables.  Star Schema is highly recommended schema for SSAS cubes.
  • 22.
    Star Schema Store Dimension Fact Table Time Dimension Store Key Store Key Period Key Store Name Product Key Year City Period Key Quarter State Units Month Region Price Product Key Product Desc Product Dimension Benefits: Easy to understand, easy to define hierarchies, reduces no. of physical joins.
  • 24.
    SnowFlake Schema  Variantof star schema model.  A single,large and central fact table and one or more tables for each dimension.  Dimension tables are normalized i.e. split dimension table data into additional tables
  • 25.
    SnowFlake Schema Store Dimension Fact Table Time Dimension Store Key Period Key Store Key Product Key Year Store Name Period Key Quarter City Key Units Month Price City Dimension City Key Product Key City Product Desc State Region Product Dimension Drawbacks: Time consuming joins,report generation slow
  • 26.
    Fact Constellation  Multiplefact tables share dimension tables.  This schema is viewed as collection of stars hence called galaxy schema or fact constellation.  Sophisticated application requires such schema.
  • 27.
    Fact Constellation Sales Shipping Fact Table Product Dimension Fact Table Store Key Shipper Key Product Key Product Key Store Key Period Key Product Desc Product Key Units Period Key Price Units Price Store Dimension Store Key Store Name City State Region
  • 28.
    Fact Constellation Sales Shipping Fact Table Product Dimension Fact Table Store Key Shipper Key Product Key Product Key Store Key Period Key Product Desc Product Key Units Period Key Price Units Price Store Dimension Store Key Store Name City State Region
  • 29.
    Building Data Warehouse Data Selection  Data Preprocessing – Fill missing values – Remove inconsistency  Data Transformation & Integration  Data Loading Data in warehouse is stored in form of fact tables and dimension tables.
  • 30.
    Case Study  Unilever is a new company which produces soaps,paste and baverages products with production unit located at NA.  There products are sold in North,North West and Western region of India.  They have sales units at India, America , UK and Japan.  The President of the company wants sales information.
  • 31.
    Sales Information Report: Thenumber of units sold. 113 Report: The number of units sold over time and date January February March April 14 41 33 25
  • 32.
    Sales Information Report :The number of items sold for each product with time Jan Feb Mar Apr Soaps 6 17 Paste 6 16 6 8 Time bread 8 25 21 Product
  • 33.
    Sales Information Report: Thenumber of items sold in each country for each product with time Jan Feb Mar Apr India Soaps 3 10 City Paste 3 16 6 bread 4 16 6 Time UK soaps 3 7 paste 3 8 Product bread 4 9 15
  • 34.
    Sales Information Report: Thenumber of items sold and income in each region for each product with time. Jan Feb Mar Apr Rs U Rs U Rs U Rs U India Soaps 7.44 3 24.80 10 Paste 7.95 3 42.40 16 15.90 6 bread 7.32 4 29.98 16 10.98 6 UK Soaps 7.44 3 17.36 7 paste 7.95 3 21.20 8 bread 7.32 4 16.47 9 27.45 15
  • 35.
    Sales Measures &Dimensions  Measure – Units sold, Amount.  Dimensions – Product,Time,Region.
  • 36.
    Sales Data WarehouseModel Fact Table Country Product Month Units Rupees India Soaps January 3 7.95 India Paste January 4 7.32 UK Soaps January 3 7.95 UK Paste January 4 7.32 India Bread February 16 42.40
  • 37.
    Sales Data WarehouseModel City_ID Prod_ID Month Units Rupees 1 589 1/1/1998 3 7.95 1 1218 1/1/1998 4 7.32 2 589 1/1/1998 3 7.95 2 1218 1/1/1998 4 7.32 1 589 2/1/1998 16 42.40
  • 38.
    Sales Data WarehouseModel Product Dimension Tables Prod_ID Product_Name Product_Category_ID 589 Soaps 1 590 Paste 1 288 Bread 2 Product_Category_Id Product_Category 1 domestic 2 food
  • 39.
    Sales Data WarehouseModel Region Dimension Table City_ID City Region Country 1 India West India 2 UK NorthWest India
  • 40.
    Sales Data WarehouseModel Time Product Sales Fact Product Category Region
  • 41.
    Data Warehousing includes Build Data Warehouse  Online analysis processing(OLAP).  Presentation. Cleaning ,Selection & Integration RDBMS Presentation Flat File Client Warehouse & OLAP server
  • 42.

Editor's Notes

  • #3 we invent something only if there is a need for that thing….today we are going to see what data warehousing is…data warehouse is evolved to satisfy some needs….we will see some of these need now
  • #15 When we need to extract data from various sources, some may be manually maintained on paper, some on different legacy systems and integrating the data is a laborious work. Many systems provide some DTS systems to convert data in appropriate format and provide necessary transformations
  • #21 We need subject oriented and multidimensional data amodel fro data warehouse which facilitates online analysis