Intro to Data Warehouse
Ch Anwar ul Hassan (Lecturer)
Department of Computer Science and Software Engineering
Capital University of Sciences & Technology, Islamabad Pakistan
anwarchaudary@gmail.com
2
What is a Data Warehouse?
A complete repository of historical
corporate data extracted from
transaction systems that is
available for ad-hoc access by
knowledge workers.
3
What is a Data Warehouse?
Complete repository
History
Transaction System
Ad-Hoc access
Knowledge workers
4
What is a Data Warehouse?
Transaction System
 Management Information System (MIS)
 Could be typed sheets (NOT transaction system)
Ad-Hoc access
 Does not have a certain access pattern.
 Queries not known in advance.
 Difficult to write SQL in advance.
Knowledge workers
 Typically NOT IT literate (Executives, Analysts, Managers).
 NOT clerical workers.
 Decision makers.
5
Another View of a DWH
Subject
Oriented
Integrated
Time
Variant
Non
Volatile
6
What is a Data Warehouse ?
It is a blend of many technologies, the basic
concept being:
 Take all data from different operational systems.
 If necessary, add relevant data from industry.
 Transform all data and bring into a uniform format.
 Integrate all data as a single entity.
7
What is a Data Warehouse ? (Cont…)
It is a blend of many technologies, the basic
concept being:
Store data in a format supporting easy access for
decision support.
 Create performance enhancing indices.
 Implement performance enhancement joins.
 Run ad-hoc queries with low selectivity.
8
Business user
needs info
User requests
IT people
IT people
create reports
IT people
send reports to
business user
IT people do
system analysis
and design
Business user
may get answers
Answers result
in more questions

?
How is it Different?
 Fundamentally different
9
How is it Different?
 Different patterns of hardware utilization
100%
0%
Operational DWH
Bus Service vs. Train
10
How is it Different?
 Combines operational and historical data.
 DWH keep historical data. Why?
 In the context of bank, want to know why the customer left?
 What were the events that led to his/her leaving? Why?
 Customer retention.
11
How much history?
 Depends on:
 Industry.
 Cost of storing historical data.
 Economic value of historical data.
12
How much history?
 Industries and history
 Telecomm calls are much much more as compared to
bank transactions- 18 months.
 Retailers interested in analyzing yearly seasonal
patterns- 65 weeks.
 Insurance companies want to do actuary analysis, use
the historical data in order to predict risk- 7 years.
13
How is it Different?
 Starts with a 6x12 availability requirement ...
but 7x24 usually becomes the goal.
 Decision makers typically don’t work 24 hrs a day and 7
days a week. An ATM system does.
 Once decision makers start using the DWH, and start
reaping the benefits, they start liking it…
 Start using the DWH more often, till want it available
100% of the time.
14
How is it Different?
 Starts with a 6x12 availability requirement ...
but 7x24 usually becomes the goal.
 For business across the globe, 50% of the world may be
sleeping at any one time, but the businesses are up 100%
of the time.

15
How is it Different?
 Does not follows the traditional development
model
Classical SDLC
 Requirements gathering
 Analysis
 Design
 Programming
 Testing
 Integration
 Implementation
Requirements
Program


16
How is it Different?
 Does not follows the traditional development
model
DWH SDLC (CLDS)
 Implement warehouse
 Integrate data
 Test for biasness
 Program w.r.t data
 Design DSS system
 Analyze results
 Understand requirement
Requirements
Program

DWH
17
Data Warehouse Vs. OLTP
OLTP (On Line Transaction Processing)
Select tx_date, balance from tx_table
Where account_ID = 23876;
18
Data Warehouse Vs. OLTP
DWH
Select balance, age, sal, gender from
customer_table, tx_table
Where age between (30 and 40) and
Education = ‘graduate’ and
CustID.customer_table =
Customer_ID.tx_table;
19
Data Warehouse Vs. OLTP
OLTP DWH
Primary key used Primary key NOT used
No concept of Primary Index Primary index used
Few rows returned Many rows returned
May use a single table Uses multiple tables
High selectivity of query Low selectivity of query
Indexing on primary key
(unique)
Indexing on primary index
(non-unique)
20
Data Warehouse Vs. OLTP
Data Warehouse OLTP
Scope * Application –Neutral
* Single source of “truth”
* Evolves over time
* How to improve business
* Application specific
* Multiple databases with repetition
* Off the shelf application
* Runs the business
Data
Perspective
* Historical, detailed data
* Some summary
* Lightly denormalized
* Operational data
* No summary
* Fully normalized
Queries * Hardly uses PK
* Number of results
returned in thousands
* Based on PK
* Number of results returned in
hundreds
Time factor * Minutes to hours
* Typical availability 6x12
* Sub seconds to seconds
* Typical availability 24x7
OLTP: OnLine Transaction Processing (MIS or Database System)
21
Comparison of Response Times
 On-line analytical processing (OLAP) queries must
be executed in a small number of seconds.
 Often requires denormalization and/or sampling.
 Complex query scripts and large list selections can
generally be executed in a small number of
minutes.
 Sophisticated clustering algorithms (e.g., data
mining) can generally be executed in a small
number of hours (even for hundreds of thousands
of customers).
22
Data Warehouse Server
(Tier 1)
Data
Warehouse
Operational
Data Bases
Semistructured
Sources Query/Reporting

Data Marts
MOLAP
ROLAP
Clients
(Tier 3)
Tools
Meta
Data
Data sources
Data
(Tier 0)





IT
Users


Business
Users


Business Users
Data Mining

Archived
data
Analysis

OLAP Servers
(Tier 2)
Extract
Transform
Load
(ETL)
www data
Putting the pieces together

Intro to Data warehousing lecture 02

  • 1.
    Intro to DataWarehouse Ch Anwar ul Hassan (Lecturer) Department of Computer Science and Software Engineering Capital University of Sciences & Technology, Islamabad Pakistan anwarchaudary@gmail.com
  • 2.
    2 What is aData Warehouse? A complete repository of historical corporate data extracted from transaction systems that is available for ad-hoc access by knowledge workers.
  • 3.
    3 What is aData Warehouse? Complete repository History Transaction System Ad-Hoc access Knowledge workers
  • 4.
    4 What is aData Warehouse? Transaction System  Management Information System (MIS)  Could be typed sheets (NOT transaction system) Ad-Hoc access  Does not have a certain access pattern.  Queries not known in advance.  Difficult to write SQL in advance. Knowledge workers  Typically NOT IT literate (Executives, Analysts, Managers).  NOT clerical workers.  Decision makers.
  • 5.
    5 Another View ofa DWH Subject Oriented Integrated Time Variant Non Volatile
  • 6.
    6 What is aData Warehouse ? It is a blend of many technologies, the basic concept being:  Take all data from different operational systems.  If necessary, add relevant data from industry.  Transform all data and bring into a uniform format.  Integrate all data as a single entity.
  • 7.
    7 What is aData Warehouse ? (Cont…) It is a blend of many technologies, the basic concept being: Store data in a format supporting easy access for decision support.  Create performance enhancing indices.  Implement performance enhancement joins.  Run ad-hoc queries with low selectivity.
  • 8.
    8 Business user needs info Userrequests IT people IT people create reports IT people send reports to business user IT people do system analysis and design Business user may get answers Answers result in more questions  ? How is it Different?  Fundamentally different
  • 9.
    9 How is itDifferent?  Different patterns of hardware utilization 100% 0% Operational DWH Bus Service vs. Train
  • 10.
    10 How is itDifferent?  Combines operational and historical data.  DWH keep historical data. Why?  In the context of bank, want to know why the customer left?  What were the events that led to his/her leaving? Why?  Customer retention.
  • 11.
    11 How much history? Depends on:  Industry.  Cost of storing historical data.  Economic value of historical data.
  • 12.
    12 How much history? Industries and history  Telecomm calls are much much more as compared to bank transactions- 18 months.  Retailers interested in analyzing yearly seasonal patterns- 65 weeks.  Insurance companies want to do actuary analysis, use the historical data in order to predict risk- 7 years.
  • 13.
    13 How is itDifferent?  Starts with a 6x12 availability requirement ... but 7x24 usually becomes the goal.  Decision makers typically don’t work 24 hrs a day and 7 days a week. An ATM system does.  Once decision makers start using the DWH, and start reaping the benefits, they start liking it…  Start using the DWH more often, till want it available 100% of the time.
  • 14.
    14 How is itDifferent?  Starts with a 6x12 availability requirement ... but 7x24 usually becomes the goal.  For business across the globe, 50% of the world may be sleeping at any one time, but the businesses are up 100% of the time. 
  • 15.
    15 How is itDifferent?  Does not follows the traditional development model Classical SDLC  Requirements gathering  Analysis  Design  Programming  Testing  Integration  Implementation Requirements Program  
  • 16.
    16 How is itDifferent?  Does not follows the traditional development model DWH SDLC (CLDS)  Implement warehouse  Integrate data  Test for biasness  Program w.r.t data  Design DSS system  Analyze results  Understand requirement Requirements Program  DWH
  • 17.
    17 Data Warehouse Vs.OLTP OLTP (On Line Transaction Processing) Select tx_date, balance from tx_table Where account_ID = 23876;
  • 18.
    18 Data Warehouse Vs.OLTP DWH Select balance, age, sal, gender from customer_table, tx_table Where age between (30 and 40) and Education = ‘graduate’ and CustID.customer_table = Customer_ID.tx_table;
  • 19.
    19 Data Warehouse Vs.OLTP OLTP DWH Primary key used Primary key NOT used No concept of Primary Index Primary index used Few rows returned Many rows returned May use a single table Uses multiple tables High selectivity of query Low selectivity of query Indexing on primary key (unique) Indexing on primary index (non-unique)
  • 20.
    20 Data Warehouse Vs.OLTP Data Warehouse OLTP Scope * Application –Neutral * Single source of “truth” * Evolves over time * How to improve business * Application specific * Multiple databases with repetition * Off the shelf application * Runs the business Data Perspective * Historical, detailed data * Some summary * Lightly denormalized * Operational data * No summary * Fully normalized Queries * Hardly uses PK * Number of results returned in thousands * Based on PK * Number of results returned in hundreds Time factor * Minutes to hours * Typical availability 6x12 * Sub seconds to seconds * Typical availability 24x7 OLTP: OnLine Transaction Processing (MIS or Database System)
  • 21.
    21 Comparison of ResponseTimes  On-line analytical processing (OLAP) queries must be executed in a small number of seconds.  Often requires denormalization and/or sampling.  Complex query scripts and large list selections can generally be executed in a small number of minutes.  Sophisticated clustering algorithms (e.g., data mining) can generally be executed in a small number of hours (even for hundreds of thousands of customers).
  • 22.
    22 Data Warehouse Server (Tier1) Data Warehouse Operational Data Bases Semistructured Sources Query/Reporting  Data Marts MOLAP ROLAP Clients (Tier 3) Tools Meta Data Data sources Data (Tier 0)      IT Users   Business Users   Business Users Data Mining  Archived data Analysis  OLAP Servers (Tier 2) Extract Transform Load (ETL) www data Putting the pieces together