The document provides an introduction to data warehouses. It defines a data warehouse as a complete repository of historical corporate data extracted from transaction systems and made available for ad-hoc querying by knowledge workers. It discusses how data warehouses differ from transaction systems in integrating data from multiple sources, storing historical data, and supporting analysis rather than transactions. The document also compares characteristics of data warehousing to online transaction processing.
Hybridoma Technology ( Production , Purification , and Application )
Intro to Data Warehouse Architecture
1. Intro to Data Warehouse
Ch Anwar ul Hassan (Lecturer)
Department of Computer Science and Software Engineering
Capital University of Sciences & Technology, Islamabad Pakistan
anwarchaudary@gmail.com
2. 2
What is a Data Warehouse?
A complete repository of historical
corporate data extracted from
transaction systems that is
available for ad-hoc access by
knowledge workers.
3. 3
What is a Data Warehouse?
Complete repository
History
Transaction System
Ad-Hoc access
Knowledge workers
4. 4
What is a Data Warehouse?
Transaction System
Management Information System (MIS)
Could be typed sheets (NOT transaction system)
Ad-Hoc access
Does not have a certain access pattern.
Queries not known in advance.
Difficult to write SQL in advance.
Knowledge workers
Typically NOT IT literate (Executives, Analysts, Managers).
NOT clerical workers.
Decision makers.
5. 5
Another View of a DWH
Subject
Oriented
Integrated
Time
Variant
Non
Volatile
6. 6
What is a Data Warehouse ?
It is a blend of many technologies, the basic
concept being:
Take all data from different operational systems.
If necessary, add relevant data from industry.
Transform all data and bring into a uniform format.
Integrate all data as a single entity.
7. 7
What is a Data Warehouse ? (Cont…)
It is a blend of many technologies, the basic
concept being:
Store data in a format supporting easy access for
decision support.
Create performance enhancing indices.
Implement performance enhancement joins.
Run ad-hoc queries with low selectivity.
8. 8
Business user
needs info
User requests
IT people
IT people
create reports
IT people
send reports to
business user
IT people do
system analysis
and design
Business user
may get answers
Answers result
in more questions
?
How is it Different?
Fundamentally different
9. 9
How is it Different?
Different patterns of hardware utilization
100%
0%
Operational DWH
Bus Service vs. Train
10. 10
How is it Different?
Combines operational and historical data.
DWH keep historical data. Why?
In the context of bank, want to know why the customer left?
What were the events that led to his/her leaving? Why?
Customer retention.
11. 11
How much history?
Depends on:
Industry.
Cost of storing historical data.
Economic value of historical data.
12. 12
How much history?
Industries and history
Telecomm calls are much much more as compared to
bank transactions- 18 months.
Retailers interested in analyzing yearly seasonal
patterns- 65 weeks.
Insurance companies want to do actuary analysis, use
the historical data in order to predict risk- 7 years.
13. 13
How is it Different?
Starts with a 6x12 availability requirement ...
but 7x24 usually becomes the goal.
Decision makers typically don’t work 24 hrs a day and 7
days a week. An ATM system does.
Once decision makers start using the DWH, and start
reaping the benefits, they start liking it…
Start using the DWH more often, till want it available
100% of the time.
14. 14
How is it Different?
Starts with a 6x12 availability requirement ...
but 7x24 usually becomes the goal.
For business across the globe, 50% of the world may be
sleeping at any one time, but the businesses are up 100%
of the time.
15. 15
How is it Different?
Does not follows the traditional development
model
Classical SDLC
Requirements gathering
Analysis
Design
Programming
Testing
Integration
Implementation
Requirements
Program
16. 16
How is it Different?
Does not follows the traditional development
model
DWH SDLC (CLDS)
Implement warehouse
Integrate data
Test for biasness
Program w.r.t data
Design DSS system
Analyze results
Understand requirement
Requirements
Program
DWH
17. 17
Data Warehouse Vs. OLTP
OLTP (On Line Transaction Processing)
Select tx_date, balance from tx_table
Where account_ID = 23876;
18. 18
Data Warehouse Vs. OLTP
DWH
Select balance, age, sal, gender from
customer_table, tx_table
Where age between (30 and 40) and
Education = ‘graduate’ and
CustID.customer_table =
Customer_ID.tx_table;
19. 19
Data Warehouse Vs. OLTP
OLTP DWH
Primary key used Primary key NOT used
No concept of Primary Index Primary index used
Few rows returned Many rows returned
May use a single table Uses multiple tables
High selectivity of query Low selectivity of query
Indexing on primary key
(unique)
Indexing on primary index
(non-unique)
20. 20
Data Warehouse Vs. OLTP
Data Warehouse OLTP
Scope * Application –Neutral
* Single source of “truth”
* Evolves over time
* How to improve business
* Application specific
* Multiple databases with repetition
* Off the shelf application
* Runs the business
Data
Perspective
* Historical, detailed data
* Some summary
* Lightly denormalized
* Operational data
* No summary
* Fully normalized
Queries * Hardly uses PK
* Number of results
returned in thousands
* Based on PK
* Number of results returned in
hundreds
Time factor * Minutes to hours
* Typical availability 6x12
* Sub seconds to seconds
* Typical availability 24x7
OLTP: OnLine Transaction Processing (MIS or Database System)
21. 21
Comparison of Response Times
On-line analytical processing (OLAP) queries must
be executed in a small number of seconds.
Often requires denormalization and/or sampling.
Complex query scripts and large list selections can
generally be executed in a small number of
minutes.
Sophisticated clustering algorithms (e.g., data
mining) can generally be executed in a small
number of hours (even for hundreds of thousands
of customers).
22. 22
Data Warehouse Server
(Tier 1)
Data
Warehouse
Operational
Data Bases
Semistructured
Sources Query/Reporting
Data Marts
MOLAP
ROLAP
Clients
(Tier 3)
Tools
Meta
Data
Data sources
Data
(Tier 0)
IT
Users
Business
Users
Business Users
Data Mining
Archived
data
Analysis
OLAP Servers
(Tier 2)
Extract
Transform
Load
(ETL)
www data
Putting the pieces together