Ch1 data-warehousing

Overview
•
•
•
•
•
•

What is data warehouse?
Why data warehouse?
Data reconciliation – ETL process
Data warehouse architectures
Star schema – dimensional modeling
Data analysis

2

What is Data Warehouse?
• Defined in many different ways, but not rigorously.
– A decision support database that is maintained separately from
the organization’s operational database
– Support information processing by providing a solid platform of
consolidated, historical data for analysis.

• “A data warehouse is a subject-oriented, integrated, timevariant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
• Data warehousing:
– The process of constructing and using data warehouses

3

Data Warehouse—SubjectOriented
• Organized around major subjects, such as
customer, product, sales
• Focusing on the modeling and analysis of data
for decision makers, not on daily operations or
transaction processing
• Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process
4

Data Warehouse—Integrated
• Constructed by integrating multiple,
heterogeneous data sources
– relational databases, flat files, on-line transaction
records

• Data cleaning and data integration techniques
are applied.
– Ensure consistency in naming conventions,
encoding structures, attribute measures, etc. among
different data sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.

– When data is moved to the warehouse, it is
converted.
5

Data Warehouse—Time Variant
• The time horizon for the data warehouse is
significantly longer than that of operational
systems
– Operational database: current value data
– Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)

• Every key structure in the data warehouse
– Contains an element of time, explicitly or implicitly
– But the key of operational data may or may not
6
contain “time element”

Data Warehouse—Nonvolatile
• A physically separate store of data transformed
from the operational environment
• Operational update of data does not occur in the
data warehouse environment
– Does not require transaction processing, recovery,
and concurrency control mechanisms
– Requires only two operations in data accessing:
• initial loading of data and access of data
7

Trends in Organisations that encourage
the need for data warehousing
• No single system of record
• Multiple systems are not synchronized
• Organisations want to analyse the activities in
a balanced way
• Customer relationship management
• Supplier relationship management

8

Need for Data Warehousing
• Integrated, company-wide view of high-quality
information (from different databases)
• Separation of operational and informational systems
and data (for improved performance)

9

Operational & Informational System
The need to separate operational and informational
systems is based on three primary factors:
• A data warehouse centralizes data that are scattered
throughout disparate operational systems and make them
a available for decision support applications
• A properly designed data warehouse adds value to data
by improving their quality
• A separate data warehouse eliminates much of contention
for resources that result when informational application
confounded with operational processing
10

Data Reconciliation
• Typical operational data is:
– Transient – not historical
– Not normalised (perhaps due to denormalisation for
performance)
– Restricted in scope – not comprehensive
– Sometimes poor quality – inconsistencies and errors

• After ETL (Extract, Transform, Load), data
should be:
–
–
–
–
–

Detailed – not summarized yet
Historical – periodic
Normalised – 3rd normal form or higher
Comprehensive – enterprise-wide perspective
Timely – data should be current enough to assist decisionmaking
– Quality controlled – accurate with full integrity

12

The ETL Process/ Data
Reconciliation Main Steps
•
•
•
•

Capture/Extract
Scrub or data cleansing
Transform
Load and Index

13

Static extract = capturing a

Incremental extract =

snapshot of the source data at a point
in time

capturing changes that have
occurred since the last static extract
14

Fixing errors: misspellings,

Also: decoding, reformatting, time

erroneous dates, incorrect field usage,
mismatched addresses, missing data,
duplicate data, inconsistencies

stamping, conversion, key generation,
merging, error detection/logging, locating
15
missing data

Record-level:

Field-level:

Selection – data partitioning
Joining – data combining
Aggregation – data summarization

single-field – from one field to one field
multi-field – from many fields to one, or
16
one field to many

Refresh mode: bulk rewriting of
target data at periodic intervals

Update mode: only changes in
source data are written to data
17
warehouse

Data Warehouse Architectures
• Generic Two-Level Architecture
• Independent Data Mart
• Dependent Data Mart and Operational
Data Store
• Logical Data Mart and @ctive
Warehouse

18

Generic two-level architecture

L
One companywide warehouse

T
E
Periodic extraction  data is not completely current in warehouse
19

Independent data mart

Data marts:
Mini-warehouses, limited in scope

L

T
E

Separate ETL for each
independent data mart

Data access complexity
due to multiple data marts
20

Dependent data mart with
operational data store

ODS provides option for
obtaining current data

L

T
E
Single ETL for
enterprise data warehouse (EDW)

Dependent data marts
loaded from21
EDW

ODS and data warehouse
are one and the same

L

T
E
Near real-time ETL for
@active Data Warehouse

Data marts are NOT separate
databases, but logical views of the
data warehouse
22
 Easier to create new data marts

Data Characteristics
Status vs. Event Data
Status

Event – a database action
(create/update/delete) that
results from a transaction

Status
23

Data Characteristics
Transient vs.
Periodic Data

Changes to existing
records are written
over previous
records, thus
destroying the
previous data content
Data are never
physically altered or
deleted once they
have been added to
the store

24

star schema
Fact tables contain
factual or quantitative
data

1:N relationship
between dimension
tables and fact
tables

Dimension tables
are denormalized to
maximize
performance

Dimension tables contain
descriptions about the
subjects of the business

Star Schema: Simple database design in
which dimensional data are separated from
fact data. Excellent for queries, but bad for
25
online transaction processing

Star schema example
Fact table provides statistics for sales broken
down by product, period and store dimensions

26

On-Line Analytical Processing (OLAP)
• The use of a set of graphical tools that
provides users with multidimensional views of
their data and allows them to analyze the
data using simple windowing techniques
• Relational OLAP (ROLAP)
– Traditional relational representation

• Multidimensional OLAP (MOLAP)
– Cube structure

• OLAP Operations
– Cube slicing – come up with 2-D view of data
– Drill-down – going from summary to more
detailed views
28

Data Warehouse vs. Operational
DBMS
• OLTP (on-line transaction processing)
– Major task of traditional relational DBMS
– Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.

• OLAP (on-line analytical processing)
– Major task of data warehouse system
– Data analysis and decision making

• Distinct features (OLTP vs. OLAP):
– User and system orientation: customer vs. market
– Data contents: current, detailed vs. historical, consolidated
– Database design: ER + application vs. star + subject
– View: current, local vs. evolutionary, integrated
– Access patterns: update vs. read-only but complex queries
29

OLTP vs. OLAP
OLTP

OLAP

users

clerk, IT professional

knowledge worker

function

day to day operations

decision support

DB design

application-oriented

subject-oriented

data

current, up-to-date
detailed, flat relational
isolated
repetitive

historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans

unit of work

read/write
index/hash on prim. key
short, simple transaction

# records accessed

tens

millions

#users

thousands

hundreds

DB size

100MB-GB

100GB-TB

metric

transaction throughput

query throughput, response

usage
access

complex query

30

Summary report

Example:
Drill-down
Drill-down with color added

32

Data Warehouse Usage
• Three kinds of data warehouse applications
– Information processing
• supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
– Analytical processing
• multidimensional analysis of data warehouse data
• supports basic OLAP operations, slice-dice, drilling, pivoting
– Data mining
• knowledge discovery from hidden patterns
• supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools

33

Ch1 data-warehousing

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Ch1 data-warehousing

Similar to Ch1 data-warehousing (20)

Recently uploaded

Recently uploaded (20)

Ch1 data-warehousing