SlideShare a Scribd company logo
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 1
Design and
Maintenance of
Data Warehouses
Design and
Maintenance of
Data Warehouses
Timos Sellis
National Technical University of Athens
KDBS Laboratory
http://www.dbnet.ece.ntua.gr/
Many thanks to P. Vassiliadis and A. Tsois
EDBT Summer School - Cargese 2002 2
Outline
What’s and Why’s for DW’s
DW architecture
DW Schema
Back End of the DW
Front End of the DW
DW Servers
Metadata Repository
Conclusions
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 2
EDBT Summer School - Cargese 2002 3
OLTP
On-line transaction processing (OLTP) is the
traditional way of using a database
Legacy systems: relational, hierarchical, network
databases / COBOL applications / …
Short transactions (read/update few records) with
ACID properties
Normally, only the last version of data stored in the
database
EDBT Summer School - Cargese 2002 4
DSS & OLAP
Decision support systems - help the executive,
manager, analyst make faster and better decisions.
What where the sales volumes by region and product
category for the last year?
Will a 10% discount increase sales volumes sufficiently?
On-line analytical processing (OLAP) is an
element of decision support systems (DSS)
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 3
EDBT Summer School - Cargese 2002 5
OLTP vs. OLAP
OLTP OLAP
User Clerk Manager
Function Day to day operations Decision support
Access Read/write Mostly read
Data detailed, up-to-date,
flat relational
summarised,
historical,
multidimensional
Db Size 100MB - 1GB 100GB - 1TB
Chaudhuri
& Dayal
@VLDB’96
EDBT Summer School - Cargese 2002 6
Data Warehouse
A decision support database that is maintained
separately from the organization’s operational
database.
• S. Chaudhuri, U. Dayal, VLDB’96 tutorial
A data warehouse is a subject-oriented,
integrated, time-varying, non-volatile collection of
data that is used primarily in organizational
decision making.
• W.H. Inmon, Building the Data Warehouse, 1992
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 4
EDBT Summer School - Cargese 2002 7
Reasons for Building Data Warehouses
Semantic Reconciliation
Dispread data sources within the same organization
Different encoding of the same entities
DW encompasses the full volume of these data
under a single, reconciled schema
Keeps the history of these data, too
EDBT Summer School - Cargese 2002 8
Reasons for Building Data Warehouses
Performance
OLAP applications need different organization of
data
Complex OLAP queries would degrade OLTP
performance
Availability
Separation increases availability
Possibly the only way to query the dispread data
sources
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 5
EDBT Summer School - Cargese 2002 9
Reasons for Building Data Warehouses
Data Quality
The validity of source data is not guaranteed (data can be
missing, inconsistent, out of date, violating business and
database rules…)
Errors in data reach a minimum 10% in most data stores
Can lead to wasting of resources of 25-40%
DW acts as a data cleaning buffer
…. and the market is there!
EDBT Summer School - Cargese 2002 10
The Market
Estimated sales in millions of dollars [ShTy98] (*estimates
are from [Pend00]).
1998 1999 2000 2001 2002 CAGR (%)
RDBMS sales for DW 900.0 1110.0 1390.0 1750.0 2200.0 25.0
Data Marts 92.4 125.0 172.0 243.0 355.0 40.0
ETL tools 101.0 125.0 150.0 180.0 210.0 20.1
Data Quality 48.0 55.0 64.5 76.0 90.0 17.0
Metadata Management 35.0 40.0 46.0 53.0 60.0 14.4
OLAP (including implementation
services)*
2000 2500 3000 3600 4000 18.9
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 6
EDBT Summer School - Cargese 2002 11
Data Warehouse Architecture
A Simple View
Client Client
Warehous
e
Sourc
e
Sourc
e
Sourc
e
Query &
Analysis
Integration
Metadata
EDBT Summer School - Cargese 2002 12
Data Warehouse Architecture
Sources
Administrator
DSA
Administrator
DW
Designer
Data
Marts
Metadata
Repository
End User
Quality
Issues
Quality
Issues
Quality
Issues
Quality
Issues
Reporting /
OLAP tools
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 7
EDBT Summer School - Cargese 2002 13
Two / Three Tier Architecture
Warehouse database server
almost always relational (RDBMS)
Data Marts / OLAP server
Relational OLAP (ROLAP)
Multidimensional OLAP (MOLAP)
Clients
Query and reporting tools
Analysis tools / Data mining tools
EDBT Summer School - Cargese 2002 14
Data Warehouse Architecture
Enterprise warehouse: collects all information about
subjects
requires extensive business modeling
may take years to design and build
Data Marts: Departmental subsets that focus on
selected subjects
Virtual warehouse: views over operational dbs
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 8
EDBT Summer School - Cargese 2002 15
How to build the DW
Top – down
Single integrated enterprise model
Reduce all sources (and clients, if necessary) to the central
model
− Time consuming; labor intensive; slow to produce results
− Enhances the risk of the DW project due to late delivery of
results
+ Provides a consistent, global view of the enterprise data
EDBT Summer School - Cargese 2002 16
How to build the DW
Bottom – up
Build smaller data marts first
Progressively combine pairwise
− Fails to provide a global view of the enterprise data
− Possibly enhances the risk since a complete
integration might prove impossible late in the project
+ Early delivery of results
+ Less time consuming, less labor intensive
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 9
EDBT Summer School - Cargese 2002 17
Data Warehouse Back-End
Sources
Administrator
DSA
Administrator
DW
Designer
Data
Marts
Metadata
Repository
End User
Quality
Issues
Quality
Issues
Quality
Issues
Quality
Issues
Reporting /
OLAP tools
EDBT Summer School - Cargese 2002 18
Design: Global-As-View Integration
Preintegration. What schemata to integrate and in
which order
Schema Comparison. To determine the correlations
among concepts of different schemata and to detect
possible naming, semantic, structural, … conflicts
Schema Conforming. Conflict resolution for
heterogeneous schemata
Schema Merging and Restructuring. Production of a
single conformed schema
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 10
EDBT Summer School - Cargese 2002 19
Design: Local-As-View Integration
Works the other way around.
Main deliverable is a central conceptual model,
produced by interactively examining user needs
and existing schemata
All source and client schemata are expressed in
terms of the central data warehouse schema and
not the other way around.
EDBT Summer School - Cargese 2002 20
DW = Materialized Views?
DW.PARTSU
PP
Aggregate1
PKEY, DAY
MIN(COST)
Aggregate2
PKEY, MONTH
AVG(COST)
V2
V1
TIME
DW.PARTSUPP.DATE,
DAY
S1_PARTSU
PP
S2_PARTSU
PP
Sources DW
U
Simple View of a DW
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 11
EDBT Summer School - Cargese 2002 21
Add_SPK1
SUPPKEY=1
SK1
DS.PS1.PKEY,
LOOKUP_PS.SKEY,
SUPPKEY
$2€
COST DATE
DS.PS2
Add_SPK2
SUPPKEY=2
SK2
DS.PS2.PKEY,
LOOKUP_PS.SKEY,
SUPPKEY
COST DATE=SYSDATE
AddDate CheckQTY
QTY>0
U
DS.PS1
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
DIFF1
DS.PS_NEW1.PKEY,
DS.PS_OLD1.PKEYDS.PS_NEW
1
DS.PS_OLD
1
DW.PARTSU
PP
Aggregate1
PKEY, DAY
MIN(COST)
Aggregate2
PKEY, MONTH
AVG(COST)
V2
V1
TIME
DW.PARTSUPP.DATE,
DAY
FTP1
S1_PARTSU
PP
S2_PARTSU
PP
FTP2
DS.PS_NEW
2
DIFF2
DS.PS_OLD
2
DS.PS_NEW2.PKEY,
DS.PS_OLD2.PKEY
DW ≠ Materialized Views !
Sources DW
DSA
EDBT Summer School - Cargese 2002 22
Operational Processes
Data extraction, transform & load
Originally treated as the ‘refreshment’ problem
Requires to transform, clean, integrate data from
different sources.
Build/refresh derived data and views
Service queries
Monitor the warehouse
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 12
EDBT Summer School - Cargese 2002 23
The Refreshment Problem
Propagate updates on source data to the
warehouse
Issues:
when to refresh
on every update
periodically
refresh policy set by administrator
how to refresh
EDBT Summer School - Cargese 2002 24
Refreshment Techniques
Full extract from base tables
Incremental techniques
detect changes on base tables
snapshots
transaction shipping
active rules
logical correctness
transactional correctness
Currently, in practice we use ETL tools/scripts (see
next)…
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 13
EDBT Summer School - Cargese 2002 25
Data Extraction
Can take snapshot or differentials
(new/deleted/updated) of source data
Transfer, encryption, compression are also
involved
Time window and source system overhead
involved
In general, faced with the requirement of minimal
changes to existing configuration of sources
EDBT Summer School - Cargese 2002 26
Data Transformation
Schema Reconciliation: conflicts at the schema
level (different attributes for the same
information)
Value Identification & Reconciliation: different
(same) id’s for same (different) objects (use
surrogate keys)
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 14
EDBT Summer School - Cargese 2002 27
Data Cleaning
Offending Data: duplicates, integrity/business
rules/format violations …
Incompleteness: missing data
Renicing: esp. addresses
EDBT Summer School - Cargese 2002 28
Data Loading
This final stage may still require additional
preprocessing:
sorting, summarizing, performing computations
Issues:
huge volumes of data to be loaded
small time window
when to build indexes and summary tables
restart after failure with no loss of data integrity
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 15
EDBT Summer School - Cargese 2002 29
Loading Techniques
Cannot use SQL language interface to update or
append data.
record at a time
too slow since it uses random disc I/O
can make rollback segment or log file to burst
Use batch load utility
sort input records on a clustering key
sequential I/O 100 times faster than random I/O
build index at the same time
use parallelism to accelerate load operations
EDBT Summer School - Cargese 2002 30
Incremental Loading
Use incremental loads during refresh to reduce data
volume (e.g. Redbrick)
insert only updated tuples
incremental load conflicts with queries
break into sequence of shorter transactions
coordinate this sequence of transactions: must
ensure consistency between base and derived
tables and indices.
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 16
EDBT Summer School - Cargese 2002 31
Data Warehouse Front-End
Sources
Administrator
DSA
Administrator
DW
Designer
Data
Marts
Metadata
Repository
End User
Quality
Issues
Quality
Issues
Quality
Issues
Quality
Issues
Reporting /
OLAP tools
EDBT Summer School - Cargese 2002 32
Front End Tools
Ad hoc query and reporting
Example: MS Excel, ProReports
OLAP: ‘Multidimensional spreadsheet’
pivot tables, drill down, roll up, slice, dice
Data Mining
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 17
EDBT Summer School - Cargese 2002 33
Basic ideas for OLAP
Several numeric measures that are analyzed
sales, budget, revenue, inventory
Dimensions
contexts in which a measure appears
Example: store, product, date information associated
with a sale.
each context is a dimension and the measure is a
point in a multi-dimensional world
EDBT Summer School - Cargese 2002 34
Basic ideas for OLAP
Nature of Analysis
aggregation (total sales, percent-to-total)
comparison (budget vs. expense)
ranking (top 10)
access to detailed and aggregate data
complex criteria specification
visualization
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 18
EDBT Summer School - Cargese 2002 35
Basic ideas for OLAP
Attributes
information associated with a dimension
example: owner of store, county in which the store is
located
Attribute Hierarchies
Attributes of a dimension are often related in a a
hierarchical way
example: street city country
EDBT Summer School - Cargese 2002 36
Multidimensional Data
Dimensions: Product, Region, Date
Hierarchical summarization paths:
Month
Region
Product
Sales volume
Industry
Category
Product
Country
Region
City
Office
Year
Quarter
Month Week
Day
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 19
EDBT Summer School - Cargese 2002 37
Operations
Roll up: summarize data
Drill down: go from higher level summary to
lower level summary or detailed data
Slice and dice: select and project
Pivot: re-orient cube
EDBT Summer School - Cargese 2002 38
Roll up
Sales volume
Electronics
Toys
Clothing
Cosmetics
Q1
$5,2
$1,9
$2,3
$1,1
Electronics
Toys
Clothing
Cosmetics
Q2
$8,9
$0,75
$4,6
$1,5
Products Store1 Store2
$5,6
$1,4
$2,6
$1,1
$7,2
$0,4
$4,6
$0,5
Sales volume
Electronics
Toys
Clothing
Cosmetics
Year1996
$14,1
$2,65
$6,9
$2,6
Products Store1 Store2
$12,8
$1,8
$7,2
$1,6
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 20
EDBT Summer School - Cargese 2002 39
Drill down
Sales volume
Electronics
Toys
Clothing
Cosmetics
Q1
$5,2
$1,9
$2,3
$1,1
Electronics
Toys
Clothing
Cosmetics
Q2
$8,9
$0,75
$4,6
$1,5
Products Store1 Store2
$5,6
$1,4
$2,6
$1,1
$7,2
$0,4
$4,6
$0,5
Sales volume
VCR
Camcorder
TV
CD player
Q1
$1,4
$0,6
$2,0
$1,2
VCR
Camcorder
TV
CD player
Q2
$2,4
$3,3
$2,2
$1,0
Electronics Store1 Store2
$1,4
$0,6
$2,4
$1,2
$2,4
$1,3
$2,5
$1,0
EDBT Summer School - Cargese 2002 40
Pivot
Sales volume
Electronics
Toys
Clothing
Cosmetics
Q1
$5,2
$1,9
$2,3
$1,1
Electronics
Toys
Clothing
Cosmetics
Q2
$8,9
$0,75
$4,6
$1,5
Products Store1 Store2
$5,6
$1,4
$2,6
$1,1
$7,2
$0,4
$4,6
$0,5
Sales volume
Electronics
Toys
Clothing
Cosmetics
Store1
$5,2
$1,9
$2,3
$1,1
Electronics
Toys
Clothing
Cosmetics
Store2
$5,6
$1,4
$2,6
$1,1
Products Q1 Q2
$8,9
$0,75
$4,6
$1,5
$7,2
$0,4
$4,6
$0,5
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 21
EDBT Summer School - Cargese 2002 41
Slice and Dice
Sales volume
Electronics
Toys
Clothing
Cosmetics
Q1
$5,2
$1,9
$2,3
$1,1
Electronics
Toys
Clothing
Cosmetics
Q2
$8,9
$0,75
$4,6
$1,5
Products Store1 Store2
$5,6
$1,4
$2,6
$1,1
$7,2
$0,4
$4,6
$0,5
Sales volume
Electronics
Toys
Q1
$5,2
$1,9
Products Store1
Electronics
Toys
Q2
$8,9
$0,75
EDBT Summer School - Cargese 2002 42
Data Warehouse Server
Sources
Administrator
DSA
Administrator
DW
Designer
Data
Marts
Metadata
Repository
End User
Quality
Issues
Quality
Issues
Quality
Issues
Quality
Issues
Reporting /
OLAP tools
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 22
EDBT Summer School - Cargese 2002 43
Data Warehouse Servers - Outline
Server Technology: ROLAP & MOLAP
Indexing Techniques
Query Processing and Optimization
EDBT Summer School - Cargese 2002 44
Database Servers
Relational and Specialized Relational DBMS
Relational OLAP (ROLAP) DBMS
Multidimensional OLAP (MOLAP) DBMS
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 23
EDBT Summer School - Cargese 2002 45
Relational DBMS
Features that support DSS
Specialized Indexing techniques
Specialized Join and Scan Methods
Data Partitioning and use of Parallelism
Complex Query Processing
Intelligent Processing of Aggregates
Extensions to SQL and their processing
EDBT Summer School - Cargese 2002 46
ROLAP Servers
Exploits services of a relational engine effectively
Key functionality
needs aggregation navigation logic
ability to generate multi statement SQL
optimize for each individual database backend
Additional services
cost-based query governor
design tool for DSS schema
performance analysis tool
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 24
EDBT Summer School - Cargese 2002 47
Database Schemata for DW & ROLAP
Star Schema
Snowflake Schema
Fact Constellation
Aggregated data
EDBT Summer School - Cargese 2002 48
Star Schema
A star schema consists of one central fact table and
several denormalized dimension tables.
The measures of interest for OLAP are stored in the
fact table (e.g. Dollar Amount, Units in the table
SALES).
For each dimension of the multidimensional model
there exists a dimension table (e.g. Geography,
Product, Time, Account) with all the levels of
aggregation and the extra properties of these levels.
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 25
EDBT Summer School - Cargese 2002 49
Star Schema
SALES
Geography Code
Time Code
Account Code
Product Code
Dollar Amount
Units
Geography
Geography Code
Region Code
Region Manager
State Code
City Code
.....
Product
Product Code
Product Name
Brand Code
Brand Name
Prod. Line Code
Prod. Line Name
Time
Time Code
Quarter Code
Quarter Name
Month Code
Month Name
Date
Account
Account Code
KeyAccount Code
KeyAccountName
Account Name
Account Type
Account Market
Stanford Technology
Group, Inc., 1996
EDBT Summer School - Cargese 2002 50
Snowflake Schema
The normalized version of the star schema
Explicit treatment of dimension hierarchies (each
level has its own table)
Easier to maintain, slower in query answering
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 26
EDBT Summer School - Cargese 2002 51
Snowflake Schema
SALES
Postal Code
Time Code
Account Code
Product Code
Dollar Amount
Units
Time
Time Code
Quarter Code
Month Code
Quarter
Quarter Code
QuarterName
Month
Month Code
Month Name
Account
Account Code
KeyAccount
Code
Account
attributes
Account Code
AccountName
KeyAccount
KeyAcc Code
KeyAcc Name
Geography
Postal Code
Region Code
State Code
City Code
Region
Region Code
Region Mgr
State
State Code
State Name
City
City Code
City Name
Product
Product Code
Prod Line Code
Brand Code
Product
Product Code
ProductName
Brand
Brand Code
Brand Name
ProdLine
ProdLineCode
ProdLineName
Stanford Technology
Group, Inc., 1996
EDBT Summer School - Cargese 2002 52
Fact Constellation
Multiple fact tables that share many dimension
tables
Example: projected expense and the actual
expense may share dimensional tables
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 27
EDBT Summer School - Cargese 2002 53
Aggregated Tables
In addition to base fact and dimension tables,
data warehouse keeps aggregated (summary)
data for efficiency.
Two approaches
store as separate summary fact and dimension
tables
add to the existing base tables
EDBT Summer School - Cargese 2002 54
Aggregated Tables
RID City Amount
1 Athens $100
2 N.Y. $300
3 Rome $120
4 Athens $250
5 Rome $180
6 Rome $65
7 N.Y. $450
City Amount
Athens $350
N.Y. $750
Rome $365
RID City Amount Level
1 Athens $100 NULL
2 N.Y. $300 NULL
3 Rome $120 NULL
4 Athens $250 NULL
5 Rome $180 NULL
6 Rome $65 NULL
7 N.Y. $450 NULL
8 Athens $350 City
9 N.Y. $750 City
10 Rome $365 City
• Separate sum-table
• Extend existing base tables
Extended Sales table
Sales table
City-dimension sum table
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 28
EDBT Summer School - Cargese 2002 55
MOLAP Servers
The storage model is an n-dimensional array
Very fast in computations and OLAP operations
Normally they require pre-computation of the
available cubes
Compression of data to save storage space
Currently: 98% of the market for client tools
SISYPHUS: A Chunk-Based Storage
Manager for OLAP Cubes
PhD work of Nikos Karayannidis
National Technical University of Athens
(NTUA)
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 29
EDBT Summer School - Cargese 2002 57
ERATOSTHENES project
ERATOSTHENES, is a
specialized database
management system for
OLAP cubes which is under
development.
In the context of
ERATOSTHENES, a
prototype storage manager
for OLAP cubes, called
SISYPHUS, has been
developed.Storage Engine
(SISYPHUS)
Processing Engine
Presentation Engine
EDBT Summer School - Cargese 2002 58
Why OLAP poses new require-ments to
storage management?
Small response time: good physical clustering +
efficient access paths
Multidimensionality: md-storage structures,
address by location
Hierarchies: access paths, clustering
Sparseness: not random but according to
hierarchies.
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 30
EDBT Summer School - Cargese 2002 59
Architecture: levels of abstraction in
SISYPHUS
SSM
Record-oriented
storage mngmnt
File Manager
Bucket-oriented File
mngmnt
Logging/Recovery
Buffer Manager
Buffer mngmnt
Access Manager
Chunk-oriented File
mngmnt
Cube Access Methods OLAP Processing
rec.oriented
access
bckt.oriented
access
chnk.orient
ed access
Cell
oriented
access
EDBT Summer School - Cargese 2002 60
Dimension data encoding
City
Region
Country
LOCATION
0.1.2
0 1 2
CityA CityB CityC CityD
0 1
RegionA RegionB
0
CountryA
3
order-codes
member-code
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 31
EDBT Summer School - Cargese 2002 61
A chunk-oriented file system: the
hierarchically chunked cube
Use the bucket file
system.
Chunking Method:
partition the data space
by forming a hierarchy of
chunks that is based on
the dimension
hierarchies.
continent
city
region
country
item
type
category
item
Pseudo
[0..18]
[0..10]
[0..4]
[0..2]
[0..5]
[0..2]
[0..2]
[0..1]
EDBT Summer School - Cargese 2002 62
D = 0
continent
city
region
country
item
type
category
item
Pseudo
[0..18] (LOCATION)
[0..5](PRODUCT)
(0,0)
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 32
EDBT Summer School - Cargese 2002 63
continent
city
region
country
item
type
category
item
Pseudo
[0..5] [6..10] [11..18]
[0..3][4..5]
D = 1
EDBT Summer School - Cargese 2002 64
continent
city
region
country
item
type
category
item
Pseudo
[0..2] [3..5] [6..10] [11..14] [15..18]
[4..5][0..1][2..3]
D = 2
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 33
EDBT Summer School - Cargese 2002 65
continent
city
region
country
item
type
category
item
Pseudo
[0..1][2..3][4..5]
[1..2][0] [4..5][3] [8..9][6..7] [10] [12..14][11] [17..18][15..16]
D = 3 (Max Depth)
EDBT Summer School - Cargese 2002 66
Chunk Identifiers (chunk-ids)
Chunk addressing.
Unique identifier of chunk within cube + depicts
hierarchy path of chunk.
Interleave the member-codes of the pivot-level
members that define a chunk (at any depth).
e.g. D = 2 LOCATION: 2.3, PRODUCT:1.2
2.3 1.2
2 . 31 2| |
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 34
EDBT Summer School - Cargese 2002 67
Accessing the chunks of a cube
Need some chunk directory.
Idea: use intermediate depth chunks as directory
chunks that will guide us to the data chunks
(Dmax + 1)
Create a chunk-tree.
EDBT Summer School - Cargese 2002 68
1
3
Grain level
(Data Chunks)
Root Chunk
P P
0 1 2 3
D = 1
D = 2
LOCATION
PRODUCT
0 1 2
0
1
0
00.00 00.10
D = 3 (Max Depth)
0
00.00.0P
0
1
1 2
00.00.1P
0
1
00.10.2P
0
1
4 5
00.10.3P
0
1
0 1
00
P P
0 1 2 3
00.01 00.11
30
00.01.0P
2
3
1 2
00.01.1P
2
3
00.11.2P
2
3
4 5
00.11.3P
2
3
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 35
EDBT Summer School - Cargese 2002 69
Bucket Organization
3 parts: bucket header, directory chunk vector,
data chunk vector.
Main idea: try to store in the same bucket
whole families (i.e. sub-trees of chunks)!
A) A single sub-tree
B) Many sub-trees that form a bucket region
C) A single tree of directory chunks (root bucket)
D) A single data chunk
EDBT Summer School - Cargese 2002 70
Chunk organization
Implementation data structure: multidimensional arrays:
Offer data address by-location, native to cubes.
Enable chunk id exploitation.
We don’t have to store the chunk ids.
Are FAST!
Compression schemes:
Data chunks: allocate only non-empty cells, maintain bitmap.
Directory chunks: full cell allocation but no allocation for
empty sub-trees.
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 36
EDBT Summer School - Cargese 2002 71
Summary
Storage management in OLAP
SISYPHUS storage manager for OLAP
Chunk-oriented file system:
Natively multidimensional and supports hierarchies.
Clusters data hierarchically.
It is space conservative.
Adopts location-based than content-based data address
scheme.
Also: data-access interface can be used for defining
access paths and OLAP operations.
EDBT Summer School - Cargese 2002 72
Future Work
Experimental tests.
Design/Implementation of algorithms for typical
OLAP operations.
Other research issues:
Finding optimal bucket regions
Updating interface for common OLAP updating
operations.
Efficient file organization for dimension data
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 37
EDBT Summer School - Cargese 2002 73
Data Warehouse Servers - Outline
Server Technology: ROLAP & MOLAP
Indexing Techniques
Query Processing and Optimization
EDBT Summer School - Cargese 2002 74
Why specialized indexing
Join-intensive queries
Almost all queries demand joins of the fact table with some
dimensions
Very large tables
traditional index become too large to be efficient
Complex queries
selections based on complex criteria
Read-intensive workload
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 38
EDBT Summer School - Cargese 2002 75
BitMap Indexes
An alternative representation of RID-list
Advantageous for low-cardinality domains
Represent each row of a table by a bit and the
table as a bit vector
There is a distinct bit vector Bv for each value v
for the domain.
The j-th bit in the vector Bv is set if the j-th row of
the table has the value v for the column
EDBT Summer School - Cargese 2002 76
BitMap Indexes
Example: The attribute sex has values M and F.
A table of 100 million people needs 2 lists of 100
million bits
Comparison, join and aggregation operations are
reduced to bit arithmetic with dramatic
improvement in processing time
Significant reduction in space and I/O (30:1)
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 39
EDBT Summer School - Cargese 2002 77
BitMap Indexes
Cust Region Rating
C1 N H
C2 S M
C3 W L
C4 W H
C5 S L
C6 W L
C7 N H
RID N S E W
1 1 0 0 0
2 0 1 0 0
3 0 0 0 1
4 0 0 0 1
5 0 1 0 0
6 0 0 0 1
7 1 0 0 0
RID H M L
1 1 0 0
2 0 1 0
3 0 0 1
4 1 0 0
5 0 0 1
6 0 0 1
7 1 0 0
Base Table Region Index Rating Index
EDBT Summer School - Cargese 2002 78
BitMap Indexes
Works poorly for high cardinality domains since
the number of vectors increase
However, often good performance via
compression since scarcity also increases
Products that support bitmaps: Model 204,
TargetIndex (Redbrick), IQ (Sybase), Oracle
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 40
EDBT Summer School - Cargese 2002 79
Join Indexes
Traditional indexes map the value in a column to a list
of rows with that value
Join indexes maintain relationships between the primary
key and the foreign keys
Thus, join indexes relate the values of the dimensions
of a star schema to rows in the fact table.
Join indexes may span multiple dimensions
EDBT Summer School - Cargese 2002 80
Join Indexes
Join index for a single dimension:
Consider a schema with a Sales fact table and two
dimensions city and product
If there is a join index on city, then for each distinct city, the
index maintains a list of RIDs of the tuples recording sale in
that city
Example: The node Athens in the index points to the list of
RIDs in the fact table corresponding to transactions (sale) in
Athens.
Join indexes can span multiple dimensions
the node (Athens, oranges) points to transactions that took
place in Athens and which corresponds to purchase of
oranges
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 41
EDBT Summer School - Cargese 2002 81
Join Indexes
RID City Amount
1 Athens $100
2 N.Y. $300
3 Rome $120
4 Athens $250
5 Rome $180
6 Rome $65
7 N.Y. $450
City Country Population
Athens Greece 3.507.000
Rome Italy 3.033.000
N.Y. USA 17.953.000
Sales table City table
City RIDs
Athens 1, 4
Rome 3, 5, 6
N.Y. 2, 7
Index on City-Sales
EDBT Summer School - Cargese 2002 82
Data Warehouse Servers - Outline
Server Technology: ROLAP & MOLAP
Indexing Techniques
Query Processing and Optimization
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 42
EDBT Summer School - Cargese 2002 83
Specialized Join Methods
Traditional systems limit themselves to binary
joins
results in many intermediate tables
For a query over many dimensions, the
optimization time can be substantial
EDBT Summer School - Cargese 2002 84
Specialized Join Methods
StarJoin Algorithm (Redbrick)
use join indexes to identify regions of cartesian
product that are of interest
Intelligent Scan (Redbrick)
take advantage of the “read-only” environment
Parallel Join Methods
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 43
EDBT Summer School - Cargese 2002 85
Complex Query Processing
Extensible optimization frameworks (e.g.
Starburst [IBM Almaden])
Estimation of Statistics (histograms, sampling)
Some of the ideas useful for DSS:
interleaving GroupBy and Join
Merging Views
Propagating selection through views
Optimizing nested subqueries
EDBT Summer School - Cargese 2002 86
Example of Optimizing Nested
Subqueries
Find all employees younger than 35 who earn more
than the average of their department
Alternatives:
Iterate over each employee: (1) find the department of the employee (2)
compute average salary in the department (3) check if the employee’s
salary is above the average
Compute the average salary of each department. For each employee,
check if his/her salary is above the corresponding average salary
Find out the set of all departments where at least one of the employees is
35. Compute the average salary of only those departments. Repeat the
previous step.
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 44
EDBT Summer School - Cargese 2002 87
Rollup and Cube operators
[Gray et.al. 1996] Rollup operator for nested
aggregations
rollup product, store, city
group by product, store, city
group by store, city
group by city
Cube operator for all possible combinations
group by product,store,city cube
group by each subset of {product, store, city}, independently of the
order of columns in the statement
EDBT Summer School - Cargese 2002 88
The CUBE operator
Jim Gray
Adam Bosworth
Andrew Layman
Microsoft
CHEVY
FORD 1990
1991
1992
1993
RED
WHITE
BLUE
By Color
By Make & Color
By Make & Year
By Color & Year
By Make
By Year
Sum
The Data Cube and
The Sub-Space Aggregates
RED
WHITE
BLUE
Chevy Ford
By Make
By Color
Sum
Cross Tab
RED
WHITE
BLUE
By Color
Sum
Group By
(with total)Sum
Aggregate
Hamid Pirahesh
IBM
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 45
Processing Star Queries on
Hierarchically-Clustered Fact Tables
Nikos Karayannidis1, Aris Tsois1, Timos Sellis1, Roland
Pieringer2, Volker Markl4,
Frank Ramsak3,Robert Fenk3, Klaus Elhardt2, Rudolf
Bayer5
1I.C.C.S. - N.T.U.Athens,
3FORWISS –5T.U.München,
2TransAction Software GmbH,
4IBM Almaden Research Center
EDBT Summer School - Cargese 2002 90
Key Points
Star queries are ubiquitous in DW and OLAP
New trend: Hierarchically clustered star-
schemata
New processing framework
New optimization challenges
Implemented in TransBase HyperCube
Tested with real-world application (up to 40
speed-up)
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 46
EDBT Summer School - Cargese 2002 91
EDITH
EDITH - the European Development on Indexing
Techniques for Databases with Multidimensional
Hierarchies
Information Society Technologies Programme
(IST) - grant No. IST-1999-20722.
http://edith.in.tum.de
EDBT Summer School - Cargese 2002 92
Motivation – Problem statement
Not just report! What about ad hoc queries?
OLAP requires efficient processing of ad-hoc
star queries
Major bottleneck processing of the star-join
Cartesian product, bitmap indexes, …
NOT enough:
Efficiency requires good physical clustering
of data
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 47
EDBT Summer School - Cargese 2002 93
Hierarchical Clustering
A new trend:
hierarchical clustering of fact table data through
path-based surrogate keys
Exploitation of multidimensional indexes
Star join transforms to multidimensional range query
The overall processing framework of star queries
changes radically
EDBT Summer School - Cargese 2002 94
Contributions
Present a novel processing framework for star
queries over hierarchically clustered data
Discuss optimizations
Realization of our technology in a real system
Evaluation on a real-world application has
shown significant speed-ups.
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 48
EDBT Summer School - Cargese 2002 95
Hierarchical Surrogate Keys
Apply hierarchical encoding on each dimension
table
System-assigned h-surrogate key:
e.g., oc1(“Greece”)/oc2(“Athens”)/oc3(“Store5”)
Implementation based on underlying physical
data structure
EDBT Summer School - Cargese 2002 96
Database Schema
FT
m1
m2
d1
d2
…
dN
D1
h1
---------------
h2
h3
f1
f2
D2
h1
---------------
h2
h3
h4
DN
h1
---------------
h2
f1
f2
f3
hsk1
hsk2
…
hskN
hsk1
hsk2
hskN
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 49
EDBT Summer School - Cargese 2002 97
Star Queries
SELECT {Di.hj}{Di.fj}{aggr(…)AS AMj}
FROM FT,D1,…,DN
WHERE FT.d1 = D1.h1 AND…
LOCPRED({D1}) AND …
MPRED({FT.mi})
GROUP BY {Di.hj},{Di.fj},{FT.mj}
HAVING <having clause>
ORDER BY <ordering fields>
Star-join conditions
Dimension
restrictions
Measure
restrictions
EDBT Summer School - Cargese 2002 98
The Abstract Processing Plan
...Dn
FT
MD Range Access
Residual Join
Group-Select
Order_By
D1
Dj
Di
Residual Join
...
Create_RangeCreate_Range
...
h-surrogate processing
Main execution phase
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 50
EDBT Summer School - Cargese 2002 99
Optimization Issues
Optimizing h-surrogate processing
Single tuple retrieval for hierarchical prefix path
restrictions
Exploit composite index on (hm, hm-1,…, h1, hski)
Pregrouping transformation
Reduces tuples for residual join and speeds up
grouping
Heuristic algorithm based on query syntax
EDBT Summer School - Cargese 2002 100
Pre-grouping Transformation
F
Group Select
by month, store
Residual Join
MD Range Access
Residual Join
Date
Location
Date
F
Group Select
by month, store
Residual Join
MD Range Access
Residual Join
Location
Group Select
by hsk1, hsk2
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 51
EDBT Summer School - Cargese 2002 101
Performance Evaluation
Greek electronic retailer data:
3 dims (1.4M, 27K, 2.5K) tuples
Fact table: 15.5M tuples (1.5GB)
220 ad hoc star queries from real application
Compare 3 plans: STAR, AEP and OPT
FT selectivity range: 0.0% to 5.0% of FT
Result:
AEP vs. STAR 20 avg. speed up
OPT vs STAR 40 avg speed up
EDBT Summer School - Cargese 2002 102
Summary
Efficient star query processing a must in DW and OLAP
New trend: Hierarchically clustered star-schemata
Presented a novel processing framework for star
queries over hierarchically clustered data
Discussed optimization issues
Fully implemented our technology in TransBase
Evaluation with real-word application has shown
significant speed-ups
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 52
EDBT Summer School - Cargese 2002 103
Future Work
Extensive experimental evaluation
Investigate applicability of our processing
framework to other areas
Further optimization issues: reducing the number
of produced h-surrogate ranges
EDBT Summer School - Cargese 2002 104
Metadata Repository
Sources
Administrator
DSA
Administrator
DW
Designer
Data
Marts
Metadata
Repository
End User
Quality
Issues
Quality
Issues
Quality
Issues
Quality
Issues
Reporting /
OLAP tools
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 53
EDBT Summer School - Cargese 2002 105
The Lack of Conceptual Support
Information
Source
Data
Warehouse
Wrapper/
Loader
Multidim.
Data Mart
Aggregation/
Customization?
Observation
OLTP
OLAPAnalyst
Operational
Department
Enterprise
Source
Quality
DW
Quality
Mart
Quality
(1)
(2)
(3)
(4)
(5)
EDBT Summer School - Cargese 2002 106
Conceptual-Logical-Physical
Source
DataStore
DW
DataStore
Wrapper
Client
DataStore
Aggregation/
Customization
?
Observation
OLTP
OLAPClient
Model
Operational
Department
Model
Enterprise
Model
Source
Schema
DW
Schema
Transportation
Agent
Transportation
Agent
Client
Schema
Conceptual
Perspective
Logical
Perspective
Physical
Perspective
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 54
EDBT Summer School - Cargese 2002 107
The DWQ Approach
Client Level
DW Level
Source Level
Conceptual
Perspective
Logical
Perspective
Physical
Perspective
Meta Model
Level
Models/
Meta Data
Level
in
Real
World
in
in
Process
Model
Process
Meta
Model
uses
Process
Processes
Quality
Metamodel
Quality
Model
Quality
Measure-
ments
EDBT Summer School - Cargese 2002 108
DWQ Repository
The DWQ approach for managing data warehouse
quality is organized around an extended, semantically
rich metadata repository (prototypically implemented
using ConceptBase), which controls all relevant
metadata
We have developed meta models for DW architecture,
quality, processes and evolution
Metadata can be provided and queried by external
tools, via active rules external tools could even be
activated
[Jarke et al., CAiSE98]
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 55
EDBT Summer School - Cargese 2002 109
DWQ Metadata Framework
Sources
...
...
Enterprise
Model
Client Client
Source Source
Model_1 Model_n
Model_1 Model_m
Mediators
conceptual/logical mapping
physical/logical mapping
conceptual link
data flow
logical link
Source Source
Wrappers
physical levelmeta level conceptual level logical level
MetaModel
Interface
Schema
Store
Client Client
DW
DW
Source Source
Schema_1 Schema_n
Schema_1 Schema_m
Data Store_1 Data Store_n
EDBT Summer School - Cargese 2002 110
Quality Model:
An Adapted GQM Approach
DW
Designers
Decision
Maker
DW
Administrator
Quality
Goal
Quality
Query
DW Objects,
Processes and Data
Metadata for
DW Architecture,
Quality and
Processes
establish
Measurement
Processes
evaluated
by
evidence
for
defined on
Quality
Factor
[Jarke et al., IS99]
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 56
EDBT Summer School - Cargese 2002 111
Quality Factors by Perspective
Conceptual
Perspective
• Completeness
• Redundancy
• Consistency
• Correctness
• Traceability
of Concepts and
Models
Logical
Perspective
• Usefulness of
schemas
• Correctness of
mappings
• Interpretability of
schemas
Physical
Perspective
• Efficiency
• Interpretability of
schemas
• Timeliness of stored
data
• Maintainability/
Usability of software
components
EDBT Summer School - Cargese 2002 112
Towards Quality-Oriented DW
Design Quality
Goal
1. Design 2. Evaluation
3. Analysis
& Improvement
Define
Quality
Factor
Types
Define
Object
Types
Define Object
Instances &
Properties
Define Metrics
& Agents
Compute!
Acquire values for
quality factors
(current status)
Feed values to
quality scenario
and play!
Discover/Refine
new/old
"functions"
Take actions!
Decompose
complex objects
and iterate
Empirically
derive
"functions"
Analyticaly
derive
"functions"
Produce a
scenario
for a goal
Produce expected/
acceptable values
Negotiate!
4. Re-evalution
& evolution
[Vassiliadis et al., IS00]
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 57
EDBT Summer School - Cargese 2002 113
DWQ Methodology : Summary
R1
R2
R3
Enterprise
Model
Materialized
Views
C1 C2 Cm
Conj.
Queries
R1
R2
R3
S1
R1
R2
R3
S2
R1
R2
R3
S3
R1
R2
R3
Sn
Conj.
Queries
Conj.
Queries
User queries
OLTP updates
3. Conceptual
Client Modeling
1. Conceptual
Enterprise Model
2. Conceptual
Source Models
Rewriting of
Aggregate Queries
Refreshment
6. Data
Reconciliation
4. Translate aggregates
into OLAP operations
5. Design
Optimization
Metadata
Repository
EDBT Summer School - Cargese 2002 114
Key Formal Results on Quality
Impacts
conceptual: description logic theory and tools for
complete reasoning about the relationships between
source, enterprise, and client models
conceptual/logical: containment, satisfiability, and
rewriting of queries over views with & without
aggregates
logical/physical: incremental cost-based optimization of
view materializations
physical: detailed impact analysis of replication and
refreshment policies
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 58
EDBT Summer School - Cargese 2002 115
ConceptBase User Interface
EDBT Summer School - Cargese 2002 116
DW Quality Example
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 59
EDBT Summer School - Cargese 2002 117
Metadata Standards
Metadata Coalition
MetaData Interchange Specification (MDIS)
Open Information Model (OIM)
OMG (latest development)
Common Warehouse Model (CWM)
Microsoft Repository
EDBT Summer School - Cargese 2002 118
Summary
OLAP - Multidimensional data
Drill down, Roll Up, Pivot, Slice and Dice
Data warehouse architecture
Warehouse operational process
Loading - Cleaning - Serving (ROLAP/MOLAP)
Refreshing
Warehouse server requirements
Star-Snowflake schemes
Specialized indexes: BitMap - Join Indexes
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 60
EDBT Summer School - Cargese 2002 119
Research issues
Data cleaning
focus on schema inconsistencies
Data warehouse design
summary tables, indexing
Query Processing
use summary data, statistics mgt, dynamic optimization
Warehouse Management
resource management, runaway queries
incremental refresh techniques
EDBT Summer School - Cargese 2002 120
References
W. H. Inmon: Building the Data Warehouse (2nd Edition),
John Wiley, 1996.
R. Kimball: The Data Warehouse Toolkit, John Wiley,
1996.
H. Garcia-Molina, Data Warehousing Overview, class
notes, Stanford University.
S. Chaudhuri & U. Dayal: Data Warehousing and OLAP
for Decision Support - VLDB’96 tutorial
Oracle, IBM, Redbrick, Sybase, Informix, Tandem,
Teradata, HP, … web sites.
The DWQ project: http://www.dbnet.ece.ntua.gr/~dwq/

More Related Content

What's hot

The technology of the business data lake
The technology of the business data lakeThe technology of the business data lake
The technology of the business data lake
Capgemini
 
Data warehouse
Data warehouseData warehouse
Data warehouse
Saurab Dulal
 
Rev_3 Components of a Data Warehouse
Rev_3 Components of a Data WarehouseRev_3 Components of a Data Warehouse
Rev_3 Components of a Data Warehouse
Ryan Andhavarapu
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
idnats
 
Data vault what's Next: Part 2
Data vault what's Next: Part 2Data vault what's Next: Part 2
Data vault what's Next: Part 2
Empowered Holdings, LLC
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016
Kent Graziano
 
Data warehouse
Data warehouseData warehouse
Data warehouse
Rishabh Dogra
 
Introduction to data warehousing
Introduction to data warehousing   Introduction to data warehousing
Introduction to data warehousing
Girish Dhareshwar
 
Data Vault Overview
Data Vault OverviewData Vault Overview
Data Vault Overview
Empowered Holdings, LLC
 
Data Archiving white paper
Data Archiving white paperData Archiving white paper
Data Archiving white paper
IBM India Smarter Computing
 
Enterprise Solutions Architect Eli Perl CV
Enterprise Solutions Architect Eli Perl CVEnterprise Solutions Architect Eli Perl CV
Enterprise Solutions Architect Eli Perl CV
Eli Perl
 
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...
Cloudera, Inc.
 
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Edureka!
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
Stephen Alex
 
080827 abramson inmon vs kimball
080827 abramson   inmon vs kimball080827 abramson   inmon vs kimball
080827 abramson inmon vs kimball
Comércio de Portugal
 
Introduction to Data Virtualization (session 1 from Packed Lunch Webinar Series)
Introduction to Data Virtualization (session 1 from Packed Lunch Webinar Series)Introduction to Data Virtualization (session 1 from Packed Lunch Webinar Series)
Introduction to Data Virtualization (session 1 from Packed Lunch Webinar Series)
Denodo
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
James Serra
 
Data Warehouse Design on Cloud ,A Big Data approach Part_One
Data Warehouse Design on Cloud ,A Big Data approach Part_OneData Warehouse Design on Cloud ,A Big Data approach Part_One
Data Warehouse Design on Cloud ,A Big Data approach Part_One
Panchaleswar Nayak
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
King Julian
 
Traditional Data-warehousing / BI overview
Traditional Data-warehousing / BI overviewTraditional Data-warehousing / BI overview
Traditional Data-warehousing / BI overview
Nagaraj Yerram
 

What's hot (20)

The technology of the business data lake
The technology of the business data lakeThe technology of the business data lake
The technology of the business data lake
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Rev_3 Components of a Data Warehouse
Rev_3 Components of a Data WarehouseRev_3 Components of a Data Warehouse
Rev_3 Components of a Data Warehouse
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
Data vault what's Next: Part 2
Data vault what's Next: Part 2Data vault what's Next: Part 2
Data vault what's Next: Part 2
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Introduction to data warehousing
Introduction to data warehousing   Introduction to data warehousing
Introduction to data warehousing
 
Data Vault Overview
Data Vault OverviewData Vault Overview
Data Vault Overview
 
Data Archiving white paper
Data Archiving white paperData Archiving white paper
Data Archiving white paper
 
Enterprise Solutions Architect Eli Perl CV
Enterprise Solutions Architect Eli Perl CVEnterprise Solutions Architect Eli Perl CV
Enterprise Solutions Architect Eli Perl CV
 
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...
Hadoop World 2011: I Want to Be BIG - Lessons Learned at Scale - David "Sunny...
 
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
080827 abramson inmon vs kimball
080827 abramson   inmon vs kimball080827 abramson   inmon vs kimball
080827 abramson inmon vs kimball
 
Introduction to Data Virtualization (session 1 from Packed Lunch Webinar Series)
Introduction to Data Virtualization (session 1 from Packed Lunch Webinar Series)Introduction to Data Virtualization (session 1 from Packed Lunch Webinar Series)
Introduction to Data Virtualization (session 1 from Packed Lunch Webinar Series)
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Data Warehouse Design on Cloud ,A Big Data approach Part_One
Data Warehouse Design on Cloud ,A Big Data approach Part_OneData Warehouse Design on Cloud ,A Big Data approach Part_One
Data Warehouse Design on Cloud ,A Big Data approach Part_One
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Traditional Data-warehousing / BI overview
Traditional Data-warehousing / BI overviewTraditional Data-warehousing / BI overview
Traditional Data-warehousing / BI overview
 

Similar to DWH Concepts

Logical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business OutcomesLogical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business Outcomes
Denodo
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
DWBASIC.ppt
DWBASIC.pptDWBASIC.ppt
DWBASIC.ppt
ssuserc65885
 
engage 2015 - - 2015 - Infrastructure Assessment - Analyze, Visualize and Op...
engage 2015 -  - 2015 - Infrastructure Assessment - Analyze, Visualize and Op...engage 2015 -  - 2015 - Infrastructure Assessment - Analyze, Visualize and Op...
engage 2015 - - 2015 - Infrastructure Assessment - Analyze, Visualize and Op...
Christoph Adler
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 
The Future of Data Warehousing and Data Integration
The Future of Data Warehousing and Data IntegrationThe Future of Data Warehousing and Data Integration
The Future of Data Warehousing and Data Integration
Eric Kavanagh
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Basic and Introduction to DBMS Unit 1 of AU
Basic and Introduction to DBMS Unit 1 of AUBasic and Introduction to DBMS Unit 1 of AU
Basic and Introduction to DBMS Unit 1 of AU
infant2404
 
BI Chapter 03.pdf business business business business business business
BI Chapter 03.pdf business business business business business businessBI Chapter 03.pdf business business business business business business
BI Chapter 03.pdf business business business business business business
JawaherAlbaddawi
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
How to Place Data at the Center of Digital Transformation in BFSI
How to Place Data at the Center of Digital Transformation in BFSIHow to Place Data at the Center of Digital Transformation in BFSI
How to Place Data at the Center of Digital Transformation in BFSI
Denodo
 
ICS UserGroup - 2015 - Infrastructure Assessment - Analyze, Visualize and Opt...
ICS UserGroup - 2015 - Infrastructure Assessment - Analyze, Visualize and Opt...ICS UserGroup - 2015 - Infrastructure Assessment - Analyze, Visualize and Opt...
ICS UserGroup - 2015 - Infrastructure Assessment - Analyze, Visualize and Opt...
Christoph Adler
 
Dwh basics datastage online training
Dwh basics datastage online trainingDwh basics datastage online training
Dwh basics datastage online training
Datawarehouse Trainings
 
DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING
DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAININGDATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING
DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING
Datawarehouse Trainings
 
Denodo Partner Connect: Business Value Demo with Denodo Demo Lite
Denodo Partner Connect: Business Value Demo with Denodo Demo LiteDenodo Partner Connect: Business Value Demo with Denodo Demo Lite
Denodo Partner Connect: Business Value Demo with Denodo Demo Lite
Denodo
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
James Serra
 
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Denodo
 
IBM Pure Data System for Analytics (Netezza)
IBM Pure Data System for Analytics (Netezza)IBM Pure Data System for Analytics (Netezza)
IBM Pure Data System for Analytics (Netezza)
Girish Srivastava
 
Data wirehouse
Data wirehouseData wirehouse
Data wirehouse
Niyitegekabilly
 

Similar to DWH Concepts (20)

Logical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business OutcomesLogical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business Outcomes
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
DWBASIC.ppt
DWBASIC.pptDWBASIC.ppt
DWBASIC.ppt
 
engage 2015 - - 2015 - Infrastructure Assessment - Analyze, Visualize and Op...
engage 2015 -  - 2015 - Infrastructure Assessment - Analyze, Visualize and Op...engage 2015 -  - 2015 - Infrastructure Assessment - Analyze, Visualize and Op...
engage 2015 - - 2015 - Infrastructure Assessment - Analyze, Visualize and Op...
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
The Future of Data Warehousing and Data Integration
The Future of Data Warehousing and Data IntegrationThe Future of Data Warehousing and Data Integration
The Future of Data Warehousing and Data Integration
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Basic and Introduction to DBMS Unit 1 of AU
Basic and Introduction to DBMS Unit 1 of AUBasic and Introduction to DBMS Unit 1 of AU
Basic and Introduction to DBMS Unit 1 of AU
 
BI Chapter 03.pdf business business business business business business
BI Chapter 03.pdf business business business business business businessBI Chapter 03.pdf business business business business business business
BI Chapter 03.pdf business business business business business business
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
How to Place Data at the Center of Digital Transformation in BFSI
How to Place Data at the Center of Digital Transformation in BFSIHow to Place Data at the Center of Digital Transformation in BFSI
How to Place Data at the Center of Digital Transformation in BFSI
 
ICS UserGroup - 2015 - Infrastructure Assessment - Analyze, Visualize and Opt...
ICS UserGroup - 2015 - Infrastructure Assessment - Analyze, Visualize and Opt...ICS UserGroup - 2015 - Infrastructure Assessment - Analyze, Visualize and Opt...
ICS UserGroup - 2015 - Infrastructure Assessment - Analyze, Visualize and Opt...
 
Dwh basics datastage online training
Dwh basics datastage online trainingDwh basics datastage online training
Dwh basics datastage online training
 
DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING
DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAININGDATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING
DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING
 
Denodo Partner Connect: Business Value Demo with Denodo Demo Lite
Denodo Partner Connect: Business Value Demo with Denodo Demo LiteDenodo Partner Connect: Business Value Demo with Denodo Demo Lite
Denodo Partner Connect: Business Value Demo with Denodo Demo Lite
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)
 
IBM Pure Data System for Analytics (Netezza)
IBM Pure Data System for Analytics (Netezza)IBM Pure Data System for Analytics (Netezza)
IBM Pure Data System for Analytics (Netezza)
 
Data wirehouse
Data wirehouseData wirehouse
Data wirehouse
 

Recently uploaded

clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
amberjdewit93
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
Celine George
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
TechSoup
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
WaniBasim
 
What is the purpose of studying mathematics.pptx
What is the purpose of studying mathematics.pptxWhat is the purpose of studying mathematics.pptx
What is the purpose of studying mathematics.pptx
christianmathematics
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Ashish Kohli
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
Priyankaranawat4
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
IreneSebastianRueco1
 
Assignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docxAssignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docx
ArianaBusciglio
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
 

Recently uploaded (20)

clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
 
What is the purpose of studying mathematics.pptx
What is the purpose of studying mathematics.pptxWhat is the purpose of studying mathematics.pptx
What is the purpose of studying mathematics.pptx
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
 
Assignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docxAssignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docx
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
 

DWH Concepts

  • 1. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 1 Design and Maintenance of Data Warehouses Design and Maintenance of Data Warehouses Timos Sellis National Technical University of Athens KDBS Laboratory http://www.dbnet.ece.ntua.gr/ Many thanks to P. Vassiliadis and A. Tsois EDBT Summer School - Cargese 2002 2 Outline What’s and Why’s for DW’s DW architecture DW Schema Back End of the DW Front End of the DW DW Servers Metadata Repository Conclusions
  • 2. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 2 EDBT Summer School - Cargese 2002 3 OLTP On-line transaction processing (OLTP) is the traditional way of using a database Legacy systems: relational, hierarchical, network databases / COBOL applications / … Short transactions (read/update few records) with ACID properties Normally, only the last version of data stored in the database EDBT Summer School - Cargese 2002 4 DSS & OLAP Decision support systems - help the executive, manager, analyst make faster and better decisions. What where the sales volumes by region and product category for the last year? Will a 10% discount increase sales volumes sufficiently? On-line analytical processing (OLAP) is an element of decision support systems (DSS)
  • 3. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 3 EDBT Summer School - Cargese 2002 5 OLTP vs. OLAP OLTP OLAP User Clerk Manager Function Day to day operations Decision support Access Read/write Mostly read Data detailed, up-to-date, flat relational summarised, historical, multidimensional Db Size 100MB - 1GB 100GB - 1TB Chaudhuri & Dayal @VLDB’96 EDBT Summer School - Cargese 2002 6 Data Warehouse A decision support database that is maintained separately from the organization’s operational database. • S. Chaudhuri, U. Dayal, VLDB’96 tutorial A data warehouse is a subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making. • W.H. Inmon, Building the Data Warehouse, 1992
  • 4. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 4 EDBT Summer School - Cargese 2002 7 Reasons for Building Data Warehouses Semantic Reconciliation Dispread data sources within the same organization Different encoding of the same entities DW encompasses the full volume of these data under a single, reconciled schema Keeps the history of these data, too EDBT Summer School - Cargese 2002 8 Reasons for Building Data Warehouses Performance OLAP applications need different organization of data Complex OLAP queries would degrade OLTP performance Availability Separation increases availability Possibly the only way to query the dispread data sources
  • 5. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 5 EDBT Summer School - Cargese 2002 9 Reasons for Building Data Warehouses Data Quality The validity of source data is not guaranteed (data can be missing, inconsistent, out of date, violating business and database rules…) Errors in data reach a minimum 10% in most data stores Can lead to wasting of resources of 25-40% DW acts as a data cleaning buffer …. and the market is there! EDBT Summer School - Cargese 2002 10 The Market Estimated sales in millions of dollars [ShTy98] (*estimates are from [Pend00]). 1998 1999 2000 2001 2002 CAGR (%) RDBMS sales for DW 900.0 1110.0 1390.0 1750.0 2200.0 25.0 Data Marts 92.4 125.0 172.0 243.0 355.0 40.0 ETL tools 101.0 125.0 150.0 180.0 210.0 20.1 Data Quality 48.0 55.0 64.5 76.0 90.0 17.0 Metadata Management 35.0 40.0 46.0 53.0 60.0 14.4 OLAP (including implementation services)* 2000 2500 3000 3600 4000 18.9
  • 6. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 6 EDBT Summer School - Cargese 2002 11 Data Warehouse Architecture A Simple View Client Client Warehous e Sourc e Sourc e Sourc e Query & Analysis Integration Metadata EDBT Summer School - Cargese 2002 12 Data Warehouse Architecture Sources Administrator DSA Administrator DW Designer Data Marts Metadata Repository End User Quality Issues Quality Issues Quality Issues Quality Issues Reporting / OLAP tools
  • 7. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 7 EDBT Summer School - Cargese 2002 13 Two / Three Tier Architecture Warehouse database server almost always relational (RDBMS) Data Marts / OLAP server Relational OLAP (ROLAP) Multidimensional OLAP (MOLAP) Clients Query and reporting tools Analysis tools / Data mining tools EDBT Summer School - Cargese 2002 14 Data Warehouse Architecture Enterprise warehouse: collects all information about subjects requires extensive business modeling may take years to design and build Data Marts: Departmental subsets that focus on selected subjects Virtual warehouse: views over operational dbs
  • 8. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 8 EDBT Summer School - Cargese 2002 15 How to build the DW Top – down Single integrated enterprise model Reduce all sources (and clients, if necessary) to the central model − Time consuming; labor intensive; slow to produce results − Enhances the risk of the DW project due to late delivery of results + Provides a consistent, global view of the enterprise data EDBT Summer School - Cargese 2002 16 How to build the DW Bottom – up Build smaller data marts first Progressively combine pairwise − Fails to provide a global view of the enterprise data − Possibly enhances the risk since a complete integration might prove impossible late in the project + Early delivery of results + Less time consuming, less labor intensive
  • 9. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 9 EDBT Summer School - Cargese 2002 17 Data Warehouse Back-End Sources Administrator DSA Administrator DW Designer Data Marts Metadata Repository End User Quality Issues Quality Issues Quality Issues Quality Issues Reporting / OLAP tools EDBT Summer School - Cargese 2002 18 Design: Global-As-View Integration Preintegration. What schemata to integrate and in which order Schema Comparison. To determine the correlations among concepts of different schemata and to detect possible naming, semantic, structural, … conflicts Schema Conforming. Conflict resolution for heterogeneous schemata Schema Merging and Restructuring. Production of a single conformed schema
  • 10. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 10 EDBT Summer School - Cargese 2002 19 Design: Local-As-View Integration Works the other way around. Main deliverable is a central conceptual model, produced by interactively examining user needs and existing schemata All source and client schemata are expressed in terms of the central data warehouse schema and not the other way around. EDBT Summer School - Cargese 2002 20 DW = Materialized Views? DW.PARTSU PP Aggregate1 PKEY, DAY MIN(COST) Aggregate2 PKEY, MONTH AVG(COST) V2 V1 TIME DW.PARTSUPP.DATE, DAY S1_PARTSU PP S2_PARTSU PP Sources DW U Simple View of a DW
  • 11. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 11 EDBT Summer School - Cargese 2002 21 Add_SPK1 SUPPKEY=1 SK1 DS.PS1.PKEY, LOOKUP_PS.SKEY, SUPPKEY $2€ COST DATE DS.PS2 Add_SPK2 SUPPKEY=2 SK2 DS.PS2.PKEY, LOOKUP_PS.SKEY, SUPPKEY COST DATE=SYSDATE AddDate CheckQTY QTY>0 U DS.PS1 Log rejected Log rejected A2EDate NotNULL Log rejected Log rejected Log rejected DIFF1 DS.PS_NEW1.PKEY, DS.PS_OLD1.PKEYDS.PS_NEW 1 DS.PS_OLD 1 DW.PARTSU PP Aggregate1 PKEY, DAY MIN(COST) Aggregate2 PKEY, MONTH AVG(COST) V2 V1 TIME DW.PARTSUPP.DATE, DAY FTP1 S1_PARTSU PP S2_PARTSU PP FTP2 DS.PS_NEW 2 DIFF2 DS.PS_OLD 2 DS.PS_NEW2.PKEY, DS.PS_OLD2.PKEY DW ≠ Materialized Views ! Sources DW DSA EDBT Summer School - Cargese 2002 22 Operational Processes Data extraction, transform & load Originally treated as the ‘refreshment’ problem Requires to transform, clean, integrate data from different sources. Build/refresh derived data and views Service queries Monitor the warehouse
  • 12. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 12 EDBT Summer School - Cargese 2002 23 The Refreshment Problem Propagate updates on source data to the warehouse Issues: when to refresh on every update periodically refresh policy set by administrator how to refresh EDBT Summer School - Cargese 2002 24 Refreshment Techniques Full extract from base tables Incremental techniques detect changes on base tables snapshots transaction shipping active rules logical correctness transactional correctness Currently, in practice we use ETL tools/scripts (see next)…
  • 13. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 13 EDBT Summer School - Cargese 2002 25 Data Extraction Can take snapshot or differentials (new/deleted/updated) of source data Transfer, encryption, compression are also involved Time window and source system overhead involved In general, faced with the requirement of minimal changes to existing configuration of sources EDBT Summer School - Cargese 2002 26 Data Transformation Schema Reconciliation: conflicts at the schema level (different attributes for the same information) Value Identification & Reconciliation: different (same) id’s for same (different) objects (use surrogate keys)
  • 14. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 14 EDBT Summer School - Cargese 2002 27 Data Cleaning Offending Data: duplicates, integrity/business rules/format violations … Incompleteness: missing data Renicing: esp. addresses EDBT Summer School - Cargese 2002 28 Data Loading This final stage may still require additional preprocessing: sorting, summarizing, performing computations Issues: huge volumes of data to be loaded small time window when to build indexes and summary tables restart after failure with no loss of data integrity
  • 15. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 15 EDBT Summer School - Cargese 2002 29 Loading Techniques Cannot use SQL language interface to update or append data. record at a time too slow since it uses random disc I/O can make rollback segment or log file to burst Use batch load utility sort input records on a clustering key sequential I/O 100 times faster than random I/O build index at the same time use parallelism to accelerate load operations EDBT Summer School - Cargese 2002 30 Incremental Loading Use incremental loads during refresh to reduce data volume (e.g. Redbrick) insert only updated tuples incremental load conflicts with queries break into sequence of shorter transactions coordinate this sequence of transactions: must ensure consistency between base and derived tables and indices.
  • 16. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 16 EDBT Summer School - Cargese 2002 31 Data Warehouse Front-End Sources Administrator DSA Administrator DW Designer Data Marts Metadata Repository End User Quality Issues Quality Issues Quality Issues Quality Issues Reporting / OLAP tools EDBT Summer School - Cargese 2002 32 Front End Tools Ad hoc query and reporting Example: MS Excel, ProReports OLAP: ‘Multidimensional spreadsheet’ pivot tables, drill down, roll up, slice, dice Data Mining
  • 17. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 17 EDBT Summer School - Cargese 2002 33 Basic ideas for OLAP Several numeric measures that are analyzed sales, budget, revenue, inventory Dimensions contexts in which a measure appears Example: store, product, date information associated with a sale. each context is a dimension and the measure is a point in a multi-dimensional world EDBT Summer School - Cargese 2002 34 Basic ideas for OLAP Nature of Analysis aggregation (total sales, percent-to-total) comparison (budget vs. expense) ranking (top 10) access to detailed and aggregate data complex criteria specification visualization
  • 18. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 18 EDBT Summer School - Cargese 2002 35 Basic ideas for OLAP Attributes information associated with a dimension example: owner of store, county in which the store is located Attribute Hierarchies Attributes of a dimension are often related in a a hierarchical way example: street city country EDBT Summer School - Cargese 2002 36 Multidimensional Data Dimensions: Product, Region, Date Hierarchical summarization paths: Month Region Product Sales volume Industry Category Product Country Region City Office Year Quarter Month Week Day
  • 19. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 19 EDBT Summer School - Cargese 2002 37 Operations Roll up: summarize data Drill down: go from higher level summary to lower level summary or detailed data Slice and dice: select and project Pivot: re-orient cube EDBT Summer School - Cargese 2002 38 Roll up Sales volume Electronics Toys Clothing Cosmetics Q1 $5,2 $1,9 $2,3 $1,1 Electronics Toys Clothing Cosmetics Q2 $8,9 $0,75 $4,6 $1,5 Products Store1 Store2 $5,6 $1,4 $2,6 $1,1 $7,2 $0,4 $4,6 $0,5 Sales volume Electronics Toys Clothing Cosmetics Year1996 $14,1 $2,65 $6,9 $2,6 Products Store1 Store2 $12,8 $1,8 $7,2 $1,6
  • 20. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 20 EDBT Summer School - Cargese 2002 39 Drill down Sales volume Electronics Toys Clothing Cosmetics Q1 $5,2 $1,9 $2,3 $1,1 Electronics Toys Clothing Cosmetics Q2 $8,9 $0,75 $4,6 $1,5 Products Store1 Store2 $5,6 $1,4 $2,6 $1,1 $7,2 $0,4 $4,6 $0,5 Sales volume VCR Camcorder TV CD player Q1 $1,4 $0,6 $2,0 $1,2 VCR Camcorder TV CD player Q2 $2,4 $3,3 $2,2 $1,0 Electronics Store1 Store2 $1,4 $0,6 $2,4 $1,2 $2,4 $1,3 $2,5 $1,0 EDBT Summer School - Cargese 2002 40 Pivot Sales volume Electronics Toys Clothing Cosmetics Q1 $5,2 $1,9 $2,3 $1,1 Electronics Toys Clothing Cosmetics Q2 $8,9 $0,75 $4,6 $1,5 Products Store1 Store2 $5,6 $1,4 $2,6 $1,1 $7,2 $0,4 $4,6 $0,5 Sales volume Electronics Toys Clothing Cosmetics Store1 $5,2 $1,9 $2,3 $1,1 Electronics Toys Clothing Cosmetics Store2 $5,6 $1,4 $2,6 $1,1 Products Q1 Q2 $8,9 $0,75 $4,6 $1,5 $7,2 $0,4 $4,6 $0,5
  • 21. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 21 EDBT Summer School - Cargese 2002 41 Slice and Dice Sales volume Electronics Toys Clothing Cosmetics Q1 $5,2 $1,9 $2,3 $1,1 Electronics Toys Clothing Cosmetics Q2 $8,9 $0,75 $4,6 $1,5 Products Store1 Store2 $5,6 $1,4 $2,6 $1,1 $7,2 $0,4 $4,6 $0,5 Sales volume Electronics Toys Q1 $5,2 $1,9 Products Store1 Electronics Toys Q2 $8,9 $0,75 EDBT Summer School - Cargese 2002 42 Data Warehouse Server Sources Administrator DSA Administrator DW Designer Data Marts Metadata Repository End User Quality Issues Quality Issues Quality Issues Quality Issues Reporting / OLAP tools
  • 22. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 22 EDBT Summer School - Cargese 2002 43 Data Warehouse Servers - Outline Server Technology: ROLAP & MOLAP Indexing Techniques Query Processing and Optimization EDBT Summer School - Cargese 2002 44 Database Servers Relational and Specialized Relational DBMS Relational OLAP (ROLAP) DBMS Multidimensional OLAP (MOLAP) DBMS
  • 23. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 23 EDBT Summer School - Cargese 2002 45 Relational DBMS Features that support DSS Specialized Indexing techniques Specialized Join and Scan Methods Data Partitioning and use of Parallelism Complex Query Processing Intelligent Processing of Aggregates Extensions to SQL and their processing EDBT Summer School - Cargese 2002 46 ROLAP Servers Exploits services of a relational engine effectively Key functionality needs aggregation navigation logic ability to generate multi statement SQL optimize for each individual database backend Additional services cost-based query governor design tool for DSS schema performance analysis tool
  • 24. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 24 EDBT Summer School - Cargese 2002 47 Database Schemata for DW & ROLAP Star Schema Snowflake Schema Fact Constellation Aggregated data EDBT Summer School - Cargese 2002 48 Star Schema A star schema consists of one central fact table and several denormalized dimension tables. The measures of interest for OLAP are stored in the fact table (e.g. Dollar Amount, Units in the table SALES). For each dimension of the multidimensional model there exists a dimension table (e.g. Geography, Product, Time, Account) with all the levels of aggregation and the extra properties of these levels.
  • 25. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 25 EDBT Summer School - Cargese 2002 49 Star Schema SALES Geography Code Time Code Account Code Product Code Dollar Amount Units Geography Geography Code Region Code Region Manager State Code City Code ..... Product Product Code Product Name Brand Code Brand Name Prod. Line Code Prod. Line Name Time Time Code Quarter Code Quarter Name Month Code Month Name Date Account Account Code KeyAccount Code KeyAccountName Account Name Account Type Account Market Stanford Technology Group, Inc., 1996 EDBT Summer School - Cargese 2002 50 Snowflake Schema The normalized version of the star schema Explicit treatment of dimension hierarchies (each level has its own table) Easier to maintain, slower in query answering
  • 26. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 26 EDBT Summer School - Cargese 2002 51 Snowflake Schema SALES Postal Code Time Code Account Code Product Code Dollar Amount Units Time Time Code Quarter Code Month Code Quarter Quarter Code QuarterName Month Month Code Month Name Account Account Code KeyAccount Code Account attributes Account Code AccountName KeyAccount KeyAcc Code KeyAcc Name Geography Postal Code Region Code State Code City Code Region Region Code Region Mgr State State Code State Name City City Code City Name Product Product Code Prod Line Code Brand Code Product Product Code ProductName Brand Brand Code Brand Name ProdLine ProdLineCode ProdLineName Stanford Technology Group, Inc., 1996 EDBT Summer School - Cargese 2002 52 Fact Constellation Multiple fact tables that share many dimension tables Example: projected expense and the actual expense may share dimensional tables
  • 27. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 27 EDBT Summer School - Cargese 2002 53 Aggregated Tables In addition to base fact and dimension tables, data warehouse keeps aggregated (summary) data for efficiency. Two approaches store as separate summary fact and dimension tables add to the existing base tables EDBT Summer School - Cargese 2002 54 Aggregated Tables RID City Amount 1 Athens $100 2 N.Y. $300 3 Rome $120 4 Athens $250 5 Rome $180 6 Rome $65 7 N.Y. $450 City Amount Athens $350 N.Y. $750 Rome $365 RID City Amount Level 1 Athens $100 NULL 2 N.Y. $300 NULL 3 Rome $120 NULL 4 Athens $250 NULL 5 Rome $180 NULL 6 Rome $65 NULL 7 N.Y. $450 NULL 8 Athens $350 City 9 N.Y. $750 City 10 Rome $365 City • Separate sum-table • Extend existing base tables Extended Sales table Sales table City-dimension sum table
  • 28. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 28 EDBT Summer School - Cargese 2002 55 MOLAP Servers The storage model is an n-dimensional array Very fast in computations and OLAP operations Normally they require pre-computation of the available cubes Compression of data to save storage space Currently: 98% of the market for client tools SISYPHUS: A Chunk-Based Storage Manager for OLAP Cubes PhD work of Nikos Karayannidis National Technical University of Athens (NTUA)
  • 29. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 29 EDBT Summer School - Cargese 2002 57 ERATOSTHENES project ERATOSTHENES, is a specialized database management system for OLAP cubes which is under development. In the context of ERATOSTHENES, a prototype storage manager for OLAP cubes, called SISYPHUS, has been developed.Storage Engine (SISYPHUS) Processing Engine Presentation Engine EDBT Summer School - Cargese 2002 58 Why OLAP poses new require-ments to storage management? Small response time: good physical clustering + efficient access paths Multidimensionality: md-storage structures, address by location Hierarchies: access paths, clustering Sparseness: not random but according to hierarchies.
  • 30. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 30 EDBT Summer School - Cargese 2002 59 Architecture: levels of abstraction in SISYPHUS SSM Record-oriented storage mngmnt File Manager Bucket-oriented File mngmnt Logging/Recovery Buffer Manager Buffer mngmnt Access Manager Chunk-oriented File mngmnt Cube Access Methods OLAP Processing rec.oriented access bckt.oriented access chnk.orient ed access Cell oriented access EDBT Summer School - Cargese 2002 60 Dimension data encoding City Region Country LOCATION 0.1.2 0 1 2 CityA CityB CityC CityD 0 1 RegionA RegionB 0 CountryA 3 order-codes member-code
  • 31. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 31 EDBT Summer School - Cargese 2002 61 A chunk-oriented file system: the hierarchically chunked cube Use the bucket file system. Chunking Method: partition the data space by forming a hierarchy of chunks that is based on the dimension hierarchies. continent city region country item type category item Pseudo [0..18] [0..10] [0..4] [0..2] [0..5] [0..2] [0..2] [0..1] EDBT Summer School - Cargese 2002 62 D = 0 continent city region country item type category item Pseudo [0..18] (LOCATION) [0..5](PRODUCT) (0,0)
  • 32. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 32 EDBT Summer School - Cargese 2002 63 continent city region country item type category item Pseudo [0..5] [6..10] [11..18] [0..3][4..5] D = 1 EDBT Summer School - Cargese 2002 64 continent city region country item type category item Pseudo [0..2] [3..5] [6..10] [11..14] [15..18] [4..5][0..1][2..3] D = 2
  • 33. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 33 EDBT Summer School - Cargese 2002 65 continent city region country item type category item Pseudo [0..1][2..3][4..5] [1..2][0] [4..5][3] [8..9][6..7] [10] [12..14][11] [17..18][15..16] D = 3 (Max Depth) EDBT Summer School - Cargese 2002 66 Chunk Identifiers (chunk-ids) Chunk addressing. Unique identifier of chunk within cube + depicts hierarchy path of chunk. Interleave the member-codes of the pivot-level members that define a chunk (at any depth). e.g. D = 2 LOCATION: 2.3, PRODUCT:1.2 2.3 1.2 2 . 31 2| |
  • 34. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 34 EDBT Summer School - Cargese 2002 67 Accessing the chunks of a cube Need some chunk directory. Idea: use intermediate depth chunks as directory chunks that will guide us to the data chunks (Dmax + 1) Create a chunk-tree. EDBT Summer School - Cargese 2002 68 1 3 Grain level (Data Chunks) Root Chunk P P 0 1 2 3 D = 1 D = 2 LOCATION PRODUCT 0 1 2 0 1 0 00.00 00.10 D = 3 (Max Depth) 0 00.00.0P 0 1 1 2 00.00.1P 0 1 00.10.2P 0 1 4 5 00.10.3P 0 1 0 1 00 P P 0 1 2 3 00.01 00.11 30 00.01.0P 2 3 1 2 00.01.1P 2 3 00.11.2P 2 3 4 5 00.11.3P 2 3
  • 35. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 35 EDBT Summer School - Cargese 2002 69 Bucket Organization 3 parts: bucket header, directory chunk vector, data chunk vector. Main idea: try to store in the same bucket whole families (i.e. sub-trees of chunks)! A) A single sub-tree B) Many sub-trees that form a bucket region C) A single tree of directory chunks (root bucket) D) A single data chunk EDBT Summer School - Cargese 2002 70 Chunk organization Implementation data structure: multidimensional arrays: Offer data address by-location, native to cubes. Enable chunk id exploitation. We don’t have to store the chunk ids. Are FAST! Compression schemes: Data chunks: allocate only non-empty cells, maintain bitmap. Directory chunks: full cell allocation but no allocation for empty sub-trees.
  • 36. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 36 EDBT Summer School - Cargese 2002 71 Summary Storage management in OLAP SISYPHUS storage manager for OLAP Chunk-oriented file system: Natively multidimensional and supports hierarchies. Clusters data hierarchically. It is space conservative. Adopts location-based than content-based data address scheme. Also: data-access interface can be used for defining access paths and OLAP operations. EDBT Summer School - Cargese 2002 72 Future Work Experimental tests. Design/Implementation of algorithms for typical OLAP operations. Other research issues: Finding optimal bucket regions Updating interface for common OLAP updating operations. Efficient file organization for dimension data
  • 37. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 37 EDBT Summer School - Cargese 2002 73 Data Warehouse Servers - Outline Server Technology: ROLAP & MOLAP Indexing Techniques Query Processing and Optimization EDBT Summer School - Cargese 2002 74 Why specialized indexing Join-intensive queries Almost all queries demand joins of the fact table with some dimensions Very large tables traditional index become too large to be efficient Complex queries selections based on complex criteria Read-intensive workload
  • 38. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 38 EDBT Summer School - Cargese 2002 75 BitMap Indexes An alternative representation of RID-list Advantageous for low-cardinality domains Represent each row of a table by a bit and the table as a bit vector There is a distinct bit vector Bv for each value v for the domain. The j-th bit in the vector Bv is set if the j-th row of the table has the value v for the column EDBT Summer School - Cargese 2002 76 BitMap Indexes Example: The attribute sex has values M and F. A table of 100 million people needs 2 lists of 100 million bits Comparison, join and aggregation operations are reduced to bit arithmetic with dramatic improvement in processing time Significant reduction in space and I/O (30:1)
  • 39. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 39 EDBT Summer School - Cargese 2002 77 BitMap Indexes Cust Region Rating C1 N H C2 S M C3 W L C4 W H C5 S L C6 W L C7 N H RID N S E W 1 1 0 0 0 2 0 1 0 0 3 0 0 0 1 4 0 0 0 1 5 0 1 0 0 6 0 0 0 1 7 1 0 0 0 RID H M L 1 1 0 0 2 0 1 0 3 0 0 1 4 1 0 0 5 0 0 1 6 0 0 1 7 1 0 0 Base Table Region Index Rating Index EDBT Summer School - Cargese 2002 78 BitMap Indexes Works poorly for high cardinality domains since the number of vectors increase However, often good performance via compression since scarcity also increases Products that support bitmaps: Model 204, TargetIndex (Redbrick), IQ (Sybase), Oracle
  • 40. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 40 EDBT Summer School - Cargese 2002 79 Join Indexes Traditional indexes map the value in a column to a list of rows with that value Join indexes maintain relationships between the primary key and the foreign keys Thus, join indexes relate the values of the dimensions of a star schema to rows in the fact table. Join indexes may span multiple dimensions EDBT Summer School - Cargese 2002 80 Join Indexes Join index for a single dimension: Consider a schema with a Sales fact table and two dimensions city and product If there is a join index on city, then for each distinct city, the index maintains a list of RIDs of the tuples recording sale in that city Example: The node Athens in the index points to the list of RIDs in the fact table corresponding to transactions (sale) in Athens. Join indexes can span multiple dimensions the node (Athens, oranges) points to transactions that took place in Athens and which corresponds to purchase of oranges
  • 41. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 41 EDBT Summer School - Cargese 2002 81 Join Indexes RID City Amount 1 Athens $100 2 N.Y. $300 3 Rome $120 4 Athens $250 5 Rome $180 6 Rome $65 7 N.Y. $450 City Country Population Athens Greece 3.507.000 Rome Italy 3.033.000 N.Y. USA 17.953.000 Sales table City table City RIDs Athens 1, 4 Rome 3, 5, 6 N.Y. 2, 7 Index on City-Sales EDBT Summer School - Cargese 2002 82 Data Warehouse Servers - Outline Server Technology: ROLAP & MOLAP Indexing Techniques Query Processing and Optimization
  • 42. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 42 EDBT Summer School - Cargese 2002 83 Specialized Join Methods Traditional systems limit themselves to binary joins results in many intermediate tables For a query over many dimensions, the optimization time can be substantial EDBT Summer School - Cargese 2002 84 Specialized Join Methods StarJoin Algorithm (Redbrick) use join indexes to identify regions of cartesian product that are of interest Intelligent Scan (Redbrick) take advantage of the “read-only” environment Parallel Join Methods
  • 43. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 43 EDBT Summer School - Cargese 2002 85 Complex Query Processing Extensible optimization frameworks (e.g. Starburst [IBM Almaden]) Estimation of Statistics (histograms, sampling) Some of the ideas useful for DSS: interleaving GroupBy and Join Merging Views Propagating selection through views Optimizing nested subqueries EDBT Summer School - Cargese 2002 86 Example of Optimizing Nested Subqueries Find all employees younger than 35 who earn more than the average of their department Alternatives: Iterate over each employee: (1) find the department of the employee (2) compute average salary in the department (3) check if the employee’s salary is above the average Compute the average salary of each department. For each employee, check if his/her salary is above the corresponding average salary Find out the set of all departments where at least one of the employees is 35. Compute the average salary of only those departments. Repeat the previous step.
  • 44. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 44 EDBT Summer School - Cargese 2002 87 Rollup and Cube operators [Gray et.al. 1996] Rollup operator for nested aggregations rollup product, store, city group by product, store, city group by store, city group by city Cube operator for all possible combinations group by product,store,city cube group by each subset of {product, store, city}, independently of the order of columns in the statement EDBT Summer School - Cargese 2002 88 The CUBE operator Jim Gray Adam Bosworth Andrew Layman Microsoft CHEVY FORD 1990 1991 1992 1993 RED WHITE BLUE By Color By Make & Color By Make & Year By Color & Year By Make By Year Sum The Data Cube and The Sub-Space Aggregates RED WHITE BLUE Chevy Ford By Make By Color Sum Cross Tab RED WHITE BLUE By Color Sum Group By (with total)Sum Aggregate Hamid Pirahesh IBM
  • 45. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 45 Processing Star Queries on Hierarchically-Clustered Fact Tables Nikos Karayannidis1, Aris Tsois1, Timos Sellis1, Roland Pieringer2, Volker Markl4, Frank Ramsak3,Robert Fenk3, Klaus Elhardt2, Rudolf Bayer5 1I.C.C.S. - N.T.U.Athens, 3FORWISS –5T.U.München, 2TransAction Software GmbH, 4IBM Almaden Research Center EDBT Summer School - Cargese 2002 90 Key Points Star queries are ubiquitous in DW and OLAP New trend: Hierarchically clustered star- schemata New processing framework New optimization challenges Implemented in TransBase HyperCube Tested with real-world application (up to 40 speed-up)
  • 46. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 46 EDBT Summer School - Cargese 2002 91 EDITH EDITH - the European Development on Indexing Techniques for Databases with Multidimensional Hierarchies Information Society Technologies Programme (IST) - grant No. IST-1999-20722. http://edith.in.tum.de EDBT Summer School - Cargese 2002 92 Motivation – Problem statement Not just report! What about ad hoc queries? OLAP requires efficient processing of ad-hoc star queries Major bottleneck processing of the star-join Cartesian product, bitmap indexes, … NOT enough: Efficiency requires good physical clustering of data
  • 47. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 47 EDBT Summer School - Cargese 2002 93 Hierarchical Clustering A new trend: hierarchical clustering of fact table data through path-based surrogate keys Exploitation of multidimensional indexes Star join transforms to multidimensional range query The overall processing framework of star queries changes radically EDBT Summer School - Cargese 2002 94 Contributions Present a novel processing framework for star queries over hierarchically clustered data Discuss optimizations Realization of our technology in a real system Evaluation on a real-world application has shown significant speed-ups.
  • 48. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 48 EDBT Summer School - Cargese 2002 95 Hierarchical Surrogate Keys Apply hierarchical encoding on each dimension table System-assigned h-surrogate key: e.g., oc1(“Greece”)/oc2(“Athens”)/oc3(“Store5”) Implementation based on underlying physical data structure EDBT Summer School - Cargese 2002 96 Database Schema FT m1 m2 d1 d2 … dN D1 h1 --------------- h2 h3 f1 f2 D2 h1 --------------- h2 h3 h4 DN h1 --------------- h2 f1 f2 f3 hsk1 hsk2 … hskN hsk1 hsk2 hskN
  • 49. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 49 EDBT Summer School - Cargese 2002 97 Star Queries SELECT {Di.hj}{Di.fj}{aggr(…)AS AMj} FROM FT,D1,…,DN WHERE FT.d1 = D1.h1 AND… LOCPRED({D1}) AND … MPRED({FT.mi}) GROUP BY {Di.hj},{Di.fj},{FT.mj} HAVING <having clause> ORDER BY <ordering fields> Star-join conditions Dimension restrictions Measure restrictions EDBT Summer School - Cargese 2002 98 The Abstract Processing Plan ...Dn FT MD Range Access Residual Join Group-Select Order_By D1 Dj Di Residual Join ... Create_RangeCreate_Range ... h-surrogate processing Main execution phase
  • 50. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 50 EDBT Summer School - Cargese 2002 99 Optimization Issues Optimizing h-surrogate processing Single tuple retrieval for hierarchical prefix path restrictions Exploit composite index on (hm, hm-1,…, h1, hski) Pregrouping transformation Reduces tuples for residual join and speeds up grouping Heuristic algorithm based on query syntax EDBT Summer School - Cargese 2002 100 Pre-grouping Transformation F Group Select by month, store Residual Join MD Range Access Residual Join Date Location Date F Group Select by month, store Residual Join MD Range Access Residual Join Location Group Select by hsk1, hsk2
  • 51. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 51 EDBT Summer School - Cargese 2002 101 Performance Evaluation Greek electronic retailer data: 3 dims (1.4M, 27K, 2.5K) tuples Fact table: 15.5M tuples (1.5GB) 220 ad hoc star queries from real application Compare 3 plans: STAR, AEP and OPT FT selectivity range: 0.0% to 5.0% of FT Result: AEP vs. STAR 20 avg. speed up OPT vs STAR 40 avg speed up EDBT Summer School - Cargese 2002 102 Summary Efficient star query processing a must in DW and OLAP New trend: Hierarchically clustered star-schemata Presented a novel processing framework for star queries over hierarchically clustered data Discussed optimization issues Fully implemented our technology in TransBase Evaluation with real-word application has shown significant speed-ups
  • 52. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 52 EDBT Summer School - Cargese 2002 103 Future Work Extensive experimental evaluation Investigate applicability of our processing framework to other areas Further optimization issues: reducing the number of produced h-surrogate ranges EDBT Summer School - Cargese 2002 104 Metadata Repository Sources Administrator DSA Administrator DW Designer Data Marts Metadata Repository End User Quality Issues Quality Issues Quality Issues Quality Issues Reporting / OLAP tools
  • 53. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 53 EDBT Summer School - Cargese 2002 105 The Lack of Conceptual Support Information Source Data Warehouse Wrapper/ Loader Multidim. Data Mart Aggregation/ Customization? Observation OLTP OLAPAnalyst Operational Department Enterprise Source Quality DW Quality Mart Quality (1) (2) (3) (4) (5) EDBT Summer School - Cargese 2002 106 Conceptual-Logical-Physical Source DataStore DW DataStore Wrapper Client DataStore Aggregation/ Customization ? Observation OLTP OLAPClient Model Operational Department Model Enterprise Model Source Schema DW Schema Transportation Agent Transportation Agent Client Schema Conceptual Perspective Logical Perspective Physical Perspective
  • 54. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 54 EDBT Summer School - Cargese 2002 107 The DWQ Approach Client Level DW Level Source Level Conceptual Perspective Logical Perspective Physical Perspective Meta Model Level Models/ Meta Data Level in Real World in in Process Model Process Meta Model uses Process Processes Quality Metamodel Quality Model Quality Measure- ments EDBT Summer School - Cargese 2002 108 DWQ Repository The DWQ approach for managing data warehouse quality is organized around an extended, semantically rich metadata repository (prototypically implemented using ConceptBase), which controls all relevant metadata We have developed meta models for DW architecture, quality, processes and evolution Metadata can be provided and queried by external tools, via active rules external tools could even be activated [Jarke et al., CAiSE98]
  • 55. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 55 EDBT Summer School - Cargese 2002 109 DWQ Metadata Framework Sources ... ... Enterprise Model Client Client Source Source Model_1 Model_n Model_1 Model_m Mediators conceptual/logical mapping physical/logical mapping conceptual link data flow logical link Source Source Wrappers physical levelmeta level conceptual level logical level MetaModel Interface Schema Store Client Client DW DW Source Source Schema_1 Schema_n Schema_1 Schema_m Data Store_1 Data Store_n EDBT Summer School - Cargese 2002 110 Quality Model: An Adapted GQM Approach DW Designers Decision Maker DW Administrator Quality Goal Quality Query DW Objects, Processes and Data Metadata for DW Architecture, Quality and Processes establish Measurement Processes evaluated by evidence for defined on Quality Factor [Jarke et al., IS99]
  • 56. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 56 EDBT Summer School - Cargese 2002 111 Quality Factors by Perspective Conceptual Perspective • Completeness • Redundancy • Consistency • Correctness • Traceability of Concepts and Models Logical Perspective • Usefulness of schemas • Correctness of mappings • Interpretability of schemas Physical Perspective • Efficiency • Interpretability of schemas • Timeliness of stored data • Maintainability/ Usability of software components EDBT Summer School - Cargese 2002 112 Towards Quality-Oriented DW Design Quality Goal 1. Design 2. Evaluation 3. Analysis & Improvement Define Quality Factor Types Define Object Types Define Object Instances & Properties Define Metrics & Agents Compute! Acquire values for quality factors (current status) Feed values to quality scenario and play! Discover/Refine new/old "functions" Take actions! Decompose complex objects and iterate Empirically derive "functions" Analyticaly derive "functions" Produce a scenario for a goal Produce expected/ acceptable values Negotiate! 4. Re-evalution & evolution [Vassiliadis et al., IS00]
  • 57. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 57 EDBT Summer School - Cargese 2002 113 DWQ Methodology : Summary R1 R2 R3 Enterprise Model Materialized Views C1 C2 Cm Conj. Queries R1 R2 R3 S1 R1 R2 R3 S2 R1 R2 R3 S3 R1 R2 R3 Sn Conj. Queries Conj. Queries User queries OLTP updates 3. Conceptual Client Modeling 1. Conceptual Enterprise Model 2. Conceptual Source Models Rewriting of Aggregate Queries Refreshment 6. Data Reconciliation 4. Translate aggregates into OLAP operations 5. Design Optimization Metadata Repository EDBT Summer School - Cargese 2002 114 Key Formal Results on Quality Impacts conceptual: description logic theory and tools for complete reasoning about the relationships between source, enterprise, and client models conceptual/logical: containment, satisfiability, and rewriting of queries over views with & without aggregates logical/physical: incremental cost-based optimization of view materializations physical: detailed impact analysis of replication and refreshment policies
  • 58. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 58 EDBT Summer School - Cargese 2002 115 ConceptBase User Interface EDBT Summer School - Cargese 2002 116 DW Quality Example
  • 59. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 59 EDBT Summer School - Cargese 2002 117 Metadata Standards Metadata Coalition MetaData Interchange Specification (MDIS) Open Information Model (OIM) OMG (latest development) Common Warehouse Model (CWM) Microsoft Repository EDBT Summer School - Cargese 2002 118 Summary OLAP - Multidimensional data Drill down, Roll Up, Pivot, Slice and Dice Data warehouse architecture Warehouse operational process Loading - Cleaning - Serving (ROLAP/MOLAP) Refreshing Warehouse server requirements Star-Snowflake schemes Specialized indexes: BitMap - Join Indexes
  • 60. Design and Maintenance of DataWharehouses ABIS 2002 – Timos Sellis 60 EDBT Summer School - Cargese 2002 119 Research issues Data cleaning focus on schema inconsistencies Data warehouse design summary tables, indexing Query Processing use summary data, statistics mgt, dynamic optimization Warehouse Management resource management, runaway queries incremental refresh techniques EDBT Summer School - Cargese 2002 120 References W. H. Inmon: Building the Data Warehouse (2nd Edition), John Wiley, 1996. R. Kimball: The Data Warehouse Toolkit, John Wiley, 1996. H. Garcia-Molina, Data Warehousing Overview, class notes, Stanford University. S. Chaudhuri & U. Dayal: Data Warehousing and OLAP for Decision Support - VLDB’96 tutorial Oracle, IBM, Redbrick, Sybase, Informix, Tandem, Teradata, HP, … web sites. The DWQ project: http://www.dbnet.ece.ntua.gr/~dwq/