1. DBM630: Data Mining and
Data Warehousing
MS.IT. Rangsit University
Semester 2/2011
Lecture 2&3
Data Warehouse and OLAP Technology
by Kritsada Sriphaew (sriphaew.k AT gmail.com)
1
2. Outline
Part I: Basic Knowledge about Data Warehousing
Dimensional Modeling
Data Cube
Architecture
Part II: OLAP and Cube Computations
OLAP, Data Cube and Data Analysis
Cube Computations
Demo PivotTable (bring laptop with MS Excel, if you have)
2 Data Warehousing and Data Mining by Kritsada Sriphaew
3. What is Data Warehouse?
Defined in many different ways, but not rigorously.
A decision support database that is maintained separately
from the organization’s operational DB
Support information processing by providing a solid
platform of consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management’s decision-making process.” (definition by W. H. Inmon)
Data warehousing:
Process of constructing and using data warehouses
3 Data Warehousing and Data Mining by Kritsada Sriphaew
4. Four Properties of Data Warehouses
subject-oriented จัดเก็บเป็นเรื่องๆ
integrated รวบรวมอยู่ในรูปแบบเดียวกัน
time-variant มีข้อมูลตามมิติเวลา
non-volatile มีเสถียรภาพ ข้อมูลไม่สูญหาย
4 Data Warehousing and Data Mining by Kritsada Sriphaew
5. Subject-Oriented
Subject-Oriented Property
Organized around major subjects, such as customer, product, sales.
Focusing on the modeling and analysis of data for decision makers, not on daily
operations or transaction processing.
Provide a simple and concise view around particular subject issues by excluding
data that are not useful in the decision support process.
OPERATIONAL DB DATA WAREHOUSE
• Loans • Customer
• Savings • Vendor
• Bank card • Product
• Trust • Activity
An application orientation A subject orientation
5 Data Warehousing and Data Mining by Kritsada Sriphaew
6. Integrated
Integrated Property
Constructed by integrating multiple, heterogeneous data sources
Relational databases, flat files, on-line transaction
records
Data cleaning and data integration techniques are applied.
Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
e.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is converted.
6 Data Warehousing and Data Mining by Kritsada Sriphaew
7. Time Variant/Non-Volatile
Time Variant Property
The time horizon for the data warehouse is significantly longer than that of
operational systems.
Operational database: current value data.
Data warehouse data: provide information from a historical perspective (e.g., past 5-10
years)
Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not contain “time element”.
Non-Volatile Property
A physically separated store of data transformed from the operational
environment.
Operational update of data does not occur in the data warehouse environment.
Does not require transaction processing, recovery, and concurrency control mechanisms
Requires only two operations in data accessing: initial loading of data and access of
data.
7 Data Warehousing and Data Mining by Kritsada Sriphaew
8. Operational DBMS vs. Data Warehouse
(OLTP vs. OLAP)
OLTP (on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll,
registration, accounting, etc.
OLAP (on-line analytical processing)
Major task of data warehouse system
Data analysis and decision making
Distinct features (OLTP vs. OLAP):
User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: create/select/insert/delete/update vs. read-only but complex queries
8 Data Warehousing and Data Mining by Kritsada Sriphaew
9. OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized,
isolated multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on primary key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands tens
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
9 Data Warehousing and Data Mining by Kritsada Sriphaew
10. Heterogeneous DBMS vs. Data Warehouse
Traditional heterogeneous DB integration
Build wrappers/mediators on top of heterogeneous DBs
Query-driven approach
When a query is posed to a client site, a meta-dictionary is used
to translate the query into queries appropriate for individual
heterogeneous sites involved, and the results are integrated into a
global answer set
Complex information filtering, compete for resources
Data warehouse (DW)
Another database for high-performance analysis
Update-driven approach
Information from heterogeneous sources is integrated in advance
and stored in warehouses for direct query/analysis
10 Data Warehousing and Data Mining by Kritsada Sriphaew
11. Heterogeneous DBMS vs. Data Warehouse
OLTP Query-driven Approach
Sales
Wrapper/
Purchasing
Mediators
Production
Heterogeneous Operational Database
OLTP Update-driven Approach
Sales
Data
Purchasing Warehouse Query
Production OLAP
Heterogeneous Operational Database
11 Data Warehousing and Data Mining by Kritsada Sriphaew
12. Why Separate Data Warehouse?
High performance for both systems
DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
Different functions and different data:
Missing data: Decision support (OLAP) requires historical data which
operational DBs (OLTP) do not typically maintain
Data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
Data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
12 Data Warehousing and Data Mining by Kritsada Sriphaew
13. * A multidimensional data model
From Tables and Spreadsheets to Data Cubes
A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled
and viewed in multiple dimensions
Fact table contains measures (such as units_sold, dollars_sold) and keys to each of
the related dimension tables
Dimension tables, such as location (branch, country, continent), or time (day, week,
month, quarter, year)
Date Branch Item Buyer Units sold Dollars sold Branch Country Continent
1/1/2008 London VCD First Company 20 5000 London UK Europe
1/1/2008 Bangkok TV First Company 30 9000 Glasgow UK Europe
10/1/2008 London Ham First Company 20 1000 Berlin Germany Europe
4/2/2008 London Milk First Company 80 1600 Bangkok Thailand Asia
15/2/2008 Bangkok VCD Best Company 30 7500 Phuket Thailand Asia
2/5/2008 Bangkok Orange Best Company 20 500 Tokyo Japan Asia
13 Fact Table Dimension Table
Data Warehousing and Data Mining by Kritsada Sriphaew
14. Three Concepts in Data Cubes
Three concepts in data cubes are
(1) Multidimension (2) Hierarchy Location Dimension Table (3 levels)
(3) Measure Fact Table Branch Country Continent
4 dimensions 2 measures London UK Europe
Glasgow UK Europe
Date Branch Item Buyer Units sold Dollars sold
Berlin Germany Europe
1/1/2008 London VCD First Company 20 5000
Bangkok Thailand Asia
1/1/2008 Bangkok TV First Company 30 9000
Phuket Thailand Asia
10/1/2008 London Ham First Company 20 1000
Tokyo Japan Asia
4/2/2008 London Milk First Company 80 1600
Item Subcategory Category
15/2/2008 Bangkok VCD Best Company 30 7500
VCD Electric Non-Food
2/5/2008 Bangkok Orange Best Company 20 500
TV Electric Non-Food
Buyer Buyer Group
Shirt Clothes Non-Food
First Company Group 1
Customer Dimension Table Ham Process food Food
Second Company Group 1 (2 levels) Milk Fresh food Food
Third Company Group 1
Orange Fresh food Food
Best Company Group 2
Good Company Group 2 Product Dimension Table (3 levels)
14 Data Warehousing and Data Mining by Kritsada Sriphaew
15. Cuboids in Data Cubes
Cuboid concept is formed by the number of
dimensions.
The top most 0-D cuboid, which holds the highest-
level of summarization, is called an apex cuboid.
In data warehousing literature, an n-D base cube
where n is the total number of dimensions is called a
base cuboid.
The lattice of cuboids forms a data cube.
15 Data Warehousing and Data Mining by Kritsada Sriphaew
17. Data Cube: A Lattice of Cuboids (Dimension)
all
0-D (apex) cuboid
time product location customer
1-D cuboids
time, item time, location item, location location, customer
2-D cuboids
time, customer item, customer
time, item, location time, location, customer
3-D cuboids
time, item, customer item, location, customer
4-D(base) cuboid
time, item, location, customer
17 Data Warehousing and Data Mining by Kritsada Sriphaew
18. Designing Data Warehouses
Conceptual Modeling (Star schema, Snowflake, Fact
constellations), Concept Hierarchy
Physical Modeling (Data sources, Data storage, OLAP
engine)
18 Data Warehousing and Data Mining by Kritsada Sriphaew
19. Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions & measures
Star schema: A fact table in the middle connected
to a set of dimension tables
Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized
into a set of smaller dimension tables, forming a
shape similar to snowflake
Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
19 Data Warehousing and Data Mining by Kritsada Sriphaew
20. Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
20 Data Warehousing and Data Mining by Kritsada Sriphaew
21. Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
21 Data Warehousing and Data Mining by Kritsada Sriphaew
22. Example of Fact Constellation
time Shipping Fact Table
item
time_key time_key
day Sales Fact Table item_key
day_of_the_week item_name item_key
month time_key brand
quarter type shipper_key
year item_key supplier_type
from_location
branch_key to_location
branch location_key location
branch_key dollars_cost
units_sold location_key
branch_name street units_shipped
branch_type dollars_sold city
province_or_street shipper
avg_sales country
Measures shipper_key
shipper_name
location_key
shipper_type
22 Data Warehousing and Data Mining by Kritsada Sriphaew
23. Measures: Three Categories
Distributive: if the result derived by applying the function to n
aggregate values is the same as that derived by applying the function
on all the data without partitioning.
e.g., count(), sum(), min(), max().
Algebraic: if it can be computed by an algebraic function with M
arguments (where M is a bounded integer), each of which is obtained
by applying a distributive aggregate function.
e.g., avg(), standard_deviation().
Holistic: if there is no constant bound on the storage size needed to
describe a sub-aggregate.
e.g., median(), mode(), rank().
23 Data Warehousing and Data Mining by Kritsada Sriphaew
24. A Concept Hierarchy: Dimension (location)
all all
region Europe ... North_America
country Germany ... Spain Canada ... Mexico
city Frankfurt ... Vancouver ... Toronto
office L. Chan ... M. Wind
24 Data Warehousing and Data Mining by Kritsada Sriphaew
25. A Concept Hierarchy: Dimension
(Distributive - count(), sum(), min(), max())
all all
sum = 2100
sum = 1400 sum = 700
region Europe North_America
sum = 900 sum = 500 sum = 600 sum = 100
country Germany Spain Canada Mexico
city Frankfurt Berlin Aechen Segovia Madrid Vancouver Toronto
400 200 300 400 100 200 400
Mexico
100
25 Data Warehousing and Data Mining by Kritsada Sriphaew
26. A Concept Hierarchy: Dimension
(Algebraic - avg(), standard_deviation())
all all
avg = 262.5
count=8
avg = 280 count=5 avg = 233.33
region Europe North_America
count=3
avg = 300 avg = 250 avg = 300 avg = 100
country Germany Spain Canada Mexico
count=3 count=2 count=1
count=2
city Frankfurt Berlin Aechen Segovia Madrid Vancouver Toronto
400 200 300 400 100 200 400
Mexico
100
26 Data Warehousing and Data Mining by Kritsada Sriphaew
27. A Concept Hierarchy: Dimension
(Holistic - median(), mode(), rank())
all all
median = 250
median = 300 median = 200
region Europe North_America
median = 300 median = 250 median = 300
country Germany Spain Canada Mexico
median = 100
city Frankfurt Berlin Aechen Segovia Madrid Vancouver Toronto
400 200 300 400 100 200 400
Mexico
100
27 Data Warehousing and Data Mining by Kritsada Sriphaew
28. Multidimensional Data
Sales volume as a function of product, month, and
region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region Year
Product
Category Country Quarter
Product City Month Week
Office Day
Month
28 Data Warehousing and Data Mining by Kritsada Sriphaew
29. A Sample Data Cube
Total annual sales
Time of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR
Country
sum
Canada
Mexico
sum
29 Data Warehousing and Data Mining by Kritsada Sriphaew
30. Browsing a Data Cube
Visualization
OLAP capabilities
Interactive manipulation
30 Data Warehousing and Data Mining by Kritsada Sriphaew
31. Typical OLAP Operations
Drill up (roll up): summarize data
by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of drill up
from higher level summary to lower level summary or detailed data, or introducing new
dimensions
Slice and dice:
project and select
Slice (select on 1 dim.), dice (select on 2 or more dim.)
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes.
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its back-end relational tables
(using SQL)
31 Data Warehousing and Data Mining by Kritsada Sriphaew
32. A Star-Net Query Model
Each circle is called
Shipping Method Customer Orders a footprint
All All
Shipping type Customer Sex
group Contracts
Shipping type All
Order
Male/Female
Time Annually Product line All
All Quarterly Daily Item Product group Product
City Salesperson
Country District
Region Promotion
type Division
All All
All
Location
Promotion Organization
32 Data Warehousing and Data Mining by Kritsada Sriphaew
33. An Example
Three concepts in data cubes are * A multi-dimensional data model
(1) Multidimension (2) Hierarchy Location Dimension Table (3 levels)
(3) Measure Fact Table Branch Country Continent
4 dimensions 2 measures London UK Europe
Glasgow UK Europe
Date Branch Item Buyer Units sold Dollars sold
Berlin Germany Europe
1/1/2008 London VCD First Company 20 5000 Dimension Table
Bangkok Thailand Asia
1/1/2008 Bangkok TV First Company 30 9000
Phuket Thailand Asia
10/1/2008 London Ham First Company 20 1000
Tokyo Japan Asia
4/2/2008 London Milk First Company 80 1600
Item Subcategory Category
15/2/2008 Bangkok VCD Best Company 30 7500
VCD Electric Non-Food
2/5/2008 Bangkok Orange Best Company 20 500
TV Electric Non-Food
Buyer Buyer Group
Shirt Clothes Non-Food
First Company Group 1
Customer Dimension Table Ham Process food Food
Second Company Group 1 (2 levels) Milk Fresh food Food
Third Company Group 1
Orange Fresh food Food
Best Company Group 2
Good Company Group 2 Product Dimension Table (3 levels)
33 Data Warehousing and Data Mining by Kritsada Sriphaew
34. OLAP Operations (Drill down vs. Drill up)
All customers All customers
Thailand Japan Total Thailand Japan Total
Time Time
Food NonFood Food NonFood
Drill down Food NonFood Food NonFood
2006 2400 2200 11000 5000 20600 Roll down Q1 1500 500 2000 1000 5000
2007 4500 3200 12000 6000 25700 Q2 900 600 1500 1000 4000
2008 5600 2900 10000 5500 24000 Q3 1200 800 4000 2000 8000
Total 12500 8300 33000 16500 70300 Drill up Q4 2000 1000 2500 1500 7000
Location Roll up Total 2008 5600 2900 10000 5500 24000
all Drill down Drill up
Roll down Roll up
continent
All customers
country
Thailand Japan Total
Time branch Product Time
all year month subcategory all Food NonFood Food NonFood
quarter day item category January 700 200 800 500 2200
buyer February 500 100 700 200 1500
buyer group
March 300 200 500 300 1300
all Total Q1 1500 500 2000 1000 5000
Customer
34 Data Warehousing and Data Mining by Kritsada Sriphaew
35. OLAP Operations (Slice and Dice)
All customers Buyer Group 1
Thailand Japan Total Thailand Japan Total
Time Time
Food NonFood Food NonFood
Food NonFood Food NonFood
2006 2400 2200 11000 5000 20600 Slice
2006 1100 1500 7000 2000 11600
2007 4500 3200 12000 6000 25700 2007 1200 1200 8000 3000 13400
2008 5600 2900 10000 5500 24000 2008 3000 1200 5000 1800 11000
Total 12500 8300 33000 16500 70300 Total 5300 3900 20000 6800 36000
Location
all
January Buyer Group 1
continent
Dice
Thailand Japan Total
country
Time
Food NonFood Food NonFood
Time branch 2006 200 200 800 300 1500
all year month subcategory all Product
2007 100 100 900 400 1500
quarter day item category
2008 300 200 400 100 1000
buyer
buyer group Total 600 500 2100 800 4000
all
Customer
35 Data Warehousing and Data Mining by Kritsada Sriphaew
36. OLAP Operations (Pivot)
All customers
Thailand Japan Total
Time
Food NonFood Food NonFood
2006 2400 2200 11000 5000 20600
2007 4500 3200 12000 6000 25700 Pivot
2008 5600 2900 10000 5500 24000
Total 12500 8300 33000 16500 70300 All customers
Location Thailand Japan Total
Time
all 2006 2007 2008 2006 2007 2008
continent Food 2400 4500 5600 11000 12000 10000 45500
country NonFood 2200 3200 2900 5000 6000 5500 24800
Total 4600 7700 8500 16000 18000 15500 70300
Time branch
all year month subcategory all
quarter day item category Product
buyer
buyer group
all
Customer
36 Data Warehousing and Data Mining by Kritsada Sriphaew
37. OLAP Operations (Drill Across)
All customers All customers
Thailand Japan Total Thailand Japan Total
Time Time
Food NonFood Food NonFood
Food NonFood Food NonFood
2006 2400 2200 11000 5000 20600 2006 1500 1500 6000 3000 12000
2007 4500 3200 12000 6000 25700 Drill Across 2007 3500 2500 8000 4000 18000
2008 5600 2900 10000 5500 24000 2008 4000 1500 5000 3000 13500
Total 12500 8300 33000 16500 70300 Total 9000 5500 19000 10000 43500
Sales Purchase
Product
Product
Drill Across
Location
Location
37 Data Warehousing and Data Mining by Kritsada Sriphaew
38. OLAP Operations (Drill Through)
All customers
Thailand Japan Total
Time
Food NonFood Food NonFood
2006 2400 2200 11000 5000 20600
2007 4500 3200 12000 6000 25700
2008 5600 2900 10000 5500 24000 In order to see the detail at the relational
Total 12500 8300 33000 16500 70300 table, we can perform ‘drill through’.
Drill Through
Sales
Date Branch Item Buyer Units sold Dollars sold
1/1/2008 Phuket VCD First Company 10 250
1/1/2008 Bangkok TV First Company 30 900
Product
10/1/2008 Phuket TV First Company 10 300
4/2/2008 Phuket Stereo First Company 40 200
15/2/2008 Bangkok VCD Best Company 30 750
2/5/2008 Bangkok Computer Best Company 20 600
Location
38 Data Warehousing and Data Mining by Kritsada Sriphaew
39. An Example of OLAP tools
39 Data Warehousing and Data Mining by Kritsada Sriphaew
40. An Example of OLAP tools
Organization
(continent level)
Measures Time
(year level)
Dimension
2 Dimensions: (1) Time
(2) Organization
40 Data Warehousing and Data Mining by Kritsada Sriphaew
41. An Example of OLAP tools
41 Data Warehousing and Data Mining by Kritsada Sriphaew
42. An Example of OLAP tools
42 Data Warehousing and Data Mining by Kritsada Sriphaew
43. Designing Data Warehouses
Conceptual Modeling (Star schema, Snowflake, Fact
constellations), Concept Hierarchy
Physical Modeling (Data sources, Data storage, OLAP
engine)
43 Data Warehousing and Data Mining by Kritsada Sriphaew
44. Multi-Tiered Architecture
Bottom Tier Middle Tier Top Tier
Monitor
& OLAP Server
other Metadata
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Warehouse Data
mining
Data Marts
Data Sources Data Storage OLAP Engine Front-End Tools
44 Data Warehousing and Data Mining by Kritsada Sriphaew
45. Three Physical Data Warehouse Models
Enterprise Warehouse
collects all of the information about subjects spanning the entire
organization
Data Mart
a subset of corporate-wide data that is of value to a specific groups
of users. Its scope is confined to specific, selected groups, such as
marketing data mart
Independent vs. dependent (directly from warehouse)
data mart
Virtual Warehouse
A set of views over operational databases
Only some of the possible summary views may be materialized
Easy to build but require excess capacity on operational DB server.
45 Data Warehousing and Data Mining by Kritsada Sriphaew
46. OLAP Server Architectures
Relational OLAP (ROLAP)
Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware to support missing pieces
Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools/services
greater scalability, we may keep summary in relational DB
Multidimensional OLAP (MOLAP)
Array-based multidimensional storage engine(sparse matrix techniques)
fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP)
User flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers
specialized support for SQL queries over star/snowflake schemas in read-
only environment.
46 Data Warehousing and Data Mining by Kritsada Sriphaew
47. Efficient Data Cube Computation
Data cube can be viewed as a lattice of cuboids
The bottom-most cuboid is the base cuboid
The top-most cuboid (apex) contains only one cell
How many cuboids in an n-dimensional cube with L levels?
n
T ( Li 1)
i 1
Materialization of data cube
Materialize every (cuboid) (full materialization), none (no materialization),
or some (partial materialization)
Selection of which cuboids to materialize
Based on size, sharing, access frequency, etc.
47 Data Warehousing and Data Mining by Kritsada Sriphaew
48. Cube: A Lattice of Cuboids (Dimension)
all
0-D (apex) cuboid
time product location customer
1-D cuboids
time, item time, location item, location location, customer
2-D cuboids
time, customer item, customer
time, item, location time, location, customer
3-D cuboids
time, item, customer item, location, customer
4-D(base) cuboid
time, item, location, customer
48 Data Warehousing and Data Mining by Kritsada Sriphaew
49. Multi-way Array Aggregation for Cube
Computation
Partition arrays into chunks (a small subcube which fits in memory).
Compressed sparse array addressing: (chunk_id, offset)
Compute aggregates in “multiway” by visiting cube cells in the order which minimizes
the # of times to visit each cell, and reduces memory access and storage cost.
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
b3 B13 14 15 16 60 What is the best traversing order
44
28 56 to do multi-way aggregation?
b2 9
B 40
24 52
b1 5 36
The size of the dimensions A, B,
20 and C is 6000, 1000, and 100000.
b0 1 2 3 4
a0 a1 a2 a3
A
49 Data Warehousing and Data Mining by Kritsada Sriphaew
50. Multi-way Array Aggregation for Cube
Computation
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B 28 56
b2 9
40
24 52
b1 5
36
20
b0 1 2 3 4
a0 a1 a2 a3
A
50 Data Warehousing and Data Mining by Kritsada Sriphaew
51. Multi-way Array Aggregation for Cube
Computation
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B 28 56
b2 9
40
24 52
b1 5
36
20
b0 1 2 3 4
a0 a1 a2 a3
A
51 Data Warehousing and Data Mining by Kritsada Sriphaew
52. Multi-Way Array Aggregation for Cube
Computation (Cont.)
Method: the planes should be sorted and computed
according to their size in ascending order.
Idea: keep the smallest plane in the main memory, fetch and
compute only one chunk at a time for the largest plane
Limitation of the method: computing well only for a small
number of dimensions
If there are a large number of dimensions, “bottom-up computation”
and iceberg cube computation methods can be explored. Iceberg
cubes store only cube partitions where the aggregate value is above
some min. support.
52 Data Warehousing and Data Mining by Kritsada Sriphaew
53. Multi-Way Array Aggregation
(An Example)
Suppose that a base cuboid has three dimensions A, B, C with the following
numbers of cells: |A| = 6,000, |B| = 1,000, |C| = 50,000. Suppose that the
dimensions A, B and C are partitioned into 6, 5 and 1000 portions for chunking,
respectively.
If each cube cell stores one measure with 4 bytes, what is the total size of the
computed cube if the cube is very dense? That is, calculate the size of a base
cuboid.
The number of cells in the computed cube is
6 x 103 x 103 x 5 x 104 = 3x1011 cells.
The total size of the computed cube is
4 x 3 x 1011 = 1.2x1012 bytes.
State the order for computing the chunks in the cube that requires the least
amount of space, and compute the total amount of main memory space required
for computing the 2-D planes.
53 Data Warehousing and Data Mining by Kritsada Sriphaew
54. Multi-Way Array Aggregation for Cube
Computation (Cont.)
C
One chunk includes
A: 6000/6 = 1000 A
B: 1000/5 = 200
C: 50000/1000 = 50
The space we need for B
computing the cube is
(3) One chunk of plane AC:
(1) All elements of plane AB:
= 1000x50 cells
= 6000x1000 cells
= 50000 cells
= 6000000 cells
= 200000 bytes
= 24000000 bytes
(4) One cube of ABC:
(2) One row of plane BC:
= 1000x200x50 cells
= 1000x50 cells
= 10000000 cells
= 50000 cells
= 40000000 bytes
= 200000 bytes
Total memory for
keeping the result
= 64400000 bytes
= 6.44x107 bytes.
54 Data Warehousing and Data Mining by Kritsada Sriphaew
55. Efficient Processing OLAP Queries
Determine which operations should be performed on
the available cuboids:
transform drill, roll, etc. into corresponding SQL and/or
OLAP operations, e.g, dice = selection + projection
Determine to which materialized cuboid(s) the
relevant operations should be applied.
Exploring indexing structures and compressed vs.
dense array structures in MOLAP
55 Data Warehousing and Data Mining by Kritsada Sriphaew
56. Metadata Repository
Meta data is the data defining warehouse objects. It
has the following kinds
The algorithms used for summarization
The mapping from operational environment to the data
warehouse
Data related to system performance
warehouse schema, view and derived data definitions
Business data
business terms and definitions, ownership of data, charging
policies
56 Data Warehousing and Data Mining by Kritsada Sriphaew
57. Data Warehouse Back-End Tools and Utilities
Extraction:
get data from multiple, heterogeneous, and external sources
Transformation:
convert data from legacy or host format to warehouse format
Load:
sort, summarize, consolidate, compute views, check integrity, and
build indices and partitions
Cleaning:
detect errors in the data and rectify them when possible
Refresh:
propagate the updates from the data sources to the warehouse
57 Data Warehousing and Data Mining by Kritsada Sriphaew
58. From OLAP to On Line Analytical Mining (OLAM)
Why online analytical mining?
High quality of data in data warehouses
DW contains integrated, consistent, cleaned data
Available information processing structure surrounding data warehouses
ODBC, OLEDB, Web accessing, service facilities, reporting
and OLAP tools
OLAP-based exploratory data analysis
mining with drilling, dicing, pivoting, etc.
On-line selection of data mining functions
integration and swapping of multiple mining functions,
algorithms, and tasks.
58 Data Warehousing and Data Mining by Kritsada Sriphaew
59. An OLAM Architecture (Online Analytical Mining)
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM
Data Cube API
Layer2
MDDB
MDDB
Meta Data
Filtering&Integration Database API Filtering
Layer1
Data cleaning Data
Data Repository
Databases Warehouse
Data integration
59 Data Warehousing and Data Mining by Kritsada Sriphaew
60. Major Applications of Data Warehouses
Applications
Online Analytical Processing (OLAP)
Data Mining
Customer Relationship Management (CRM)
60 Data Warehousing and Data Mining by Kritsada Sriphaew
61. Some commercial tools for
Data Warehouse
Microsoft Excel’s PivotTable
Crystal Reports’ Business Objects
IBM Cognos
Microsoft SQL Server Analysis Services
Oracle Express
Microsoft Proclarity
Etc.
61 Data Warehousing and Data Mining by Kritsada Sriphaew