Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

# 1
19th International Conference on Data Management
COMAD’2013 @ Ahmedabad, India
20th
December
2013
Multidimensional Database Design via
Schema Transformation
Turning TPC-H into the TPC-H*d Multidimensional
Benchmark
Alfredo Cuzzocrea∗
and Rim Moussa‡
*cuzzocrea@si.deis.unical.it ICAR-CNR & University of Calabria, Italy
‡rim.moussa@esti.rnu.tn LaTICE, University of Tunis, Tunisia

Context
# 2
20th
of Dec.
2013
COMAD@Ahmedabad.India
Data
Warehouse
Systems
Multidimensional
Databases
OLAP Technologies
Visual Analytics: BI Dashboards,
OLAP cubes, pivots tables,

Motivations & Goals
# 3
20th
of Dec.
2013
• Motivations
– BI analysts require Visual OLAP technologies
– According to market watchers, such as Pringle & Company
and Gartner, the market for BI platforms will remain one
of the fastest growing software markets in most regions,
• Goals: propose, implement & test
– Framework for MDB design based on business
requirements,
– Appropriate benchmark for OLAP servers benchmarking.

Motivations & Goals
# 4
20th
of Dec.
2013
• Motivations
– BI analysts require Visual OLAP technologies
– According to market watchers, such as Pringle & Company
and Gartner, the market for BI platforms will remain one
of the fastest growing software markets in most regions,
• Goals: propose, implement & test
– Framework for MDB design based on business
requirements,
– Appropriate benchmark for OLAP servers.

Outline
• Rules for MDB Design
• Turning TPC-H Benchmark into a TPC-H*d: a
Multidimensional Benchmark
– Initial MDB Schema
– Performances Results
• Derived data based optimizations
– Workload Taxonomy
– Performance Results
• Related Work
• Conclusion
# 5
20th
of Dec.
2013

Problem
• Given,
– A relational warehouse schema
– A Workload composed of business queries
W = {Q1, Q2, …, Qn}
where Qi is a parameterized query
• Multidimensional DB Schema
– How to define OLAP cubes?
• Will there be a single cube or multiple cubes? Are there any rules
for definition of virtual cubes?
– Which optimizations are suitable for performance tuning?
• Derived data calculus & refresh?
• Data partitioning and parallel cube building?
# 6# 6
20th
of Dec.
2013

Idea
Map each business query to an OLAP cube
» Obtain an initial MDB schema composed of OLAP
cubes
Run Optimizations
» Derived Data: Derived attributes, Aggregate tables,
» Virtual Cubes
# 7# 7
20th
of Dec.
2013

SQL Stmt Template
SELECT t1.col_i,t2.col_j,…,tn.col_k
aggregate_function(column) as measure_1, …,
aggregate_function(expression) as measure_m
FROM table_1 t1, table_2 t2, …, table_n tn
WHERE ti.col_x operator $query_parameter$
AND ti.col_y = tj.col_z
AND …
GROUP BY t1.col_i, t2.col_j, …, tn.col_k;
# 8# 8
20th
of Dec.
2013
=, < , <=, >=, !=
min, max, sum, avg, count, count-distinct …

Rules for OLAP Cube Design
--Measures’ Definition
• Measures feature aggregate functions, such as min,
max,count,count-distinct,sum,avg, …
• Simple Measure
– Defined over a single attribute,
– Exple: SUM(l_extendedprice),
• Measure expressions
– Defined over multiple attributes,
– Exple: SUM(l_extendedprice*(1 - l_discount))
• Computed Members
– Defined over measures or measure expressions,
– Exple: M1=SUM(l_extendedprice),
M2=COUNT(l_orderkey), CM = M1 / M2
# 9# 9
20th
of Dec.
2013

--Fact Table Definition (1)
• All attributes involved in measures and measure expressions
belong to the fact table!
– Exple: Q10 of TPC-H benchmark,
# 10# 10# 10
20th
of Dec.
2013

• Case measurable attributes belong to different tables The
fact table is the join of tables, to which measurable attributes
belong!
– Exple 1: Q9 of TPC-H benchmark, where l_extendedprice,
l_discount and l_qty belong to lineitem, and
ps_supplycost belongs to partsupp.
– The fact table is the join of lineitem and partsupp tables.
Select attributes needed for join with dimension tables (namely,
l_partkey, l_orderkey, l_suppkey), and
measurable attributes (namely l_extendedprice,
l_discount ,l_quantity, ps_supplycost).
# 11
--Fact Table Definition over multiple tables (2)
# 11# 11# 11
20th
of Dec.
2013

# 12
Q9 SQL statement
# 12# 12# 12# 12# 12
20th
of Dec.
2013

# 13
# 13# 13# 13# 13
20th
of Dec.
2013

# 14
--Fact Table Definition & filters' processing (6)
# 14# 14# 14# 14
20th
of Dec.
2013
• Filters' Processing: the fact table is filtered using non-
multidimensional predicates,
– Extract all filters involving the fact table from the WHERE
clause, such as
• (attr_i operator attr_j), where both attr_i and attr_j
belong to the fact table,
• (attr_k operator $fixed value$), such that attr_k
belongs to the fact table,
• [not] exists (select … from … where attr_k …),
such that attr_k belongs to the fact table,
• attr_k [not] in (list of fixed values), such that
attr_k belongs to the fact table,
– Exple 1: Q10 of TPC-H benchmark

# 15
# 15# 15# 15# 15
20th
of Dec.
2013
• Filters' Processing: the fact table is filtered using non-
multidimensional predicates,
– Extract all filters involving the fact table from the WHERE
clause, such as
• (attr_i operator attr_j), where both attr_i and attr_j
belong to the fact table,
• (attr_k operator $fixed value$), such that attr_k
belongs to the fact table,
• [not] exists (select … from … where attr_k …),
such that attr_k belongs to the fact table,
• attr_k [not] in (list of fixed values), such that
attr_k belongs to the fact table,

# 16
# 16# 16# 16# 16
20th
of Dec.
2013
Q10 SQL statement

# 17
# 17# 17# 17# 17
20th
of Dec.
2013
Q16 SQL statement

# 18
# 18# 18# 18# 18
20th
of Dec.
2013
Q21 SQL statement

# 19
--Dimensions' Definition (1)
# 19# 19# 19# 19
20th
of Dec.
2013
First, consider all attributes in the SELECT, WHERE and
GROUP BY clauses,
– discard measurable attributes, which figure out in measures,
– discard attributes which figure out in the WHERE clause, and
are used for joining tables or filtering the fact table,
– Compose time dimension along well known time hierarchies,
• Year, quarter, month
– Compose geography dimension along well known locations'
hierarchies,
• Region, nation, city, district

# 20
# 20# 20# 20# 20
20th
of Dec.
2013
– Exple: Q10 of TPC-H benchmark, all highlighted attributes are
considered for dimensions' mount!
• Time dimension o_orderdate requires order_year and
order_quarter levels

# 21
# 21# 21# 21# 21
20th
of Dec.
2013
Second, find out hierarchical relations, i.e., one-to-many
relationships, and re-organize attributes along hierarchies to
form dimensions’ hierarchies,
– Example: Q10 of TPC-H benchmark
• Each customer can be related to at most one nation, but a
nation may be related to many customers,
customer_dim: n_name (customer_nation) > c_custkey,
c_name, c_acctbal, c_address, c_phone, c_comment,
• order_dim: order_year > order_quarter

# 22
# 22# 22# 22# 22
20th
of Dec.
2013
Third, distinguish levels from properties. Properties are in
functional dependency with levels,
– Example: Q10 of TPC-H benchmark
• For customer_dim, c_custkey is the level, and all of
c_name, c_acctbal, c_address, c_phone, c_comment
attributes are properties of c_custkey level.

# 23
--Dimensions' Definition & Filters' processing (5)
# 23# 23# 23# 23
20th
of Dec.
2013
• Filters Processing: not all tuples in the dimension table should
be considered, so we have to extract filters defined over
dimension tables from the WHERE clause not useful for
multidimensional design,
– Exple 1: Q12 of TPC-H Benchmark. The OLAP cube C12 counts
the nber of urgent and high priorities orders (hig line count), and
the nber of not urgent and not high priorities orders (low line
count) by line_ship_mode, line_receipt_year over orders facts,
and considering only lines such that commit_date < receipt_date
and ship_date < commit_date.

# 24
--Dimensions' Definition & Filters' processing (6)
# 24# 24# 24# 24
20th
of Dec.
2013
Q12 SQL statement

# 25
TPC-H*d
--Summary
# 25# 25# 25# 25
20th
of Dec.
2013
●
Multi-dimensional design of TPC-H benchmark
– Minimal changes to TPC-H relational DB schema
– Each SQL statement is mapped into an OLAP cube
●
TPC-H*d Workload
– 23 MDX statements for OLAP cubes' run
– 23 MDX statements for OLAP queries' run

# 26
TPC-H*d
--Screenshots of C10 and Q10 Pivot Tables
# 26# 26# 26# 26
20th
of Dec.
2013

# 27
Performance Results
--Software and Hardware Technologies
# 27# 27# 27# 27
20th
of Dec.
2013
French Grid Platform G5K
• Sophia site
• Suno nodes, 32 GB of memory, each CPU is Intel Xeon E5520,
• 2.27 GHz, with 2 CPUs per node and 4 cores per CPU
Relational DBMS
Mysql 5.1
Jpivot OLAP client
Servlet container
Mondrian ROLAP Server
Mondrian-3.5.0

# 28
Performance Results
--TPC-H*d for SF=10
# 28# 28# 28# 28
20th
of Dec.
2013
●
Over 22 business queries: 14 perform as Q1, 4 perform as
Q10, 2 perform as Q11, 2 perform as Q9
●
The system under test was unable to build big cubes related
to business queries: Q3, Q9, Q10, Q13, Q18 and Q20, either
for memory leaks or systems constraints (max crossjoin size:
2,147,483,647),
Query
workloadd
Cube-Query workload
cube query
Q1 2,147.33 2,777.49 0.29
Q10 7,100.24 n/a -
Q11 2,558.21 3,020.27 1,604.1
Q9 n/a n/a n/a

# 29
Optimizations based on Derived Data
--Aggregate Tables (1)
# 29# 29# 29# 29
20th
of Dec.
2013
• An aggregate table (a.k.a. Materialized view) summarizes
large number of detail rows into information that has a
coarser granularity, and so fewer rows.
– Allows faster query processing,
– Requires refresh: incremental refresh or a total rebuild.

# 30
--Derived Attributes (1)
# 30# 30# 30# 30
20th
of Dec.
2013
• Alter warehouse schema and calculate attributes,
• It should allow gain in performance, CPU and I/O, (some joins
are no more processed),
• Choose attributes which are not stale after data refresh or
refresh cost is not important,
• Exple of Q10
– Add o_sumlostrevenue for each order,
– This avoids join of LineItem and Orders relations. It saves CPU and I/O.

# 31
--Derived Attributes vs. OLAP Cube design (1)
# 31# 31# 31# 31
20th
of Dec.
2013
CUSTOMER
ORDERS
LINEITEM
TIME
NATION
CUSTOMER
ORDERS
TIME
NATION
C10, with o_lost_revenue
derived attribute
Original C10

# 32
Workload Taxonomy
# 32# 32# 32# 32
20th
of Dec.
2013
• 2 variables
– Cube dimensionality: size of the cross of dimensions’ sizes
– Cube density : ratio of the size of the materialized view and the size
of the cross of dimensions’ sizes
• Exple 1: Q4 of TPC-H benchmark
– Cube dimensionality: order_years (7) × nbr_quarters (4) ×
order_priorities (5) × 1 measure (count orders) = 140
• Cube size is SF independent
– Cube density: for SF=0.96 (more than 96% of the cells are not empty)
• dense cube

# 33
Workload Taxonomy (2)
# 33# 33# 33# 33
20th
of Dec.
2013
• Exple 2: Q10 of TPC-H benchmark,
– order_year (7) × nbr_quarters (4) × line_return_flag (3) × customer (SF ×
150,000: selected by 25 nation) × 1 measure = 12,600,000 × SF
• SF dependent
– with restriction return_flag=R (returned parts), order_year (7) ×
nbr_quarters (4) × line_return_flag (1) × customer (SF × 150,000: selected
by 25 nation) × 1 measure = 4,200,000 × SF
– Cube density: for SF=0.12
• sparse cube

# 34
Workload Taxonomy
--Recommendations
# 34# 34# 34# 34
20th
of Dec.
2013
Features TPC-H Business Questions (OLAP Cube)
Medium/high dimensionality
•Dense cube
•Result is not % TPC-H Scale
Factor independent
Build Aggregate Tables
Q1, Q3, Q4, Q5, Q6, Q7, Q8, Q12, Q13,
Q14, Q16, Q19, Q22
13 business questions
•High dimensionality
•Sparse cube, few results,
lots of empty cells
Build Aggregate Tables
Q15, Q18
•High dimensionality
•Result % of Scale Factor
Add Derived Attributes
Q2, Q9, Q10, Q11, Q17, Q20, Q21

# 35
Performnce Results
--TPC-H*d for SF=10 with derived data (ms)
# 35# 35# 35# 35
20th
of Dec.
2013
Query
workload
Cube-Query workload
cube query
1.10 1.39 0.21
2.28 27.38 0.78
329.01 n/a -
2738.46 2723.85 1585.21
n/a n/a n/a
Q1
Q21
Q10
Q11
Q9
●
Response times of business queries of both workloads, for which
aggregate tables were built were improved.
●
The impact of derived attributes is mitigated. Performance results
show good improvements for Q10 and Q21, and small impact on
Q11 (saved operations are not complex).
Query
workload
Cube-Query workload
cube query
2,147.33 2,777.49 0.29
578.09 855.46 0.15
7,100.24 n/a -
2,558.21 3,020.27 1,604.1
n/a n/a n/a

# 36
Related Work
# 36# 36# 36# 36
20th
of Dec.
2013
• Methods for MDB design
– Not generalized, not tested, no empirical tests conducted by Niemi et al.
2011, Romero et al. 2006, Nair et al. 2007
• Variants of TPC-H benchmark
– Star Schema Benchmark (SSB) by O'Neil et al. 2009
• Turning the schema of TPC-H benchmark into a star schema(one fact
table and many dimensions)
• Almost half TPC-H benchmark,
• SQL workload, No OLAP cubes
– MS Analysis Services Test by La Brie et al. 2002
• Turning Q4 into an OLAP cube (MDX stmt)
• Performance measurements

# 37
Conclusion
# 37# 37# 37# 37
20th
of Dec.
2013
• We provided a framework for MDB design
– Tested for the TPC-H benchmark,
– TPC-H is the most prominent DSS benchmark
• Thorough experimentations
– Two workloads types :
• query-workload
• cube-then-query workload,
– Open source ROLAP server
– Different warehouse volumes SF=1,10

# 38
Future Work
# 38# 38# 38# 38
20th
of Dec.
2013
• Test framework with TPC-DS workload
– 7 data marts
– A hundred of business queries
• Investigate other optimization methods for OLAP over Big Data
– Approximate query processing through data synopsis calculus,

# 39
COMAD’2013@Ahmedabad.India
20th
of December, 2013
Thank you for your
Attention
Q & A
Multidimensional Database Design via Schema
Transformation
Turning TPC-H into the TPC-H*d Multidimensional Benchmark
Alfredo Cuzzocrea & Rim Moussa

# 40
TPC-H*d
--DB Schema
# 40# 40# 40# 40
20th
of Dec.
2013

# 41
--Q10 -Derived Attribute
# 41# 41# 41# 41
20th
of Dec.
2013

# 42
Performance Results
--Performance of Derived Data Calculus (ms)
# 42# 42# 42# 42
20th
of Dec.
2013
Single DB Backend
ps_isminimum
(PartSupp, Supplier, Nation,
Region are replicated )
862.4
ps_excess_YYYY
(PartSupp, Time are replicated
and LineItem is fragmented
into 4 fragments)
18,195.48
l_profit
(LineItem is fragmented into 4
fragments)
4,377.51
agg_c1 343.91
agg_c15 10,904.00

Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Multidimensional DB design, revolving TPC-H benchmark into OLAP bench

Similar to Multidimensional DB design, revolving TPC-H benchmark into OLAP bench (20)

More from Rim Moussa

More from Rim Moussa (14)

Recently uploaded

Recently uploaded (20)

Multidimensional DB design, revolving TPC-H benchmark into OLAP bench