Compsac 2018

24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 1
Towards Lambda-based Near Real-
time OLAP over Big Data
Alfredo Cuzzocrea, University of Trieste and ICAR-CNR, Italy
Rim Moussa, LaTICE lab. Univ. of Tunis & Univ. of Carthage, Tunisia
The 42nd
IEEE International Conference on Computers,
Software and Applications @ Tokyo, Japan
24th
of July, 2018

24th
July 2018 The 42nd
Context
↬ Data warehouses Systems in the Big Data Era
Variety
Different forms
of data
to integrate
Volume
Processing Big
scale of
historical data
Velocity
Data in motion
Must refresh DW!
Veracity
Processing
Uncertain Data
Value
Decision Making
at right time
based on
all data
Data Warehouse
System

24th
July 2018 The 42nd
Outline
●Part I: Data Warehouse Systems
»DWS Architectures
»Data Summaries
●Part II: Big Data Summaries Refresh
»When and How to refresh ?
»New Framework for Effective and Efficient near-real OLAP
over Big Data
●Lambda processing
●Factorized Streams' Processing
●Performance Evaluation
●Part III: Related Work
●Part IV: Conclusions & Future Work
»Conclusions
»FW1: CDC Data
»FW2: Data Synopsis

24th
July 2018 The 42nd
Part I: Data Warehouse Systems
↬ DWS Typical Architecture
↬ Data Summaries

24th
July 2018 The 42nd
DWS Typical Architecture

24th
July 2018 The 42nd
Performance Tuning
●Data Fragmentation
»Parallel I/O
»Parallel processing
●OLAP Indexes
●Data Summaries
»Materialized Views (a.k.a Aggreate Tables)
»Derived Attributes

24th
July 2018 The 42nd
TPC-H Benchmark
●TPC Benchmarks
»The Transaction Processing Council founded in 1988 to define
benchmarks
»Examples of benchmarks relevant for benchmarking decision
support systems: TPC-H, TPC-DS and TPC-DI
»Common characteristics of TPC benchmarks
●Synthetic data
●Scale factor allowing generation of different volumes 1GB to 1PB
●TPC-H Benchmark
»Workload
●22 ad-hoc SQL statements (star queries, nested queries, …)
●Refresh functions

24th
July 2018 The 42nd
TPC-H Benchmark
--Relational Schema

24th
July 2018 The 42nd
Materialized View Example
Q12: Shipping Modes and Order Priority Query
●Q12 determines whether selecting less expensive modes of
shipping is negatively affecting the critical priority orders by
causing more parts to be received by customers after the
committed date
●The query counts, by ship mode, for lineitems actually received by
customers in a given year, the number of lineitems belonging to
orders for which the l_receiptdate exceeds the l_commitdate for
two different specified ship modes. Only lineitems that were
actually shipped before the l_commitdate are considered. The
late lineitems are partitioned into two groups, those with priority
URGENT or HIGH, and those with a priority other than URGENT or
HIGH.

24th
July 2018 The 42nd
Q12: SQL Statement

24th
July 2018 The 42nd
MV-Q12 SQL Statement
Equi-join
Scan of LineItem
table
l_shipmode
l_receipt_year
Measure high_line_count
Measure low_line_count
|mv_q12| = #receipt-years  #ship-modes

24th
July 2018 The 42nd
Q12 -rewritten
Scan of mv_q12

24th
July 2018 The 42nd
Derived Attribute Example
Q10: Returned Item Reporting Query
●Q10 identifies customers who might be having problems with the
parts that are shipped to them. The Returned Item Reporting
Query finds the top 20 customers, in terms of their effect on lost
revenue for a given quarter, who have returned parts. The
query considers only parts that were ordered in the specified
quarter.
●The query lists the customer's name, address, nation, phone
number, account balance, comment information and revenue lost.
The customers are listed in descending order of
lost revenue. Revenue lost is defined as
sum(l_extendedprice*(1-l_discount)) for all qualifying lineitems.

24th
July 2018 The 42nd
Q10: Resultset

24th
July 2018 The 42nd
Q10: SQL Statement
3 Equi-joins
1 filter invariable

24th
July 2018 The 42nd
Q10 rewritten
We propose adding an immutable attribute o_sum_lost_revenue
for each order,
The query complexity is then reduced.
2 Equi-joins

24th
July 2018 The 42nd
Part II: Big Data Summaries Refresh
↬ Problem Statement
↬ Refresh Strategies (When?)
↬ Refresh Operations (How?)
↬ DW Maintenance Transaction
↬ A New Framework for Big Data Summaries

24th
July 2018 The 42nd
Problem Statement
Given,
●A relational data warehouse schema
●An OLAP workload
●Refresh streams triggering the DWS Maintenance
Transaction
●Calculated attributes and materialized views for
boosting performance of OLAP queries
»How to process efficiently refresh streams?
»When and How data summaries are refreshed ?
»How to operate during a maintenance transaction?

24th
July 2018 The 42nd
MVs Storage Cost for TPC-H
Almost 1GB of Materialized views
and good query performance,
whether is the TPCH scale
factor because MVs have fixed sizes

24th
July 2018 The 42nd
Derived Attributes Storage Cost for TPC-H(SF=10 ~ 11GB)
The cost is linear to TPCH tables sizes,
and consequently to the TPCH scale factor

24th
July 2018 The 42nd
●Refresh Strategies (When?)
●Eager refresh
»Derived attributes and materialized views are
refreshed with in the maintenance transaction. Hence,
the data warehouse is coherent at the expense of
costful maintenance.
●Lazy refresh:
»The refresh of calculated attributes and materialized
views is delayed and is not part of the maintenance
transaction. Thus, the data warehouse is incoherent for
better performances.

24th
July 2018 The 42nd
●Refresh Processing (How?)
●Incremental processing
»an incremental refresh executes first a sophisticated
merge of the old snapshot and a new snapshot built
over fresh data and if needed relations in the
warehouse, and second integrates fresh data in the
data warehouse.
●Full reprocessing: a full reprocessing integrates fresh
data in the data warehouse, then recomputes data
summaries.
●Hybrid processing: some parts require full reprocessing,
while others can be incrementally refreshed.

24th
July 2018 The 42nd
Data Warehouse Refresh
8-steps process for handling a Maintenance Transaction
Transformations include: cleaning, de-duplication, data format
conversion, derivation of new calculated values from existing
data, filtering, joining, splitting, and so forth.
Staging area is an intermediate storage area used for data
processing during the data integration process
① Copy fresh data to the staging area
③ Preparing transformed fresh data
④ Inserting fresh data into the data warehouse
Prepare the insertion of transformed fresh data by usually
disabling reference constraints and entity constraints, thus
making indexes able to accelerate data warehouse insertion
performance.
In some cases, it is necessary to merge fresh and stale data,
indicate the time of last data update or maintain multiple data
versions in order to handle suitable Change Data Capture (CDC).

24th
July 2018 The 42nd
Data Warehouse Refresh
8-steps process
Re-enable reference constraints, entity constraints and other
kinds of constraint over inserted data.
Validate inserted data and processing different alerts (e.g.,
constraint violations). Alerts may need human solutions.
⑤ Validating inserted data
⑦ Refreshing indexes
⑧ Refreshing data summaries
Prepare the insertion of transformed fresh data by usually
disabling reference constraints and entity constraints, thus
making indexes able to accelerate data warehouse insertion
performance.
Refresh auxiliary structures, such as materialized views, over
inserted data.

24th
July 2018 The 42nd
●Headlines of our Contribution
Ultimate Goals
① Improve query performance
② Improve query accuracy w.r.t. fresh data
③ Ensure that the DWS is operational during the
Maintenance Transaction

24th
July 2018 The 42nd
Headlines of our Contribution
--How To?
③ postponement of the data warehouse maintenance
transaction to an opportune time (still based on a
cost-aware analysis).
② Factorization of streams processing for fast
computation of delta views
① Perform delta computations for calculating delta
views
(inspired by the well-known Lambda architecture)

24th
July 2018 The 42nd
●Power Test: A single user environment and queries and update
functions run one at a time.
●Throughput Test: measures the ability of the system to process
concurrent queries and update functions in a multi-user
environment.
TPC-H Benchmark
--Types of Tests
User #1> Query Set #1 ……………………………
User #2> Query Set #2 ……………………………
...
User #i> Query Set #i ……………………………
Refresh #1 
Refresh #2 

... …
Refresh #j 
time
< Query Set > < Refresh #1 > < Query Set > < Refresh #2 > ….
time

24th
July 2018 The 42nd
Lambda Architecture by Nathan Marz

24th
July 2018 The 42nd
Inserts' Refresh Stream Analysis

24th
July 2018 The 42nd
Deletes' Refresh Stream Analysis

24th
July 2018 The 42nd
Q10 processing RF1 refresh stream
Q10@ batch layer, i.e DWS
Q10@ batch layer
and speed layer

24th
July 2018 The 42nd
Q19 SQL Statement

24th
July 2018 The 42nd
MV-Q19
Q19@ speed layer: Delta-RF1-MV-Q19
Q19@ serving layer

24th
July 2018 The 42nd
TPC-H Workload Analysis
Each query performs
a set of
relational ops

24th
July 2018 The 42nd
Non-optimized stream processing
LineItem<i> Orders<i>
ᐅᐊᐅᐊ
ᐅᐊ
pqs r
…
…
…
● ●
● ●● ●
●●
●
● ●
t o
ᐅᐊ
●
●
ᐅᐊ
●
● ●
●
R R’

Each query performs
a set of
relational ops on
R (and R') @batch layer
as well as
new streams of LineItem
@speed layer
and
new streams of Orders
@speed layer

24th
July 2018 The 42nd
Optimized Stream Processing
LineItem<i> Orders<i>
ᐅᐊ
s ᐱq
● ●
● ●
●
…
r ᐱp●
ᐅᐊ
●
●
●
R

●
ᐅᐊ
●
●
R’

●
t o
For instance,
New streams of LineItem
And
New streams of Orders
are joined, …..
Then, each query
performs
on the resultset
of this join

24th
July 2018 The 42nd
Performance Results
●Q10
»Elapsed time without derived attributes: 12.69s
»Elapsed time to process Q10 using the derived
attribute: 7.11s
»The elapsed time to process streams refresh type RF1:
0.02s
»Elapsed time to process a query Q10, after refresh
type RF1 from serving layer: 11.57s
●Environment
»3 nodes with MonetDB -relational column-oriented
DBMS
»TPC-H SF=10
» Each node has 16GB of RAM
●Q10 -query performance enhanced using a derived
attribute
attribute o-sum-lost-revenue: 7.11s
0.02s

24th
July 2018 The 42nd
Performance Results
●Q10
attribute: 7.11s
0.02s
●Q12 -query performance enhanced using a materialized
view
»Elapsed time without MVs: 0.79s
»Elapsed time to build MV-12: 3.52s
»Elapsed time to process Q12 using MV-12: 0.023s
»Elapsed time to refresh MV-12 after RF1: 0.01s
»Elapsed time to refresh SV-12 (service view) after RF1:
0.002s
»Elapsed time to process Q12 using SV-12: 0.009s

24th
July 2018 The 42nd
Part III: Related Work

24th
July 2018 The 42nd
Related Work
●CR-OLAP by Dehne and Zaboli @CCGRID'2012
»Real-time OLAP system based on a distributed index
structure for OLAP -distributed PDCR tree.
»summary data maintenance is not investigated.
●R-Store by Li et al. @ICDE'2014
»A Scalable Distributed System for Supporting Real-time
Analytics, which periodically materializes real-time data
into a data cube.
»R-Store uses HBase for data storage and MapReduce for
query processing, and implements MVCC (Multi-version
concurrent control) to support real-time OLAP.
●Ferreiran, Cuzzocra et al., @DaWaK'2014
»propose a Rewrite/Merge Approach for Real-Time Data
Warehousing.

24th
July 2018 The 42nd
Related Work
●Mesa (Google) @VLDB'2014
»Mesa -a highly scalable analytic data warehousing system
that stores critical measurement data related
to Google’s Internet advertising business.
»Mesa satisfies near real-time data ingestion and query-
ability requirements. It supports continuous updates which
should be available for querying consistently across
different views within minutes.
●Pinut (LinkedIn) 2014
»real-time distributed column-oriented OLAP datastore
column oriented. It implements bitmaps and inverted
indexes.
»It is suited for analytical use cases on immutable append-
only data with exclusively selection, aggregation, filtering,
group by, order by, distinct queries on fact data (no
complex joins).

24th
July 2018 The 42nd
Related Work
●Druid @SIGMOD'2014
»open source distributed column-oriented data store
designed for real-time exploratory analytics on large data
sets of events.

24th
July 2018 The 42nd
Part IV: Conclusions & Future Work

24th
July 2018 The 42nd
Conclusions
●Our proposed framework is proved effective and
efficient for Near-Real-Time OLAP

24th
July 2018 The 42nd
Future Work
●Extend the framework to Change Data Capture (CDC)
data
»TPC-H refresh functions are simple
●RF1: inserts of new orders and new lines
●RF2: deletes of orders
»CDC: data exists in the DW and might has changed of
value
●Investigate the application of the framework
»on SQL-on-Hadoop Systems (Impala, SparkSQL...)
»other Data Synopses like histograms and sketches
●Application of the Framework to TPC-DS benchmark
»TPC-DS bench schema is 7 data marts and multiple
tables sizes are scale factor dependent
»TPC-DS workload is hundred of queries

24th
July 2018 The 42nd
Thank you for your Attention
Q & A
Towards Lambda-based Near Real-time OLAP
over Big Data
Alfredo Cuzzocrea and Rim Moussa
24th
of July, 2018
The 42nd
IEEE International Conference on Computers, Software and
Applications @ Tokyo, Japan

Compsac 2018

More Related Content

Similar to Compsac 2018

More from Rim Moussa

Recently uploaded

Compsac 2018