24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 1
Towards Lambda-based Near Real-
time OLAP over Big Data
Alfredo Cuzzocrea, University of Trieste and ICAR-CNR, Italy
Rim Moussa, LaTICE lab. Univ. of Tunis & Univ. of Carthage, Tunisia
The 42nd
IEEE International Conference on Computers,
Software and Applications @ Tokyo, Japan
24th
of July, 2018
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 2
Context
↬ Data warehouses Systems in the Big Data Era
Variety
Different forms
of data
to integrate
Volume
Processing Big
scale of
historical data
Velocity
Data in motion
Must refresh DW!
Veracity
Processing
Uncertain Data
Value
Decision Making
at right time
based on
all data
Data Warehouse
System
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 3
Outline
●Part I: Data Warehouse Systems
»DWS Architectures
»Data Summaries
●Part II: Big Data Summaries Refresh
»When and How to refresh ?
»New Framework for Effective and Efficient near-real OLAP
over Big Data
●Lambda processing
●Factorized Streams' Processing
●Performance Evaluation
●Part III: Related Work
●Part IV: Conclusions & Future Work
»Conclusions
»FW1: CDC Data
»FW2: Data Synopsis
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 4
Part I: Data Warehouse Systems
↬ DWS Typical Architecture
↬ Data Summaries
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 5
DWS Typical Architecture
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 6
Performance Tuning
●Data Fragmentation
»Parallel I/O
»Parallel processing
●OLAP Indexes
●Data Summaries
»Materialized Views (a.k.a Aggreate Tables)
»Derived Attributes
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 7
TPC-H Benchmark
●TPC Benchmarks
»The Transaction Processing Council founded in 1988 to define
benchmarks
»Examples of benchmarks relevant for benchmarking decision
support systems: TPC-H, TPC-DS and TPC-DI
»Common characteristics of TPC benchmarks
●Synthetic data
●Scale factor allowing generation of different volumes 1GB to 1PB
●TPC-H Benchmark
»Workload
●22 ad-hoc SQL statements (star queries, nested queries, …)
●Refresh functions
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 8
TPC-H Benchmark
--Relational Schema
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 9
Materialized View Example
Q12: Shipping Modes and Order Priority Query
●Q12 determines whether selecting less expensive modes of
shipping is negatively affecting the critical priority orders by
causing more parts to be received by customers after the
committed date
●The query counts, by ship mode, for lineitems actually received by
customers in a given year, the number of lineitems belonging to
orders for which the l_receiptdate exceeds the l_commitdate for
two different specified ship modes. Only lineitems that were
actually shipped before the l_commitdate are considered. The
late lineitems are partitioned into two groups, those with priority
URGENT or HIGH, and those with a priority other than URGENT or
HIGH.
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 10
Materialized View Example
Q12: SQL Statement
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 11
Materialized View Example
MV-Q12 SQL Statement
Equi-join
Scan of LineItem
table
l_shipmode
l_receipt_year
Measure high_line_count
Measure low_line_count
|mv_q12| = #receipt-years  #ship-modes
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 12
Materialized View Example
Q12 -rewritten
Scan of mv_q12
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 13
Derived Attribute Example
Q10: Returned Item Reporting Query
●Q10 identifies customers who might be having problems with the
parts that are shipped to them. The Returned Item Reporting
Query finds the top 20 customers, in terms of their effect on lost
revenue for a given quarter, who have returned parts. The
query considers only parts that were ordered in the specified
quarter.
●The query lists the customer's name, address, nation, phone
number, account balance, comment information and revenue lost.
The customers are listed in descending order of
lost revenue. Revenue lost is defined as
sum(l_extendedprice*(1-l_discount)) for all qualifying lineitems.
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 14
Derived Attribute Example
Q10: Resultset
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 15
Derived Attribute Example
Q10: SQL Statement
3 Equi-joins
1 filter invariable
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 16
Derived Attribute Example
Q10 rewritten
We propose adding an immutable attribute o_sum_lost_revenue
for each order,
The query complexity is then reduced.
2 Equi-joins
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 17
Part II: Big Data Summaries Refresh
↬ Problem Statement
↬ Refresh Strategies (When?)
↬ Refresh Operations (How?)
↬ DW Maintenance Transaction
↬ A New Framework for Big Data Summaries
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 18
Problem Statement
Given,
●A relational data warehouse schema
●An OLAP workload
●Refresh streams triggering the DWS Maintenance
Transaction
●Calculated attributes and materialized views for
boosting performance of OLAP queries
»How to process efficiently refresh streams?
»When and How data summaries are refreshed ?
»How to operate during a maintenance transaction?
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 19
MVs Storage Cost for TPC-H
Almost 1GB of Materialized views 
and good query performance, 
whether is the TPC­H scale 
factor because MVs have fixed sizes   
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 20
Derived Attributes Storage Cost for TPC-H(SF=10 ~ 11GB)
The cost is linear to TPC­H tables sizes, 
and consequently to the TPC­H scale factor 
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 21
●Refresh Strategies (When?)
●Eager refresh
»Derived attributes and materialized views are
refreshed with in the maintenance transaction. Hence,
the data warehouse is coherent at the expense of
costful maintenance.
●Lazy refresh:
»The refresh of calculated attributes and materialized
views is delayed and is not part of the maintenance
transaction. Thus, the data warehouse is incoherent for
better performances.
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 22
●Refresh Processing (How?)
●Incremental processing
»an incremental refresh executes first a sophisticated
merge of the old snapshot and a new snapshot built
over fresh data and if needed relations in the
warehouse, and second integrates fresh data in the
data warehouse.
●Full reprocessing: a full reprocessing integrates fresh
data in the data warehouse, then recomputes data
summaries.
●Hybrid processing: some parts require full reprocessing,
while others can be incrementally refreshed.
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 23
Data Warehouse Refresh
8-steps process for handling a Maintenance Transaction
Transformations include: cleaning, de-duplication, data format
conversion, derivation of new calculated values from existing
data, filtering, joining, splitting, and so forth.
Staging area is an intermediate storage area used for data
processing during the data integration process
① Copy fresh data to the staging area
③ Preparing transformed fresh data
④ Inserting fresh data into the data warehouse
Prepare the insertion of transformed fresh data by usually
disabling reference constraints and entity constraints, thus
making indexes able to accelerate data warehouse insertion
performance.
In some cases, it is necessary to merge fresh and stale data,
indicate the time of last data update or maintain multiple data
versions in order to handle suitable Change Data Capture (CDC).
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 24
Data Warehouse Refresh
8-steps process
Re-enable reference constraints, entity constraints and other
kinds of constraint over inserted data.
Validate inserted data and processing different alerts (e.g.,
constraint violations). Alerts may need human solutions.
⑤ Validating inserted data
⑦ Refreshing indexes
⑧ Refreshing data summaries
Prepare the insertion of transformed fresh data by usually
disabling reference constraints and entity constraints, thus
making indexes able to accelerate data warehouse insertion
performance.
Refresh auxiliary structures, such as materialized views, over
inserted data.
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 25
●Headlines of our Contribution
Ultimate Goals
① Improve query performance
② Improve query accuracy w.r.t. fresh data
③ Ensure that the DWS is operational during the 
     Maintenance Transaction
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 26
Headlines of our Contribution
--How To?
③ postponement of the data warehouse maintenance
transaction to an opportune time (still based on a
cost-aware analysis).
② Factorization of streams processing for fast
computation of delta views
① Perform delta computations for calculating delta
views
(inspired by the well-known Lambda architecture)
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 27
●Power Test: A single user environment and queries and update
functions run one at a time.
●Throughput Test: measures the ability of the system to process
concurrent queries and update functions in a multi-user
environment.
TPC-H Benchmark
--Types of Tests
User #1> Query Set #1 ……………………………
User #2> Query Set #2 ……………………………
...
User #i> Query Set #i ……………………………
Refresh #1 
Refresh #2 

... …
Refresh #j 
time
< Query Set > < Refresh #1 > < Query Set > < Refresh #2 > ….
time
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 28
Lambda Architecture by Nathan Marz
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 29
Inserts' Refresh Stream Analysis
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 30
Deletes' Refresh Stream Analysis
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 31
Q10 processing RF1 refresh stream
Q10@ batch layer, i.e DWS
Q10@ batch layer
and speed layer
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 32
Q19 SQL Statement
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 33
MV-Q19
Q19@ speed layer: Delta-RF1-MV-Q19
Q19@ serving layer
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 34
TPC-H Workload Analysis
Each query performs 
a set of
relational ops
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 35
Non-optimized stream processing
LineItem<i> Orders<i>
ᐅᐊᐅᐊ
ᐅᐊ
pqs r
…
…
…
● ●
● ●● ●
●●
●
● ●
t o
ᐅᐊ
●
●
ᐅᐊ
●
● ●
●
R R’

Each query performs 
a set of
relational ops on 
R (and R') @batch layer 
as well as 
new streams of LineItem
@speed layer 
and 
new streams of Orders 
@speed layer
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 36
Optimized Stream Processing
LineItem<i> Orders<i>
ᐅᐊ
s ᐱq
● ●
● ●
●
…
r ᐱp●
ᐅᐊ
●
●
●
R

●
ᐅᐊ
●
●
R’

●
t o
For instance, 
New streams of LineItem 
And
New streams of Orders 
are joined, ….. 
Then, each query 
performs 
on the resultset 
of this join     
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 37
Performance Results
●Q10
»Elapsed time without derived attributes: 12.69s
»Elapsed time to process Q10 using the derived
attribute: 7.11s
»The elapsed time to process streams refresh type RF1:
0.02s
»Elapsed time to process a query Q10, after refresh
type RF1 from serving layer: 11.57s
●Environment
»3 nodes with MonetDB -relational column-oriented
DBMS
»TPC-H SF=10
» Each node has 16GB of RAM
●Q10 -query performance enhanced using a derived
attribute
»Elapsed time without derived attributes: 12.69s
»Elapsed time to process Q10 using the derived
attribute o-sum-lost-revenue: 7.11s
»The elapsed time to process streams refresh type RF1:
0.02s
»Elapsed time to process a query Q10, after refresh
type RF1 from serving layer: 11.57s
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 38
Performance Results
●Q10
»Elapsed time without derived attributes: 12.69s
»Elapsed time to process Q10 using the derived
attribute: 7.11s
»The elapsed time to process streams refresh type RF1:
0.02s
»Elapsed time to process a query Q10, after refresh
type RF1 from serving layer: 11.57s
●Q12 -query performance enhanced using a materialized
view
»Elapsed time without MVs: 0.79s
»Elapsed time to build MV-12: 3.52s
»Elapsed time to process Q12 using MV-12: 0.023s
»Elapsed time to refresh MV-12 after RF1: 0.01s
»Elapsed time to refresh SV-12 (service view) after RF1:
0.002s
»Elapsed time to process Q12 using SV-12: 0.009s
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 39
Part III: Related Work
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 40
Related Work
●CR-OLAP by Dehne and Zaboli @CCGRID'2012
»Real-time OLAP system based on a distributed index
structure for OLAP -distributed PDCR tree.
»summary data maintenance is not investigated.
●R-Store by Li et al. @ICDE'2014
»A Scalable Distributed System for Supporting Real-time
Analytics, which periodically materializes real-time data
into a data cube.
»R-Store uses HBase for data storage and MapReduce for
query processing, and implements MVCC (Multi-version
concurrent control) to support real-time OLAP.
●Ferreiran, Cuzzocra et al., @DaWaK'2014
»propose a Rewrite/Merge Approach for Real-Time Data
Warehousing.
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 41
Related Work
●Mesa (Google) @VLDB'2014
»Mesa -a highly scalable analytic data warehousing system
that stores critical measurement data related
to Google’s Internet advertising business.
»Mesa satisfies near real-time data ingestion and query-
ability requirements. It supports continuous updates which
should be available for querying consistently across
different views within minutes.
●Pinut (LinkedIn) 2014
»real-time distributed column-oriented OLAP datastore
column oriented. It implements bitmaps and inverted
indexes.
»It is suited for analytical use cases on immutable append-
only data with exclusively selection, aggregation, filtering,
group by, order by, distinct queries on fact data (no
complex joins).
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 42
Related Work
●Druid @SIGMOD'2014
»open source distributed column-oriented data store
designed for real-time exploratory analytics on large data
sets of events.
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 43
Part IV: Conclusions & Future Work
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 44
Conclusions
●Our proposed framework is proved effective and
efficient for Near-Real-Time OLAP
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 45
Future Work
●Extend the framework to Change Data Capture (CDC)
data
»TPC-H refresh functions are simple
●RF1: inserts of new orders and new lines
●RF2: deletes of orders
»CDC: data exists in the DW and might has changed of
value
●Investigate the application of the framework
»on SQL-on-Hadoop Systems (Impala, SparkSQL...)
»other Data Synopses like histograms and sketches
●Application of the Framework to TPC-DS benchmark
»TPC-DS bench schema is 7 data marts and multiple
tables sizes are scale factor dependent
»TPC-DS workload is hundred of queries
24th
July 2018 The 42nd
IEEE COMPSAC @Tokyo 46
Thank you for your Attention
Q & A
Towards Lambda-based Near Real-time OLAP
over Big Data
Alfredo Cuzzocrea and Rim Moussa
24th
of July, 2018
The 42nd
IEEE International Conference on Computers, Software and
Applications @ Tokyo, Japan

Compsac 2018

  • 1.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 1 Towards Lambda-based Near Real- time OLAP over Big Data Alfredo Cuzzocrea, University of Trieste and ICAR-CNR, Italy Rim Moussa, LaTICE lab. Univ. of Tunis & Univ. of Carthage, Tunisia The 42nd IEEE International Conference on Computers, Software and Applications @ Tokyo, Japan 24th of July, 2018
  • 2.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 2 Context ↬ Data warehouses Systems in the Big Data Era Variety Different forms of data to integrate Volume Processing Big scale of historical data Velocity Data in motion Must refresh DW! Veracity Processing Uncertain Data Value Decision Making at right time based on all data Data Warehouse System
  • 3.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 3 Outline ●Part I: Data Warehouse Systems »DWS Architectures »Data Summaries ●Part II: Big Data Summaries Refresh »When and How to refresh ? »New Framework for Effective and Efficient near-real OLAP over Big Data ●Lambda processing ●Factorized Streams' Processing ●Performance Evaluation ●Part III: Related Work ●Part IV: Conclusions & Future Work »Conclusions »FW1: CDC Data »FW2: Data Synopsis
  • 4.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 4 Part I: Data Warehouse Systems ↬ DWS Typical Architecture ↬ Data Summaries
  • 5.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 5 DWS Typical Architecture
  • 6.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 6 Performance Tuning ●Data Fragmentation »Parallel I/O »Parallel processing ●OLAP Indexes ●Data Summaries »Materialized Views (a.k.a Aggreate Tables) »Derived Attributes
  • 7.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 7 TPC-H Benchmark ●TPC Benchmarks »The Transaction Processing Council founded in 1988 to define benchmarks »Examples of benchmarks relevant for benchmarking decision support systems: TPC-H, TPC-DS and TPC-DI »Common characteristics of TPC benchmarks ●Synthetic data ●Scale factor allowing generation of different volumes 1GB to 1PB ●TPC-H Benchmark »Workload ●22 ad-hoc SQL statements (star queries, nested queries, …) ●Refresh functions
  • 8.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 8 TPC-H Benchmark --Relational Schema
  • 9.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 9 Materialized View Example Q12: Shipping Modes and Order Priority Query ●Q12 determines whether selecting less expensive modes of shipping is negatively affecting the critical priority orders by causing more parts to be received by customers after the committed date ●The query counts, by ship mode, for lineitems actually received by customers in a given year, the number of lineitems belonging to orders for which the l_receiptdate exceeds the l_commitdate for two different specified ship modes. Only lineitems that were actually shipped before the l_commitdate are considered. The late lineitems are partitioned into two groups, those with priority URGENT or HIGH, and those with a priority other than URGENT or HIGH.
  • 10.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 10 Materialized View Example Q12: SQL Statement
  • 11.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 11 Materialized View Example MV-Q12 SQL Statement Equi-join Scan of LineItem table l_shipmode l_receipt_year Measure high_line_count Measure low_line_count |mv_q12| = #receipt-years  #ship-modes
  • 12.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 12 Materialized View Example Q12 -rewritten Scan of mv_q12
  • 13.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 13 Derived Attribute Example Q10: Returned Item Reporting Query ●Q10 identifies customers who might be having problems with the parts that are shipped to them. The Returned Item Reporting Query finds the top 20 customers, in terms of their effect on lost revenue for a given quarter, who have returned parts. The query considers only parts that were ordered in the specified quarter. ●The query lists the customer's name, address, nation, phone number, account balance, comment information and revenue lost. The customers are listed in descending order of lost revenue. Revenue lost is defined as sum(l_extendedprice*(1-l_discount)) for all qualifying lineitems.
  • 14.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 14 Derived Attribute Example Q10: Resultset
  • 15.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 15 Derived Attribute Example Q10: SQL Statement 3 Equi-joins 1 filter invariable
  • 16.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 16 Derived Attribute Example Q10 rewritten We propose adding an immutable attribute o_sum_lost_revenue for each order, The query complexity is then reduced. 2 Equi-joins
  • 17.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 17 Part II: Big Data Summaries Refresh ↬ Problem Statement ↬ Refresh Strategies (When?) ↬ Refresh Operations (How?) ↬ DW Maintenance Transaction ↬ A New Framework for Big Data Summaries
  • 18.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 18 Problem Statement Given, ●A relational data warehouse schema ●An OLAP workload ●Refresh streams triggering the DWS Maintenance Transaction ●Calculated attributes and materialized views for boosting performance of OLAP queries »How to process efficiently refresh streams? »When and How data summaries are refreshed ? »How to operate during a maintenance transaction?
  • 19.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 19 MVs Storage Cost for TPC-H Almost 1GB of Materialized views  and good query performance,  whether is the TPC­H scale  factor because MVs have fixed sizes   
  • 20.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 20 Derived Attributes Storage Cost for TPC-H(SF=10 ~ 11GB) The cost is linear to TPC­H tables sizes,  and consequently to the TPC­H scale factor 
  • 21.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 21 ●Refresh Strategies (When?) ●Eager refresh »Derived attributes and materialized views are refreshed with in the maintenance transaction. Hence, the data warehouse is coherent at the expense of costful maintenance. ●Lazy refresh: »The refresh of calculated attributes and materialized views is delayed and is not part of the maintenance transaction. Thus, the data warehouse is incoherent for better performances.
  • 22.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 22 ●Refresh Processing (How?) ●Incremental processing »an incremental refresh executes first a sophisticated merge of the old snapshot and a new snapshot built over fresh data and if needed relations in the warehouse, and second integrates fresh data in the data warehouse. ●Full reprocessing: a full reprocessing integrates fresh data in the data warehouse, then recomputes data summaries. ●Hybrid processing: some parts require full reprocessing, while others can be incrementally refreshed.
  • 23.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 23 Data Warehouse Refresh 8-steps process for handling a Maintenance Transaction Transformations include: cleaning, de-duplication, data format conversion, derivation of new calculated values from existing data, filtering, joining, splitting, and so forth. Staging area is an intermediate storage area used for data processing during the data integration process ① Copy fresh data to the staging area ③ Preparing transformed fresh data ④ Inserting fresh data into the data warehouse Prepare the insertion of transformed fresh data by usually disabling reference constraints and entity constraints, thus making indexes able to accelerate data warehouse insertion performance. In some cases, it is necessary to merge fresh and stale data, indicate the time of last data update or maintain multiple data versions in order to handle suitable Change Data Capture (CDC).
  • 24.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 24 Data Warehouse Refresh 8-steps process Re-enable reference constraints, entity constraints and other kinds of constraint over inserted data. Validate inserted data and processing different alerts (e.g., constraint violations). Alerts may need human solutions. ⑤ Validating inserted data ⑦ Refreshing indexes ⑧ Refreshing data summaries Prepare the insertion of transformed fresh data by usually disabling reference constraints and entity constraints, thus making indexes able to accelerate data warehouse insertion performance. Refresh auxiliary structures, such as materialized views, over inserted data.
  • 25.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 25 ●Headlines of our Contribution Ultimate Goals ① Improve query performance ② Improve query accuracy w.r.t. fresh data ③ Ensure that the DWS is operational during the       Maintenance Transaction
  • 26.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 26 Headlines of our Contribution --How To? ③ postponement of the data warehouse maintenance transaction to an opportune time (still based on a cost-aware analysis). ② Factorization of streams processing for fast computation of delta views ① Perform delta computations for calculating delta views (inspired by the well-known Lambda architecture)
  • 27.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 27 ●Power Test: A single user environment and queries and update functions run one at a time. ●Throughput Test: measures the ability of the system to process concurrent queries and update functions in a multi-user environment. TPC-H Benchmark --Types of Tests User #1> Query Set #1 …………………………… User #2> Query Set #2 …………………………… ... User #i> Query Set #i …………………………… Refresh #1  Refresh #2   ... … Refresh #j  time < Query Set > < Refresh #1 > < Query Set > < Refresh #2 > …. time
  • 28.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 28 Lambda Architecture by Nathan Marz
  • 29.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 29 Inserts' Refresh Stream Analysis
  • 30.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 30 Deletes' Refresh Stream Analysis
  • 31.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 31 Q10 processing RF1 refresh stream Q10@ batch layer, i.e DWS Q10@ batch layer and speed layer
  • 32.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 32 Q19 SQL Statement
  • 33.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 33 MV-Q19 Q19@ speed layer: Delta-RF1-MV-Q19 Q19@ serving layer
  • 34.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 34 TPC-H Workload Analysis Each query performs  a set of relational ops
  • 35.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 35 Non-optimized stream processing LineItem<i> Orders<i> ᐅᐊᐅᐊ ᐅᐊ pqs r … … … ● ● ● ●● ● ●● ● ● ● t o ᐅᐊ ● ● ᐅᐊ ● ● ● ● R R’  Each query performs  a set of relational ops on  R (and R') @batch layer  as well as  new streams of LineItem @speed layer  and  new streams of Orders  @speed layer
  • 36.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 36 Optimized Stream Processing LineItem<i> Orders<i> ᐅᐊ s ᐱq ● ● ● ● ● … r ᐱp● ᐅᐊ ● ● ● R  ● ᐅᐊ ● ● R’  ● t o For instance,  New streams of LineItem  And New streams of Orders  are joined, …..  Then, each query  performs  on the resultset  of this join     
  • 37.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 37 Performance Results ●Q10 »Elapsed time without derived attributes: 12.69s »Elapsed time to process Q10 using the derived attribute: 7.11s »The elapsed time to process streams refresh type RF1: 0.02s »Elapsed time to process a query Q10, after refresh type RF1 from serving layer: 11.57s ●Environment »3 nodes with MonetDB -relational column-oriented DBMS »TPC-H SF=10 » Each node has 16GB of RAM ●Q10 -query performance enhanced using a derived attribute »Elapsed time without derived attributes: 12.69s »Elapsed time to process Q10 using the derived attribute o-sum-lost-revenue: 7.11s »The elapsed time to process streams refresh type RF1: 0.02s »Elapsed time to process a query Q10, after refresh type RF1 from serving layer: 11.57s
  • 38.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 38 Performance Results ●Q10 »Elapsed time without derived attributes: 12.69s »Elapsed time to process Q10 using the derived attribute: 7.11s »The elapsed time to process streams refresh type RF1: 0.02s »Elapsed time to process a query Q10, after refresh type RF1 from serving layer: 11.57s ●Q12 -query performance enhanced using a materialized view »Elapsed time without MVs: 0.79s »Elapsed time to build MV-12: 3.52s »Elapsed time to process Q12 using MV-12: 0.023s »Elapsed time to refresh MV-12 after RF1: 0.01s »Elapsed time to refresh SV-12 (service view) after RF1: 0.002s »Elapsed time to process Q12 using SV-12: 0.009s
  • 39.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 39 Part III: Related Work
  • 40.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 40 Related Work ●CR-OLAP by Dehne and Zaboli @CCGRID'2012 »Real-time OLAP system based on a distributed index structure for OLAP -distributed PDCR tree. »summary data maintenance is not investigated. ●R-Store by Li et al. @ICDE'2014 »A Scalable Distributed System for Supporting Real-time Analytics, which periodically materializes real-time data into a data cube. »R-Store uses HBase for data storage and MapReduce for query processing, and implements MVCC (Multi-version concurrent control) to support real-time OLAP. ●Ferreiran, Cuzzocra et al., @DaWaK'2014 »propose a Rewrite/Merge Approach for Real-Time Data Warehousing.
  • 41.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 41 Related Work ●Mesa (Google) @VLDB'2014 »Mesa -a highly scalable analytic data warehousing system that stores critical measurement data related to Google’s Internet advertising business. »Mesa satisfies near real-time data ingestion and query- ability requirements. It supports continuous updates which should be available for querying consistently across different views within minutes. ●Pinut (LinkedIn) 2014 »real-time distributed column-oriented OLAP datastore column oriented. It implements bitmaps and inverted indexes. »It is suited for analytical use cases on immutable append- only data with exclusively selection, aggregation, filtering, group by, order by, distinct queries on fact data (no complex joins).
  • 42.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 42 Related Work ●Druid @SIGMOD'2014 »open source distributed column-oriented data store designed for real-time exploratory analytics on large data sets of events.
  • 43.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 43 Part IV: Conclusions & Future Work
  • 44.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 44 Conclusions ●Our proposed framework is proved effective and efficient for Near-Real-Time OLAP
  • 45.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 45 Future Work ●Extend the framework to Change Data Capture (CDC) data »TPC-H refresh functions are simple ●RF1: inserts of new orders and new lines ●RF2: deletes of orders »CDC: data exists in the DW and might has changed of value ●Investigate the application of the framework »on SQL-on-Hadoop Systems (Impala, SparkSQL...) »other Data Synopses like histograms and sketches ●Application of the Framework to TPC-DS benchmark »TPC-DS bench schema is 7 data marts and multiple tables sizes are scale factor dependent »TPC-DS workload is hundred of queries
  • 46.
    24th July 2018 The42nd IEEE COMPSAC @Tokyo 46 Thank you for your Attention Q & A Towards Lambda-based Near Real-time OLAP over Big Data Alfredo Cuzzocrea and Rim Moussa 24th of July, 2018 The 42nd IEEE International Conference on Computers, Software and Applications @ Tokyo, Japan