Optimization of Continuous Queries in Federated Database and Stream Processing Systems

Optimization of Continuous Queries in Federated
Database and Stream Processing Systems
Yuanzhen Ji1, Zbigniew Jerzak1, Anisoara Nica1, Gregor Hackenbroich1,
Christof Fetzer2
1SAP SE 2TU Dresden
1firstname.lastname@sap.com 2christof.fetzer@tu-dresden.de
March 16, 2015 BTW 2015

Agenda
• Introduction
• Federated Continuous Query Execution
• Query Optimization Problem
• Our Optimization Solution
• Evaluation
• Conclusions
2

• Problem: optimizing continuous queries (CQ) for federated execution over
a native stream processing engine (SPE) and column-oriented in-memory
database (CIMDB).
– operators: select, join, project, aggregate
• Goal: maximize query throughput (amount of data processed in unit time)
Introduction
3
SPE
CIMDB
data
streams
query
results
data flow

Introduction
• Motivation:
– “No one size fits all” (Cyclops[LHB13], [JI13])
– obtain the best of both worlds (SPE, CIMDB)
• Application Scenario:
– analyzing energy consumption data collected from smart plugs
installed in households (DEBS 2014 Grand Challenge)
• Main contributions:
– a static cost-based optimizer for federated systems
• extends established optimization techniques
• considers the feasibility property of CQ
– showed the potential of federated CQ execution over a SPE and a CIMDB
• up to 8.5x as high as throughput of pure SPE based processing
• up to 1.8x as high as throughput of pure CIMDB based processing
4

Federated Continuous Query Execution
• send relevant input data from SPE to CIMDB
• trigger re-evaluation of query pieces moved to CIMDB
• take results of query pieces executed in CIMDB back to SPE
5
SPE
CIMDB
data
streams
query
results
SQL
query
MIG
MIG
data flow

Query Optimization Problem
• Problem: determine the optimal execution
plan for a given CQ
– currently at deployment time
• Feasibility of continuous queries [AN04]:
– feasible execution plan: can keep up
with data arrival rate
– feasible query: has at least one feasible plan
6
SPE CIMDB
• Feasibility-dependent optimization objective:
– feasible queries: find the feasible plan with least resource consumption
– infeasible queries: find the plan which with maximal throughput
• State of the art: either consider feasibility of CQ but not the federation
context, or the federation context but not the feasibility of CQ.

Optimization Solution
Cost Model – Operator Cost (1)
• Operator cost: CPU cost caused by tuples arrived from data sources within
unit-time
For an 𝑂 with k direct upstream operators:
– li: # tuples produced by the i-th upstream operator as a result of
unit-time source arrivals
– ci: time to process a single tuple from the i-th upstream operator
7
𝑢 > 1  bottleneck  infeasible plan
𝑢(𝑂) = 𝑖=1
𝑘
li 𝑐𝑖 = l1 𝑐1 + l2 𝑐2
O
l1=300
=200
=0.001
= 0.002l2
c1
c2
= 300* 0.001+ 200 * 0.002 = 0.7

Cost Model – Operator Cost (2)
• A query piece executed in CIMDB and its corresponding MIG operator:
– treated as a composite operator and cost as a whole
– cost includes data transfer (in & out) cost and query execution cost
8
SPE
CIMDB
data
streams
query
results
SQL
query
MIG
data flow

• Execution plan cost: C(P) = <𝐶 𝑏 𝑃 , 𝐶 𝑢 𝑃 > (m operator)
– Two components: bottleneck cost: 𝐶 𝑏 𝑃 = max{𝑢(𝑂𝑗): 𝑗 ∈ [1, 𝑚]}
total utilization cost: 𝐶 𝑢 𝑃 = 𝑗=1
𝑚
𝑢(𝑂𝑗)
(m: # operators in P)
– 𝑃 is infeasible if 𝐶 𝑏 𝑃 >1
Cost Model – Execution Plan Cost
9
𝐶 𝑏 𝑃 = 1.1
𝐶 𝑢 𝑃 = 2.6
𝑢(𝑂1)=0.5
O3
O1
O2
O4
𝑢(𝑂2)=0.3
𝑢(𝑂3)=1.1 𝑢(𝑂4)=0.7

Optimal Execution Plan
• An execution plan P of a CQ is an optimal plan, iff for any other plan P’ of
CQ, one of the following conditions is satisfied:
– Condition 1: P is feasible but P’ is infeasible
(Cb(P) ≤ 1 < Cb(P’) )
– Condition 2: Both P and P’ are feasible, but P has lower Cu(P)
(Cb(P) ≤ 1, Cb(P’) ≤ 1, and Cu(P) ≤ Cu(P’) )
– Condition 3: Both P and P’ are feasible, but P has lower Cu(P)
(1 < Cb(P) ≤ Cb(P’) )
10

Two Phase-Optimization
• Large search space (# possible plans):
– many semantically equivalent logical plans
– A logical plan with n operators -> 2n possible placement decisions
• Two-Phase optimization:
– Phase One: determine the optimal logical plan (consider join ordering,
etc.)
– Phase two: determine placement for each operator in the logical plan
produced in phase-one.
• Bottom-up plan construction following dynamic programming (DP) model
• Proved applicability of DP for feasibility-dependent optimization objective
in paper.
11

• For each operator O in a logical plan, the optimal sub-plan until O, where
O is placed in the SPE, can be build from the optimal sub-plans until direct
upstream operators of O.
• For a large logical plan: divide into smaller pieces, optimize and compose
in post order.
Pruning in Phase Two
12
I1
𝑶 𝟐
𝑺𝑷𝑬
𝑶 𝟏
𝑺𝑷𝑬
𝑶 𝟐
𝑺𝑷𝑬
𝑶 𝟏
𝑫𝑩 I2
𝐶 𝐼1 < 𝐶 𝐼2

Evaluation
Setup
• Setup: HP Z620 workstation with 24-cores (1.2GHz per core) and 96 GB
RAM, running SUSE Linux.
• Data: real-world energy consumption data from smart plugs installed in
households (DEBS 2014 Grand Challenge).
• Tested queries:
13

26.1
3.1
18.7
0
5
10
15
20
25
30
SELECT in
SPE
All in SPE All in DB
Max.throughput(thousand/s)
0
5
10
15
20
25
30
0 5 10 15 20 25 30 35 40
Actualthroughput(thousand/s)
Requested throughput (thousand/s)
Evaluation
Optimizer effectiveness (1)
• Examine 10 source stream data rates picked from
range [1,000, 40,000] (tuples/s)
• measure throughput of devised optimal query
14
Max. throughput comparisonActual vs. requested throughput
PROJECT
INNER JOIN
AGGR (avg)
SELECT SELECT
WINDOW
(5 min)
WINDOW
(5 min)
AGGR (cnt)
SELECT IN SPE

Evaluation
Optimizer effectiveness (2)
15
0
5
10
15
20
25
30
0 5 10 15 20 25 30 35 40
18.1
28.6
6.0
18.0
0
5
10
15
20
25
30
SELECT in
SPE
SEL, JOIN,
P in SPE
All in SPE All in DB
Max.throughput(thousand/s)
P1
P2
P1
P2
Max. throughput comparisonActual vs. requested throughput
• Examine data rates ranging from 1000 to 40,000
tuples/s, at 1000 tuples/s increment
• measure throughput of devised optimal query
P1
PROJECT
INNER JOIN
AGGR
(avg, max)
AGGR
(avg, max)
SELECT SELECT
WINDOW
(5 min)
WINDOW
(1 min)
SELECT IN SPE (P1)
SEL, JOIN, P IN SPE (P2)

Evaluation
Influence of Feasibility Check
16
0
5
10
15
20
25
30
0 5 10 15 20 25 30 35 40
PROJECT
INNER JOIN
AGGR
(avg, max)
AGGR
(avg, max)
SELECT SELECT
WINDOW
(5 min)
WINDOW
(1 min)
SELECT IN SPE (with feasibility check)
SEL, JOIN, P IN SPE (with feasibility check)
SEL IN SPE (without feasibility check)

Evaluation
Optimization Time
• Tested with join queries (2-way, 5-way, 8-way).
17
11
312
8411
64
327168
2-way (6) 5-way (15) 8-way (24)
#enumeratedplansinPhase-Two
(logscale)
With pruning
Without pruning
0.9
68.6 100.5
12.3
908.6
61335.3
2-way (6) 5-way (15) 8-way (24)
Timeinmillisecond
(logscale)
Phase-One
Phase-Two
16+ million
PROJECT
INNER JOIN
AGGR
(avg, max)
AGGR
(avg, max)
SELECT SELECT
WINDOW
(5 min)
WINDOW
(1 min)

Conclusion
• Exploits the potential of federated execution of CQ over SPE and IMDB.
• Presents a static optimizer which extends traditional optimization
techniques to consider feasibility of CQ.
• Evaluation show promising results.
For examined queries, throughput of devised federated plan is
– up to 8.5 times as high as throughput of pure SPE-based plan
– up to 1.8 times as high as throughput of pure CIMDB-based plan
18

References
[AN04] Ayad, A. M. & Naughton, J. F., Static Optimization of Conjunctive Queries with Sliding Windows over
Infinite Streams, SIGMOD, 2004
[FKC+09] Franklin, M. J.; Krishnamurthy, S.; Conway, N.; Li, A., Russakovsky, A. & Thombre, N., Continuous
Analytics: Rethinking query processing in a network-effect world. CIDR, 2009
[KS09] Kraemer, J. & Seeger B., Semantics and implementation of continuous sliding window queries over data
streams, ACM TODS, 2009
[BCD+10] Botan, I.; Cho, Y.; Derakhshan, R.; Dindar, N.; Gupta, A.; Haas, L. M.; Kim, K.; Lee, C.; Mundada, G.;
Shan, M.-C.; Tatbul, N.; Yan, Y.; Yun, B. & Zhang, J. A demonstration of the MaxStream federated stream
processing system. ICDE, 2010
[LMB+10] Liu, M.; Mihaylov, S. R.; Bao, Z.; Jacob, M.; Ives, Z. G.; Loo, B. T. & Guha, S. SmartCIS: integrating
digital and physical environments. SIGMOD Record, 2010
[LIM+12] Liarou, E.; Idreos, S.; Manegold, S. & Kersten, M. MonetDB/DataCell: online analytics in a streaming
column-store, PVLDB, 2012
[LHB13] Lim, H.; Han, Y. & Babu, S. How to Fit when No One Size Fits, CIDR, 2013
[Ji13] Ji, Y., Database support for processing complex aggregate queries over data streams , EDBT Workshops,
2013
[CDK+14] Çetintemel, U.; Du, J.; Kraska, T.; Madden, S.; Maier, D.; Meehan, J.; Pavlo, A.; Stonebraker, M.;
Sutherland, E.; Tatbul, N.; Tufte, K.; Wang, H. & Zdonik, S. B., S-Store: A streaming NewSQL system for big
velocity applications, PVLDB, 2014
[DLB+11] Daum, M.; Lauterwald, F.; Baumgärtel, P.; Pollner, N. & Meyer-Wegener, K., Efficient and Cost-aware
Operator Placement in Heterogeneous Stream-processing Environments, DEBS, 2011
19

Query Optimization Problem
State-of-the-Art
21
CQ
optimization
Federation
context
Optimization
Granularity
Feasibility-
dependent opt.
[VN02, AN04] √ operator √
Traditional distributed,
federated DBMS, e.g.,
[DH02, BCE+05]
√ operator
MaxStream [BCD+10] √
Cyclops [LHB13] √ √ query
ASPEN [LMB+10] √ √ operator
Operator placement,
e.g., [DLB+11]
√ √/X operator
query

Semantics
• Adopt the abstract semantics defined in [ABW06], which is based on:
– Two data types:
• Stream (S): a possibly infinite bag of elements <s, t>, where s is a
tuple belonging to the schema of S and t is the timestamp of s.
• Time-varying Relation (R): a mapping from T to a finite but
unbounded bag of tuples belonging to the schema of R.
– Three classes of query operators:
• stream-to-relation (S2R) operators: produce one relation from one
stream (e.g., window operators)
• relation-to-relation (R2R) operators: produce one relation from
one or more relations.
• relation-to-stream (R2S) operators: produce one stream from one
relation.
22

SPE
continuous query
streaming data query results
Introduction
From DBMS to SPE
• Increasing interests in processing high-velocity data streams generated in
real-time using continuous queries (CQ).
 Need a new processing paradigm
DBMS
one-shot
queries
query results
stored data
23

Introduction
From DBMS to SPE
• However, many applications require:
– persisting input streaming data/query results for on-demand analysis
– combining streaming data with static data during processing.
24
DBMS
one-shot
queries
query results
stored data
SPE
continuous query
store data
access
stored data

Introduction
Build SPE on Top of DBMS Kernel
• Exploit and merge technologies from both worlds in an integration way.
– Truviso Continuous Analytics [FKC+09], HP Lab work [CH10], DataCell
[LIM+12], S-Store [CDK+14]
25
SPE + DBMS
one-shot
queries query results
stored data
continuous query
in-memory
table
buffers
in UDFs

Optimization of Continuous Queries in Federated Database and Stream Processing Systems

More Related Content

What's hot

Viewers also liked

Similar to Optimization of Continuous Queries in Federated Database and Stream Processing Systems

More from Zbigniew Jerzak

Recently uploaded

Optimization of Continuous Queries in Federated Database and Stream Processing Systems