Dynamically Optimizing Queries over 
Large Scale Data Platforms 
[Work done at IBM Almaden Research Center] 
Konstantinos Karanasos♯, Andrey Balmin§, Marcel Kutsch♣, 
Fatma Özcan*, Vuk Ercegovac◊, Chunyang Xia♦, Jesse Jackson♦ 
♯Microsoft *IBM Research §Platfora ♣Apple ◊Google ♦IBM 
Inria Saclay 
November 26, 2014
Impala 
Dryad HAWQ 
2 
The Big Data Landscape 
Big Data 
Platforms 
nested 
relational 
HiveQL 
DryadLINQ 
Pig 
Spark 
SQL 
Jaql 
Stratosphere 
unstructured 
semi-structured 
structured 
data streams 
Languages 
Hadoop 
Hive/Stinger 
Jaql 
Spark 
Stratosphere 
Hadapt 
Polybase 
Drill 
Need for efficient Big Data management 
Challenging due to size and heterogeneity of data, 
variety of applications 
Query optimization is crucial
Query Optimization in Large Scale Data Platforms 
3 
• Existing challenges 
• Exponential error propagation in joins 
• Correlations between predicates 
• “New” challenges 
• Prominent use of UDFs 
• Complex data types (arrays, maps, structs) 
• Poor statistics (do we own the data?) 
• Bad plans over Big data may be disastrous 
• Exploit cluster’s resources (parallel execution) 
Traditional static techniques are not sufficient 
We introduce dynamic techniques that are: 
• at least as good as and 
• up to 2x (4x) better than 
the best hand-written left-deep Jaql (Hive) plans
4 
SELECT <projection list> FROM ( 
SELECT <projection list> 
FROM "PART", "SUPPLIER", "LINEITEM", 
"PARTSUPP", "ORDERS", "NATION" 
5-way join 
WHERE <join conditions> 
AND "PART"."p_name" LIKE '%green%' 
AND "ORDERS"."o_orderdate" BETWEEN '1995-01-01' AND 
'1995-07-01' 
correlated 
predicates 
AND "ORDERS"."o_orderstatus"='P' 
AND udf("PARTSUPP"."ps_partkey") < 0.001 
external UDFs 
AND <udf list> 
) "PROFIT" 
GROUP BY "PROFIT"."NATION", "PROFIT"."order_YEAR" 
ORDER BY "PROFIT"."NATION" ASC, "PROFIT"."order_YEAR" DESC; 
Example: TPCH Q9’
5 
“SQL” Processing in Large Scale Platforms 
• Relational operators -> MapReduce jobs 
• Two join algorithms: 
• Repartition join (RJ) – 1MR job (default) 
• Memory join (MJ) – map-only job 
• Optimizations based on rewrite rules and hints 
• RJ -> MJ 
• Chain MJs (multiple joins in one map job) 
• Left-deep plans 
• This is the picture for Jaql (and Hive)
6 
Limitations 
• No selectivity estimation for predicates/UDFs 
• Conservative application of memory joins 
• No cost-based join enumeration 
• Rely on order of relations in FROM clause 
• Left-deep plans 
• Often close to optimal for centralized settings 
• Not sufficient for distributed query processing
7 
TPCH Q9’: Execution Plans 
udf(o,l) 
RJ ps 
p 
l 
RJ 
RJ l 
udf(o,l) 
udf(p) 
udf(o) 
udf(ps) 
Best left-deep hand-written 
Jaql plan 
RJ 
o 
RJ 
Best relational 
optimizer plan 
MJ 
udf(ps) s n 
udf(o) 
udf(p) 
RJ 
s 
RJ 
RJ 
p 
n 
MJ 
o 
ps
8 
Dynamic Optimization 
• Key idea: alter execution plan at runtime 
• Studied in the relational setting 
• Both centralized and distributed 
• Basic concern: when to break the pipeline? 
• No emphasis on UDFs and correlated predicates 
• Increasingly being used in large scale platforms 
(e.g., Scope, Shark, Hive) 
Goal: dynamic optimization techniques for large 
scale data platforms (implemented in Jaql)
9 
IBM BigInsights Jaql 
Dataflows for conceptual JSON data 
Key differentiators 
• Functions: 
reusability + abstraction 
• Physical Transparency: 
precise control when needed 
• Data model: 
semi-structured based on JSON 
Flexible scripting language 
Scalable map-reduce runtime 
Fault Tolerant DFS 
Jaql 
Map 
Jaql 
Reduce 
Jaql 
Map 
Jaql 
Reduce 
Jaql 
Map
10 
Jaql Script: Example 
read transform group by write 
Query Data 
read(hdfs("reviews")) 
-> transform { pid: $.placeid, rev: sentAn($.review) } 
-> group by p = ($.pid) as r into { pid: p, revs: r.rev } 
-> write(hdfs("group-reviews")) 
[ 
{ pid: 12, revs: [ 3*, 4*, … ] }, 
{ pid: 19, revs: [ 2*, 1*, … ] } 
] 
Group user reviews by place
11 
Jaql to MapReduce 
mapReduce( 
input: { type: hdfs, location: "reviews" }, 
output: { type: hdfs, location: "group-reviews" }, 
map: fn($mapIn) ( 
$mapIn -> transform { pid: $.placeid, rev: sentAn($.review) } 
-> transform [ $.placeid, $.rev ] ), 
reduce: fn($p, $r) ( [ pid: $p, revs: $r ] ) ) 
• Functions as parameters 
• Rewritten script is valid 
Jaql! 
read(hdfs("reviews")) 
-> transform { pid: $.placeid, rev: sentAn($.review) } 
-> group by p = ($.pid) as r into { pid: p, revs: r.rev } 
-> write(hdfs("group-reviews")) 
Rewrite Engine
12 
Outline 
• Introduction 
• System Architecture 
• Pilot Runs 
• Adaptation of Execution Plans 
• Experiments 
• Conclusion
13 
DynO Architecture 
Query 
best plan 
Query 
result 
Jaql 
plan 
Optimizer 
(join enumeration) 
Jaql compiler 
Jaql runtime 
MapReduce 
join query 
blocks 
Statistics DFS 
execute part 
of the plan 
pilot runs 
remaining 
plan 
1 
2 
3 
4 
8 
5 
6 
7
14 
Pilot Runs 
• PilR algorithm: 
• Push-down selections/UDFs 
• Get leaf expressions (scans + local predicates) 
• Transform them to map-only jobs 
• Execute them over random splits of each relation 
• Until k tuples are output 
• Collect statistics during execution 
• Parallel execution of pilot runs (~4.5x speedup) 
• Approx. 3% overhead to the execution 
• Performance speedup of up to 2x (4x) for Jaql (Hive)
15 
udf(o,l) 
RJ ps 
p 
l 
RJ 
RJ l 
udf(o,l) 
udf(p) 
udf(o) 
udf(ps) 
Best left-deep hand-written 
Jaql plan 
RJ 
o 
RJ 
Best relational 
optimizer plan 
MJ 
udf(ps) s n 
udf(o) 
udf(p) 
RJ 
s 
RJ 
RJ 
p 
n 
MJ 
o 
ps 
udf(o,l) 
MJ 
p 
MJ 
o MJ 
l 
ps 
MJ 
MJ 
s n 
udf(ps) 
udf(o) 
udf(p) 
Up to 2x speedup 
(4x when applied 
to Hive) 
DynO plan 
TPCH Q9’: Impact of Pilot Runs
16 
Pilot Runs: Details 
• Collected statistics: 
• #tuples, min/max, #distinct values 
• add more if the optimizer can support them 
• Statistics reusability 
• Optimization for selective (and expensive) predicates 
• Shortcomings: 
• Non-local predicates 
• Non primary/foreign key joins 
• Join correlations 
Runtime adaptation of execution plans
17 
Adaptation of Execution Plans 
• Cost-based optimizer 
• Based on Columbia (top-down) optimizer 
• Focuses on join enumeration 
• Accurate statistics from pilot runs and/or previous executions 
• Bushy plans (intra-query parallelization) 
• Online statistics collection 
• Re-optimization points (natural in MR) 
• Execution strategies: choosing leaf jobs 
• Degree of parallelization, cost/uncertainty of jobs
18 
TPCH Q8’: Impact of Execution Plan Adaptation 
MJ r 
MJ n2 
RJ c 
RJ o 
p s Best left-deep hand-written 
Jaql plan 
RJ l 
RJ 
n1 
MJ 
o 
RJ 
MJ 
RJ n2 
s 
RJ 
l 
RJ 
p 
udf(o,c) 
r 
MJ 
MJ 
c 
n1 
udf(o,c) 
Best relational 
optimizer plan
TPCH Q8’: Impact of Execution Plan Adaptation 
MJ 
RJ n2 
19 
udf(o,c) 
o 
RJ n2 
s 
MJ 
RJ 
l 
RJ 
p 
t1 RJ 
r 
MJ 
MJ 
c 
n1 
MJ n2 
s 
MJ 
RJ 
l 
RJ 
t1 
p 
t2 
RJ s 
p t2 
t3 
MJ 
MJ 
n2 
s 
t3 
Speedup up to 2x without 
any initial statistics 
(despite the added overhead)
20 
Outline 
• Introduction 
• System Architecture 
• Pilot Runs 
• Adaptation of Execution Plans 
• Experiments 
• Conclusion
21 
Experimental Setup 
• 15-node cluster, 10 GbE 
• Each machine: 
• 12-cores, 96 GB RAM (2GB to each MR slot), 12*2TB disks 
• 10 map/8 reduce slots 
• Hadoop 1.1.1 
• ZooKeeper for coordination (in statistics collection) 
• TPCH data, SF = {100, 300, 1000} 
• TPCH queries (with additional UDFs)
22 
Execution times comparison 
• At least as good as the best left-deep hand-written plans 
• Benefits from bushy plans (Q2) 
• Benefits from pilot runs due to many UDFs (Q9’) 
• Benefits from re-optimization due to UDF on join result (Q8’) 
• Biggest benefit is brought by the pilot runs
23 
Benefits of our Approach on Hive 
• Similar 
performance 
trends with Jaql 
• Bigger speedup 
(up to 4x) due to 
implementation of 
broadcast joins 
(Hive 0.12 exploits 
DistributedCache)
24 
Overhead of Dynamic Optimization 
• Pilot runs overhead 
2.5-6.5% 
• Stats collection 
overhead 0.1-2.8% 
• Overall overhead 
7-10%
25 
Conclusion 
• Pilot runs to account for UDFs 
• Dynamic adaptation of execution plans 
• Traditional optimizer for join ordering (bushy plans) 
• Online statistics collection (no need for initial statistics) 
• Execution strategies 
• At least as good plans as the left-deep hand-written ones 
• Up to 2x faster (4x for Hive) 
• Applicability to other systems (e.g., Hive)
26 
Perspectives 
• Broader range of applications (e.g., ML) 
• Other runtimes (e.g., Tez) 
• Adaptive operators 
• Extend optimizer to support grouping, ordering
Thank you!

Dynamically Optimizing Queries over Large Scale Data Platforms

  • 1.
    Dynamically Optimizing Queriesover Large Scale Data Platforms [Work done at IBM Almaden Research Center] Konstantinos Karanasos♯, Andrey Balmin§, Marcel Kutsch♣, Fatma Özcan*, Vuk Ercegovac◊, Chunyang Xia♦, Jesse Jackson♦ ♯Microsoft *IBM Research §Platfora ♣Apple ◊Google ♦IBM Inria Saclay November 26, 2014
  • 2.
    Impala Dryad HAWQ 2 The Big Data Landscape Big Data Platforms nested relational HiveQL DryadLINQ Pig Spark SQL Jaql Stratosphere unstructured semi-structured structured data streams Languages Hadoop Hive/Stinger Jaql Spark Stratosphere Hadapt Polybase Drill Need for efficient Big Data management Challenging due to size and heterogeneity of data, variety of applications Query optimization is crucial
  • 3.
    Query Optimization inLarge Scale Data Platforms 3 • Existing challenges • Exponential error propagation in joins • Correlations between predicates • “New” challenges • Prominent use of UDFs • Complex data types (arrays, maps, structs) • Poor statistics (do we own the data?) • Bad plans over Big data may be disastrous • Exploit cluster’s resources (parallel execution) Traditional static techniques are not sufficient We introduce dynamic techniques that are: • at least as good as and • up to 2x (4x) better than the best hand-written left-deep Jaql (Hive) plans
  • 4.
    4 SELECT <projectionlist> FROM ( SELECT <projection list> FROM "PART", "SUPPLIER", "LINEITEM", "PARTSUPP", "ORDERS", "NATION" 5-way join WHERE <join conditions> AND "PART"."p_name" LIKE '%green%' AND "ORDERS"."o_orderdate" BETWEEN '1995-01-01' AND '1995-07-01' correlated predicates AND "ORDERS"."o_orderstatus"='P' AND udf("PARTSUPP"."ps_partkey") < 0.001 external UDFs AND <udf list> ) "PROFIT" GROUP BY "PROFIT"."NATION", "PROFIT"."order_YEAR" ORDER BY "PROFIT"."NATION" ASC, "PROFIT"."order_YEAR" DESC; Example: TPCH Q9’
  • 5.
    5 “SQL” Processingin Large Scale Platforms • Relational operators -> MapReduce jobs • Two join algorithms: • Repartition join (RJ) – 1MR job (default) • Memory join (MJ) – map-only job • Optimizations based on rewrite rules and hints • RJ -> MJ • Chain MJs (multiple joins in one map job) • Left-deep plans • This is the picture for Jaql (and Hive)
  • 6.
    6 Limitations •No selectivity estimation for predicates/UDFs • Conservative application of memory joins • No cost-based join enumeration • Rely on order of relations in FROM clause • Left-deep plans • Often close to optimal for centralized settings • Not sufficient for distributed query processing
  • 7.
    7 TPCH Q9’:Execution Plans udf(o,l) RJ ps p l RJ RJ l udf(o,l) udf(p) udf(o) udf(ps) Best left-deep hand-written Jaql plan RJ o RJ Best relational optimizer plan MJ udf(ps) s n udf(o) udf(p) RJ s RJ RJ p n MJ o ps
  • 8.
    8 Dynamic Optimization • Key idea: alter execution plan at runtime • Studied in the relational setting • Both centralized and distributed • Basic concern: when to break the pipeline? • No emphasis on UDFs and correlated predicates • Increasingly being used in large scale platforms (e.g., Scope, Shark, Hive) Goal: dynamic optimization techniques for large scale data platforms (implemented in Jaql)
  • 9.
    9 IBM BigInsightsJaql Dataflows for conceptual JSON data Key differentiators • Functions: reusability + abstraction • Physical Transparency: precise control when needed • Data model: semi-structured based on JSON Flexible scripting language Scalable map-reduce runtime Fault Tolerant DFS Jaql Map Jaql Reduce Jaql Map Jaql Reduce Jaql Map
  • 10.
    10 Jaql Script:Example read transform group by write Query Data read(hdfs("reviews")) -> transform { pid: $.placeid, rev: sentAn($.review) } -> group by p = ($.pid) as r into { pid: p, revs: r.rev } -> write(hdfs("group-reviews")) [ { pid: 12, revs: [ 3*, 4*, … ] }, { pid: 19, revs: [ 2*, 1*, … ] } ] Group user reviews by place
  • 11.
    11 Jaql toMapReduce mapReduce( input: { type: hdfs, location: "reviews" }, output: { type: hdfs, location: "group-reviews" }, map: fn($mapIn) ( $mapIn -> transform { pid: $.placeid, rev: sentAn($.review) } -> transform [ $.placeid, $.rev ] ), reduce: fn($p, $r) ( [ pid: $p, revs: $r ] ) ) • Functions as parameters • Rewritten script is valid Jaql! read(hdfs("reviews")) -> transform { pid: $.placeid, rev: sentAn($.review) } -> group by p = ($.pid) as r into { pid: p, revs: r.rev } -> write(hdfs("group-reviews")) Rewrite Engine
  • 12.
    12 Outline •Introduction • System Architecture • Pilot Runs • Adaptation of Execution Plans • Experiments • Conclusion
  • 13.
    13 DynO Architecture Query best plan Query result Jaql plan Optimizer (join enumeration) Jaql compiler Jaql runtime MapReduce join query blocks Statistics DFS execute part of the plan pilot runs remaining plan 1 2 3 4 8 5 6 7
  • 14.
    14 Pilot Runs • PilR algorithm: • Push-down selections/UDFs • Get leaf expressions (scans + local predicates) • Transform them to map-only jobs • Execute them over random splits of each relation • Until k tuples are output • Collect statistics during execution • Parallel execution of pilot runs (~4.5x speedup) • Approx. 3% overhead to the execution • Performance speedup of up to 2x (4x) for Jaql (Hive)
  • 15.
    15 udf(o,l) RJps p l RJ RJ l udf(o,l) udf(p) udf(o) udf(ps) Best left-deep hand-written Jaql plan RJ o RJ Best relational optimizer plan MJ udf(ps) s n udf(o) udf(p) RJ s RJ RJ p n MJ o ps udf(o,l) MJ p MJ o MJ l ps MJ MJ s n udf(ps) udf(o) udf(p) Up to 2x speedup (4x when applied to Hive) DynO plan TPCH Q9’: Impact of Pilot Runs
  • 16.
    16 Pilot Runs:Details • Collected statistics: • #tuples, min/max, #distinct values • add more if the optimizer can support them • Statistics reusability • Optimization for selective (and expensive) predicates • Shortcomings: • Non-local predicates • Non primary/foreign key joins • Join correlations Runtime adaptation of execution plans
  • 17.
    17 Adaptation ofExecution Plans • Cost-based optimizer • Based on Columbia (top-down) optimizer • Focuses on join enumeration • Accurate statistics from pilot runs and/or previous executions • Bushy plans (intra-query parallelization) • Online statistics collection • Re-optimization points (natural in MR) • Execution strategies: choosing leaf jobs • Degree of parallelization, cost/uncertainty of jobs
  • 18.
    18 TPCH Q8’:Impact of Execution Plan Adaptation MJ r MJ n2 RJ c RJ o p s Best left-deep hand-written Jaql plan RJ l RJ n1 MJ o RJ MJ RJ n2 s RJ l RJ p udf(o,c) r MJ MJ c n1 udf(o,c) Best relational optimizer plan
  • 19.
    TPCH Q8’: Impactof Execution Plan Adaptation MJ RJ n2 19 udf(o,c) o RJ n2 s MJ RJ l RJ p t1 RJ r MJ MJ c n1 MJ n2 s MJ RJ l RJ t1 p t2 RJ s p t2 t3 MJ MJ n2 s t3 Speedup up to 2x without any initial statistics (despite the added overhead)
  • 20.
    20 Outline •Introduction • System Architecture • Pilot Runs • Adaptation of Execution Plans • Experiments • Conclusion
  • 21.
    21 Experimental Setup • 15-node cluster, 10 GbE • Each machine: • 12-cores, 96 GB RAM (2GB to each MR slot), 12*2TB disks • 10 map/8 reduce slots • Hadoop 1.1.1 • ZooKeeper for coordination (in statistics collection) • TPCH data, SF = {100, 300, 1000} • TPCH queries (with additional UDFs)
  • 22.
    22 Execution timescomparison • At least as good as the best left-deep hand-written plans • Benefits from bushy plans (Q2) • Benefits from pilot runs due to many UDFs (Q9’) • Benefits from re-optimization due to UDF on join result (Q8’) • Biggest benefit is brought by the pilot runs
  • 23.
    23 Benefits ofour Approach on Hive • Similar performance trends with Jaql • Bigger speedup (up to 4x) due to implementation of broadcast joins (Hive 0.12 exploits DistributedCache)
  • 24.
    24 Overhead ofDynamic Optimization • Pilot runs overhead 2.5-6.5% • Stats collection overhead 0.1-2.8% • Overall overhead 7-10%
  • 25.
    25 Conclusion •Pilot runs to account for UDFs • Dynamic adaptation of execution plans • Traditional optimizer for join ordering (bushy plans) • Online statistics collection (no need for initial statistics) • Execution strategies • At least as good plans as the left-deep hand-written ones • Up to 2x faster (4x for Hive) • Applicability to other systems (e.g., Hive)
  • 26.
    26 Perspectives •Broader range of applications (e.g., ML) • Other runtimes (e.g., Tez) • Adaptive operators • Extend optimizer to support grouping, ordering
  • 27.