• Save
Query Compilation in Impala
 

Query Compilation in Impala

on

  • 1,330 views

 

Statistics

Views

Total Views
1,330
Views on SlideShare
1,311
Embed Views
19

Actions

Likes
9
Downloads
0
Comments
0

2 Embeds 19

http://www.slideee.com 10
https://twitter.com 9

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Query Compilation in Impala Query Compilation in Impala Presentation Transcript

  • Query Compilation in Impala Query Compilation in Impala Alexander Behm | Software Engineer May 2014 @ Impala User Group
  • Query Compilation in Impala Compile Query Execute Query Client Client SQL Text Executable Plan Query Results Impala Frontend (Java) Impala Backend (C++) Focus of this talk Flow of a SQL Query
  • Query Compilation in Impala Client SQL Text Executable Plan Query Compilation Query Compiler SQL Parsing Semantic Analysis Query Planning Parse Tree Parse Tree + Analyzer
  • Query Compilation in Impala Query Parsing SELECT c1, SUM(c2) FROM t1 JOIN t2 USING(id) WHERE c3 > 10 GROUP BY c1 SelectList TableRefs WhereClause SelectStmt GroupByClause ColRef AggExpr ColRef BinaryPredicate ColRef IntLiteral ColRefTableRef TableRef UsingClause ColRef • Applies SQL grammar, reports syntax errors • Produces parse tree capturing syntactic structure of query
  • Query Compilation in Impala Semantic Analysis… • Precondition: Query is syntactically valid. Analysis operates on parse tree. • Consults table metadata • Do t1 and t2 exist? Does c1 exist in t1 or t2 (or both  error)? Does id exist in t1 and t2? • Does the user have privileges to SELECT from t1? • Checks type compatibility of expressions, adds implicit casts • c3 > 10  c3 > cast(10 as bigint) • SQL rules (semantic, not syntactic) • Does c1 appear in the GROUP BY clause? SELECT c1, SUM(c2) FROM t1 JOIN t2 USING(id) WHERE c3 > 10 GROUP BY c1
  • Query Compilation in Impala … Semantic Analysis • Expression substitution for views • Resolve column references against base tables • Preparation for Planning • Register state in analyzer for correct predicate assignment during planning • Register predicates (WHERE, HAVING, ON, USING, etc.) • Register outer-joined tables • Compute value-transfer graph and equivalence classes for predicate inference • (…) • Postcondition: Query is valid. An executable plan can be produced. SELECT c1, SUM(c2) FROM (SELECT dept AS c1, revenue AS c2, month AS c3 FROM t1) AS v WHERE c3 > 10 GROUP BY c1 SELECT dept, SUM(revenue) FROM t1 WHERE month > 10 GROUP BY dept
  • Query Compilation in Impala • Generate executable plan (“tree” of operators) • Maximize scan locality using DN block metadata • Minimize data movement • Full distribution of operators • Query operators • Scan, HashJoin, HashAggregation, Union, TopN, Exchange Query Planning: Goals
  • Query Compilation in Impala Query Planning: Overview Semantic Analysis Parse Tree + Analyzer Query Planner Walk Parse Tree Parallelize & Fragment Single-node Plan Executable Plan
  • Query Compilation in Impala Query Planning: Single-Node Plan • Four major functions: 1. Parse Tree  Plan Tree 2. Assigns predicates to lowest plan node 3. Optimizes join order 4. Prunes irrelevant columns
  • Query Compilation in Impala Parse Tree  Single-Node Plan Tree HashJoin Scan: t1 Scan: t3 Scan: t2 HashJoin TopN Agg SELECT t1.dept, SUM(t2.revenue) FROM LargeHdfsTable t1 JOIN HugeHdfsTable t2 ON (t1.id1 = t2.id) JOIN SmallHbaseTable t3 ON (t1.id2 = t3.id) WHERE t3.category = 'Online‘ AND t1.id > 10 GROUP BY t1.dept HAVING COUNT(t2.revenue) > 10 ORDER BY revenue LIMIT 10
  • Query Compilation in Impala SELECT t1.dept, SUM(t2.revenue) FROM LargeHdfsTable t1 JOIN HugeHdfsTable t2 ON (t1.id1 = t2.id) JOIN SmallHbaseTable t3 ON (t1.id2 = t3.id) WHERE t3.category = 'Online‘ AND t1.id > 10 GROUP BY t1.dept HAVING COUNT(t2.revenue) > 10 ORDER BY revenue LIMIT 10 Predicate Assignment & Inference HashJoin Scan: t1 Scan: t3 Scan: t2 HashJoin TopN Agg COUNT(t2.revenue) > 10 t1.id2 = t3.id t1.id1 = t2.id id1 > 10 category = ‘Online’ id > 10 Inferred Predicate
  • Query Compilation in Impala Join-Order Optimization • Inner joins are commutative and associative • Query results correct independent of execution order • Query execution costs vary dramatically! • Hash table sizes, network transfers, #hash lookups • Join-order optimization • Impala only considers left-deep join trees • (Right join input is a table, not another join) • Find cheapest valid join order • Relies heavily on table and column statistics • Limitation: Choice of join order independent of join strategy
  • Query Compilation in Impala Invalid Join Orders SELECT t1.dept, SUM(t2.revenue) FROM LargeHdfsTable t1 JOIN HugeHdfsTable t2 ON (t1.id1 = t2.id) JOIN SmallHbaseTable t3 ON (t1.id2 = t3.id) WHERE t3.category = 'Online‘ AND t1.id > 10 GROUP BY t1.dept HAVING COUNT(t2.revenue) > 10 ORDER BY revenue LIMIT 10 No explicit or implicit predicate between t2 and t3
  • Query Compilation in Impala Join-Order Optimization HashJoin Scan: t1 Scan: t3 Scan: t2 HashJoin HashJoin Scan: t1 Scan: t2 Scan: t3 HashJoin HashJoin Scan: t2 Scan: t3 Scan: t1 HashJoin HashJoin Scan: t2 Scan: t1 Scan: t3 HashJoin HashJoin Scan: t3 Scan: t2 Scan: t1 HashJoin HashJoin Scan: t3 Scan: t1 Scan: t2 HashJoin Order: t1, t2, t3 Order: t1, t3, t2 Order: t2, t1, t3 Order: t2, t3, t1 Order: t3, t1, t2 Order: t3, t2, t1
  • Query Compilation in Impala Join-Order Optimization • Impala’s Implementation: 1. Heuristic • Order tables descending by size • Best plan typically has largest table on the left (if valid) 2. Plan enumeration & costing • Generate all possible join orders starting from a given left-most table (starting with largest one) • Ignore invalid join orders • Estimate intermediate result sizes (key!) • Choose plan that minimizes intermediate result sizes
  • Query Compilation in Impala Query Planning: Overview Semantic Analysis Parse Tree + Analyzer Query Planner Walk Parse Tree Parallelize & Fragment Single-node Plan Executable Plan
  • Query Compilation in Impala Query Planning: Distributed Plans • Distributed Aggregation • Pre-aggregation where data is first materialized • Merge-aggregation partitioned by grouping columns • Distinct aggregation: additional level of pre- and merge aggregation • Distributed Top-N • Initial Top-N where data is first materialized • Final Top-N at coordinator • Distributed Union • Pre-aggregation/top-n placed into plans of each union operand • Union-operand plans executed in parallel, merged via exchange • Above strategies are currently fixed in Impala • Independent of column/table stats
  • Query Compilation in Impala Query Planning: Distributed Joins • Broadcast Join • Join is co-located with left input • Broadcast right input to all nodes executing join • Build hash table on right input, streaming probe from left input •  Preferred for small right side (relative to left side) • Partitioned Join • Both tables hash-partitioned on join columns • Same build/probe procedure as above •  Preferred for joins where both left and right side are large • Cost-based decision based on table/column stats • Minimize required network transfer
  • Query Compilation in Impala Query Planning: Distributed Plans HashJoinScan: t2 Scan: t3 Scan: t1 HashJoin TopN Pre-Agg MergeAgg TopN Broadcast Merge hash t2.idhash t1.id1 hash t1.custid at HDFS DN at HBase RS at coordinator HashJoin Scan: t2 Scan: t3 Scan: t1 HashJoin TopN Agg Single-Node Plan
  • Query Compilation in Impala Explain Example: TPCDS Q42 SELECT d.d_year, i.i_category_id, i.i_category, SUM(ss_ext_sales_price) FROM store_sales ss JOIN date_dim d ON (ss.ss_sold_date_sk = d.d_date_sk) JOIN item i ON (ss.ss_item_sk = i.i_item_sk) WHERE i.i_manager_id = 1 AND d.d_moy = 12 AND d.d_year = 1998 GROUP BY d.d_year, i.i_category_id, i.i_category ORDER BY total_sales DESC, d_year, i_category_id, i_category LIMIT 100
  • Query Compilation in Impala Explain Example: TPCDS Q42 +-----------------------------------------------------+ | Explain String | +-----------------------------------------------------+ | Estimated Per-Host Requirements: Memory=0B VCores=0 | | | | 06:TOP-N [LIMIT=100] | | 05:AGGREGATE [FINALIZE] | | 04:HASH JOIN [INNER JOIN] | | |--02:SCAN HDFS [tpcds1000gb.item i] | | 03:HASH JOIN [INNER JOIN] | | |--01:SCAN HDFS [tpcds1000gb.date_dim d] | | 00:SCAN HDFS [tpcds1000gb.store_sales ss] | +-----------------------------------------------------+ set explain_level=0; set num_nodes=1;
  • Query Compilation in Impala Explain Example: TPCDS Q42 +---------------------------------------------------------------------+ | Explain String | +---------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=3.76GB VCores=3 | | | | 12:TOP-N [LIMIT=100] | | 11:EXCHANGE [PARTITION=UNPARTITIONED] | | 06:TOP-N [LIMIT=100] | | 10:AGGREGATE [MERGE FINALIZE] | | 09:EXCHANGE [PARTITION=HASH(d.d_year,i.i_category_id,i.i_category)] | | 05:AGGREGATE | | 04:HASH JOIN [INNER JOIN, BROADCAST] | | |--08:EXCHANGE [BROADCAST] | | | 02:SCAN HDFS [tpcds1000gb.item i] | | 03:HASH JOIN [INNER JOIN, BROADCAST] | | |--07:EXCHANGE [BROADCAST] | | | 01:SCAN HDFS [tpcds1000gb.date_dim d] | | 00:SCAN HDFS [tpcds1000gb.store_sales ss] | +---------------------------------------------------------------------+ set explain_level=0; set num_nodes=0;
  • Query Compilation in Impala Explain Example: TPCDS Q42 | … | 03:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: ss.ss_sold_date_sk = d.d_date_sk | | | hosts=10 per-host-mem=511B | | | tuple-ids=0,1 row-size=40B cardinality=8251124389 | | | | | |--07:EXCHANGE [BROADCAST] | | | | hosts=3 per-host-mem=0B | | | | tuple-ids=1 row-size=16B cardinality=29 | | | | | | | 01:SCAN HDFS [tpcds1000gb.date_dim d, PARTITION=RANDOM] | | | partitions=1/1 size=9.77MB | | | predicates: d.d_moy = 12, d.d_year = 1998 | | | table stats: 73049 rows total | | | column stats: all | | | hosts=3 per-host-mem=48.00MB | | | tuple-ids=1 row-size=16B cardinality=29 | | | | | 00:SCAN HDFS [tpcds1000gb.store_sales ss, PARTITION=RANDOM] | | partitions=1823/1823 size=1.10TB | | table stats: 8251124389 rows total | | column stats: all | | hosts=10 per-host-mem=3.75GB | | tuple-ids=0 row-size=24B cardinality=8251124389 | +--------------------------------------------------------------+ set explain_level=2; set num_nodes=0;
  • Query Compilation in Impala Conclusion • Cost-based choice of join order and strategy • Critical for performance • Relies on table and column stats • Other plan optimizations currently independent of stats • Likely to expand plan choices in the future • Likely to increase reliance on stats • Helpful Impala commands • compute stats • show table/column stats • explain query/insert stmt • set explain_level=[0-3] • set num_nodes=0  show single-node plan
  • Query Compilation in Impala Try It Out! •Questions/comments? • Download: cloudera.com/impala • Email: impala-user@cloudera.org • Join: groups.cloudera.org
  • Query Compilation in Impala