SlideShare a Scribd company logo
Query Compilation in Impala
Query Compilation in Impala
Alexander Behm | Software Engineer
May 2014 @ Impala User Group
Query Compilation in Impala
Compile Query
Execute Query
Client
Client
SQL Text
Executable Plan
Query Results
Impala Frontend
(Java)
Impala Backend
(C++)
Focus of this talk
Flow of a SQL Query
Query Compilation in Impala
Client
SQL Text
Executable Plan
Query Compilation
Query
Compiler
SQL
Parsing
Semantic
Analysis
Query
Planning
Parse Tree
Parse Tree + Analyzer
Query Compilation in Impala
Query Parsing
SELECT c1, SUM(c2)
FROM t1 JOIN t2 USING(id)
WHERE c3 > 10 GROUP BY c1
SelectList TableRefs WhereClause
SelectStmt
GroupByClause
ColRef AggExpr
ColRef
BinaryPredicate
ColRef IntLiteral
ColRefTableRef TableRef
UsingClause
ColRef
• Applies SQL grammar, reports syntax errors
• Produces parse tree capturing syntactic structure of query
Query Compilation in Impala
Semantic Analysis…
• Precondition: Query is syntactically valid. Analysis operates on parse tree.
• Consults table metadata
• Do t1 and t2 exist? Does c1 exist in t1 or t2 (or both  error)? Does id exist in t1 and t2?
• Does the user have privileges to SELECT from t1?
• Checks type compatibility of expressions, adds implicit casts
• c3 > 10  c3 > cast(10 as bigint)
• SQL rules (semantic, not syntactic)
• Does c1 appear in the GROUP BY clause?
SELECT c1, SUM(c2)
FROM t1 JOIN t2 USING(id)
WHERE c3 > 10 GROUP BY c1
Query Compilation in Impala
… Semantic Analysis
• Expression substitution for views
• Resolve column references against base tables
• Preparation for Planning
• Register state in analyzer for correct predicate assignment during planning
• Register predicates (WHERE, HAVING, ON, USING, etc.)
• Register outer-joined tables
• Compute value-transfer graph and equivalence classes for predicate inference
• (…)
• Postcondition: Query is valid. An executable plan can be produced.
SELECT c1, SUM(c2)
FROM (SELECT dept AS c1, revenue AS c2,
month AS c3 FROM t1) AS v
WHERE c3 > 10 GROUP BY c1
SELECT dept, SUM(revenue)
FROM t1
WHERE month > 10
GROUP BY dept
Query Compilation in Impala
• Generate executable plan (“tree” of operators)
• Maximize scan locality using DN block metadata
• Minimize data movement
• Full distribution of operators
• Query operators
• Scan, HashJoin, HashAggregation, Union, TopN,
Exchange
Query Planning: Goals
Query Compilation in Impala
Query Planning: Overview
Semantic
Analysis
Parse Tree + Analyzer
Query
Planner
Walk Parse Tree
Parallelize
& Fragment
Single-node Plan
Executable Plan
Query Compilation in Impala
Query Planning: Single-Node Plan
• Four major functions:
1. Parse Tree  Plan Tree
2. Assigns predicates to lowest plan node
3. Optimizes join order
4. Prunes irrelevant columns
Query Compilation in Impala
Parse Tree  Single-Node Plan Tree
HashJoin
Scan: t1
Scan: t3
Scan: t2
HashJoin
TopN
Agg
SELECT t1.dept, SUM(t2.revenue)
FROM LargeHdfsTable t1
JOIN HugeHdfsTable t2 ON (t1.id1 = t2.id)
JOIN SmallHbaseTable t3 ON (t1.id2 = t3.id)
WHERE t3.category = 'Online‘ AND t1.id > 10
GROUP BY t1.dept
HAVING COUNT(t2.revenue) > 10
ORDER BY revenue LIMIT 10
Query Compilation in Impala
SELECT t1.dept, SUM(t2.revenue)
FROM LargeHdfsTable t1
JOIN HugeHdfsTable t2 ON (t1.id1 = t2.id)
JOIN SmallHbaseTable t3 ON (t1.id2 = t3.id)
WHERE t3.category = 'Online‘ AND t1.id > 10
GROUP BY t1.dept
HAVING COUNT(t2.revenue) > 10
ORDER BY revenue LIMIT 10
Predicate Assignment & Inference
HashJoin
Scan: t1
Scan: t3
Scan: t2
HashJoin
TopN
Agg
COUNT(t2.revenue) > 10
t1.id2 = t3.id
t1.id1 = t2.id
id1 > 10
category = ‘Online’
id > 10
Inferred
Predicate
Query Compilation in Impala
Join-Order Optimization
• Inner joins are commutative and associative
• Query results correct independent of execution order
• Query execution costs vary dramatically!
• Hash table sizes, network transfers, #hash lookups
• Join-order optimization
• Impala only considers left-deep join trees
• (Right join input is a table, not another join)
• Find cheapest valid join order
• Relies heavily on table and column statistics
• Limitation: Choice of join order independent of join strategy
Query Compilation in Impala
Invalid Join Orders
SELECT t1.dept, SUM(t2.revenue)
FROM LargeHdfsTable t1
JOIN HugeHdfsTable t2 ON (t1.id1 = t2.id)
JOIN SmallHbaseTable t3 ON (t1.id2 = t3.id)
WHERE t3.category = 'Online‘ AND t1.id > 10
GROUP BY t1.dept
HAVING COUNT(t2.revenue) > 10
ORDER BY revenue LIMIT 10
No explicit or implicit
predicate between t2 and t3
Query Compilation in Impala
Join-Order Optimization
HashJoin
Scan: t1
Scan: t3
Scan: t2
HashJoin
HashJoin
Scan: t1
Scan: t2
Scan: t3
HashJoin
HashJoin
Scan: t2
Scan: t3
Scan: t1
HashJoin
HashJoin
Scan: t2
Scan: t1
Scan: t3
HashJoin
HashJoin
Scan: t3
Scan: t2
Scan: t1
HashJoin
HashJoin
Scan: t3
Scan: t1
Scan: t2
HashJoin
Order:
t1, t2, t3
Order:
t1, t3, t2
Order:
t2, t1, t3
Order:
t2, t3, t1
Order:
t3, t1, t2
Order:
t3, t2, t1
Query Compilation in Impala
Join-Order Optimization
• Impala’s Implementation:
1. Heuristic
• Order tables descending by size
• Best plan typically has largest table on the left (if valid)
2. Plan enumeration & costing
• Generate all possible join orders starting from a given
left-most table (starting with largest one)
• Ignore invalid join orders
• Estimate intermediate result sizes (key!)
• Choose plan that minimizes intermediate result sizes
Query Compilation in Impala
Query Planning: Overview
Semantic
Analysis
Parse Tree + Analyzer
Query
Planner
Walk Parse Tree
Parallelize
& Fragment
Single-node Plan
Executable Plan
Query Compilation in Impala
Query Planning: Distributed Plans
• Distributed Aggregation
• Pre-aggregation where data is first materialized
• Merge-aggregation partitioned by grouping columns
• Distinct aggregation: additional level of pre- and merge aggregation
• Distributed Top-N
• Initial Top-N where data is first materialized
• Final Top-N at coordinator
• Distributed Union
• Pre-aggregation/top-n placed into plans of each union operand
• Union-operand plans executed in parallel, merged via exchange
• Above strategies are currently fixed in Impala
• Independent of column/table stats
Query Compilation in Impala
Query Planning: Distributed Joins
• Broadcast Join
• Join is co-located with left input
• Broadcast right input to all nodes executing join
• Build hash table on right input, streaming probe from left input
•  Preferred for small right side (relative to left side)
• Partitioned Join
• Both tables hash-partitioned on join columns
• Same build/probe procedure as above
•  Preferred for joins where both left and right side are large
• Cost-based decision based on table/column stats
• Minimize required network transfer
Query Compilation in Impala
Query Planning: Distributed Plans
HashJoinScan: t2
Scan: t3
Scan: t1
HashJoin
TopN
Pre-Agg
MergeAgg
TopN
Broadcast
Merge
hash t2.idhash t1.id1
hash
t1.custid
at HDFS DN
at HBase RS
at coordinator
HashJoin
Scan: t2
Scan: t3
Scan: t1
HashJoin
TopN
Agg
Single-Node
Plan
Query Compilation in Impala
Explain Example: TPCDS Q42
SELECT d.d_year, i.i_category_id, i.i_category, SUM(ss_ext_sales_price)
FROM store_sales ss
JOIN date_dim d
ON (ss.ss_sold_date_sk = d.d_date_sk)
JOIN item i
ON (ss.ss_item_sk = i.i_item_sk)
WHERE i.i_manager_id = 1 AND d.d_moy = 12 AND d.d_year = 1998
GROUP BY d.d_year, i.i_category_id, i.i_category
ORDER BY total_sales DESC, d_year, i_category_id, i_category
LIMIT 100
Query Compilation in Impala
Explain Example: TPCDS Q42
+-----------------------------------------------------+
| Explain String |
+-----------------------------------------------------+
| Estimated Per-Host Requirements: Memory=0B VCores=0 |
| |
| 06:TOP-N [LIMIT=100] |
| 05:AGGREGATE [FINALIZE] |
| 04:HASH JOIN [INNER JOIN] |
| |--02:SCAN HDFS [tpcds1000gb.item i] |
| 03:HASH JOIN [INNER JOIN] |
| |--01:SCAN HDFS [tpcds1000gb.date_dim d] |
| 00:SCAN HDFS [tpcds1000gb.store_sales ss] |
+-----------------------------------------------------+
set explain_level=0;
set num_nodes=1;
Query Compilation in Impala
Explain Example: TPCDS Q42
+---------------------------------------------------------------------+
| Explain String |
+---------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=3.76GB VCores=3 |
| |
| 12:TOP-N [LIMIT=100] |
| 11:EXCHANGE [PARTITION=UNPARTITIONED] |
| 06:TOP-N [LIMIT=100] |
| 10:AGGREGATE [MERGE FINALIZE] |
| 09:EXCHANGE [PARTITION=HASH(d.d_year,i.i_category_id,i.i_category)] |
| 05:AGGREGATE |
| 04:HASH JOIN [INNER JOIN, BROADCAST] |
| |--08:EXCHANGE [BROADCAST] |
| | 02:SCAN HDFS [tpcds1000gb.item i] |
| 03:HASH JOIN [INNER JOIN, BROADCAST] |
| |--07:EXCHANGE [BROADCAST] |
| | 01:SCAN HDFS [tpcds1000gb.date_dim d] |
| 00:SCAN HDFS [tpcds1000gb.store_sales ss] |
+---------------------------------------------------------------------+
set explain_level=0;
set num_nodes=0;
Query Compilation in Impala
Explain Example: TPCDS Q42
| …
| 03:HASH JOIN [INNER JOIN, BROADCAST] |
| | hash predicates: ss.ss_sold_date_sk = d.d_date_sk |
| | hosts=10 per-host-mem=511B |
| | tuple-ids=0,1 row-size=40B cardinality=8251124389 |
| | |
| |--07:EXCHANGE [BROADCAST] |
| | | hosts=3 per-host-mem=0B |
| | | tuple-ids=1 row-size=16B cardinality=29 |
| | | |
| | 01:SCAN HDFS [tpcds1000gb.date_dim d, PARTITION=RANDOM] |
| | partitions=1/1 size=9.77MB |
| | predicates: d.d_moy = 12, d.d_year = 1998 |
| | table stats: 73049 rows total |
| | column stats: all |
| | hosts=3 per-host-mem=48.00MB |
| | tuple-ids=1 row-size=16B cardinality=29 |
| | |
| 00:SCAN HDFS [tpcds1000gb.store_sales ss, PARTITION=RANDOM] |
| partitions=1823/1823 size=1.10TB |
| table stats: 8251124389 rows total |
| column stats: all |
| hosts=10 per-host-mem=3.75GB |
| tuple-ids=0 row-size=24B cardinality=8251124389 |
+--------------------------------------------------------------+
set explain_level=2;
set num_nodes=0;
Query Compilation in Impala
Conclusion
• Cost-based choice of join order and strategy
• Critical for performance
• Relies on table and column stats
• Other plan optimizations currently independent of stats
• Likely to expand plan choices in the future
• Likely to increase reliance on stats
• Helpful Impala commands
• compute stats
• show table/column stats
• explain query/insert stmt
• set explain_level=[0-3]
• set num_nodes=0  show single-node plan
Query Compilation in Impala
Try It Out!
•Questions/comments?
• Download: cloudera.com/impala
• Email: impala-user@cloudera.org
• Join: groups.cloudera.org
Query Compilation in Impala

More Related Content

What's hot

Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
How Impala Works
How Impala WorksHow Impala Works
How Impala Works
Yue Chen
 
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Michael Rainey
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Cassandra internals
Cassandra internalsCassandra internals
Cassandra internals
narsiman
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Databricks
 
Cloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and AnalysisCloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and Analysis
Yue Chen
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
Databricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 

What's hot (20)

Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ Salesforce
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
How Impala Works
How Impala WorksHow Impala Works
How Impala Works
 
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Cassandra internals
Cassandra internalsCassandra internals
Cassandra internals
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Cloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and AnalysisCloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and Analysis
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 

Similar to Query Compilation in Impala

Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Databricks
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
Databricks
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Databricks
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, AlibabaWhat's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
Flink Forward
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentation
Michael Keane
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
Carl Lu
 
Meetup tensorframes
Meetup tensorframesMeetup tensorframes
Meetup tensorframes
Paolo Platter
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
InfluxData
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
Jags Ramnarayan
 
So You Want to Write a Connector?
So You Want to Write a Connector? So You Want to Write a Connector?
So You Want to Write a Connector?
confluent
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
MathewJohnSinoCruz
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
Amazon Web Services
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Gruter
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer Overview
Olav Sandstå
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
Prashant Gupta
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
Satoshi Nagayasu
 

Similar to Query Compilation in Impala (20)

Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, AlibabaWhat's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentation
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
Meetup tensorframes
Meetup tensorframesMeetup tensorframes
Meetup tensorframes
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
So You Want to Write a Connector?
So You Want to Write a Connector? So You Want to Write a Connector?
So You Want to Write a Connector?
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer Overview
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 

Recently uploaded (20)

Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 

Query Compilation in Impala

  • 1. Query Compilation in Impala Query Compilation in Impala Alexander Behm | Software Engineer May 2014 @ Impala User Group
  • 2. Query Compilation in Impala Compile Query Execute Query Client Client SQL Text Executable Plan Query Results Impala Frontend (Java) Impala Backend (C++) Focus of this talk Flow of a SQL Query
  • 3. Query Compilation in Impala Client SQL Text Executable Plan Query Compilation Query Compiler SQL Parsing Semantic Analysis Query Planning Parse Tree Parse Tree + Analyzer
  • 4. Query Compilation in Impala Query Parsing SELECT c1, SUM(c2) FROM t1 JOIN t2 USING(id) WHERE c3 > 10 GROUP BY c1 SelectList TableRefs WhereClause SelectStmt GroupByClause ColRef AggExpr ColRef BinaryPredicate ColRef IntLiteral ColRefTableRef TableRef UsingClause ColRef • Applies SQL grammar, reports syntax errors • Produces parse tree capturing syntactic structure of query
  • 5. Query Compilation in Impala Semantic Analysis… • Precondition: Query is syntactically valid. Analysis operates on parse tree. • Consults table metadata • Do t1 and t2 exist? Does c1 exist in t1 or t2 (or both  error)? Does id exist in t1 and t2? • Does the user have privileges to SELECT from t1? • Checks type compatibility of expressions, adds implicit casts • c3 > 10  c3 > cast(10 as bigint) • SQL rules (semantic, not syntactic) • Does c1 appear in the GROUP BY clause? SELECT c1, SUM(c2) FROM t1 JOIN t2 USING(id) WHERE c3 > 10 GROUP BY c1
  • 6. Query Compilation in Impala … Semantic Analysis • Expression substitution for views • Resolve column references against base tables • Preparation for Planning • Register state in analyzer for correct predicate assignment during planning • Register predicates (WHERE, HAVING, ON, USING, etc.) • Register outer-joined tables • Compute value-transfer graph and equivalence classes for predicate inference • (…) • Postcondition: Query is valid. An executable plan can be produced. SELECT c1, SUM(c2) FROM (SELECT dept AS c1, revenue AS c2, month AS c3 FROM t1) AS v WHERE c3 > 10 GROUP BY c1 SELECT dept, SUM(revenue) FROM t1 WHERE month > 10 GROUP BY dept
  • 7. Query Compilation in Impala • Generate executable plan (“tree” of operators) • Maximize scan locality using DN block metadata • Minimize data movement • Full distribution of operators • Query operators • Scan, HashJoin, HashAggregation, Union, TopN, Exchange Query Planning: Goals
  • 8. Query Compilation in Impala Query Planning: Overview Semantic Analysis Parse Tree + Analyzer Query Planner Walk Parse Tree Parallelize & Fragment Single-node Plan Executable Plan
  • 9. Query Compilation in Impala Query Planning: Single-Node Plan • Four major functions: 1. Parse Tree  Plan Tree 2. Assigns predicates to lowest plan node 3. Optimizes join order 4. Prunes irrelevant columns
  • 10. Query Compilation in Impala Parse Tree  Single-Node Plan Tree HashJoin Scan: t1 Scan: t3 Scan: t2 HashJoin TopN Agg SELECT t1.dept, SUM(t2.revenue) FROM LargeHdfsTable t1 JOIN HugeHdfsTable t2 ON (t1.id1 = t2.id) JOIN SmallHbaseTable t3 ON (t1.id2 = t3.id) WHERE t3.category = 'Online‘ AND t1.id > 10 GROUP BY t1.dept HAVING COUNT(t2.revenue) > 10 ORDER BY revenue LIMIT 10
  • 11. Query Compilation in Impala SELECT t1.dept, SUM(t2.revenue) FROM LargeHdfsTable t1 JOIN HugeHdfsTable t2 ON (t1.id1 = t2.id) JOIN SmallHbaseTable t3 ON (t1.id2 = t3.id) WHERE t3.category = 'Online‘ AND t1.id > 10 GROUP BY t1.dept HAVING COUNT(t2.revenue) > 10 ORDER BY revenue LIMIT 10 Predicate Assignment & Inference HashJoin Scan: t1 Scan: t3 Scan: t2 HashJoin TopN Agg COUNT(t2.revenue) > 10 t1.id2 = t3.id t1.id1 = t2.id id1 > 10 category = ‘Online’ id > 10 Inferred Predicate
  • 12. Query Compilation in Impala Join-Order Optimization • Inner joins are commutative and associative • Query results correct independent of execution order • Query execution costs vary dramatically! • Hash table sizes, network transfers, #hash lookups • Join-order optimization • Impala only considers left-deep join trees • (Right join input is a table, not another join) • Find cheapest valid join order • Relies heavily on table and column statistics • Limitation: Choice of join order independent of join strategy
  • 13. Query Compilation in Impala Invalid Join Orders SELECT t1.dept, SUM(t2.revenue) FROM LargeHdfsTable t1 JOIN HugeHdfsTable t2 ON (t1.id1 = t2.id) JOIN SmallHbaseTable t3 ON (t1.id2 = t3.id) WHERE t3.category = 'Online‘ AND t1.id > 10 GROUP BY t1.dept HAVING COUNT(t2.revenue) > 10 ORDER BY revenue LIMIT 10 No explicit or implicit predicate between t2 and t3
  • 14. Query Compilation in Impala Join-Order Optimization HashJoin Scan: t1 Scan: t3 Scan: t2 HashJoin HashJoin Scan: t1 Scan: t2 Scan: t3 HashJoin HashJoin Scan: t2 Scan: t3 Scan: t1 HashJoin HashJoin Scan: t2 Scan: t1 Scan: t3 HashJoin HashJoin Scan: t3 Scan: t2 Scan: t1 HashJoin HashJoin Scan: t3 Scan: t1 Scan: t2 HashJoin Order: t1, t2, t3 Order: t1, t3, t2 Order: t2, t1, t3 Order: t2, t3, t1 Order: t3, t1, t2 Order: t3, t2, t1
  • 15. Query Compilation in Impala Join-Order Optimization • Impala’s Implementation: 1. Heuristic • Order tables descending by size • Best plan typically has largest table on the left (if valid) 2. Plan enumeration & costing • Generate all possible join orders starting from a given left-most table (starting with largest one) • Ignore invalid join orders • Estimate intermediate result sizes (key!) • Choose plan that minimizes intermediate result sizes
  • 16. Query Compilation in Impala Query Planning: Overview Semantic Analysis Parse Tree + Analyzer Query Planner Walk Parse Tree Parallelize & Fragment Single-node Plan Executable Plan
  • 17. Query Compilation in Impala Query Planning: Distributed Plans • Distributed Aggregation • Pre-aggregation where data is first materialized • Merge-aggregation partitioned by grouping columns • Distinct aggregation: additional level of pre- and merge aggregation • Distributed Top-N • Initial Top-N where data is first materialized • Final Top-N at coordinator • Distributed Union • Pre-aggregation/top-n placed into plans of each union operand • Union-operand plans executed in parallel, merged via exchange • Above strategies are currently fixed in Impala • Independent of column/table stats
  • 18. Query Compilation in Impala Query Planning: Distributed Joins • Broadcast Join • Join is co-located with left input • Broadcast right input to all nodes executing join • Build hash table on right input, streaming probe from left input •  Preferred for small right side (relative to left side) • Partitioned Join • Both tables hash-partitioned on join columns • Same build/probe procedure as above •  Preferred for joins where both left and right side are large • Cost-based decision based on table/column stats • Minimize required network transfer
  • 19. Query Compilation in Impala Query Planning: Distributed Plans HashJoinScan: t2 Scan: t3 Scan: t1 HashJoin TopN Pre-Agg MergeAgg TopN Broadcast Merge hash t2.idhash t1.id1 hash t1.custid at HDFS DN at HBase RS at coordinator HashJoin Scan: t2 Scan: t3 Scan: t1 HashJoin TopN Agg Single-Node Plan
  • 20. Query Compilation in Impala Explain Example: TPCDS Q42 SELECT d.d_year, i.i_category_id, i.i_category, SUM(ss_ext_sales_price) FROM store_sales ss JOIN date_dim d ON (ss.ss_sold_date_sk = d.d_date_sk) JOIN item i ON (ss.ss_item_sk = i.i_item_sk) WHERE i.i_manager_id = 1 AND d.d_moy = 12 AND d.d_year = 1998 GROUP BY d.d_year, i.i_category_id, i.i_category ORDER BY total_sales DESC, d_year, i_category_id, i_category LIMIT 100
  • 21. Query Compilation in Impala Explain Example: TPCDS Q42 +-----------------------------------------------------+ | Explain String | +-----------------------------------------------------+ | Estimated Per-Host Requirements: Memory=0B VCores=0 | | | | 06:TOP-N [LIMIT=100] | | 05:AGGREGATE [FINALIZE] | | 04:HASH JOIN [INNER JOIN] | | |--02:SCAN HDFS [tpcds1000gb.item i] | | 03:HASH JOIN [INNER JOIN] | | |--01:SCAN HDFS [tpcds1000gb.date_dim d] | | 00:SCAN HDFS [tpcds1000gb.store_sales ss] | +-----------------------------------------------------+ set explain_level=0; set num_nodes=1;
  • 22. Query Compilation in Impala Explain Example: TPCDS Q42 +---------------------------------------------------------------------+ | Explain String | +---------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=3.76GB VCores=3 | | | | 12:TOP-N [LIMIT=100] | | 11:EXCHANGE [PARTITION=UNPARTITIONED] | | 06:TOP-N [LIMIT=100] | | 10:AGGREGATE [MERGE FINALIZE] | | 09:EXCHANGE [PARTITION=HASH(d.d_year,i.i_category_id,i.i_category)] | | 05:AGGREGATE | | 04:HASH JOIN [INNER JOIN, BROADCAST] | | |--08:EXCHANGE [BROADCAST] | | | 02:SCAN HDFS [tpcds1000gb.item i] | | 03:HASH JOIN [INNER JOIN, BROADCAST] | | |--07:EXCHANGE [BROADCAST] | | | 01:SCAN HDFS [tpcds1000gb.date_dim d] | | 00:SCAN HDFS [tpcds1000gb.store_sales ss] | +---------------------------------------------------------------------+ set explain_level=0; set num_nodes=0;
  • 23. Query Compilation in Impala Explain Example: TPCDS Q42 | … | 03:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: ss.ss_sold_date_sk = d.d_date_sk | | | hosts=10 per-host-mem=511B | | | tuple-ids=0,1 row-size=40B cardinality=8251124389 | | | | | |--07:EXCHANGE [BROADCAST] | | | | hosts=3 per-host-mem=0B | | | | tuple-ids=1 row-size=16B cardinality=29 | | | | | | | 01:SCAN HDFS [tpcds1000gb.date_dim d, PARTITION=RANDOM] | | | partitions=1/1 size=9.77MB | | | predicates: d.d_moy = 12, d.d_year = 1998 | | | table stats: 73049 rows total | | | column stats: all | | | hosts=3 per-host-mem=48.00MB | | | tuple-ids=1 row-size=16B cardinality=29 | | | | | 00:SCAN HDFS [tpcds1000gb.store_sales ss, PARTITION=RANDOM] | | partitions=1823/1823 size=1.10TB | | table stats: 8251124389 rows total | | column stats: all | | hosts=10 per-host-mem=3.75GB | | tuple-ids=0 row-size=24B cardinality=8251124389 | +--------------------------------------------------------------+ set explain_level=2; set num_nodes=0;
  • 24. Query Compilation in Impala Conclusion • Cost-based choice of join order and strategy • Critical for performance • Relies on table and column stats • Other plan optimizations currently independent of stats • Likely to expand plan choices in the future • Likely to increase reliance on stats • Helpful Impala commands • compute stats • show table/column stats • explain query/insert stmt • set explain_level=[0-3] • set num_nodes=0  show single-node plan
  • 25. Query Compilation in Impala Try It Out! •Questions/comments? • Download: cloudera.com/impala • Email: impala-user@cloudera.org • Join: groups.cloudera.org