SlideShare a Scribd company logo
1 of 36
Download to read offline
Cost-based query optimization in 
Apache Hive 0.14 
Julian Hyde Julian Hyde 
Page ‹#› © Hortonworks Inc. 2014 
Hive User Group, NYC 
October 15th, 2014
About me 
Julian Hyde 
Architect at Hortonworks 
Open source: 
• Founder & lead, Apache Calcite (query optimization framework) 
• Founder & lead, Pentaho Mondrian (analysis engine) 
• Committer, Apache Drill 
• Contributor, Apache Hive 
• Contributor, Cascading Lingual (SQL interface to Cascading) 
Past: 
• SQLstream (streaming SQL) 
• Broadbase (data warehouse) 
• Oracle (SQL kernel development) 
Page ‹#› © Hortonworks Inc. 2014
Hadoop - A New Data Architecture for New Data 
APP 
LICA 
TIO 
NS 
DAT 
A 
SYS 
TEM 
Business Analytics 
Custom Applications 
Packaged Applications 
RDBMS EDW MPP 
REPOSITORIES 
SOU 
RCE 
S 
Existing Sources 
(CRM, ERP, Clickstream, Logs) 
Page ‹#› © Hortonworks Inc. 2014 
OLTP, ERP, CRM Systems 
Unstructured documents, emails 
Server logs 
Clickstream 
Sentiment, Web Data 
Sensor. Machine Data 
Geolocation 
New Data Requirements: 
! 
• Scale 
• Economics 
• Flexibility 
Traditional Data Architecture
Interactive SQL-IN-Hadoop Delivered 
Stinger Initiative – DELIVERED 
Next generation SQL based 
interactive query in Hadoop 
Speed 
Page ‹#› © Hortonworks Inc. 2014 
© Hortonworks Inc. 2013 
Improve Hive query performance has increased by 100X to allow for 
interactive query times (seconds) 
Scale 
The only SQL interface to Hadoop designed for queries that scale from TB 
to PB 
SQL 
Support broadest range of SQL semantics for analytic applications running 
against Hadoop 
SQL 
Business 
Analytics 
Custom 
Apps 
Window 
Functions 
Apache Hive 
Apache 
MapReduce 
Apache 
Tez 
Apache YARN 
!! 
1 ° ° ° 
° ° ° ° 
° ° ° ° 
!! 
Apache Hive Contribution… an Open Community at its finest 
1,672 
Jira Tickets Closed 
145 
Developers 
44 
Companies 
~390,000 
Lines Of Code Added… (2x) 
° 
° 
N 
HDFS 
(Hadoop Distributed File System) 
Stinger Project 
Stinger Phase 1: 
• Base Optimizations 
• SQL Types 
• SQL Analytic Functions 
• ORCFile Modern File Format 
Sting!er Phase 2: 
Delivered 
HDP 2.1 
• SQL Types 
• SQL Analytic Functions 
• Advanced Optimizations 
• Performance Boosts via YARN 
!S 
tinger Phase 3 
• Hive on Apache Tez 
• Query Service (always on) 
• Buffer Cache 
• Cost Based Optimizer (Calcite) 
13 
Months 
Governance 
& Integration 
Security 
Operations 
Data Access 
Data 
Management 
ORC File
Hive – Single tool for all SQL use cases 
Page ‹#› © Hortonworks Inc. 2014 
OLTP, ERP, CRM Systems 
Unstructured documents, emails 
Server logs 
Clickstream 
Sentiment, Web Data 
Sensor. Machine Data 
Geolocation 
Interactive 
Analytics 
Batch Reports / 
Deep Analytics 
Hive - SQL 
ETL / ELT
Incremental cutover to cost-based optimization 
Release Date Remarks 
Apache Hive 0.12 October 2013 • Rule-based Optimizations 
Page ‹#› © Hortonworks Inc. 2014 
• No join reordering 
• Main optimizations: predicate push- 
Apache Hive 0.13 April 2014 “Old-style” JOIN & push-down 
conditions: 
… FROM t1, t2 WHERE … 
HDP 2.1 April 2014 Cost-based ordering of joins 
Apache Hive 0.14 soon Bushy joins, large joins, better 
operator coverage, better statistics, …
Apache Calcite 
(incubating) 
Apache 
Calcite 
Page ‹#› © Hortonworks Inc. 2014
Apache Calcite 
Apache incubator project since May, 2014 
Query planning framework 
• Relational algebra, rewrite rules, cost model 
• Extensible 
• Usable standalone (JDBC) or embedded 
Adoption 
• Lingual – SQL interface to Cascading 
• Apache Drill 
• Apache Hive 
Adapters: Splunk, Spark, MongoDB, JDBC, CSV, JSON, Web tables, In-memory 
data 
Page ‹#› © Hortonworks Inc. 2014
Conventional DB architecture 
Page ‹#› © Hortonworks Inc. 2014
Calcite architecture 
Page ‹#› © Hortonworks Inc. 2014
Expression tree 
Splunk 
Table: splunk 
MySQL 
Page ‹#› © Hortonworks Inc. 2014 
Key: product_id 
join 
Key: product_name 
Agg: count 
group 
Condition: 
action = 
'purchase' 
filter 
Key: c DESC 
sort 
scan 
scan 
Table: products 
SELECT p.“product_name”, COUNT(*) AS c 
FROM “splunk”.”splunk” AS s 
JOIN “mysql”.”products” AS p 
ON s.”product_id” = p.”product_id” 
WHERE s.“action” = 'purchase' 
GROUP BY p.”product_name” 
ORDER BY c DESC
Expression tree 
(optimized) 
Splunk 
Table: splunk 
Page ‹#› © Hortonworks Inc. 2014 
Key: product_id 
join 
Key: product_name 
Agg: count 
group 
Condition: 
action = 
'purchase' 
filter 
Key: c DESC 
sort 
scan 
MySQL 
scan 
Table: products 
SELECT p.“product_name”, COUNT(*) AS c 
FROM “splunk”.”splunk” AS s 
JOIN “mysql”.”products” AS p 
ON s.”product_id” = p.”product_id” 
WHERE s.“action” = 'purchase' 
GROUP BY p.”product_name” 
ORDER BY c DESC
Calcite – APIs and SPIs 
Page ‹#› © Hortonworks Inc. 2014 
Cost, statistics 
RelOptCost 
RelOptCostFactory 
RelMetadataProvider 
• RelMdColumnUniquensss 
• RelMdDistinctRowCount 
• RelMdSelectivity 
SQL parser 
SqlNode 
SqlParser 
SqlValidator 
Transformation rules 
RelOptRule 
• MergeFilterRule 
• PushAggregateThroughUnionRule 
• 100+ more 
Global transformations 
• Unification (materialized view) 
• Column trimming 
• De-correlation 
Relational algebra 
RelNode (operator) 
• TableScan 
• Filter 
• Project 
• Union 
• Aggregate 
• … 
RelDataType (type) 
RexNode (expression) 
RelTrait (physical property) 
• RelConvention (calling-convention) 
• RelCollation (sortedness) 
• TBD (bucketedness/distribution) JDBC driver 
Metadata 
Schema 
Table 
Function 
• TableFunction 
• TableMacro 
Lattice
Just added to Calcite - materialized views & lattices 
Page ‹#› © Hortonworks Inc. 2014 
Query: SELECT x, SUM(y) FROM t GROUP BY x 
In-memory 
materialized 
queries 
Tables 
on disk 
http://hortonworks.com/blog/dmmq/
Calcite Lattice () 1 
(z, s) 43.4k (y, m) 60 
Key 
! 
z zipcode (43k) 
s state (50) 
g gender (2) 
y year (5) 
m month (12) 
(z) 43k (s) 50 (g) 2 (y) 5 (m) 12 
Page ‹#› © Hortonworks Inc. 2014 
(z, s, g, y, 
m) 912k 
(s, g, y, m) 6k 
(z, g, y, m) 
909k 
(z, s, y, m) 
831k 
raw 1m 
(z, s, g, m) 
644k 
(z, s, g, y) 
392k 
(z, s, g) 87k 
(g, y) 10 
(g, y, m) 120 
(g, m) 24
Now… back to Hive 
Page ‹#› © Hortonworks Inc. 2014 
Apache 
Calcite
CBO in Hive 
Why cost-based optimization? 
Ease of Use – Join Reordering 
View Chaining 
Ad hoc queries involving multiple views 
Enables BI Tools as front ends to Hive 
More efficient & maintainable query preparation process 
Laying the groundwork for deeper optimizations, e.g. materialized views 
Page ‹#› © Hortonworks Inc. 2014 
17
Query preparation – Hive 0.13 
SQL 
parser 
Page ‹#› © Hortonworks Inc. 2014 
Semantic 
analyzer 
Logical 
Optimizer 
Physical 
Optimizer 
Abstract 
Syntax 
Tree (AST) 
Hive SQL 
Annotated 
AST 
Plan 
Tez 
Tuned 
Plan
Query preparation – full CBO 
SQL 
parser 
Page ‹#› © Hortonworks Inc. 2014 
Semantic 
analyzer 
Annotated 
AST 
Translate 
to algebra 
Physical 
Optimizer 
Abstract 
Syntax 
Tree (AST) 
Hive SQL 
Tez 
Tuned 
Plan Calcite 
optimizer 
RelNode
Query preparation – Hive 0.14 
SQL 
parser 
Page ‹#› © Hortonworks Inc. 2014 
Semantic 
analyzer 
Logical 
Optimizer 
Physical 
Optimizer 
Hive SQL 
AST with optimized 
join-ordering 
Tez 
Tuned 
Plan 
Translate 
to algebra 
Calcite 
optimizer
Query Execution – The basics 
Page ‹#› © Hortonworks Inc. 2014 
© Hortonworks Inc. 2013 
21 
SELECT R1.x 
FROM R1 
JOIN R2 ON R1.x = R2.x 
JOIN R3 on R1.x = R3.x AND 
R2.x = R3.x 
WHERE R1.z > 10; 
π 
σ 
⋈ 
⋈ 
R1 R2 
R3 
TS [R1] 
TS [R2] 
RS 
RS 
Shuffle 
Join 
TS [R3] 
Map 
Join 
Filter FS
Calcite Planner Process 
Hive Op ! RelNode 
Hive Expr ! RexNode 
Logical Plan 
Physical traits: Table 
Part./Buckets; 
RedSink Ops 
removed 
Page ‹#› © Hortonworks Inc. 2014 
Hive 
Plan 
Planner 
RelNode 
RelNode Converter Graph 
RexNode Converter 
1. Plan Graph 
• Node for each node in 
Input Plan 
• Each node is a Set of 
alternate Sub Plans 
• Set further divided into 
Subsets: based on traits 
like sortedness 
2. Rules 
• Rule: specifies a Operator 
sub-graph to match and 
logic to generate equivalent 
‘better’ sub-graph. 
• We only have Join 
Reordering Rules. 
3. Cost Model 
• RelNodes have Cost (& 
Cumulative Cost) 
• We only use Cardinality 
for Cost. 
4. Metadata Providers 
- Used to Plugin Schema, 
Cost Formulas: Selectivity, 
NDV calculations etc. 
- We only added 
Selectivity and NDV 
formulas; Schema is 
only available at the 
Node level 
Rule Match Queue 
- Add Rule matches to Queue 
- Apply Rule match 
transformations to Plan Graph 
- Iterate for fixed iterations or until 
Cost doesn’t change. 
- Match importance based on 
Cost of RelNode and height. 
Best 
RelNode 
Graph 
AST Converter 
Revised 
AST
Star schema 
Page ‹#› © Hortonworks Inc. 2014 
Sales Time Inventory 
Product 
Customer 
Warehouse 
Key 
Fact table 
Dimension table 
Many-to-one relationship
Query combining two stars 
Page ‹#› © Hortonworks Inc. 2014 
Sales Time Inventory 
Product 
Customer 
Warehouse 
SELECT product.id, sum(sales.units), sum(inventory.on_hand) 
FROM sales ON … 
JOIN customer ON … 
JOIN time ON … 
JOIN product ON … 
JOIN inventory ON … 
JOIN warehouse ON … 
WHERE time.year = 2014 
AND time.quarter = ‘Q1’ 
AND product.color = ‘Red’ 
AND warehouse.state = ‘WA’ 
GROUP BY …
Left-deep tree 
Initial tree is “left-deep” 
No join node is the right child of its parent 
Join-ordering algorithm chooses what order 
to join tables – does not re-shape the tree 
Typical plan: 
• Start with largest table at bottom left 
• Join tables with more selective conditions 
first 
Page ‹#› © Hortonworks Inc. 2014 
Warehouse 
Product 
Time 
Sales Customer 
Inventory
Bushy tree 
No restrictions on where join 
nodes can occur 
“Bushes” consist of fact tables 
(Sales and Inventory) surrounded 
by many-to-one related dimension 
tables 
Dimension tables have a filtering 
effect 
This tree produces the same 
result as previous plan but is 
more efficient 
Page ‹#› © Hortonworks Inc. 2014 
Time 
Sales Customer 
Product 
Inventory Warehouse
Two approaches to join optimization 
Algorithm #1 “exhaustive search” 
Apply 3 transformation rules exhaustively: 
• SwapJoinRule: A join B ! B join A 
• PushJoinThroughJoinRule: (A join B) join C ! (A join C) join B 
• CommutativeJoinRule: (A join B) join C ! A join (B join C) 
Finds every possible plan, but not practical for more than ~8 joins 
Algorithm #2 “greedy” 
Build a graph iteratively 
Use heuristics to choose the “best” node to add next 
Applying them to Hive 
We can use both algorithms – and we do! 
Both are sensitive to bad statistics – e.g. poor estimation of intermediate result set sizes 
Page ‹#› © Hortonworks Inc. 2014
Statistics 
Feeding the beast 
CBO disabled if your tables don’t have statistics 
• No longer require statistics on all columns, just join columns 
Better optimizations need ever-better statistics… so, statistics are getting better 
Kinds of statistics 
Raw statistics on stored data: row counts, number-of-distinct-values (NDV) 
Statistics on intermediate operators, computed using selectivity estimates 
• Much improved selectivity estimates this release, based on NDVs 
• Planned improvements to raw statistics (e.g. histograms, unique keys, sort order) will help 
• Materialized views 
Run-time statistics 
• Example 1: 90% of the rows in this join have the same key ! use skew join 
• Example 2: Only 10 distinct values of GROUP BY key ! auto-reduce parallelism 
Page ‹#› © Hortonworks Inc. 2014
Stored statistics – recent improvements 
ANALYZE 
Clean up command syntax 
Faster computation 
Table vs partition statistics 
All statistics now stored per partition 
Statistics retrieval 
Faster retrieval 
Merge partition statistics 
Extrapolate for missing statistics 
Page ‹#› © Hortonworks Inc. 2014 
Extrapolation 
SQL: 
SELECT productId, COUNT(*) 
FROM Sales 
WHERE year = 2014 
GROUP BY productId 
Required statistic: NDV(productId) 
Statistics available 
2013 
Q1 
2013 
Q2 
2013 
Q3 
2013 
Q4 
2014 
Q1 
2014 
Q2 
2014 
Q3 
2014 
Q4 
Used in query 
Extrapolate 
{Q1, Q2, Q3} 
stats for Q4
Dynamic partition pruning 
Consider a query with a partitioned fact table, filters on the dimension 
table: 
SELECT … FROM Sales 
JOIN Time ON Sales.time_id = Time.time_id 
WHERE time.year = 2014 AND time.quarter IN (‘Q1’, ‘Q2’) 
At execute time, DAG figures out which 
partitions could possibly match, and cancels 
scans of the others 
2013 
Q1 
Page ‹#› © Hortonworks Inc. 2014 
2013 
Q2 
2013 
Q3 
2013 
Q4 
2014 
Q1 
2014 
Q2 
2014 
Q3 
2014 
Q4 
Time
Summary 
Join-ordering: (exhaustive & heuristic), scalability, bushy joins 
Statistics – faster, better, extrapolate if stats missing 
Very few operators that CBO can’t handle – TABLESAMPLE, SCRIPT, multi- 
INSERT 
Dynamic partition pruning 
Auto-reduce parallelism 
Page ‹#› © Hortonworks Inc. 2014 
31
Show me the numbers… 
Page ‹#› © Hortonworks Inc. 2014
TPC-DS (30TB) Q17 
Joins Store Sales, Store Returns and Catalog 
Sales fact tables. 
Each of the fact tables are independently 
restricted by time. 
Analysis at Item and Store grain, so these 
dimensions are also joined in. 
As specified Query starts by joining the 3 Fact 
tables. 
Page ‹#› © Hortonworks Inc. 2014 
SELECT i_item_id 
,i_item_desc 
,s_state 
,count(ss_quantity) as store_sales_quantitycount 
,…. 
FROM store_sales ss ,store_returns sr, catalog_sales cs, 
date_dim d1, date_dim d2, date_dim d3, store s, item I 
WHERE d1.d_quarter_name = '2000Q1’ 
AND d1.d_date_sk = ss.ss_sold_date_sk 
AND i.i_item_sk = ss.ss_item_sk AND … 
GROUP BY i_item_id ,i_item_desc, ,s_state 
ORDER BY i_item_id ,i_item_desc, s_state 
LIMIT 100; 
CBO Elapsed 
(s) 
Intemediate 
data (GB) 
Off 10,683 5,017 
On 1,284 275
TPC-DS (200G) queries 
Query Hive 14 
CBO off 
Page ‹#› © Hortonworks Inc. 2014 
Hive 14 
CBO on 
Gain 
Q15 84 44 91% 
Q22 123 99 24% 
Q29 1677 48 3,394% 
Q40 118 29 307% 
Q51 276 80 245% 
Q80 842 70 1,103% 
Q82 278 23 1,109% 
Q87 275 51 439% 
Q92 511 80 539% 
Q93 160 69 132% 
Q97 483 79 511%
Stinger.next 
• SQL compliance: interval data type, non-equi joins, set operators, more 
sub-queries 
• Transactions: COMMIT, savepoint, rollback) 
• LLAP 
• Materialized views 
• In-memory 
• Automatic or manual 
http://hortonworks.com/blog/stinger-next-enterprise-sql-hadoop-scale-apache-hive/ 
Page ‹#› © Hortonworks Inc. 2014
Thank you! 
Upcoming talk 
“SQL on everything — in memory” 
Thursday October 16th 
Strata & Hadoop World Conference, NYC 
@julianhyde 
http://hive.apache.org/ 
http://calcite.incubator.apache.org/ 
Page ‹#› © Hortonworks Inc. 2014

More Related Content

What's hot

Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 
Discardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With HadoopDiscardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With HadoopJulian Hyde
 
SQL on Big Data using Optiq
SQL on Big Data using OptiqSQL on Big Data using Optiq
SQL on Big Data using OptiqJulian Hyde
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllMichael Mior
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartEvans Ye
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineeringJulian Hyde
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.Julian Hyde
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentJulian Hyde
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / OptiqJulian Hyde
 
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Christian Tzolov
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons Provectus
 

What's hot (20)

Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Discardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With HadoopDiscardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With Hadoop
 
SQL on Big Data using Optiq
SQL on Big Data using OptiqSQL on Big Data using Optiq
SQL on Big Data using Optiq
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / Optiq
 
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 

Viewers also liked

Cost-based Query Optimization in Hive
Cost-based Query Optimization in HiveCost-based Query Optimization in Hive
Cost-based Query Optimization in HiveDataWorks Summit
 
Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: eBay构建开源分布式实时预警引擎实践Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: eBay构建开源分布式实时预警引擎实践Hao Chen
 
The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too muchJulian Hyde
 
What's new in Mondrian 4?
What's new in Mondrian 4?What's new in Mondrian 4?
What's new in Mondrian 4?Julian Hyde
 
Optiq: A dynamic data management framework
Optiq: A dynamic data management frameworkOptiq: A dynamic data management framework
Optiq: A dynamic data management frameworkJulian Hyde
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits allJulian Hyde
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overviewJulian Hyde
 
Apache Eagle in Action
Apache Eagle in ActionApache Eagle in Action
Apache Eagle in ActionHao Chen
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 

Viewers also liked (16)

Cost-based Query Optimization in Hive
Cost-based Query Optimization in HiveCost-based Query Optimization in Hive
Cost-based Query Optimization in Hive
 
Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: eBay构建开源分布式实时预警引擎实践Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: eBay构建开源分布式实时预警引擎实践
 
The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too much
 
What's new in Mondrian 4?
What's new in Mondrian 4?What's new in Mondrian 4?
What's new in Mondrian 4?
 
Optiq: A dynamic data management framework
Optiq: A dynamic data management frameworkOptiq: A dynamic data management framework
Optiq: A dynamic data management framework
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 
Apache Hive - Introduction
Apache Hive - IntroductionApache Hive - Introduction
Apache Hive - Introduction
 
Apache Eagle in Action
Apache Eagle in ActionApache Eagle in Action
Apache Eagle in Action
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 

Similar to Cost-based query optimization in Apache Hive 0.14

Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsWeb Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsKognitio
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overviewRohit Jain
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Hortonworks
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizonArtem Ervits
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problemsAbhishek Gupta
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_Spark
Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_SparkSunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_Spark
Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_SparkMopuru Babu
 
Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_Spark
Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_SparkSunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_Spark
Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_SparkMopuru Babu
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaMopuru Babu
 

Similar to Cost-based query optimization in Apache Hive 0.14 (20)

Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsWeb Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Hortonworks.bdb
Hortonworks.bdbHortonworks.bdb
Hortonworks.bdb
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problems
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_Spark
Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_SparkSunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_Spark
Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_Spark
 
Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_Spark
Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_SparkSunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_Spark
Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_Spark
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
 

More from Julian Hyde

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteJulian Hyde
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Julian Hyde
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQLJulian Hyde
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming languageJulian Hyde
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityJulian Hyde
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're IncubatingJulian Hyde
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteJulian Hyde
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesJulian Hyde
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Julian Hyde
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databasesJulian Hyde
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and FastJulian Hyde
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache CalciteJulian Hyde
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache CalciteJulian Hyde
 

More from Julian Hyde (17)

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using Calcite
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQL
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're Incubating
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databases
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 

Recently uploaded

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Recently uploaded (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Cost-based query optimization in Apache Hive 0.14

  • 1. Cost-based query optimization in Apache Hive 0.14 Julian Hyde Julian Hyde Page ‹#› © Hortonworks Inc. 2014 Hive User Group, NYC October 15th, 2014
  • 2. About me Julian Hyde Architect at Hortonworks Open source: • Founder & lead, Apache Calcite (query optimization framework) • Founder & lead, Pentaho Mondrian (analysis engine) • Committer, Apache Drill • Contributor, Apache Hive • Contributor, Cascading Lingual (SQL interface to Cascading) Past: • SQLstream (streaming SQL) • Broadbase (data warehouse) • Oracle (SQL kernel development) Page ‹#› © Hortonworks Inc. 2014
  • 3. Hadoop - A New Data Architecture for New Data APP LICA TIO NS DAT A SYS TEM Business Analytics Custom Applications Packaged Applications RDBMS EDW MPP REPOSITORIES SOU RCE S Existing Sources (CRM, ERP, Clickstream, Logs) Page ‹#› © Hortonworks Inc. 2014 OLTP, ERP, CRM Systems Unstructured documents, emails Server logs Clickstream Sentiment, Web Data Sensor. Machine Data Geolocation New Data Requirements: ! • Scale • Economics • Flexibility Traditional Data Architecture
  • 4. Interactive SQL-IN-Hadoop Delivered Stinger Initiative – DELIVERED Next generation SQL based interactive query in Hadoop Speed Page ‹#› © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 Improve Hive query performance has increased by 100X to allow for interactive query times (seconds) Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB SQL Support broadest range of SQL semantics for analytic applications running against Hadoop SQL Business Analytics Custom Apps Window Functions Apache Hive Apache MapReduce Apache Tez Apache YARN !! 1 ° ° ° ° ° ° ° ° ° ° ° !! Apache Hive Contribution… an Open Community at its finest 1,672 Jira Tickets Closed 145 Developers 44 Companies ~390,000 Lines Of Code Added… (2x) ° ° N HDFS (Hadoop Distributed File System) Stinger Project Stinger Phase 1: • Base Optimizations • SQL Types • SQL Analytic Functions • ORCFile Modern File Format Sting!er Phase 2: Delivered HDP 2.1 • SQL Types • SQL Analytic Functions • Advanced Optimizations • Performance Boosts via YARN !S tinger Phase 3 • Hive on Apache Tez • Query Service (always on) • Buffer Cache • Cost Based Optimizer (Calcite) 13 Months Governance & Integration Security Operations Data Access Data Management ORC File
  • 5. Hive – Single tool for all SQL use cases Page ‹#› © Hortonworks Inc. 2014 OLTP, ERP, CRM Systems Unstructured documents, emails Server logs Clickstream Sentiment, Web Data Sensor. Machine Data Geolocation Interactive Analytics Batch Reports / Deep Analytics Hive - SQL ETL / ELT
  • 6. Incremental cutover to cost-based optimization Release Date Remarks Apache Hive 0.12 October 2013 • Rule-based Optimizations Page ‹#› © Hortonworks Inc. 2014 • No join reordering • Main optimizations: predicate push- Apache Hive 0.13 April 2014 “Old-style” JOIN & push-down conditions: … FROM t1, t2 WHERE … HDP 2.1 April 2014 Cost-based ordering of joins Apache Hive 0.14 soon Bushy joins, large joins, better operator coverage, better statistics, …
  • 7. Apache Calcite (incubating) Apache Calcite Page ‹#› © Hortonworks Inc. 2014
  • 8. Apache Calcite Apache incubator project since May, 2014 Query planning framework • Relational algebra, rewrite rules, cost model • Extensible • Usable standalone (JDBC) or embedded Adoption • Lingual – SQL interface to Cascading • Apache Drill • Apache Hive Adapters: Splunk, Spark, MongoDB, JDBC, CSV, JSON, Web tables, In-memory data Page ‹#› © Hortonworks Inc. 2014
  • 9. Conventional DB architecture Page ‹#› © Hortonworks Inc. 2014
  • 10. Calcite architecture Page ‹#› © Hortonworks Inc. 2014
  • 11. Expression tree Splunk Table: splunk MySQL Page ‹#› © Hortonworks Inc. 2014 Key: product_id join Key: product_name Agg: count group Condition: action = 'purchase' filter Key: c DESC sort scan scan Table: products SELECT p.“product_name”, COUNT(*) AS c FROM “splunk”.”splunk” AS s JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id” WHERE s.“action” = 'purchase' GROUP BY p.”product_name” ORDER BY c DESC
  • 12. Expression tree (optimized) Splunk Table: splunk Page ‹#› © Hortonworks Inc. 2014 Key: product_id join Key: product_name Agg: count group Condition: action = 'purchase' filter Key: c DESC sort scan MySQL scan Table: products SELECT p.“product_name”, COUNT(*) AS c FROM “splunk”.”splunk” AS s JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id” WHERE s.“action” = 'purchase' GROUP BY p.”product_name” ORDER BY c DESC
  • 13. Calcite – APIs and SPIs Page ‹#› © Hortonworks Inc. 2014 Cost, statistics RelOptCost RelOptCostFactory RelMetadataProvider • RelMdColumnUniquensss • RelMdDistinctRowCount • RelMdSelectivity SQL parser SqlNode SqlParser SqlValidator Transformation rules RelOptRule • MergeFilterRule • PushAggregateThroughUnionRule • 100+ more Global transformations • Unification (materialized view) • Column trimming • De-correlation Relational algebra RelNode (operator) • TableScan • Filter • Project • Union • Aggregate • … RelDataType (type) RexNode (expression) RelTrait (physical property) • RelConvention (calling-convention) • RelCollation (sortedness) • TBD (bucketedness/distribution) JDBC driver Metadata Schema Table Function • TableFunction • TableMacro Lattice
  • 14. Just added to Calcite - materialized views & lattices Page ‹#› © Hortonworks Inc. 2014 Query: SELECT x, SUM(y) FROM t GROUP BY x In-memory materialized queries Tables on disk http://hortonworks.com/blog/dmmq/
  • 15. Calcite Lattice () 1 (z, s) 43.4k (y, m) 60 Key ! z zipcode (43k) s state (50) g gender (2) y year (5) m month (12) (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 Page ‹#› © Hortonworks Inc. 2014 (z, s, g, y, m) 912k (s, g, y, m) 6k (z, g, y, m) 909k (z, s, y, m) 831k raw 1m (z, s, g, m) 644k (z, s, g, y) 392k (z, s, g) 87k (g, y) 10 (g, y, m) 120 (g, m) 24
  • 16. Now… back to Hive Page ‹#› © Hortonworks Inc. 2014 Apache Calcite
  • 17. CBO in Hive Why cost-based optimization? Ease of Use – Join Reordering View Chaining Ad hoc queries involving multiple views Enables BI Tools as front ends to Hive More efficient & maintainable query preparation process Laying the groundwork for deeper optimizations, e.g. materialized views Page ‹#› © Hortonworks Inc. 2014 17
  • 18. Query preparation – Hive 0.13 SQL parser Page ‹#› © Hortonworks Inc. 2014 Semantic analyzer Logical Optimizer Physical Optimizer Abstract Syntax Tree (AST) Hive SQL Annotated AST Plan Tez Tuned Plan
  • 19. Query preparation – full CBO SQL parser Page ‹#› © Hortonworks Inc. 2014 Semantic analyzer Annotated AST Translate to algebra Physical Optimizer Abstract Syntax Tree (AST) Hive SQL Tez Tuned Plan Calcite optimizer RelNode
  • 20. Query preparation – Hive 0.14 SQL parser Page ‹#› © Hortonworks Inc. 2014 Semantic analyzer Logical Optimizer Physical Optimizer Hive SQL AST with optimized join-ordering Tez Tuned Plan Translate to algebra Calcite optimizer
  • 21. Query Execution – The basics Page ‹#› © Hortonworks Inc. 2014 © Hortonworks Inc. 2013 21 SELECT R1.x FROM R1 JOIN R2 ON R1.x = R2.x JOIN R3 on R1.x = R3.x AND R2.x = R3.x WHERE R1.z > 10; π σ ⋈ ⋈ R1 R2 R3 TS [R1] TS [R2] RS RS Shuffle Join TS [R3] Map Join Filter FS
  • 22. Calcite Planner Process Hive Op ! RelNode Hive Expr ! RexNode Logical Plan Physical traits: Table Part./Buckets; RedSink Ops removed Page ‹#› © Hortonworks Inc. 2014 Hive Plan Planner RelNode RelNode Converter Graph RexNode Converter 1. Plan Graph • Node for each node in Input Plan • Each node is a Set of alternate Sub Plans • Set further divided into Subsets: based on traits like sortedness 2. Rules • Rule: specifies a Operator sub-graph to match and logic to generate equivalent ‘better’ sub-graph. • We only have Join Reordering Rules. 3. Cost Model • RelNodes have Cost (& Cumulative Cost) • We only use Cardinality for Cost. 4. Metadata Providers - Used to Plugin Schema, Cost Formulas: Selectivity, NDV calculations etc. - We only added Selectivity and NDV formulas; Schema is only available at the Node level Rule Match Queue - Add Rule matches to Queue - Apply Rule match transformations to Plan Graph - Iterate for fixed iterations or until Cost doesn’t change. - Match importance based on Cost of RelNode and height. Best RelNode Graph AST Converter Revised AST
  • 23. Star schema Page ‹#› © Hortonworks Inc. 2014 Sales Time Inventory Product Customer Warehouse Key Fact table Dimension table Many-to-one relationship
  • 24. Query combining two stars Page ‹#› © Hortonworks Inc. 2014 Sales Time Inventory Product Customer Warehouse SELECT product.id, sum(sales.units), sum(inventory.on_hand) FROM sales ON … JOIN customer ON … JOIN time ON … JOIN product ON … JOIN inventory ON … JOIN warehouse ON … WHERE time.year = 2014 AND time.quarter = ‘Q1’ AND product.color = ‘Red’ AND warehouse.state = ‘WA’ GROUP BY …
  • 25. Left-deep tree Initial tree is “left-deep” No join node is the right child of its parent Join-ordering algorithm chooses what order to join tables – does not re-shape the tree Typical plan: • Start with largest table at bottom left • Join tables with more selective conditions first Page ‹#› © Hortonworks Inc. 2014 Warehouse Product Time Sales Customer Inventory
  • 26. Bushy tree No restrictions on where join nodes can occur “Bushes” consist of fact tables (Sales and Inventory) surrounded by many-to-one related dimension tables Dimension tables have a filtering effect This tree produces the same result as previous plan but is more efficient Page ‹#› © Hortonworks Inc. 2014 Time Sales Customer Product Inventory Warehouse
  • 27. Two approaches to join optimization Algorithm #1 “exhaustive search” Apply 3 transformation rules exhaustively: • SwapJoinRule: A join B ! B join A • PushJoinThroughJoinRule: (A join B) join C ! (A join C) join B • CommutativeJoinRule: (A join B) join C ! A join (B join C) Finds every possible plan, but not practical for more than ~8 joins Algorithm #2 “greedy” Build a graph iteratively Use heuristics to choose the “best” node to add next Applying them to Hive We can use both algorithms – and we do! Both are sensitive to bad statistics – e.g. poor estimation of intermediate result set sizes Page ‹#› © Hortonworks Inc. 2014
  • 28. Statistics Feeding the beast CBO disabled if your tables don’t have statistics • No longer require statistics on all columns, just join columns Better optimizations need ever-better statistics… so, statistics are getting better Kinds of statistics Raw statistics on stored data: row counts, number-of-distinct-values (NDV) Statistics on intermediate operators, computed using selectivity estimates • Much improved selectivity estimates this release, based on NDVs • Planned improvements to raw statistics (e.g. histograms, unique keys, sort order) will help • Materialized views Run-time statistics • Example 1: 90% of the rows in this join have the same key ! use skew join • Example 2: Only 10 distinct values of GROUP BY key ! auto-reduce parallelism Page ‹#› © Hortonworks Inc. 2014
  • 29. Stored statistics – recent improvements ANALYZE Clean up command syntax Faster computation Table vs partition statistics All statistics now stored per partition Statistics retrieval Faster retrieval Merge partition statistics Extrapolate for missing statistics Page ‹#› © Hortonworks Inc. 2014 Extrapolation SQL: SELECT productId, COUNT(*) FROM Sales WHERE year = 2014 GROUP BY productId Required statistic: NDV(productId) Statistics available 2013 Q1 2013 Q2 2013 Q3 2013 Q4 2014 Q1 2014 Q2 2014 Q3 2014 Q4 Used in query Extrapolate {Q1, Q2, Q3} stats for Q4
  • 30. Dynamic partition pruning Consider a query with a partitioned fact table, filters on the dimension table: SELECT … FROM Sales JOIN Time ON Sales.time_id = Time.time_id WHERE time.year = 2014 AND time.quarter IN (‘Q1’, ‘Q2’) At execute time, DAG figures out which partitions could possibly match, and cancels scans of the others 2013 Q1 Page ‹#› © Hortonworks Inc. 2014 2013 Q2 2013 Q3 2013 Q4 2014 Q1 2014 Q2 2014 Q3 2014 Q4 Time
  • 31. Summary Join-ordering: (exhaustive & heuristic), scalability, bushy joins Statistics – faster, better, extrapolate if stats missing Very few operators that CBO can’t handle – TABLESAMPLE, SCRIPT, multi- INSERT Dynamic partition pruning Auto-reduce parallelism Page ‹#› © Hortonworks Inc. 2014 31
  • 32. Show me the numbers… Page ‹#› © Hortonworks Inc. 2014
  • 33. TPC-DS (30TB) Q17 Joins Store Sales, Store Returns and Catalog Sales fact tables. Each of the fact tables are independently restricted by time. Analysis at Item and Store grain, so these dimensions are also joined in. As specified Query starts by joining the 3 Fact tables. Page ‹#› © Hortonworks Inc. 2014 SELECT i_item_id ,i_item_desc ,s_state ,count(ss_quantity) as store_sales_quantitycount ,…. FROM store_sales ss ,store_returns sr, catalog_sales cs, date_dim d1, date_dim d2, date_dim d3, store s, item I WHERE d1.d_quarter_name = '2000Q1’ AND d1.d_date_sk = ss.ss_sold_date_sk AND i.i_item_sk = ss.ss_item_sk AND … GROUP BY i_item_id ,i_item_desc, ,s_state ORDER BY i_item_id ,i_item_desc, s_state LIMIT 100; CBO Elapsed (s) Intemediate data (GB) Off 10,683 5,017 On 1,284 275
  • 34. TPC-DS (200G) queries Query Hive 14 CBO off Page ‹#› © Hortonworks Inc. 2014 Hive 14 CBO on Gain Q15 84 44 91% Q22 123 99 24% Q29 1677 48 3,394% Q40 118 29 307% Q51 276 80 245% Q80 842 70 1,103% Q82 278 23 1,109% Q87 275 51 439% Q92 511 80 539% Q93 160 69 132% Q97 483 79 511%
  • 35. Stinger.next • SQL compliance: interval data type, non-equi joins, set operators, more sub-queries • Transactions: COMMIT, savepoint, rollback) • LLAP • Materialized views • In-memory • Automatic or manual http://hortonworks.com/blog/stinger-next-enterprise-sql-hadoop-scale-apache-hive/ Page ‹#› © Hortonworks Inc. 2014
  • 36. Thank you! Upcoming talk “SQL on everything — in memory” Thursday October 16th Strata & Hadoop World Conference, NYC @julianhyde http://hive.apache.org/ http://calcite.incubator.apache.org/ Page ‹#› © Hortonworks Inc. 2014