SlideShare a Scribd company logo
An Overview on Optimization
in Apache Hive:
Past, Present, Future
Jesús Camacho Rodríguez
DataWorks Summit Europe
April 5, 2017
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
 Initial use case: batch processing
– Read-only data
– HiveQL (SQL-like query language)
– MapReduce
 Effort to take Hive beyond its batch processing roots
– Started in Apache Hive 0.10.0 (January 2013)
– Latest released version: Apache Hive 2.1.1 (December 2016)
 Extensive renovation to improve three different axes
– Latency: allow interactive and sub-second queries
– Scalability: from TB to PB of data
– SQL support: move from HiveQL to SQL standard
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
 Multiple execution engines: Apache Tez and Apache Spark
 Partition pruning
 More efficient join execution algorithms
 Vectorized query execution
– Integration with columnar storage formats:
Apache ORC, Apache Parquet
 LLAP (Live Long and Process)
– Persistent deamons for low-latency queries
Important execution internals improvements
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
 Determine the most efficient way to execute a given query
– Huge benefits for declarative query languages like SQL
– Especially if queries are generated by BI tools
 Challenging: trade-off between plan generation latency and optimality
 Helps exploiting Apache Hive capabilities
– Join reordering and most efficient algorithms selection,
static and dynamic partition pruning, column pruning, etc.
Motivation for query optimizer
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
An overview on optimization in Apache Hive
Introduction
Query optimizer overview
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Query preparation
✗ Optimization difficult to perform over AST
– Missing opportunities
Before Apache Hive 0.14.0
SQL Parser
Semantic
Analyzer
Logical
Optimizer
Physical
Optimizer
Execution
Engine
AST
Annotated
ASTSQL Plan
Tuned
plan
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Query preparation
 Logical optimizer powered by Apache Calcite
– Each new version of Hive tightens this integration
• Improvements in the optimizer to be more effective and more efficient
• Shifting logic from Apache Hive to Apache Calcite: predicate pushdown, constant folding, etc.
After Apache Hive 0.14.0
SQL Parser
Semantic
Analyzer
Logical
Optimizer
Physical
Optimizer
Execution
Engine
Logical
plan
Physical
plan
Optimized
logical planASTSQL
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Query optimization
Release Date Features (not exhaustive)
Apache Hive 0.14.0 Nov 2014 • Cost-based ordering of joins
• Cost model improvements (expressions selectivity)
• Improvements to partition pruner
Apache Hive 1.1.0 Mar 2015 • Enable Apache Calcite optimization by default
• More rules (identity project removal, union merge)
Apache Hive 1.2.0 May 2015 • Improvements in predicate inference (leads to more pruning oportunities)
• More rules (predicate folding, transitive inference)
• Chained predicate pushdown, inference, propagation, and folding
Apache Hive 2.0.0 (*) Feb 2016 • Apache Calcite rewriting rules run even without stats (subset)
• Disable rules depending on input query profile
• Improvements in metadata caching
• More rules (sort and limit pushdown, aggregate pushdown)
Apache Hive 2.1.0 Jun 2016 • Enhancement to predicate pushdown, folding and inference existing rules
• Disabling equivalent rules written in Apache Hive
Optimizer improvements since Apache Hive 0.14.0
(*) LLAP brings new challenges: queries might take longer to plan than to execute
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Other improvements impacting query optimization
 Collection
– Faster computation
– All statistics stored per column and partition vs per table
– Automatic gathering
 Estimation
– Improved selectivity on estimates, based on NDVs, min, max
 Retrieval
– Merge partition statistics
– Multiple ongoing efforts to improve metastore scalability by getting rid of ORM layer
• Use Apache Hbase to store metadata
• Redesign underlying schema for metastore
– Yahoo! recorded improvements of up to 90% for some queries
Statistics
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
An overview on optimization in Apache Hive
Introduction
Query optimizer overview
Logical optimization
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization
 Top-level project since October 2015
 Query planning framework
– Relational algebra, transformation rules, cost model
– Extensible
– Usable standalone (JDBC) or embedded
 Wide adoption by the Apache community
– Apache Apex, Apache Drill, Apache Flink, Apache Hive, Apache Kylin, Apache Samza, etc.
Apache Calcite
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization
Apache Calcite – Architecture
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization
Apache Calcite – Architecture
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization
 Multi-phase optimization using Apache Calcite
– Both rule-based and cost-based phases
 More than 40 different rewriting rules
– Pushdown Filter predicates
– Pushdown Project expressions
– Infer new Filter predicates
– Infer new constant expressions
– Prune unnecesary operator columns
– Simplify expressions
– Pushdown Sort and Limit operators
– Merge operators
– Pull up constants
– …
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization – Example (cost-based)
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization – Example (cost-based)
 Capable of generating bushy join operator trees
– Both inputs of a join operator can receive results from other join operators
– Well suited for parallel execution engines: increases parallelism degree, performance and cluster
utilization
Join reordering
Join
Join
Join
Join
Left-deep join tree
Join
Join
Join
Bushy join tree
Join
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization – Example (cost-based)
 Combining two star schemas
– Fact tables:
• sales, inventory
– Dimension tables:
• customer, time, product, warehouse
Join reordering
inventory warehouse
productJoin
Join
Join Join
Join time
sales customer
Dimension tables filter
fact tables records
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization – Example (rule-based)
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization – Example (rule-based)
 Predicate pushdown: push predicates as close to Scan operators as possible
 Predicate inference: infer new predicates from existant ones (maybe redundant) that
help to optimize the plan further
 Predicate propagation: propagate inferred values for expressions through the plan
 Predicate folding: fold constant expressions in predicates
Interaction of different rules allows to generate more efficient plans
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization – Example (rule-based)
SELECT quantity FROM sales s
JOIN addr a ON s.c_id=a.c_id
WHERE (a.country='US' AND a.c_id=2452276)
OR (a.country='US' AND a.c_id=2452640)
Join
Query logical plan
Project
Filter
sales addr
s.c_id=a.c_id
(a.country='US' AND a.c_id=2452276) OR
(a.country='US' AND a.c_id=2452276)
Scan Scan
quantity
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization – Example (rule-based)
Join
Project
sales addr
s.c_id=a.c_id
Scan Scan
quantity
 Predicate pushdown
 Predicate inference
 Predicate propagation
 Predicate folding
Filter (a.country='US' AND a.c_id=2452276) OR
(a.country='US' AND a.c_id=2452276)
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization – Example (rule-based)
Join
Project
sales addr
s.c_id=a.c_id
Scan Scan
quantity
 Predicate pushdown
 Predicate inference
 Predicate propagation
 Predicate folding
Filter
(a.country='US' AND a.c_id=2452276) OR
(a.country='US' AND a.c_id=2452276)
AND a.country='US’
AND (a.c_id=2452276 OR a.c_id=2452276)
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Project
Logical optimization – Example (rule-based)
Join
sales addr
s.c_id=a.c_id
Scan Scan
quantity
 Predicate pushdown
 Predicate inference
 Predicate propagation
 Predicate folding
Filter
(a.country='US' AND a.c_id=2452276) OR
(a.country='US' AND a.c_id=2452276)
AND a.country='US’
AND (a.c_id=2452276 OR a.c_id=2452276)
Filters.c_id=2452276 OR
s.c_id=2452276
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Project
Logical optimization – Example (rule-based)
Join
sales addr
s.c_id=a.c_id
Scan Scan
quantity
 Predicate pushdown
 Predicate inference
 Predicate propagation
 Predicate folding
Filter
('US' ='US' AND a.c_id=2452276) OR
('US' ='US' AND a.c_id=2452276)
AND a.country='US’
AND (a.c_id=2452276 OR a.c_id=2452276)
Filters.c_id=2452276 OR
s.c_id=2452276
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Project
Logical optimization – Example (rule-based)
Join
sales addr
s.c_id=a.c_id
Scan Scan
quantity
 Predicate pushdown
 Predicate inference
 Predicate propagation
 Predicate folding
Filter
(true AND a.c_id=2452276) OR
(true AND a.c_id=2452276)
AND a.country='US’
AND (a.c_id=2452276 OR a.c_id=2452276)
Filters.c_id=2452276 OR
s.c_id=2452276
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Project
Logical optimization – Example (rule-based)
Join
sales addr
s.c_id=a.c_id
Scan Scan
quantity
 Predicate pushdown
 Predicate inference
 Predicate propagation
 Predicate folding
Filter
(a.c_id=2452276 OR a.c_id=2452276)
AND a.country='US’
AND (a.c_id=2452276 OR a.c_id=2452276)
Filters.c_id=2452276 OR
s.c_id=2452276
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Project
Logical optimization – Example (rule-based)
Join
sales addr
s.c_id=a.c_id
Scan Scan
quantity
 Predicate pushdown
 Predicate inference
 Predicate propagation
 Predicate folding
Filter a.country='US’ AND
(a.c_id=2452276 OR a.c_id=2452276)Filters.c_id=2452276 OR
s.c_id=2452276
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
An overview on optimization in Apache Hive
Introduction
Query optimizer overview
Logical optimization
Physical optimization
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Physical optimization
 Determine how the plan will be executed
– Algorithms to use for each of the operators
– Partitioning and sorting data between operators
 Finds additional optimization opportunities
– Removal of unnecesary stages
 Completely done in Apache Hive
– As we tighten the integration between Apache Hive and Apache Calcite, more decissions should be
taken in Apache Calcite
• Example: join algorithm selection
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Physical optimization – Example
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Physical optimization - Example
 Dynamic runtime filtering
– Runtime decision to cancel scan of records depending on subquery results
– Equivalent to semi-join reduction optimization
SELECT …
FROM sales JOIN time ON sales.time_id = time.time_id
WHERE time.year = 2014 AND time.quarter IN ('Q1', 'Q2’)
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Physical optimization - Example
 Dynamic runtime filtering
– Runtime decision to cancel scan of records depending on subquery results
– Equivalent to semi-join reduction optimization
SELECT …
FROM sales JOIN time ON sales.time_id = time.time_id
WHERE time.year = 2014 AND time.quarter IN ('Q1', 'Q2’)
Join
sales
sales.time_id = time.time_id
Physical query plan
Scan
Filter
time
year = 2014 AND
quarter IN ('Q1', 'Q2')
ReduceSinkScan key: time_id
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Physical optimization - Example
 Dynamic runtime filtering
– Runtime decision to cancel scan of records depending on subquery results
– Equivalent to semi-join reduction optimization
SELECT …
FROM sales JOIN time ON sales.time_id = time.time_id
WHERE time.year = 2014 AND time.quarter IN ('Q1', 'Q2’)
Join
sales
sales.time_id = time.time_id
Physical query plan
Scan
Filter
time
year = 2014 AND
quarter IN ('Q1', 'Q2')
ReduceSinkScan
Scan only subset of records
with given time_id
key: time_id
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
An overview on optimization in Apache Hive
Introduction
Query optimizer overview
Logical optimization
Physical optimization
Optimizer performance
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Optimizer performance
Apache Hive 2.1.1 - TPCDS (10TB) queries
* Max execution time: 1000 s
Update on Hive performance within a few weeks: http://hortonworks.com/blog/
0
20
40
60
80
100
0
200
400
600
800
1000
1200
q48 q13 q58 q21 q84 q66 q73 q20 q34 q25 q15
Gain(%)
Executiontime(s)
Optimizer off Optimizer on
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
An overview on optimization in Apache Hive
Introduction
Query optimizer overview
Logical optimization
Physical optimization
Optimizer performance
Road ahead
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Work in progress and future work
 Cost model should reflect actual execution cost
 Takes into account CPU and IO cost for operators
 Dependent on underlying execution engine
– In contrast to cardinality-based model (default cost model)
– Only available for Apache Tez
 Requires ‘tuning’ parameters
– CPU atomic operation cost, local disk byte read/write cost, HDFS byte read/write cost, network
byte transfer cost
 Allows join algorithm selection in Apache Calcite
– Sort-merge join, mapjoin, bucket mapjoin, sorted bucket mapjoin
Extended cost model
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Work in progress and future work
 A materialized view contains the results of a query
 Accelerate query execution by using automatic rewriting
– Automatically benefits from LLAP, since materialized view is represented as a table
– Possibility to use other systems to store views (federation)
 Powerful side-effects: accurate statistics for query plans
Materialized views support
Query
LLAP
Disk
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Work in progress and future work
 Table cmv_table
 Syntax
CREATE MATERIALIZED VIEW cmv_mat_view ENABLE REWRITE
AS
SELECT id, discount
FROM cmv_table
WHERE discount > 0.2;
Materialized views support
id name unit_price discount
1 'p1' 10.30 0.15
2 'p2' 3.14 0.35
3 'p3' 172.2 0.35
4 'p4' 978.76 0.35
Enable automatic rewriting
using this materialized view
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Work in progress and future work
 Implemented as a set of rewriting rules
 Early pre-filtering of irrelevant views
 Currently supports Project, Filter, Join, Aggregate, Scan operators
SELECT id, discount
FROM cmv_table
WHERE discount > 0.3;
Materialized views support – Automatic rewriting
Materialized view
cmv_mat_view plan
Project
Filter
Scancmv_table
discount>0.2
id, discountProject
Filter
Scancmv_table
discount>0.3
id, discount
Query plan
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Work in progress and future work
 Implemented as a set of rewriting rules
 Early pre-filtering of irrelevant views
 Currently supports Project, Filter, Join, Aggregate, Scan operators
SELECT id, discount
FROM cmv_table
WHERE discount > 0.3;
Materialized views support – Automatic rewriting
Query results contained within
materialized view resultsMaterialized view
cmv_mat_view plan
Project
Filter
Scancmv_table
discount>0.2
id, discountProject
Filter
Scancmv_table
discount>0.3
id, discount
Query plan
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Work in progress and future work
 Implemented as a set of rewriting rules
 Early pre-filtering of irrelevant views
 Currently supports Project, Filter, Join, Aggregate, Scan operators
SELECT id, discount
FROM cmv_table
WHERE discount > 0.3;
Materialized views support – Automatic rewriting
Rewriting
Filter
Scancmv_mat_view
discount>0.3
Materialized view
cmv_mat_view plan
Project
Filter
Scancmv_table
discount>0.2
id, discountProject
Filter
Scancmv_table
discount>0.3
id, discount
Query plan
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Work in progress and future work
 Allow partitioned materialized views
 Explore materialized view maintenance
– Incremental view update (ACID should help)
– Time window for valid automatic rewriting
Materialized views support – Upcoming features
44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Conclusion
 Improvements on effectiveness and efficiency over the last 3 years
– Contributions to Apache Hive and Apache Calcite
 Interesting road ahead for both communities
– Query decorrelation, materialized views, query factoring, plan caching, parameterized plans, and
many more
Progress of Apache Hive optimizer
45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Acknowledgments
 Apache Hive and Apache Calcite communities
– Ashutosh Chauhan, Julian Hyde, Pengcheng Xiong, Vineet Garg, John Pullokkaran,
Harish Butani, Hari Sankar Sivarama Subramaniyan, Michael Mior, and many others
 Julian Hyde for sharing slides of previous optimizer talks
Thank You
http://hive.apache.org | @ApacheHive
http://calcite.apache.org | @ApacheCalcite

More Related Content

What's hot

LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
DataWorks Summit
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
DataWorks Summit/Hadoop Summit
 
IoT:what about data storage?
IoT:what about data storage?IoT:what about data storage?
IoT:what about data storage?
DataWorks Summit/Hadoop Summit
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
DataWorks Summit/Hadoop Summit
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
DataWorks Summit
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
DataWorks Summit
 
How to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDBHow to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDB
Hortonworks
 
Schema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeSchema Registry - Set Your Data Free
Schema Registry - Set Your Data Free
DataWorks Summit
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
DataWorks Summit
 
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
DataWorks Summit
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
DataWorks Summit
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
The Future of Apache Ambari
The Future of Apache AmbariThe Future of Apache Ambari
The Future of Apache Ambari
DataWorks Summit
 
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and TroubleshootingApache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
DataWorks Summit/Hadoop Summit
 
Delivering a Flexible IT Infrastructure for Analytics on IBM Power Systems
Delivering a Flexible IT Infrastructure for Analytics on IBM Power SystemsDelivering a Flexible IT Infrastructure for Analytics on IBM Power Systems
Delivering a Flexible IT Infrastructure for Analytics on IBM Power Systems
Hortonworks
 
Accelerate Your Big Data Analytics Efforts with SAS and Hadoop
Accelerate Your Big Data Analytics Efforts with SAS and HadoopAccelerate Your Big Data Analytics Efforts with SAS and Hadoop
Accelerate Your Big Data Analytics Efforts with SAS and Hadoop
DataWorks Summit
 
Running Services on YARN
Running Services on YARNRunning Services on YARN
Running Services on YARN
DataWorks Summit/Hadoop Summit
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
DataWorks Summit/Hadoop Summit
 

What's hot (20)

LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
IoT:what about data storage?
IoT:what about data storage?IoT:what about data storage?
IoT:what about data storage?
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
 
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
 
How to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDBHow to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDB
 
Schema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeSchema Registry - Set Your Data Free
Schema Registry - Set Your Data Free
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
The Future of Apache Ambari
The Future of Apache AmbariThe Future of Apache Ambari
The Future of Apache Ambari
 
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and TroubleshootingApache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
 
Delivering a Flexible IT Infrastructure for Analytics on IBM Power Systems
Delivering a Flexible IT Infrastructure for Analytics on IBM Power SystemsDelivering a Flexible IT Infrastructure for Analytics on IBM Power Systems
Delivering a Flexible IT Infrastructure for Analytics on IBM Power Systems
 
Accelerate Your Big Data Analytics Efforts with SAS and Hadoop
Accelerate Your Big Data Analytics Efforts with SAS and HadoopAccelerate Your Big Data Analytics Efforts with SAS and Hadoop
Accelerate Your Big Data Analytics Efforts with SAS and Hadoop
 
Running Services on YARN
Running Services on YARNRunning Services on YARN
Running Services on YARN
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 

Similar to An Overview on Optimization in Apache Hive: Past, Present Future

An Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureAn Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, Future
DataWorks Summit
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
Julian Hyde
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghai
Yifeng Jiang
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
John Park
 
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureHadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and Future
DataWorks Summit
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
VMware Tanzu
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
Ashish Narasimham
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
Aldrin Piri
 
Edw Optimization Solution
Edw Optimization Solution Edw Optimization Solution
Edw Optimization Solution
Hortonworks
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
DataWorks Summit/Hadoop Summit
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
DataWorks Summit
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
 
Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018
Timothy Spann
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
Big Data Spain
 
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
alanfgates
 
Future of Apache Ambari
Future of Apache AmbariFuture of Apache Ambari
Future of Apache Ambari
Jayush Luniya
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015
Hortonworks
 

Similar to An Overview on Optimization in Apache Hive: Past, Present Future (20)

An Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureAn Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, Future
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghai
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
 
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureHadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and Future
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
 
Edw Optimization Solution
Edw Optimization Solution Edw Optimization Solution
Edw Optimization Solution
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 
Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
 
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
Future of Apache Ambari
Future of Apache AmbariFuture of Apache Ambari
Future of Apache Ambari
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015
 

More from DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
 

Recently uploaded (20)

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
 

An Overview on Optimization in Apache Hive: Past, Present Future

  • 1. An Overview on Optimization in Apache Hive: Past, Present, Future Jesús Camacho Rodríguez DataWorks Summit Europe April 5, 2017
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive  Initial use case: batch processing – Read-only data – HiveQL (SQL-like query language) – MapReduce  Effort to take Hive beyond its batch processing roots – Started in Apache Hive 0.10.0 (January 2013) – Latest released version: Apache Hive 2.1.1 (December 2016)  Extensive renovation to improve three different axes – Latency: allow interactive and sub-second queries – Scalability: from TB to PB of data – SQL support: move from HiveQL to SQL standard
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive  Multiple execution engines: Apache Tez and Apache Spark  Partition pruning  More efficient join execution algorithms  Vectorized query execution – Integration with columnar storage formats: Apache ORC, Apache Parquet  LLAP (Live Long and Process) – Persistent deamons for low-latency queries Important execution internals improvements
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive  Determine the most efficient way to execute a given query – Huge benefits for declarative query languages like SQL – Especially if queries are generated by BI tools  Challenging: trade-off between plan generation latency and optimality  Helps exploiting Apache Hive capabilities – Join reordering and most efficient algorithms selection, static and dynamic partition pruning, column pruning, etc. Motivation for query optimizer
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda An overview on optimization in Apache Hive Introduction Query optimizer overview
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Query preparation ✗ Optimization difficult to perform over AST – Missing opportunities Before Apache Hive 0.14.0 SQL Parser Semantic Analyzer Logical Optimizer Physical Optimizer Execution Engine AST Annotated ASTSQL Plan Tuned plan
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Query preparation  Logical optimizer powered by Apache Calcite – Each new version of Hive tightens this integration • Improvements in the optimizer to be more effective and more efficient • Shifting logic from Apache Hive to Apache Calcite: predicate pushdown, constant folding, etc. After Apache Hive 0.14.0 SQL Parser Semantic Analyzer Logical Optimizer Physical Optimizer Execution Engine Logical plan Physical plan Optimized logical planASTSQL
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Query optimization Release Date Features (not exhaustive) Apache Hive 0.14.0 Nov 2014 • Cost-based ordering of joins • Cost model improvements (expressions selectivity) • Improvements to partition pruner Apache Hive 1.1.0 Mar 2015 • Enable Apache Calcite optimization by default • More rules (identity project removal, union merge) Apache Hive 1.2.0 May 2015 • Improvements in predicate inference (leads to more pruning oportunities) • More rules (predicate folding, transitive inference) • Chained predicate pushdown, inference, propagation, and folding Apache Hive 2.0.0 (*) Feb 2016 • Apache Calcite rewriting rules run even without stats (subset) • Disable rules depending on input query profile • Improvements in metadata caching • More rules (sort and limit pushdown, aggregate pushdown) Apache Hive 2.1.0 Jun 2016 • Enhancement to predicate pushdown, folding and inference existing rules • Disabling equivalent rules written in Apache Hive Optimizer improvements since Apache Hive 0.14.0 (*) LLAP brings new challenges: queries might take longer to plan than to execute
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Other improvements impacting query optimization  Collection – Faster computation – All statistics stored per column and partition vs per table – Automatic gathering  Estimation – Improved selectivity on estimates, based on NDVs, min, max  Retrieval – Merge partition statistics – Multiple ongoing efforts to improve metastore scalability by getting rid of ORM layer • Use Apache Hbase to store metadata • Redesign underlying schema for metastore – Yahoo! recorded improvements of up to 90% for some queries Statistics
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda An overview on optimization in Apache Hive Introduction Query optimizer overview Logical optimization
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization  Top-level project since October 2015  Query planning framework – Relational algebra, transformation rules, cost model – Extensible – Usable standalone (JDBC) or embedded  Wide adoption by the Apache community – Apache Apex, Apache Drill, Apache Flink, Apache Hive, Apache Kylin, Apache Samza, etc. Apache Calcite
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization Apache Calcite – Architecture
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization Apache Calcite – Architecture
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization  Multi-phase optimization using Apache Calcite – Both rule-based and cost-based phases  More than 40 different rewriting rules – Pushdown Filter predicates – Pushdown Project expressions – Infer new Filter predicates – Infer new constant expressions – Prune unnecesary operator columns – Simplify expressions – Pushdown Sort and Limit operators – Merge operators – Pull up constants – …
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization – Example (cost-based)
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization – Example (cost-based)  Capable of generating bushy join operator trees – Both inputs of a join operator can receive results from other join operators – Well suited for parallel execution engines: increases parallelism degree, performance and cluster utilization Join reordering Join Join Join Join Left-deep join tree Join Join Join Bushy join tree Join
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization – Example (cost-based)  Combining two star schemas – Fact tables: • sales, inventory – Dimension tables: • customer, time, product, warehouse Join reordering inventory warehouse productJoin Join Join Join Join time sales customer Dimension tables filter fact tables records
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization – Example (rule-based)
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization – Example (rule-based)  Predicate pushdown: push predicates as close to Scan operators as possible  Predicate inference: infer new predicates from existant ones (maybe redundant) that help to optimize the plan further  Predicate propagation: propagate inferred values for expressions through the plan  Predicate folding: fold constant expressions in predicates Interaction of different rules allows to generate more efficient plans
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization – Example (rule-based) SELECT quantity FROM sales s JOIN addr a ON s.c_id=a.c_id WHERE (a.country='US' AND a.c_id=2452276) OR (a.country='US' AND a.c_id=2452640) Join Query logical plan Project Filter sales addr s.c_id=a.c_id (a.country='US' AND a.c_id=2452276) OR (a.country='US' AND a.c_id=2452276) Scan Scan quantity
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization – Example (rule-based) Join Project sales addr s.c_id=a.c_id Scan Scan quantity  Predicate pushdown  Predicate inference  Predicate propagation  Predicate folding Filter (a.country='US' AND a.c_id=2452276) OR (a.country='US' AND a.c_id=2452276)
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization – Example (rule-based) Join Project sales addr s.c_id=a.c_id Scan Scan quantity  Predicate pushdown  Predicate inference  Predicate propagation  Predicate folding Filter (a.country='US' AND a.c_id=2452276) OR (a.country='US' AND a.c_id=2452276) AND a.country='US’ AND (a.c_id=2452276 OR a.c_id=2452276)
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Project Logical optimization – Example (rule-based) Join sales addr s.c_id=a.c_id Scan Scan quantity  Predicate pushdown  Predicate inference  Predicate propagation  Predicate folding Filter (a.country='US' AND a.c_id=2452276) OR (a.country='US' AND a.c_id=2452276) AND a.country='US’ AND (a.c_id=2452276 OR a.c_id=2452276) Filters.c_id=2452276 OR s.c_id=2452276
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Project Logical optimization – Example (rule-based) Join sales addr s.c_id=a.c_id Scan Scan quantity  Predicate pushdown  Predicate inference  Predicate propagation  Predicate folding Filter ('US' ='US' AND a.c_id=2452276) OR ('US' ='US' AND a.c_id=2452276) AND a.country='US’ AND (a.c_id=2452276 OR a.c_id=2452276) Filters.c_id=2452276 OR s.c_id=2452276
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Project Logical optimization – Example (rule-based) Join sales addr s.c_id=a.c_id Scan Scan quantity  Predicate pushdown  Predicate inference  Predicate propagation  Predicate folding Filter (true AND a.c_id=2452276) OR (true AND a.c_id=2452276) AND a.country='US’ AND (a.c_id=2452276 OR a.c_id=2452276) Filters.c_id=2452276 OR s.c_id=2452276
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Project Logical optimization – Example (rule-based) Join sales addr s.c_id=a.c_id Scan Scan quantity  Predicate pushdown  Predicate inference  Predicate propagation  Predicate folding Filter (a.c_id=2452276 OR a.c_id=2452276) AND a.country='US’ AND (a.c_id=2452276 OR a.c_id=2452276) Filters.c_id=2452276 OR s.c_id=2452276
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Project Logical optimization – Example (rule-based) Join sales addr s.c_id=a.c_id Scan Scan quantity  Predicate pushdown  Predicate inference  Predicate propagation  Predicate folding Filter a.country='US’ AND (a.c_id=2452276 OR a.c_id=2452276)Filters.c_id=2452276 OR s.c_id=2452276
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda An overview on optimization in Apache Hive Introduction Query optimizer overview Logical optimization Physical optimization
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Physical optimization  Determine how the plan will be executed – Algorithms to use for each of the operators – Partitioning and sorting data between operators  Finds additional optimization opportunities – Removal of unnecesary stages  Completely done in Apache Hive – As we tighten the integration between Apache Hive and Apache Calcite, more decissions should be taken in Apache Calcite • Example: join algorithm selection
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Physical optimization – Example
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Physical optimization - Example  Dynamic runtime filtering – Runtime decision to cancel scan of records depending on subquery results – Equivalent to semi-join reduction optimization SELECT … FROM sales JOIN time ON sales.time_id = time.time_id WHERE time.year = 2014 AND time.quarter IN ('Q1', 'Q2’)
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Physical optimization - Example  Dynamic runtime filtering – Runtime decision to cancel scan of records depending on subquery results – Equivalent to semi-join reduction optimization SELECT … FROM sales JOIN time ON sales.time_id = time.time_id WHERE time.year = 2014 AND time.quarter IN ('Q1', 'Q2’) Join sales sales.time_id = time.time_id Physical query plan Scan Filter time year = 2014 AND quarter IN ('Q1', 'Q2') ReduceSinkScan key: time_id
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Physical optimization - Example  Dynamic runtime filtering – Runtime decision to cancel scan of records depending on subquery results – Equivalent to semi-join reduction optimization SELECT … FROM sales JOIN time ON sales.time_id = time.time_id WHERE time.year = 2014 AND time.quarter IN ('Q1', 'Q2’) Join sales sales.time_id = time.time_id Physical query plan Scan Filter time year = 2014 AND quarter IN ('Q1', 'Q2') ReduceSinkScan Scan only subset of records with given time_id key: time_id
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda An overview on optimization in Apache Hive Introduction Query optimizer overview Logical optimization Physical optimization Optimizer performance
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Optimizer performance Apache Hive 2.1.1 - TPCDS (10TB) queries * Max execution time: 1000 s Update on Hive performance within a few weeks: http://hortonworks.com/blog/ 0 20 40 60 80 100 0 200 400 600 800 1000 1200 q48 q13 q58 q21 q84 q66 q73 q20 q34 q25 q15 Gain(%) Executiontime(s) Optimizer off Optimizer on
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda An overview on optimization in Apache Hive Introduction Query optimizer overview Logical optimization Physical optimization Optimizer performance Road ahead
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Work in progress and future work  Cost model should reflect actual execution cost  Takes into account CPU and IO cost for operators  Dependent on underlying execution engine – In contrast to cardinality-based model (default cost model) – Only available for Apache Tez  Requires ‘tuning’ parameters – CPU atomic operation cost, local disk byte read/write cost, HDFS byte read/write cost, network byte transfer cost  Allows join algorithm selection in Apache Calcite – Sort-merge join, mapjoin, bucket mapjoin, sorted bucket mapjoin Extended cost model
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Work in progress and future work  A materialized view contains the results of a query  Accelerate query execution by using automatic rewriting – Automatically benefits from LLAP, since materialized view is represented as a table – Possibility to use other systems to store views (federation)  Powerful side-effects: accurate statistics for query plans Materialized views support Query LLAP Disk
  • 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Work in progress and future work  Table cmv_table  Syntax CREATE MATERIALIZED VIEW cmv_mat_view ENABLE REWRITE AS SELECT id, discount FROM cmv_table WHERE discount > 0.2; Materialized views support id name unit_price discount 1 'p1' 10.30 0.15 2 'p2' 3.14 0.35 3 'p3' 172.2 0.35 4 'p4' 978.76 0.35 Enable automatic rewriting using this materialized view
  • 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Work in progress and future work  Implemented as a set of rewriting rules  Early pre-filtering of irrelevant views  Currently supports Project, Filter, Join, Aggregate, Scan operators SELECT id, discount FROM cmv_table WHERE discount > 0.3; Materialized views support – Automatic rewriting Materialized view cmv_mat_view plan Project Filter Scancmv_table discount>0.2 id, discountProject Filter Scancmv_table discount>0.3 id, discount Query plan
  • 41. 41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Work in progress and future work  Implemented as a set of rewriting rules  Early pre-filtering of irrelevant views  Currently supports Project, Filter, Join, Aggregate, Scan operators SELECT id, discount FROM cmv_table WHERE discount > 0.3; Materialized views support – Automatic rewriting Query results contained within materialized view resultsMaterialized view cmv_mat_view plan Project Filter Scancmv_table discount>0.2 id, discountProject Filter Scancmv_table discount>0.3 id, discount Query plan
  • 42. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Work in progress and future work  Implemented as a set of rewriting rules  Early pre-filtering of irrelevant views  Currently supports Project, Filter, Join, Aggregate, Scan operators SELECT id, discount FROM cmv_table WHERE discount > 0.3; Materialized views support – Automatic rewriting Rewriting Filter Scancmv_mat_view discount>0.3 Materialized view cmv_mat_view plan Project Filter Scancmv_table discount>0.2 id, discountProject Filter Scancmv_table discount>0.3 id, discount Query plan
  • 43. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Work in progress and future work  Allow partitioned materialized views  Explore materialized view maintenance – Incremental view update (ACID should help) – Time window for valid automatic rewriting Materialized views support – Upcoming features
  • 44. 44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Conclusion  Improvements on effectiveness and efficiency over the last 3 years – Contributions to Apache Hive and Apache Calcite  Interesting road ahead for both communities – Query decorrelation, materialized views, query factoring, plan caching, parameterized plans, and many more Progress of Apache Hive optimizer
  • 45. 45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Acknowledgments  Apache Hive and Apache Calcite communities – Ashutosh Chauhan, Julian Hyde, Pengcheng Xiong, Vineet Garg, John Pullokkaran, Harish Butani, Hari Sankar Sivarama Subramaniyan, Michael Mior, and many others  Julian Hyde for sharing slides of previous optimizer talks
  • 46. Thank You http://hive.apache.org | @ApacheHive http://calcite.apache.org | @ApacheCalcite

Editor's Notes

  1. ORM (object relational mapping)
  2. ORM (object relational mapping) Combines exhaustive search with greedy algorithm Exhaustive search finds every possible plan using rewriting rules Not practical for large number of joins Greedy algorithm builds the plan iteratively Uses heuristic to choose best join to add next
  3. ORM (object relational mapping)
  4. ORM (object relational mapping)