SlideShare a Scribd company logo
An Overview on Optimization
in Apache Hive:
Past, Present, Future
Ashutosh Chauhan
Jesús Camacho Rodríguez
DataWorks Summit San Jose
June 4, 2017
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
 Initial use case: batch processing
– Read-only data
– HiveQL (SQL-like query language)
– MapReduce
 Effort to take Hive beyond its batch processing roots
– Started in Apache Hive 0.10.0 (January 2013)
– Latest released version: Apache Hive 2.1.1 (December 2016)
 Extensive renovation to improve three different axes
– Latency: allow interactive and sub-second queries
– Scalability: from TB to PB of data
– SQL support: move from HiveQL to SQL standard
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
 Multiple execution engines: Apache Tez and Apache Spark
 More efficient join execution algorithms
 Vectorized query execution
– Integration with columnar storage formats:
Apache ORC, Apache Parquet
 LLAP (Live Long and Process)
– Persistent deamons for low-latency queries
 Semi join reduction
Important execution internals improvements
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
 Determine the most efficient way to execute a given query
– Huge benefits for declarative query languages like SQL
– Especially if queries are generated by BI tools
 Challenging: trade-off between plan generation latency and optimality
 Helps exploiting Apache Hive capabilities
– Join reordering and most efficient algorithms selection,
static and dynamic partition pruning, column pruning, etc.
Motivation for query optimizer
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
An overview on optimization in Apache Hive
Introduction
Query optimizer overview
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Query preparation
✗ Optimization difficult to perform over AST
– Missing opportunities
Before Apache Hive 0.14.0
SQL Parser
Semantic
Analyzer
Logical
Optimizer
Physical
Optimizer
Execution
Engine
AST
Annotated
ASTSQL Plan
Tuned
plan
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Query preparation
 Logical optimizer powered by Apache Calcite
– Each new version of Hive tightens this integration
• Improvements in the optimizer to be more effective and more efficient
• Shifting logic from Apache Hive to Apache Calcite: predicate pushdown, constant folding, etc.
After Apache Hive 0.14.0
SQL Parser
Semantic
Analyzer
Logical
Optimizer
Physical
Optimizer
Execution
Engine
Logical
plan
Physical
plan
Optimized
logical planASTSQL
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Query optimization
Release Date Features (not exhaustive)
Apache Hive 0.14.0 Nov 2014 • Cost-based ordering of joins
• Cost model improvements (expressions selectivity)
• Improvements to partition pruner
Apache Hive 1.1.0 Mar 2015 • Enable Apache Calcite optimization by default
• More rules (identity project removal, union merge)
Apache Hive 1.2.0 May 2015 • Improvements in predicate inference (leads to more pruning oportunities)
• More rules (predicate folding, transitive inference)
• Chained predicate pushdown, inference, propagation, and folding
Apache Hive 2.0.0 (*) Feb 2016 • Apache Calcite rewriting rules run even without stats (subset)
• Disable rules depending on input query profile
• Improvements in metadata caching
• More rules (sort and limit pushdown, aggregate pushdown)
Apache Hive 2.1.0 Jun 2016 • Enhancement to predicate pushdown, folding and inference existing rules
• Disabling equivalent rules written in Apache Hive
Optimizer improvements since Apache Hive 0.14.0
(*) Introduction of Llap
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Rules aren’t only consideration
 Collection
– Faster computation
– All statistics stored per column and partition vs per table
– Automatic gathering
 Estimation
– Improved selectivity on estimates, based on NDVs, min, max
 Retrieval
– Merge partition statistics
– Multiple ongoing efforts to improve metastore scalability by getting rid of ORM layer
• Use Apache Hbase to store metadata
• Caching in metastore process
• Redesign underlying schema for metastore
– Yahoo! recorded improvements of up to 90% for some queries
Statistics
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
An overview on optimization in Apache Hive
Introduction
Query optimizer overview
Logical optimization
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization
Apache Calcite – Architecture
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization
Apache Calcite – Architecture
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization
 Multi-phase optimization using Apache Calcite
– Both rule-based and cost-based phases
 More than 40 different rewriting rules
– Pushdown Filter predicates
– Pushdown Project expressions
– Infer new Filter predicates
– Infer new constant expressions
– Prune unnecesary operator columns
– Simplify expressions
– Pushdown Sort and Limit operators
– Merge operators
– Pull up constants
– …
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization – Example (cost-based)
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization – Example (cost-based)
 Capable of generating bushy join operator trees
– Both inputs of a join operator can receive results from other join operators
– Well suited for parallel execution engines: increases parallelism degree, performance and cluster
utilization
Join reordering
Join
Join
Join
Join
Left-deep join tree
Join
Join
Join
Bushy join tree
Join
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization – Example (cost-based)
 Combining two star schemas
– Fact tables:
• sales, inventory
– Dimension tables:
• customer, time, product, warehouse
Join reordering
inventory warehouse
productJoin
Join
Join Join
Join time
sales customer
Dimension tables filter
fact tables records
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization – Example (rule-based)
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization – Example (rule-based)
 Predicate pushdown: push predicates as close to Scan operators as possible
 Predicate inference: infer new predicates from existant ones (maybe redundant) that
help to optimize the plan further
 Predicate propagation: propagate inferred values for expressions through the plan
 Predicate folding: fold constant expressions in predicates
Interaction of different rules allows to generate more efficient plans
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization – Example (rule-based)
SELECT quantity FROM sales s
JOIN addr a ON s.c_id=a.c_id
WHERE (a.country='US' AND a.c_id=2452276)
OR (a.country='US' AND a.c_id=2452640)
Join
Query logical plan
Project
Filter
sales addr
s.c_id=a.c_id
(a.country='US' AND a.c_id=2452276) OR
(a.country='US' AND a.c_id=2452640)
Scan Scan
quantity
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization – Example (rule-based)
Join
Project
sales addr
s.c_id=a.c_id
Scan Scan
quantity
 Predicate pushdown
 Predicate inference
 Predicate propagation
Filter (a.country='US' AND a.c_id=2452276) OR
(a.country='US' AND a.c_id=2452640)
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logical optimization – Example (rule-based)
Join
Project
sales addr
s.c_id=a.c_id
Scan Scan
quantity
 Predicate pushdown
 Predicate Simplification
 Predicate propagation
Filter a.country='US’
AND (a.c_id=2452276 OR a.c_id=2452640)
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Project
Logical optimization – Example (rule-based)
Join
sales addr
s.c_id=a.c_id
Scan Scan
quantity
 Predicate pushdown
 Predicate inference
 Predicate propagation
Filter a.country='US’
AND (a.c_id=2452276 OR a.c_id=2452640)Filters.c_id=2452276 OR
s.c_id=2452276
SELECT quantity FROM sales s
JOIN addr a ON s.c_id=a.c_id
WHERE (a.country='US' AND a.c_id=2452276)
OR (a.country='US' AND a.c_id=2452640)
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Optimizer performance
Apache Hive 2.1.1 - TPCDS (10TB) queries
* Max execution time: 1000 s
Update on Hive performance within a few weeks: http://hortonworks.com/blog/
0
20
40
60
80
100
0
200
400
600
800
1000
1200
q48 q13 q58 q21 q84 q66 q73 q20 q34 q25 q15
Gain(%)
Executiontime(s)
Optimizer off Optimizer on
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
An overview on optimization in Apache Hive
Introduction - Past
Query optimizer overview – Past
Logical optimization - Past
Logical Rewrites - Present
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Rules are not only for optimization
 Extended support for subqueries in filter (where/having) and select.
 Subqueries with
– conjunction/disjunction,
– as scalar subqueries
– in select.
 Leverages calcite to rewrite subqueries.
 Select * from src where src.key in (select key from src s1 where s1.key > '9')
– Will be rewritten into Left semi join between outer src and inner src s1.
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Rules are not only for optimization - II
 Extended support for
– Union Distinct
– Intersect
– Except
 Leverages calcite to rewrite into existing Hive operators.
 Select * from tb1 intersect select * from tb2;
 Select * from tb1 except select * from tb2;
 Select * from tb1 distinct select * from tb2;
 Will be rewritten using join, group by, union, UDTF etc.
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
An overview on optimization in Apache Hive
Introduction - Past
Query optimizer overview – Past
Logical optimization – Past
Logical Rewrites - Present
Logical Optimization - Present
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Top 10 users that have added more characters
from beginning of 2010 until the end of 2011
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Top 10 users that have added more characters
from beginning of 2010 until the end of 2011
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
Possible to express filters
on time dimension using
SQL standard functions
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Druid query
select
Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Initially:
– Scan is executed in Druid (select query)
– Rest of the query is executed in Hive
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Druid query
select
Rewriting
rule
Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Druid query
select
Rewriting
rule
Druid query recognition (powered by Apache Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Druid query
groupBy
Rewriting
rule
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Druid query recognition (powered by Apache Calcite)
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
Druid query
groupBy
Rewriting
rule
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
 Rewriting rules push computation into Druid
– Need to check that operator meets some
pre-conditions before pushing it to Druid
Druid query recognition (powered by Apache Calcite)
Apache Hive - SQL query
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
{
"queryType": "groupBy",
"dataSource": "users_index",
"granularity": "all",
"dimension": "user",
"aggregations": [ { "type": "longSum", "name": "s", "fieldName": "c_added" } ],
"limitSpec": {
"limit": 10,
"columns": [ {"dimension": "s", "direction": "descending" } ]
},
"intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ]
}
Physical plan transformation
Apache Hive
Druid query
groupBy
Query logical plan
Druid Scan
Project
Aggregate
Sort Limit
Sink
Filter
Select
File SinkFile Sink
Table Scan
Query physical plan
Druid JSON query
Table Scan uses
Druid Input Format
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Union Remove Rewrite
select key, c, m from (
– (select key, count(key) as c, 1 as m from src group by key) s1
– union all
– (select key, count(key) as c, 4 as m from src group by key) s4
– ) where m = 1;
select key, c, 1 as m from (select key, count(key) as c from src group by key;
Union All
Filter(c=1)
TS0 TS1
Filter(c=1) Filter(c=2)
TS0
Filter(c=1)
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Count Distinct rewrite
 Select sum(c1), count(distinct c1) from t1;
 Select sum(s1), count(c1) from (select sum(c1) as s1, c1 from t1 group by c1) t2;
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Shared work optimizer
 Queries might scan same table more than once
– Correlated subqueries, self-joins, union queries, etc.
– Common pattern in OLAP workloads
 Scanning same table only once can reduce execution time remarkably
– Specially for I/O bounded queries
 In addition reuses similar selection predicates and other operators
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Left outer join
Shared scan optimizer
SELECT e1.employee_id, e1.first_name, e1.last_name,
e1.mgr_id, e2.first_name as mgr_first_name, e2.last_name as mgr_last_name
FROM employees e1 LEFT OUTER JOIN employees e2
ON e1.mgr_id = e2.employee_id;
Find all employees and their managers
Left outer join
employees
e1.mgr_id =
e2.employee_id
Scan employees
ReduceSink key:
employee_id
Scan
ReduceSinkkey:
mgr_id
employees
ReduceSink key:
employee_id
Scan
ReduceSinkkey:
mgr_id
e1.mgr_id =
e2.employee_id
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
An overview on optimization in Apache Hive
Introduction
Query optimizer overview
Logical optimization
Logical Rewrites - Present
Logical Optimization – Present
Road Ahead - Future
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Work in progress
 A materialized view contains the results of a query
 Accelerate query execution by using automatic rewriting
– Automatically benefits from LLAP, since materialized view is represented as a table
– Possibility to use other systems to store views (federation)
 Powerful side-effects: accurate statistics for query plans
Materialized views support
Query
LLAP
Disk
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Work in progress
 Table cmv_table
 Syntax
CREATE MATERIALIZED VIEW cmv_mat_view ENABLE REWRITE
AS
SELECT avg(unit_price), discount
FROM cmv_table GROUP BY discount
WHERE discount > 0.2;
Materialized views support
id name unit_price discount
1 'p1' 10.30 0.15
2 'p2' 3.14 0.35
3 'p3' 172.2 0.35
4 'p4' 978.76 0.35
Enable automatic rewriting
using this materialized view
44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Work in progress
 Implemented as a set of rewriting rules
 Early pre-filtering of irrelevant views
 Currently supports Project, Filter, Join, Aggregate, Scan operators
SELECT avg(unit_price), discount
FROM cmv_table GROUP BY discount
WHERE discount > 0.3;
Materialized views support – Automatic rewriting
Rewrite using
Materialized view
Filter
Scancmv_mat_view
discount>0.2
Group By
Filter
Scancmv_table
discount>0.3
Avg(price)
Query plan
45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Future
 Collecting column stats automatically
 Making better ndv estimates
 Performance of compiler
 Exploiting constraints for optimization
 Improving stats estimation and propagation in tree
 Improving cost model
46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Conclusion
 Improvements on effectiveness and efficiency over the last 3 years
– Contributions to Apache Hive and Apache Calcite
 Interesting road ahead for both communities
Progress of Apache Hive optimizer
Thank You
http://hive.apache.org | @ApacheHive
http://calcite.apache.org | @ApacheCalcite
Slides contributed by:
Vineet Garg
Pengcheng Xiong
Jesus Camacho Rodriguez
Julian Hyde
49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Future Work
 Cost model should reflect actual execution cost
 Takes into account CPU and IO cost for operators
 Dependent on underlying execution engine
– In contrast to cardinality-based model (default cost model)
– Only available for Apache Tez
 Requires ‘tuning’ parameters
– CPU atomic operation cost, local disk byte read/write cost, HDFS byte read/write cost, network
byte transfer cost
 Allows join algorithm selection in Apache Calcite
– Sort-merge join, mapjoin, bucket mapjoin, sorted bucket mapjoin
Extended cost model
50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Physical optimization
 Determine how the plan will be executed
– Algorithms to use for each of the operators
– Partitioning and sorting data between operators
 Finds additional optimization opportunities
– Removal of unnecesary stages
 Completely done in Apache Hive
51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Physical optimization – Example
52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Physical optimization - Example
 Dynamic runtime filtering
– Runtime decision to cancel scan of records depending on subquery results
– Simplified semi-join reduction optimization
SELECT …
FROM sales JOIN time ON sales.time_id = time.time_id
WHERE time.year = 2014 AND time.quarter IN ('Q1', 'Q2’)
53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Physical optimization - Example
 Dynamic runtime filtering
– Runtime decision to cancel scan of records depending on subquery results
– Equivalent to semi-join reduction optimization
SELECT …
FROM sales JOIN time ON sales.time_id = time.time_id
WHERE time.year = 2014 AND time.quarter IN ('Q1', 'Q2’)
Join
sales
sales.time_id = time.time_id
Physical query plan
Scan
Filter
time
year = 2014 AND
quarter IN ('Q1', 'Q2')
ReduceSinkScan key: time_id
54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Physical optimization - Example
 Dynamic runtime filtering
– Runtime decision to cancel scan of records depending on subquery results
– Equivalent to semi-join reduction optimization
SELECT …
FROM sales JOIN time ON sales.time_id = time.time_id
WHERE time.year = 2014 AND time.quarter IN ('Q1', 'Q2’)
Join
sales
sales.time_id = time.time_id
Physical query plan
Scan
Filter
time
year = 2014 AND
quarter IN ('Q1', 'Q2')
ReduceSinkScan
Scan only subset of records
with given time_id
key: time_id

More Related Content

What's hot

Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
DataWorks Summit/Hadoop Summit
 
The Future of Apache Ambari
The Future of Apache AmbariThe Future of Apache Ambari
The Future of Apache Ambari
DataWorks Summit
 
SAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made EasySAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made Easy
DataWorks Summit
 
Next Generation Execution for Apache Storm
Next Generation Execution for Apache StormNext Generation Execution for Apache Storm
Next Generation Execution for Apache Storm
DataWorks Summit
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present Future
DataWorks Summit/Hadoop Summit
 
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetBig Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
DataWorks Summit
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit
 
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
DataWorks Summit
 
Running Services on YARN
Running Services on YARNRunning Services on YARN
Running Services on YARN
DataWorks Summit/Hadoop Summit
 
Spark Security
Spark SecuritySpark Security
Spark Security
Yifeng Jiang
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
DataWorks Summit
 
Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
DataWorks Summit/Hadoop Summit
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
DataWorks Summit
 
Creating the Internet of Your Things
Creating the Internet of Your ThingsCreating the Internet of Your Things
Creating the Internet of Your Things
DataWorks Summit/Hadoop Summit
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
Hortonworks
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
DataWorks Summit/Hadoop Summit
 
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureHadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and Future
DataWorks Summit
 
Connecting the Drops with Apache NiFi & Apache MiNiFi
Connecting the Drops with Apache NiFi & Apache MiNiFiConnecting the Drops with Apache NiFi & Apache MiNiFi
Connecting the Drops with Apache NiFi & Apache MiNiFi
DataWorks Summit
 

What's hot (20)

Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
 
The Future of Apache Ambari
The Future of Apache AmbariThe Future of Apache Ambari
The Future of Apache Ambari
 
SAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made EasySAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made Easy
 
Next Generation Execution for Apache Storm
Next Generation Execution for Apache StormNext Generation Execution for Apache Storm
Next Generation Execution for Apache Storm
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present Future
 
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetBig Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
 
Running Services on YARN
Running Services on YARNRunning Services on YARN
Running Services on YARN
 
Spark Security
Spark SecuritySpark Security
Spark Security
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
 
Creating the Internet of Your Things
Creating the Internet of Your ThingsCreating the Internet of Your Things
Creating the Internet of Your Things
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureHadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and Future
 
Connecting the Drops with Apache NiFi & Apache MiNiFi
Connecting the Drops with Apache NiFi & Apache MiNiFiConnecting the Drops with Apache NiFi & Apache MiNiFi
Connecting the Drops with Apache NiFi & Apache MiNiFi
 

Similar to An Overview on Optimization in Apache Hive: Past, Present, Future

Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
Julian Hyde
 
Managing Enterprise Hadoop Clusters with Apache Ambari
Managing Enterprise Hadoop Clusters with Apache AmbariManaging Enterprise Hadoop Clusters with Apache Ambari
Managing Enterprise Hadoop Clusters with Apache Ambari
Jayush Luniya
 
Managing Enterprise Hadoop Clusters with Apache Ambari
Managing Enterprise Hadoop Clusters with Apache AmbariManaging Enterprise Hadoop Clusters with Apache Ambari
Managing Enterprise Hadoop Clusters with Apache Ambari
Hortonworks
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
alanfgates
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghai
Yifeng Jiang
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
VMware Tanzu
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
DataWorks Summit
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
Hortonworks
 
Future of Apache Ambari
Future of Apache AmbariFuture of Apache Ambari
Future of Apache Ambari
Jayush Luniya
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
alanfgates
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
John Park
 
Manage Add-on Services in Apache Ambari
Manage Add-on Services in Apache AmbariManage Add-on Services in Apache Ambari
Manage Add-on Services in Apache Ambari
Jayush Luniya
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
Ashish Narasimham
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
DataWorks Summit
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
Aldrin Piri
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
HortonworksJapan
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
 

Similar to An Overview on Optimization in Apache Hive: Past, Present, Future (20)

Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Managing Enterprise Hadoop Clusters with Apache Ambari
Managing Enterprise Hadoop Clusters with Apache AmbariManaging Enterprise Hadoop Clusters with Apache Ambari
Managing Enterprise Hadoop Clusters with Apache Ambari
 
Managing Enterprise Hadoop Clusters with Apache Ambari
Managing Enterprise Hadoop Clusters with Apache AmbariManaging Enterprise Hadoop Clusters with Apache Ambari
Managing Enterprise Hadoop Clusters with Apache Ambari
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghai
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
 
Future of Apache Ambari
Future of Apache AmbariFuture of Apache Ambari
Future of Apache Ambari
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
 
Manage Add-on Services in Apache Ambari
Manage Add-on Services in Apache AmbariManage Add-on Services in Apache Ambari
Manage Add-on Services in Apache Ambari
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 

Recently uploaded (20)

Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 

An Overview on Optimization in Apache Hive: Past, Present, Future

  • 1. An Overview on Optimization in Apache Hive: Past, Present, Future Ashutosh Chauhan Jesús Camacho Rodríguez DataWorks Summit San Jose June 4, 2017
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive  Initial use case: batch processing – Read-only data – HiveQL (SQL-like query language) – MapReduce  Effort to take Hive beyond its batch processing roots – Started in Apache Hive 0.10.0 (January 2013) – Latest released version: Apache Hive 2.1.1 (December 2016)  Extensive renovation to improve three different axes – Latency: allow interactive and sub-second queries – Scalability: from TB to PB of data – SQL support: move from HiveQL to SQL standard
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive  Multiple execution engines: Apache Tez and Apache Spark  More efficient join execution algorithms  Vectorized query execution – Integration with columnar storage formats: Apache ORC, Apache Parquet  LLAP (Live Long and Process) – Persistent deamons for low-latency queries  Semi join reduction Important execution internals improvements
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive  Determine the most efficient way to execute a given query – Huge benefits for declarative query languages like SQL – Especially if queries are generated by BI tools  Challenging: trade-off between plan generation latency and optimality  Helps exploiting Apache Hive capabilities – Join reordering and most efficient algorithms selection, static and dynamic partition pruning, column pruning, etc. Motivation for query optimizer
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda An overview on optimization in Apache Hive Introduction Query optimizer overview
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Query preparation ✗ Optimization difficult to perform over AST – Missing opportunities Before Apache Hive 0.14.0 SQL Parser Semantic Analyzer Logical Optimizer Physical Optimizer Execution Engine AST Annotated ASTSQL Plan Tuned plan
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Query preparation  Logical optimizer powered by Apache Calcite – Each new version of Hive tightens this integration • Improvements in the optimizer to be more effective and more efficient • Shifting logic from Apache Hive to Apache Calcite: predicate pushdown, constant folding, etc. After Apache Hive 0.14.0 SQL Parser Semantic Analyzer Logical Optimizer Physical Optimizer Execution Engine Logical plan Physical plan Optimized logical planASTSQL
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Query optimization Release Date Features (not exhaustive) Apache Hive 0.14.0 Nov 2014 • Cost-based ordering of joins • Cost model improvements (expressions selectivity) • Improvements to partition pruner Apache Hive 1.1.0 Mar 2015 • Enable Apache Calcite optimization by default • More rules (identity project removal, union merge) Apache Hive 1.2.0 May 2015 • Improvements in predicate inference (leads to more pruning oportunities) • More rules (predicate folding, transitive inference) • Chained predicate pushdown, inference, propagation, and folding Apache Hive 2.0.0 (*) Feb 2016 • Apache Calcite rewriting rules run even without stats (subset) • Disable rules depending on input query profile • Improvements in metadata caching • More rules (sort and limit pushdown, aggregate pushdown) Apache Hive 2.1.0 Jun 2016 • Enhancement to predicate pushdown, folding and inference existing rules • Disabling equivalent rules written in Apache Hive Optimizer improvements since Apache Hive 0.14.0 (*) Introduction of Llap
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Rules aren’t only consideration  Collection – Faster computation – All statistics stored per column and partition vs per table – Automatic gathering  Estimation – Improved selectivity on estimates, based on NDVs, min, max  Retrieval – Merge partition statistics – Multiple ongoing efforts to improve metastore scalability by getting rid of ORM layer • Use Apache Hbase to store metadata • Caching in metastore process • Redesign underlying schema for metastore – Yahoo! recorded improvements of up to 90% for some queries Statistics
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda An overview on optimization in Apache Hive Introduction Query optimizer overview Logical optimization
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization Apache Calcite – Architecture
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization Apache Calcite – Architecture
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization  Multi-phase optimization using Apache Calcite – Both rule-based and cost-based phases  More than 40 different rewriting rules – Pushdown Filter predicates – Pushdown Project expressions – Infer new Filter predicates – Infer new constant expressions – Prune unnecesary operator columns – Simplify expressions – Pushdown Sort and Limit operators – Merge operators – Pull up constants – …
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization – Example (cost-based)
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization – Example (cost-based)  Capable of generating bushy join operator trees – Both inputs of a join operator can receive results from other join operators – Well suited for parallel execution engines: increases parallelism degree, performance and cluster utilization Join reordering Join Join Join Join Left-deep join tree Join Join Join Bushy join tree Join
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization – Example (cost-based)  Combining two star schemas – Fact tables: • sales, inventory – Dimension tables: • customer, time, product, warehouse Join reordering inventory warehouse productJoin Join Join Join Join time sales customer Dimension tables filter fact tables records
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization – Example (rule-based)
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization – Example (rule-based)  Predicate pushdown: push predicates as close to Scan operators as possible  Predicate inference: infer new predicates from existant ones (maybe redundant) that help to optimize the plan further  Predicate propagation: propagate inferred values for expressions through the plan  Predicate folding: fold constant expressions in predicates Interaction of different rules allows to generate more efficient plans
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization – Example (rule-based) SELECT quantity FROM sales s JOIN addr a ON s.c_id=a.c_id WHERE (a.country='US' AND a.c_id=2452276) OR (a.country='US' AND a.c_id=2452640) Join Query logical plan Project Filter sales addr s.c_id=a.c_id (a.country='US' AND a.c_id=2452276) OR (a.country='US' AND a.c_id=2452640) Scan Scan quantity
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization – Example (rule-based) Join Project sales addr s.c_id=a.c_id Scan Scan quantity  Predicate pushdown  Predicate inference  Predicate propagation Filter (a.country='US' AND a.c_id=2452276) OR (a.country='US' AND a.c_id=2452640)
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logical optimization – Example (rule-based) Join Project sales addr s.c_id=a.c_id Scan Scan quantity  Predicate pushdown  Predicate Simplification  Predicate propagation Filter a.country='US’ AND (a.c_id=2452276 OR a.c_id=2452640)
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Project Logical optimization – Example (rule-based) Join sales addr s.c_id=a.c_id Scan Scan quantity  Predicate pushdown  Predicate inference  Predicate propagation Filter a.country='US’ AND (a.c_id=2452276 OR a.c_id=2452640)Filters.c_id=2452276 OR s.c_id=2452276 SELECT quantity FROM sales s JOIN addr a ON s.c_id=a.c_id WHERE (a.country='US' AND a.c_id=2452276) OR (a.country='US' AND a.c_id=2452640)
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Optimizer performance Apache Hive 2.1.1 - TPCDS (10TB) queries * Max execution time: 1000 s Update on Hive performance within a few weeks: http://hortonworks.com/blog/ 0 20 40 60 80 100 0 200 400 600 800 1000 1200 q48 q13 q58 q21 q84 q66 q73 q20 q34 q25 q15 Gain(%) Executiontime(s) Optimizer off Optimizer on
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda An overview on optimization in Apache Hive Introduction - Past Query optimizer overview – Past Logical optimization - Past Logical Rewrites - Present
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Rules are not only for optimization  Extended support for subqueries in filter (where/having) and select.  Subqueries with – conjunction/disjunction, – as scalar subqueries – in select.  Leverages calcite to rewrite subqueries.  Select * from src where src.key in (select key from src s1 where s1.key > '9') – Will be rewritten into Left semi join between outer src and inner src s1.
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Rules are not only for optimization - II  Extended support for – Union Distinct – Intersect – Except  Leverages calcite to rewrite into existing Hive operators.  Select * from tb1 intersect select * from tb2;  Select * from tb1 except select * from tb2;  Select * from tb1 distinct select * from tb2;  Will be rewritten using join, group by, union, UDTF etc.
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda An overview on optimization in Apache Hive Introduction - Past Query optimizer overview – Past Logical optimization – Past Logical Rewrites - Present Logical Optimization - Present
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Top 10 users that have added more characters from beginning of 2010 until the end of 2011 Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Top 10 users that have added more characters from beginning of 2010 until the end of 2011 Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter Possible to express filters on time dimension using SQL standard functions
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Druid query select Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Initially: – Scan is executed in Druid (select query) – Rest of the query is executed in Hive Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Druid query select Rewriting rule Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Druid query select Rewriting rule Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Druid query groupBy Rewriting rule SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Druid query recognition (powered by Apache Calcite) Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive Druid query groupBy Rewriting rule SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10;  Rewriting rules push computation into Druid – Need to check that operator meets some pre-conditions before pushing it to Druid Druid query recognition (powered by Apache Calcite) Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved { "queryType": "groupBy", "dataSource": "users_index", "granularity": "all", "dimension": "user", "aggregations": [ { "type": "longSum", "name": "s", "fieldName": "c_added" } ], "limitSpec": { "limit": 10, "columns": [ {"dimension": "s", "direction": "descending" } ] }, "intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ] } Physical plan transformation Apache Hive Druid query groupBy Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter Select File SinkFile Sink Table Scan Query physical plan Druid JSON query Table Scan uses Druid Input Format
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Union Remove Rewrite select key, c, m from ( – (select key, count(key) as c, 1 as m from src group by key) s1 – union all – (select key, count(key) as c, 4 as m from src group by key) s4 – ) where m = 1; select key, c, 1 as m from (select key, count(key) as c from src group by key; Union All Filter(c=1) TS0 TS1 Filter(c=1) Filter(c=2) TS0 Filter(c=1)
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Count Distinct rewrite  Select sum(c1), count(distinct c1) from t1;  Select sum(s1), count(c1) from (select sum(c1) as s1, c1 from t1 group by c1) t2;
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Shared work optimizer  Queries might scan same table more than once – Correlated subqueries, self-joins, union queries, etc. – Common pattern in OLAP workloads  Scanning same table only once can reduce execution time remarkably – Specially for I/O bounded queries  In addition reuses similar selection predicates and other operators
  • 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Left outer join Shared scan optimizer SELECT e1.employee_id, e1.first_name, e1.last_name, e1.mgr_id, e2.first_name as mgr_first_name, e2.last_name as mgr_last_name FROM employees e1 LEFT OUTER JOIN employees e2 ON e1.mgr_id = e2.employee_id; Find all employees and their managers Left outer join employees e1.mgr_id = e2.employee_id Scan employees ReduceSink key: employee_id Scan ReduceSinkkey: mgr_id employees ReduceSink key: employee_id Scan ReduceSinkkey: mgr_id e1.mgr_id = e2.employee_id
  • 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 41. 41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda An overview on optimization in Apache Hive Introduction Query optimizer overview Logical optimization Logical Rewrites - Present Logical Optimization – Present Road Ahead - Future
  • 42. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Work in progress  A materialized view contains the results of a query  Accelerate query execution by using automatic rewriting – Automatically benefits from LLAP, since materialized view is represented as a table – Possibility to use other systems to store views (federation)  Powerful side-effects: accurate statistics for query plans Materialized views support Query LLAP Disk
  • 43. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Work in progress  Table cmv_table  Syntax CREATE MATERIALIZED VIEW cmv_mat_view ENABLE REWRITE AS SELECT avg(unit_price), discount FROM cmv_table GROUP BY discount WHERE discount > 0.2; Materialized views support id name unit_price discount 1 'p1' 10.30 0.15 2 'p2' 3.14 0.35 3 'p3' 172.2 0.35 4 'p4' 978.76 0.35 Enable automatic rewriting using this materialized view
  • 44. 44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Work in progress  Implemented as a set of rewriting rules  Early pre-filtering of irrelevant views  Currently supports Project, Filter, Join, Aggregate, Scan operators SELECT avg(unit_price), discount FROM cmv_table GROUP BY discount WHERE discount > 0.3; Materialized views support – Automatic rewriting Rewrite using Materialized view Filter Scancmv_mat_view discount>0.2 Group By Filter Scancmv_table discount>0.3 Avg(price) Query plan
  • 45. 45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Future  Collecting column stats automatically  Making better ndv estimates  Performance of compiler  Exploiting constraints for optimization  Improving stats estimation and propagation in tree  Improving cost model
  • 46. 46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Conclusion  Improvements on effectiveness and efficiency over the last 3 years – Contributions to Apache Hive and Apache Calcite  Interesting road ahead for both communities Progress of Apache Hive optimizer
  • 47. Thank You http://hive.apache.org | @ApacheHive http://calcite.apache.org | @ApacheCalcite Slides contributed by: Vineet Garg Pengcheng Xiong Jesus Camacho Rodriguez Julian Hyde
  • 48. 49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Future Work  Cost model should reflect actual execution cost  Takes into account CPU and IO cost for operators  Dependent on underlying execution engine – In contrast to cardinality-based model (default cost model) – Only available for Apache Tez  Requires ‘tuning’ parameters – CPU atomic operation cost, local disk byte read/write cost, HDFS byte read/write cost, network byte transfer cost  Allows join algorithm selection in Apache Calcite – Sort-merge join, mapjoin, bucket mapjoin, sorted bucket mapjoin Extended cost model
  • 49. 50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Physical optimization  Determine how the plan will be executed – Algorithms to use for each of the operators – Partitioning and sorting data between operators  Finds additional optimization opportunities – Removal of unnecesary stages  Completely done in Apache Hive
  • 50. 51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Physical optimization – Example
  • 51. 52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Physical optimization - Example  Dynamic runtime filtering – Runtime decision to cancel scan of records depending on subquery results – Simplified semi-join reduction optimization SELECT … FROM sales JOIN time ON sales.time_id = time.time_id WHERE time.year = 2014 AND time.quarter IN ('Q1', 'Q2’)
  • 52. 53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Physical optimization - Example  Dynamic runtime filtering – Runtime decision to cancel scan of records depending on subquery results – Equivalent to semi-join reduction optimization SELECT … FROM sales JOIN time ON sales.time_id = time.time_id WHERE time.year = 2014 AND time.quarter IN ('Q1', 'Q2’) Join sales sales.time_id = time.time_id Physical query plan Scan Filter time year = 2014 AND quarter IN ('Q1', 'Q2') ReduceSinkScan key: time_id
  • 53. 54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Physical optimization - Example  Dynamic runtime filtering – Runtime decision to cancel scan of records depending on subquery results – Equivalent to semi-join reduction optimization SELECT … FROM sales JOIN time ON sales.time_id = time.time_id WHERE time.year = 2014 AND time.quarter IN ('Q1', 'Q2’) Join sales sales.time_id = time.time_id Physical query plan Scan Filter time year = 2014 AND quarter IN ('Q1', 'Q2') ReduceSinkScan Scan only subset of records with given time_id key: time_id

Editor's Notes

  1. ORM (object relational mapping)
  2. ORM (object relational mapping) Combines exhaustive search with greedy algorithm Exhaustive search finds every possible plan using rewriting rules Not practical for large number of joins Greedy algorithm builds the plan iteratively Uses heuristic to choose best join to add next
  3. ORM (object relational mapping)
  4. ORM (object relational mapping)
  5. ORM (object relational mapping)