-
1.
Apache Hive 2.0:
SQL, Speed, Scale
Alan Gates
Hive PMC Member
Co-founder Hortonworks
May 2016
-
2.
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Acknowledgements
The Apache Hive community for building all this awesome tech
Content of some of these slides based on earlier presentations by Sergey Shelukhin
and Siddarth Seth
alias Hive=‘Apache Hive’
alias Hadoop=‘Apache Hadoop’
alias Spark=‘Apache Spark’
alias Tez=‘Apache Tez’
alias Parquet=‘Apache Parquet’
alias ORC=‘Apache ORC’
alias Omid=‘Apache Omid (incubating)’
alias Calcite=‘Apache Calcite’
-
3.
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive History
Initially Hive provided SQL on Hadoop
– Provided a table view instead of file view of data
– Translated SQL to MapReduce
– Mostly used for ETL (Extract Transform Load)
– Big, batch, high start up time
Around 2012 it became clear users wanted to do all data warehousing on Hadoop,
not just batch ETL
Hive has shifted over time to focus on traditional data warehousing problems
– Still does large ETL well
– Now also can be used for analytics, reporting
– Work being done to better support BI (Business Intelligence) tools
Not OLTP, very focused on backend analytics
-
4.
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 1.x and 2.x
New feature development in Hive moving at a fast pace
– Stressful for those who use Hive for its original purpose (ETL type SQL on MapReduce)
– Realizing the full potential of Hive as data warehouse on Hadoop requires more changes
Compromise: follow Hadoop’s example, split into stable and new feature lines
1.x
– Stable
– Backwards compatible
– Ongoing bug fixes
2.x
– Major new features
– Backwards compatible where possible, but some things will be broken
– Hive 2.0 released February 15, 2016 – Not considered production ready
-
5.
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2.0 New Features Overview
1039 JIRAs resolved with 2.0 as fix version
– 666 bugs
– 140 improvements or new features
HPLSQL
LLAP
HBase Metastore
Hive-On-Spark Improvements
Cost Based Optimizer Improvements
Many, many new features and bug fixes I will not have time to cover
-
6.
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Adding Procedural SQL: HPLSQL
Procedural SQL, akin to Oracle’s PL/SQL and Teradata’s stored procedures
– Adds cursors, loops (FOR, WHILE, LOOP), branches (IF), HPLSQL procedures, exceptions (SIGNAL)
Aims to be compatible with all major dialects of procedural SQL to maximize re-use of
existing scripts
Currently external to Hive, communicates with Hive via JDBC.
– User runs command using hplsql binary
– Goal is to tightly integrate it so that Hive’s parser can execute HPLSQL, store HPLSQL procedures,
etc.
-
7.
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sub-second Queries in Hive: LLAP (Live Long and Process)
Persistent daemons
– Saves time on process start up (eliminates container allocation and JVM start up time)
– All code JITed within a query or two
Data caching with an async I/O elevator
– Hot data cached in memory (columnar aware, so only hot columns cached)
– When possible work scheduled on node with data cached, if not work will be run in other node
Operators can be executed inside LLAP when it makes sense
– Large, ETL style queries usually don’t make sense
– User code not run in LLAP for security
Working on interface to allow other data engines to read securely in parallel
Beta in 2.0
-
8.
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive With LLAP Execution Options
AM AM
T T T
R R
R
T T
T
R
M M M
R R
R
M M
R
R
Tez Only LLAP + Tez
T T T
R R
R
T T
T
R
LLAP only
-
9.
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP Performance
0
5
10
15
20
25
30
35
40
45
50
query3 query12 query20 query21 query26 query27 query42 query52 query55 query73 query89 query91 query98
TIME(SECONDS)
LLAP vs Hive 1.x 10TB Scale
LLAP Hive 1.x
-
10.
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP Performance Continued
0
100
200
300
400
500
Time(seconds)
LLAP Hive 1.2.1
Hive / LLAP, Hive 1.2.1 Query Times
38 out of 61 queries ran 50% faster
25 out of 61 queries ran 70% faster
12 out of 61 queries ran 80% faster
1 query ran 90% faster
-
11.
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP Limitations
Currently in Beta
Read only, no write path yet
Does not work with ACID yet (see previous bullet)
User must decide whether query runs fully in LLAP, mixed mode, or not at all
– Should be handled by CBO
Currently only reads ORC files
Currently only integrates with Tez as an engine
-
12.
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Speeding up Query Planning: HBase Metastore
Add option to use HBase to store Hive’s metadata
Why?
– Planning a query that reads several thousand partitions in Hive 1.2 takes 5+ seconds, mostly for metadata
acquisition
– ORM layer produces complex, slow schema (40+ tables)
– The need to work across 5 different databases limits performance optimizations and maximizes test
matrix for developers
– Limits caching opportunities as we cannot store too much data in a single node RDBMS
– The need to limit number of concurrent connections forces all metadata operations to be done during
query planning
– HBase addresses each of these
Goal: cut metadata access time for query with thousands of partitions to 200 milliseconds
– Not there yet, currently at 1-1.5 seconds
Challenges
– HBase lacks transactions, addressing via Apache Omid (incubating)
Alpha in Hive 2.0
-
13.
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Improvements to Hive on Spark
Dynamic partition pruning
Make use of spark persistence for self-join, self-union, and CTEs
Vectorized map-join and other map-join improvements
Parallel order by
Pre-warming of containers
Support for Spark 1.5
Many bug fixes
-
14.
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cost Base Optimizer (CBO) Improvements
Hive’s CBO uses Calcite
– Not all optimization rules migrated yet, but 2.0 continues work towards that
CBO on by default in 2.0 (wasn’t in in 1.x)
Main focus of CBO work has been BI queries (using TPC-DS as guide)
– Some work on machine generated queries, since tools generate some funky queries
Focus on improving stats collection and estimating stats more accurately between
operators in the plan
-
15.
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
And Many, Many More
• SQL Standard Auth is the default authorization (actually works)
• CLI mode for beeline (WIP to replace and deprecate CLI in Hive 2.*)
• Codahale-based metrics (also in 1.3)
• HS2 Web UI
• Stability Improvements and bugfixes for ACID (almost production ready now)
• Native vectorized mapjoin, vectorized reducesink, improved vectorized GBY, etc.
• Improvements to Parquet performance (PPD, memory manager, etc.)
• ORC schema evolution (beta)
• Improvement to windowing functions, refactoring ORC before split, SIMD
optimizations, new LIMIT syntax, parallel compilation in HS2, improvements to Tez
session management, many more
-
16.
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2.0 Incompabilities
Java 7 & 8 supported, 6 no longer supported
Requires Hadoop 2.x, Hadoop 1.x no longer supported
MapReduce deprecated, Tez or Spark recommended instead
– At some future date MR will be removed
Some configuration defaults changed, e.g.
– bucketing enforced by default
– metadata schema no longer created if it is missing
– SQL Standard authorization used by default
We plan to remove Hive CLI in the future and replace with beeline CLI
– Why?
• Makes it easier for users to deploy secure clusters where all access is via [OJ]DBC
• It is cleaner to maintain one code path
– Does not require HiveServer2, can run HS2 embedded in beeline
-
17.
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You
10 compute nodes, with 512GB RAM per node, running HDP 2.3
Scale 10K (10TB), interactive queries
Single query runs – via Hive CLI
Concurrency runs – via HiveServer2 and jmeter
Hive1: Hive 1.2 + Tez 0.7
Pre-warm and container reuse enabled
LLAP: Close to the 2.0 Hive branch, Tez close to the current master branch
Caching Enabled (as of November 2015)
1. DPP: Implemented in two sequential jobs. The first one processes the pruning part, saving the dynamic values on HDFS. The second job uses these values to filter out unwanted partitoins. Not fully tested yet.
2. Spark RDD persistence is used to store the temporary results from repeated subqueires to avoid re-computation. This is similar to materialized view and happens automatically. This is especially useful for cases of self-join, self-union, and CTE.
3. Vectorized map-join, optimized hashtable for mapjoin. These are very similar to tez.
4. Use parallel order by provided by Spark to do global sorting without limiting to one reducer. Internally, however, spark does the sampling.
5. Wait for a few seconds after SparkContext is created before submitting the job to make sure that enough number of executors are launched. SparkContext allows a job to be submitted right way, even if the executors are still starting up. Parallelism at reducer is partially determined by the number of available executors at the time when the job is submitted. This is useful for short-lived sessions, such as those launched by Oozie.