LLAP: long-lived execution in Hive

© Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP: long-lived execution in Hive
Sergey Shelukhin

LLAP: long-lived execution in Hive
Stinger recap and even faster queries+
+ LLAP: overview+
+ Query fragment execution+
+ IO elevator and caching+
+ Performance+
+ Current status and future directions+
+ Query fragment API+

Hive performance recap
• Stinger: An Open Roadmap to improve Apache Hive’s
performance 100x
• Delivered in 100% Apache Open Source
• Stinger.Next: Enterprise SQL at Hadoop Scale
• Launched in September 2014, phase 1 delivered in 2015
Vectorized SQL Engine,
Tez Execution Engine,
ORC Columnar format
Cost Based Optimizer
Hive 0.10
Batch
Processing
100-150x Query Speedup
Hive 0.14
Human
Interactive
(5 seconds)

The road ahead to sub-second queries
• Startup costs are now a key bottleneck
• Example: JVM takes 100s of ms to start up
• Vectorized code can benefit from JIT optimization
• JIT optimizer needs (run)time to do its work
• Improved operator performance shifts focus on IO
• Reading data is serialized with data processing
• Reading from HDFS is relatively expensive
• Large machines provide opportunities for data sharing
• Both between parallel computation (sharing) and serial (caching)

LLAP: overview

What is LLAP?
• Hybrid execution with daemons in Hive
• Eliminates startup costs for tasks
• Allows the JIT optimizer to have time to optimize
• Multi-threaded execution of vectorized
operator pipelines
• Also allows sharing of metadata, map join tables, etc.
• Asynchronous IO elevator and caching
• Reduces IO cost and parallelizes IO and processing
• Can be spindle-aware; other IO optimizations
• Query fragment API
Node
LLAP Process
Cache
Query Fragment
HDFS
Query Fragment

What LLAP isn't
• Not a Hive execution engine (like Tez, MR, Spark…)
• Execution engines provide coordination and scheduling
• Some work (e.g. large shuffles) can still be scheduled in containers
• Not a storage layer
• Daemons are stateless and read (and cache) data from HDFS
• Does not supersede existing Hive
• Container-based execution still fully supported

Example execution: MR vs Tez vs Tez+LLAP
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
T T T
R R
R
T T
T
R
M M M
R R
R
M M
R
R
HDFS
In-Memory
columnar cache
Map – Reduce
Intermediate results in HDFS
Tez
Optimized Pipeline
Tez with LLAP
Resident process on Nodes
Map tasks
read HDFS

LLAP in your cluster
• LLAP daemons run on existing YARN
• Apache Slider is used for provisioning and recovery
• Easy to bring up, tear down, and share clusters
• Resource management via YARN delegation model (WIP)
• LLAP and containers dynamically balance resource usage (WIP)

Benefits unrelated to performance (WIP)
• Concurrent query execution and priority enforcement
• Access control, including column-level security
• ACID improvements
• Can be used externally via the API
• Will be usable e.g. by Spark, Pig, Cascading, …

Query fragment API

Query Fragment API - overview
• Hadoop RPC, protobuf are used to send fragments
• Fragments are "physical algebra": operators, metadata, input
sources and output channels
• Results are returned asynchronously via output channels
• Hive will produce fragments for LLAP as part of physical
optimization
• Other applications can compile their own physical algebra

Query Fragment API – algebra
• Operators: Scan, Filter, Group By, Hash/Merge join, etc.
• Operators may include statistics for local optimization
• Expressions: comparison, arithmetic, Hive built-in functions
• All Hive datatypes
• Complex types like map/list/etc. – WIP

Query Fragment API – client API
• Encapsulates creation, submission of query fragments
• Also helps with IO from LLAP
• Getting vectorized record readers, batches, etc.
• Working with output channels (cancellation, availability of records,
failure)

Query execution

LLAP: Query Execution
Overview of Query Execution+
+ Scheduling+
++
+ Coordination via Tez+
What Fragments run in LLAP vs Containers+
Future work+

Tez + LLAP – overview
• Hive on Tez already proven to perform well
• Tez being enhanced to allow it to coordinate work to external
systems (TEZ-2003)
• Pluggable Scheduling
• Pluggable communication – custom execution specifications, protocols
• DAG coordination remains unchanged
• Hive Operators / Tez Runtime components used for Processing
and data transfer

Deciding on where query components run
• Fragments can run in LLAP, regular containers, AM (as threads)
• Decision made by the Hive Client
• Configurable – all in LLAP, none in LLAP, intelligent mix
• Criteria for running in LLAP (in auto mode)
• No user code (or only blessed user code)
• Data source – HDFS
• ORC and vectorized execution (for now)
• Others can still run in LLAP in "all" mode, w/o IO elevator and cache
• Data size limitations (avoid heavy / long running processing within LLAP)

So…
M M M
R R
R
M M
R
R
Tez

AM
So…
T T T
R R
R
T T
T
R
M M M
R R
R
M M
R
R
Tez Tez with LLAP (auto)
auto

AM AM
So…
T T T
R R
R
T T
T
R
M M M
R R
R
M M
R
R
Tez Tez with LLAP (auto)
T T T
R R
R
T T
T
R
Tez with LLAP (all)
allauto

Scheduling for LLAP in Tez AM
• Greedy scheduling per query – assumes entire cluster available
• Schedule work to preferred location (HDFS locality)
• Multiple independent queries set the same preferred location if accessing the
same data (improves cache locality)
• LLAP Daemons schedule fragments independently – across
multiple queries

LLAP
Queue
Queuing fragments
• LLAP daemon has a number of executors
(think containers)
• Wait queue with pluggable priority
• Geared towards low latency queries (default)
• Models estimated work left in query
• Sequencing within a query handled via topological
order
• Fragment start time factors into scheduling decision
Executor
Q1 Reducer 2
Executor
Q1 Map 1
Executor
Q1 Map 1
Executor
Q3 Map 19
Q1 Reducer 2
Q1 Map 1
Q3 Map 19
Q1 Reducer 2

LLAP Scheduling – pipelining and preemption
• A fragment can run when inputs are not yet
available (for pipelining)
• A fragment is "finishable" if
all the source data is ready
LLAP
QueueExecutor
Executor
Interactive
query map 1/3
…
Interactive
query map 3/3
Executor
Interactive
query map 2/3
Wide query
reduce
Well, 10
mapper out of
100 are done!

• If the data is not ready, may never free the executor
• Non-finishable fragments can be preempted
• Improves throughput, prevents deadlocks
LLAP
QueueExecutor
Executor
Interactive
query map 1/3
…
Interactive
query map 3/3
Executor
Interactive
query map 2/3
Wide query
reduce

• If the data is not ready, may never free the executor
• Non-finishable fragments can be preempted
• Improves throughput, prevents deadlocks
LLAP
QueueExecutor
Executor
Interactive
query map 1/3
…
Interactive
query map 3/3
Executor
Interactive
query map 2/3

IO elevator and other internals

LLAP: IO elevator and other internals
Asynchronous IO and decompression+
+ Off-heap data caching+
++
+ File metadata caching+
Map join table sharing+
Better JIT usage thanks to persistent daemon+

Asynchronous IO
• Currently, Hive IO and input
decoding is interleaved
with processing
• Remote HDFS reads are
expensive
• Even local disk might be
• Data decompression and
decoding is expensive

Asynchronous IO
• With IO elevator, reading,
decoding and processing are
parallel
• IO threads can be spindle
aware (WIP)
• Depending on workload, IO
and processing threads can
balance resource usage
(throttle IO, etc.) (WIP)

Caching and off-heap data
• Decompressed data is cached off-heap
• Simplifies memory management, mitigates some GC problems
• Saves HDFS and decompression costs, esp. on dimension tables
• In future, processing cache data directly possible to avoid copies
• Replacement policy is pluggable
• Currently, simple local policies are used e.g. FIFO, LRFU
• Other policies possible (e.g. workflow-adaptable, or lazily
coordinated for better cache affinity)

Cache size vs operator memory requirement
• Cache space takes away from operator space
• Sort buffers, hash join tables, GBY buffers take space
• Tradeoff between HDFS reads and operator speed
• Depends on workflow, dataset size, etc.
• New vectorization changes in Hive will speed up operators and
allow for larger cache

Other benefits
• File metadata and indexes are cached
• Much faster PPD application for selective queries – no HDFS reads
• Same replacement as data cache (but higher priority)
• Map join hash tables, fragment plans are shared
• Multiple tasks do not all generate the table or deserialize the plans
• Better use of JIT optimizer
• Because the daemons are persistent, JIT has more time to kick in
• Especially good with vectorization!

Performance

Setup
• 13 physical machines (12 cores, 40Gb RAM each)
• Note – smaller cluster than previous Tez perf runs
• TPCDS 200, interactive queries
• Both – ORC, vectorized, Hadoop 2.8, queries via HS2 w/JMeter
• TEZ: Hive 1.2 + Tez 0.8 (snapshot)
• Pre-warm and container reuse enabled
• LLAP: Branch in pre-alpha stage + Tez 0.8 (snapshot)
• Bias towards executors – small cache
• Otherwise no tuning

Summary
• NOTE - in early stage – pre-alpha-release perf results
• Still, interactive queries are already 1.5-4 times faster
• First query result after launching CLI significantly improved
• In real life, LLAP daemons would also already be warm
• Parallel queries are already better
• Lots of work still ahead – epic locks in Kryo, Log4j, HDFS, HiveServer2;
better object sharing, better priority enforcement
• Should be much faster in short order

Query execution time
0
5
10
15
20
25
30
35
query55 query42 query52 query3 query12 query27 query26 query7 query19 query96 query43 query15 query82 query13
Execuonme,sec
Hive (1.2.0)
Hive (LLAP)

Parallel query execution
• 8 users, 4 parallel
executors on HS
• Tez: 50% of serial
time; LLAP alpha:
41% of serial time
0
50
100
150
200
250
300
Serial Parallel
Execuonme,sec
Total execu on me (13 queries)
Hive (1.2.0)
Hive (LLAP)

Current status and future directions

Current status
• Putting the finishing touches on the CTP (alpha release)
• Watch Hortonworks blog, and Apache Hive mailing lists, for details!
• The basic features are functional
• Currently only on Tez; IO only on vectorized and ORC
• AKA the fastest Hive setup possible 
• Lots of performance improvement not yet realized
• Lots of advanced features are WIP or planned

Work in progress
• Further performance improvement
• Concurrent query execution improvements
• Better vectorized operators (join, group by, …)
• Defining the API

Future work
• Security, including column level security
• Tighter integration with YARN, e.g. resource delegation
• Guaranteed Capacities for better SLA guarantee, maybe with central scheduler
• Dynamic daemon sizing with off-heap storage
• ACID support
• Better (maybe centrally coordinated) locality and caching
• Temp tables, intermediate query results in LLAP
• Interleaving of Fragment Execution
• Past processing is not lost (as against preemption)
• A rogue / badly scheduled query will not hog the system

Questions?
?
Interested? Stop by the Hortonworks booth to learn more

LLAP: long-lived execution in Hive

More Related Content

What's hot

Similar to LLAP: long-lived execution in Hive

More from DataWorks Summit

Recently uploaded

LLAP: long-lived execution in Hive