Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production

1© Cloudera, Inc. All rights reserved.
Faster Batch Processing with
Hive-on-Spark
Santosh Kumar | Cloudera
Rui Li | Intel

Agenda
• What is Hive-on-Spark?
• Using Hive-on-Spark
• Performance Metrics
• Configuration & Tuning
• What’s Next?
• Q&A

Apache Spark
Flexible, in-memory data processing for Hadoop
Easy
Development
Flexible Extensible
API
Fast Batch & Stream
Processing
• Rich APIs for Scala, Java,
and Python
• Interactive shell
• APIs for different types
of workloads:
• Batch
• Streaming
• Machine Learning
• Graph
• In-Memory processing
and caching

Spark Takes Advantage of Memory
• Resilient Distributed Datasets (RDD)
• In-memory data-structure partitioned across a set of machines
• Can fall back to disk when data-set does not fit in memory
• Created by parallel transformations on data in stable storage
• Provides fault-tolerance through concept of lineage

Introduction
• Enables Hive to use Spark as underlying execution engine
• Motivations
• Consolidation of Spark as execution engine
• Better performance
• Increased adoption of Hive (e.g. for Spark users)
• Community effort by Cloudera, IBM, Intel, MapR, and others

Choosing the Right SQL Engine
Know Your Audience, Know Your Use Case
Batch
Processing
BI and
SQL Analytics
Procedural
Development
SQLOR
Impala

Current State of Hive-on-Spark (HoS)
• Fully supported production release in C5.7
• Functional parity with Hive-on-MapReduce (HoMR)
• Average 3x performance improvement vs HoMR
• Automatic configuration and optimizations via Cloudera Manager
• Strong early user base
• Early commitment for future collaboration from Intel and others

Design Principles
• Minimize impact on existing code path
• Minimizes functional and performance impact
• Minimizes maintenance
• Maximizes support for Hive features – current as well as future
• Spark invoked only at execution layer
• HoS produces similar logical operators plan as HoMR
• Logical plan runs on low-level Spark primitives
• Minimizes usage of advanced Spark primitives

Getting Started with Hive-on-Spark

Configuration
• Minimal configurations needed
• Via Cloudera Manager: Set “Spark on YARN Service” (internally sets
spark.master=yarn-cluster)
• Set hive.execution.engine=spark per service or query
• Only yarn-cluster is supported
• Cloudera Manager auto-configures most configurations
• Configuration & Tuning Guide available on Docs

Performance
Avg. ~3X faster than Hive-on-MapReduce
More Suitable Less Suitable
Complex workloads w/ multiple MR stages e.g. filter
followed by JOIN followed by GROUP BY
Simple workloads e.g. select *
Disk-bound w/ multiple disk reads/writes CPU bound workloads e.g. complex UDFs
Workloads requiring mins to hours for completion Workloads typically requiring <1 min

Query Execution: Background
Input
status_updates( userid int,status string,ds string)
profiles(userid int,school string,gender int)
Output
school_summary(school string,cnt int,ds string)
gender_summary(gender int,cnt int,ds string)

Query Execution: MapReduce
BEGINS CONTINUES
CONTINUES ENDS

BEGINS CONTINUES
CONTINUES ENDS

BEGINS CONTINUES
CONTINUES ENDS
FileSinkOperator (disk write) and TableScanOperator (disk read)
are very costly

Query Execution: Hive-on-Spark
Costly Steps Removed
BEGINS CONTINUES
CONTINUES ENDS

Optimization for Resource Management:
Long-Live Executors (LLE)
• MR: Each query an independent YARN application
• Spark: Each SQL session is a long-lived YARN application
• First query of a session spawns a YARN app
• Subsequent queries re-use same YARN app as well as containers
• Session disconnect shuts down YARN app and releases container resources

Long-Lived Executors Details
• Hive User Session will submit Spark Application to YARN
• Spark YARN Application:
• YARN container = Spark Executors live in YARN containers
• YARN Application Master = RemoteDriver
• Submits Spark ‘jobs’, aka Hive queries, to Spark executors
• Connects back to HS2 to report job progress from Spark executors
User1
User2
HiveServer2
Session1
Session2
YARN Cluster
AM (RemoteDriver1) Containers (Executors)
AM (RemoteDriver2) Containers (Executors)

Configuration and Tuning
Hive-on-Spark

Spark Configuration
• Size of executors
• Bigger and fewer executors
• Threads contention
• GC pressure
• Smaller and more executors
• Less memory efficient
• Bigger start-up overhead

Spark Configuration
• CPU
• Around 5-7 cores per executor
• Memory
• Leave 10% for OS cache
• Executor memory overhead
• Tune by case
• Can be heavily used by Netty
• Usually 15% - 20%
• Around 3GB per core

Spark Configuration
• Serialization
• spark.serializer – kryo performs better and is REQUIRED by HoS
• spark.kryo.referenceTracking – disable to avoid java performance issue
• Shuffle
• spark.shuffle.compress
• spark.shuffle.spill.compress
• Trade CPU for I/O
• Increase number of reducers

Partitioning
• Number of mappers
• Inputformat
• mapreduce.input.fileinputformat.split.maxsize
• Number of reducers
• hive.exec.reducers.bytes.per.reducer
• mapreduce.job.reduces
• HoS tends to launch more reducers
• Merge small files
• hive.merge.sparkfiles

Hive Configuration
• General optimizations
• Enable vectorization
• Enable CBO
• Map join auto convertion
• Map side aggregation
• Etc.

Hive Configuration
• Map join
• hive.auto.convert.join.noconditionaltask.size
• HoS doesn’t support conditional map join yet
• HoS uses raw data size as small table size – different from MR
• hive.stats.collect.rawdatasize
• Skew join
• Compile time – same as MR
• Runtime - HoS will split the original task at join

Resource Allocation
• Static allocation
• spark.executor.instances
• Won’t release until session is closed
• Recommended for benchmarking
• Dynamic allocation
• spark.dynamicAllocation.enabled
• spark.executor.dynamicAllocation.initialExecutors
• spark.executor.dynamicAllocation.minExecutors
• spark.executor.dynamicAllocation.maxExecutors
• Number of executors per Spark application scales up and down
• Suited for multi-tenancy scenarios (multi-session)

Resource Allocation
• Pre-warm containers
• hive.prewarm.enabled
• spark.scheduler.maxRegisteredResourcesWaitingTime
• spark.scheduler.minRegisteredResourcesRatio
• Attempt for better parallelism
• Considerable delay for start-up job
• Not recommended for short-lived sessions

Configuration and Tuning Summary
• Number and size of executors most important determinants of
performance
• Resolve query performance/failures by allocating more executors with
more CPU and RAM
• spark.executor.instances, spark.executor.cores, spark.executor.memory,
spark.yarn.executor.memoryOverhead
• Cloudera Manager takes care of most of the optimizations
• Most Hive config settings applicable to HoS, but few have different
semantics
• See Config and Tuning Guide for details

Roadmap
• Additional Optimizations
• Dynamic Partition Pruning
• Vectorization support
• Cost-Based Optimizer
• Others – Caching RDDs across queries, Optimize self join/union etc.
• Supportability Enhancements
• Better support for debugging and logging
• More informative stage description in WebUI
• Others: Improve Hue integration, additional metrics specific to HoS etc.
• Rebase to Spark 2.0 and Parquet 1.8

More Information & Next Steps
Get Started
• Download C5.7: www.cloudera.com/downloads
Release Notes
• www.cloudera.com/documentation/enterprise/latest/topics/rg_release_
notes.html
Training Classes
• university.cloudera.com

Questions?

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production

More Related Content

What's hot

Similar to Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production

More from Cloudera, Inc.

Recently uploaded

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production