This document summarizes Richard Xu's presentation on tuning Yarn, Hive, and queries on a Hadoop cluster. The initial issues with the cluster included jobs taking hours to finish when they were supposed to take minutes. Initial tuning focused on cluster configuration best practices and increasing Yarn capacity. Further tuning involved limiting user capacity, increasing resources for application masters, and tuning memory settings for MapReduce and Tez. Specific Hive query issues addressed were full table scans, non-deterministic functions, join orders, and data type mismatches. Tools discussed for analysis included Tez visualization and Lipwig. Lessons learned emphasized a holistic tuning approach and understanding data structures and explain plans. Long-lived execution (LLAP) was presented as providing in
1. Tune up Yarn and Hive
Richard Xu
Systems Architect at Hortonworks
rxu@hortonworks.com
Toronto Hadoop User Group Nov 27, 2015
2. Today’s Agenda
• Review a real battle
• Tuning cluster and Yarn
• Tuning Hive queries
• Yet more tuning needed
• LLAP---what we can expect in the near future
Page 2
4. Cluster overview
• 14 nodes
• 2.46 TB memory available for Yarn applications
Page 4
Main use cases
• 200+ of Hive queries kicked off by Oozie to aggregate
data quarter-hourly, hourly, daily and weekly.
• HBase tables are loaded into memory as in-memory
cache; Hive queries retrieve data from these HBase
tables via Hive UDF
5. Initial complaints
• Cluster is slow
• Almost everybody’s job hanging for hours which is
supposed to be finished in a few minutes
• Hadoop does not work
Page 5
7. Initial Approaches
Page 7
• Ensure best practice configures are in place in all aspects: OS
(disable Transparent Huge Pages, disable swappiness on datanodes
only), network (disable iptables), hard drives --- found ulimit setting
too low
• Create 2 more Yarn capacity scheduler queues: batch (60%), ad-hoc
(30%) in addition to the default queue (10%)
• Applied default configurations suggested by hdp-configuration-
utils.py
8. Issues after initial approach
Page 8
The very first issue we encountered is one off-shore team member’s bad query
used up all the resource of the cluster. Fine, need to limit user capacity to avoid it:
1. Set user-limit-factor to from default value 1 to 0.1, to restrict any user from
using resources beyond 10% of the queue capacity.
2. Set minimum-user-limit-percent from 100 to 10, so that the queue can serve 10
users same time.
9. New issues right after applying the above
changes
• Some users submitting 2 Oozie jobs same time get
stuck.
• The cluster is not running with full load/speed ---- we
observe pending applications while the cluster still
have resources
Page 9
11. Why?
Page 11
Reason Related source code
Yarn capacity queue property,
“Max Schedulable Applications
Per User”.
As we allow more concurrent
users, then the number of max
schedulable applications per user
decreases!
public static int computeMaxActi
veApplicationsPerUser(
int maxActiveApplications, int
userLimit, float userLimitFactor) {
return Math.max(
(int)Math.ceil(
maxActiveApplications * (
userLimit / 100.0f) * userLimitFact
or), 1);
}
13. Solution
Page 13
increase yarn.scheduler.capacity.maximum-am-resource-percent to assign more
resources to applicationManager:
yarn.scheduler.capacity.root.prts-batch.minimum-user-limit-percent=10
yarn.scheduler.capacity.root.prts-batch.user-limit-factor=0.5
yarn.scheduler.capacity.maximum-am-resource-percent=0.2
14. Change made from original settings:
Page 14
yarn.scheduler.capacity.root.prts-batch.user-limit-factor=0.5 change from 2
mapreduce.map.java.opts - change from 3g to 4g
mapreduce.reduce.java.opts - change from 3g to 4g
Default virtual memory for a job's map-task - 4g to 8g
Default virtual memory for a job's reduce-task - change from 4g to 16g
yarn.app.mapreduce.am.resource.mb - change from 4g to 16g
yarn.app.mapreduce.am.command-opts - change from 4g to 12g
mapreduce.reduce.java.opts - change from 4g to 12g
mapreduce.map.java.opts - change from 4g to 6g
yarn.scheduler.minimum-allocation-mb - changed to 8g from 3g
======
tez.am.resource.memory.mb changed to 8g from 4g
tez.task.resource.memory.mb changed to 8g from 4g
tez.am.java.opts changed to 4g from 6g
====hive=====
hive.tez.container.size changed from 4g to 8g
hive.tez.java.opts changed from 2560 to 6144Mb
hive.auto.convert.join.noconditionaltask.size changed from 2.5gb to 512Mb
15. Change made from original settings:
Page 15
Max containers per host: moved from 21 to 58
Suggested hardware change:
IO is a bottleneck to move beyond
Remove Raid pair on OS and use additional drive for HDFS
17. Starting point
Page 17
• Ofer’s blog, “5 Ways to Make Your Hive Queries Run
Faster” is a great guidance:
• http://hortonworks.com/blog/5-ways-make-hive-
queries-run-faster/,
Then look at individual Hive queries,
especially those taking extremely
longer time than peers.
18. Page 18
50 concurrent clients
Issue: Execution plan shows that a whole
table is loaded in--- 9 GB, millions of rows
The subquery below should be filtered first:
(select fddcell_key,date_key as date, hour_key, qhour_key,
OSSC_RC,MeContext,EUtranCellFDD,enodebfunction from
tf001_fddcell_qhourly tf0001 ) tf001
changed to:
left outer join tf001_fddcell_qhourly tf0001 tf001
Result: the job used to run 2 hours now takes 25 mins --- remember, it can be
further improved.
19. Issue: unix_timestamp function
Page 19
50 concurrent clients
Throughput = 1095 reqs/s
unix_timestamp function is used to get the current day and hour in the where
clause for joining tables. The unix_timestamp function is a non-deterministic
function. What this means is that it is not executed when the query is compiled, it
is executed at runtime. For each row. This disables dynamic partition pruning
since the optimizer can’t tell what the date and hour are before each row is read.
Full Table Scans for everyone! In Hive 1.2.0 and beyond, the unix_timestamp
function will begin to be deprecated. it is replaced with a current_timestamp
function that is deterministic. Since Sprint are on Hive 0.14, we installed a UDF
that implements the current_timestamp code.
20. Issue: hive query hanging with Tez and
failing with Map Reduce
Issue: hive query hanging with Tez and failing with Map Reduce
(Generate Map Join Task Error)
Solution:
Disable hive.auto.convert.join
or hive.auto.convert.sortmerge.join.
Page 20
21. Issue: datatype mismatch on join
When joining tables it is important to make sure that the data types
match on join columns. There were a couple of joins that were trying to
join a bigint to an int. The inability to cast a bigint to an int was causing
a table scan (I believe because of the join order). Performing an explicit
cast on the int to a bigint allowed the query to do a range scan instead.
Page 21
22. Issue: join orders
For example, there were some places where joining a larger table to a smaller one
in a subquery before joining it back to another large table helps to filter the large
table and improve performance of the query.
1. Consider a JOIN as follows: SELECT * FROM A JOIN B JOIN C (assuming all on
the same ID fields)
2. Consider further that A is a very very large table, and both B and C are relatively
smaller table.
3. Without CBO the plan may become (A JOIN B) —> T, then T JOIN C —> output.
This is of course not very efficient since T is also very very large (based on A) and
thus we have two very large joins. Alternatively you could do (B JOIN C) -> T, and
then (A JOIN T). You would think CBO would do this, but with HDP 2.1 it’s not. Not
sure if this would be different in 2.2 - potentially yes, but not sure.
Page 22
24. Tuning Specifics
Page 24
• Use 80% RAM at most in Yarn --- leave space for shuffing
• Oozie launcher and hive query go to different queues
• Tez unique features
1. Dynamic partition pruning+tuning
hive.tez.dynamic.partition.pruning=true
hive.tez.dynamic.partition.pruning.max.data.size
hive.tez.dynamic.partition.pruning.max.event.size
2. Auto-reducer parallelism+tuning
hive.tez.auto.reducer.parallelism=true
hive.tez.min.partition.factor=0.01
3. Tunable slow-start
tez.shuffle-vertex-manager.min-src-fraction=0.99
tez.shuffle-vertex-manager.max-src-fraction=1.0
4. Min-held containers instead of prewarm
tez.am.session.min.held-containers=3
27. Lessons Learned
Page 27
• Take a holistic approach to performance tuning
• You cannot tune the system around bad code
• Know your performance target before you begin
• Tuning Hive queries requires a better understanding
of the data structures than relational databases
• Developers may not know how to tune or even read
an explain plan
• It is NOT always bad developer code
• Get engaged with the customer developers EARLY
29. Tez with LLAP engine
LLAP is an optional daemon process running on multiple nodes, that provides
the following:
• Caching and data reuse across queries with compressed columnar data in-memory (off-heap)
• Multi-threaded execution including reads with predicate pushdown and hash joins
• High throughput IO using Async IO Elevator with dedicated thread and core per disk
• Granular column level security across applications
• YARN will provide workload management in LLAP by using delegation
Node
LLAP
Process
HDFS
Query
Fragm
ent
LLAP In-Memory
columnar cache
LLAP process
running read task
for a query
LLAP process runs on multiple nodes,
accelerating Tez tasks
Node
Hive
Query
Node NodeNode Node
LLAP LLAP LLAP LLAP