The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Self-Serve Performance
Tuning for Hadoop &
Spark
The Fifth Elephant 2016
Akshay Rai
Engineer, Hadoop Development Team
Linkedin Dr. Elephant
© 2016 LinkedIn Corporation. All Rights Reserved.

Hadoop @ Linkedin c. 2008
● 1 cluster
● 20 nodes
● 10 users
● 10 workflows in production
● MapReduce, Pig
2

Hadoop @ Linkedin c. 2016
● > 10 clusters
● > 10000 nodes
● > 1000 users
● Thousands of queries and flows in development
● Hundreds running in Production
● MapReduce, Pig, Hive, Spark, Scalding, Gobblin, Cubert
3

Scaling Hadoop Infrastructure
• Add extra machines to the cluster
• Hadoop is scalable but not that optimal!
• We cannot keep adding machines forever
• Tune given resources and minimize addition of new machines
4

Measuring performance
• Highlights hardware failures and poor performing components
• Scope for environment upgrades.
5

Cluster Level Performance Tuning
Job Level Performance Tuning
6

How difficult is it to tune a Job?
• Production Gatekeeper - Let jobs go into production only after verifying it
is tuned.
• Restriction! More questions on how to tune! Spend more resources
helping people.
Here’s what we tried to achieve Job tuning!
7

Challenges in tuning a job
• Hadoop is designed to let users tune their jobs BUT!
• One cannot optimize if one doesn’t understand the internals of the framework
• Critical information is scattered
• Hadoop has a huge set of parameters, tuning some may impact other
8

You cannot tune what you do not know & you cannot
improve what you cannot measure
9

• More people, more frequent sessions.
• Hadoop experience varies with people
• Framework specific training. Pig, hive, etc
Training - Doesn’t Scale
11

Expert Review - Also Doesn’t Work
• Again not scalable
• Cannot ensure job is performing optimally, no easy comparison.
• Different people, different perspective, no consensus
• Error prone, one might overlook certain aspects.
13

Scaling Hadoop Infrastructure is HARD
Scaling User Productivity is much HARDER
14

What does Dr. Elephant do?
• Help every user get the best performance from their jobs
• Analyse and compare historical executions
• Provides a platform for other performance related tools
16

Mapper Skew Problem
• Varying size of splits can cause skewness in the Mapper Input
19

Solution to Mapper Skewness
• Each Mapper should process the same amount of data
• Combine the small chunks and feed it to a single Mapper
20

Mapper Memory Problem & Solution
• Requested Container Memory >> Task’s Consumed Memory
• Request 4 GB of container
• Actually job uses only 512 MB
• Wait longer to get 4 GB and then block 4GB of resources!
• Request a lower container memory by setting
• mapreduce.map(or reduce).memory.mb
22

How does a Rule work?
INPUT Counters & Task Data
LOGIC Some logic to compute a value
OUTPUT Compare value against threshold levels
27

Adding a Custom Rule
1. Create a new Rule and test it.
2. Create a help page defining the rule, parameters to tune etc.
3. Add the details of the Rule in the HeuristicConf.xml file
<heuristic>
<applicationtype>Mapreduce</applicationtype>
<heuristicname>Rule Name</heuristicname>
<classname>path.to.rule.class</classname>
<viewname>path.to.rule.help.page</viewname>
</heuristic>
4. Run Dr. Elephant. It should now include the new rules.
29

What else can you customize?
● Rules, set threshold levels
● Easily integrate with new schedulers (Azkaban, Airflow, Oozie, etc)
● Enable/disable and extend to new Fetchers
● Extend to newer application types and job types
30

Automated Production Reviews | JIRA Bot
• Cluster for critical workloads
• Audit before deployment
32

Workflow monitoring and reports
• Monitor performance on each execution
• Compare behaviour across revisions
• Cost to Serve analysis
33

Open Source, April 2016
github.com / linkedin / dr-elephant
34
Watchers Stars Forks
60 262 109

Let’s collectively contribute!
35
Pull Requests 60 +
Contributors 10 +
User Topics 50 +

Coming Soon
37
● Real time analysis of Jobs
● Analytics for Failed Jobs
● Visualizing Workflows through DAGs
● Support for Other schedulers and Frameworks

References
Engineering Blog: engineering.linkedin.com/blog/2016/04/dr-elephant-open-
source-self-serve-performance-tuning-hadoop-spark
Open Source Github Link:
github.com/linkedin/dr-elephant
Mailing List & Gitter
dr-elephant-users, linkedin/dr-elephant
Hadoop Summit 2015:
https://www.youtube.com/watch?v=aL3OJ4YoxPA (Mark Wagner)
38

github.com / linkedin / dr-elephant
Thank You
39
Akshay Rai
https://in.linkedin.com/in/akshayrai09

The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Similar to The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark (20)

Recently uploaded

Recently uploaded (20)

The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark