Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Dr. Elephant
github.com/linkedin/dr-elephant
Akshay Rai
Hadoop Dev Team
Introduction
Scaling Hadoop
Infrastructure
Scale and Optimize Hardware
● More users, more jobs, more resources
● Large investment in hardware
● Can’t keep upgrading ...
Users are more valuable than machines
What do we do?
Improve User Productivity
User Productivity
● Freedom to experiment and run jobs on the cluster
● Build tools to help developers. (Hadoop DSL, Resol...
The Tuning Problem
How easy is it to tune a job?
● Problems are not obvious
● Critical information is scattered
● Inter-related settings
● La...
Here’s what we learned!
Expert Intervention
● Not enough support resources available
● Poor coverage
● Difficult to prioritize efforts
● Delays us...
Training is not at all easy
● Too many users
● Diverse backgrounds
● Scope is large and evolving
● Other responsibilities ...
Scaling Productivity is Hard!
Dr. Elephant to the Rescue
What does Dr. Elephant do?
● Automated performance monitoring and tuning tool
● Help every user get the best performance f...
Architecture
Dashboard
Search
Job Page
MapReduce Report
Failed Job
Help Page
Tuning Tips
Awesome Features
Simplified analysis of a flow’s historical executions
● Monitoring performance, resource usage and many others
● Comparing...
Flow History
Job History
Heuristics
How does a Heuristic work?
● Fetch Counters and Task Data
● Some logic to compute a value
● Compare value against threshol...
Heuristic Severity
Severity Color Description
CRITICAL The job is in critical state and must be tuned
SEVERE There is scop...
Example | Mapper Data Skew
Mapper Skew Problem
● Number of Mappers depend on the number of splits
● Varying size of splits can cause skewness in the ...
Solution to Mapper Skewness
● Each Mapper should process the same amount of data
● Combine the small chunks and feed it to...
Example | Spark Executor Load Balance
Spark Driver
Executor
1
Executor
2
Executor
3
RDD
Partition 1
Partition 2
Partition 3
Custom Heuristics
Adding a New Heuristic
1. Create a new heuristic and test it.
2. Create a new view for the heuristic. For example, helpMap...
Configuring Heuristics/Threshold levels
<heuristics>
<heuristic>
<applicationtype>mapreduce</applicationtype>
<heuristicna...
Elephagent
Workflow monitoring and reports
● Performance characteristics change
○ Data Growth
○ Data distribution change
○ Hardware c...
Production Reviews | JIRA Bot
● Separate cluster for critical workloads
● Audit before deployment
● Improved accuracy
● Fa...
Future Plans
Upcoming
● Job Resource Usage and Wastage
● Job Wait time
● Real time analysis of a job
● Workflow DAG visualization
● Imp...
References
Engineering Blog: engineering.linkedin.com/blog/2016/04/dr-elephant-open-
source-self-serve-performance-tuning-...
Thank You
©2014 LinkedIn Corporation. All Rights
Reserved.
©2014 LinkedIn Corporation. All Rights
Reserved.
© 2016
Upcoming SlideShare
Loading in …5
×

Hadoop & Spark Performance tuning using Dr. Elephant

1,830 views

Published on

Dr. Elephant is a tool for the users of Hadoop to help them understand, analyze and tune their Hadoop/Spark applications easily, thus improving their productivity and the cluster’s efficiency. It analyzes the Hadoop and Spark jobs using a set of pluggable, configurable, rule-based heuristics that provide insights on how a job performed, and then uses the results to make suggestions about how to tune the job to make it perform more efficiently.

Published in: Data & Analytics
  • Be the first to comment

Hadoop & Spark Performance tuning using Dr. Elephant

  1. 1. Dr. Elephant github.com/linkedin/dr-elephant Akshay Rai Hadoop Dev Team
  2. 2. Introduction
  3. 3. Scaling Hadoop Infrastructure
  4. 4. Scale and Optimize Hardware ● More users, more jobs, more resources ● Large investment in hardware ● Can’t keep upgrading and adding machines to solve problem forever ● Some tuning is needed to get things running
  5. 5. Users are more valuable than machines What do we do?
  6. 6. Improve User Productivity
  7. 7. User Productivity ● Freedom to experiment and run jobs on the cluster ● Build tools to help developers. (Hadoop DSL, Resolvers for Pig/Hive) ○ Improve developer lifecycle ○ Also reduce unnecessary resource wastage
  8. 8. The Tuning Problem
  9. 9. How easy is it to tune a job? ● Problems are not obvious ● Critical information is scattered ● Inter-related settings ● Large parameter space
  10. 10. Here’s what we learned!
  11. 11. Expert Intervention ● Not enough support resources available ● Poor coverage ● Difficult to prioritize efforts ● Delays user development Random Suggestions
  12. 12. Training is not at all easy ● Too many users ● Diverse backgrounds ● Scope is large and evolving ● Other responsibilities are more important
  13. 13. Scaling Productivity is Hard!
  14. 14. Dr. Elephant to the Rescue
  15. 15. What does Dr. Elephant do? ● Automated performance monitoring and tuning tool ● Help every user get the best performance from their jobs ● Highlights common mistakes ● Indicates best practices and tuning tips ● Provides a platform for other performance related tools ● Analyzes hundred thousand jobs every day
  16. 16. Architecture
  17. 17. Dashboard
  18. 18. Search
  19. 19. Job Page
  20. 20. MapReduce Report
  21. 21. Failed Job
  22. 22. Help Page
  23. 23. Tuning Tips
  24. 24. Awesome Features
  25. 25. Simplified analysis of a flow’s historical executions ● Monitoring performance, resource usage and many others ● Comparing flows against previous executions ● Impact of tuning a specific parameter or a changing a line of code
  26. 26. Flow History
  27. 27. Job History
  28. 28. Heuristics
  29. 29. How does a Heuristic work? ● Fetch Counters and Task Data ● Some logic to compute a value ● Compare value against threshold levels
  30. 30. Heuristic Severity Severity Color Description CRITICAL The job is in critical state and must be tuned SEVERE There is scope for improvement MODERATE There is scope for further improvement LOW There is scope for few minor improvements NONE The job is safe. No tuning necessary
  31. 31. Example | Mapper Data Skew
  32. 32. Mapper Skew Problem ● Number of Mappers depend on the number of splits ● Varying size of splits can cause skewness in the Mapper Input
  33. 33. Solution to Mapper Skewness ● Each Mapper should process the same amount of data ● Combine the small chunks and feed it to a single Mapper
  34. 34. Example | Spark Executor Load Balance
  35. 35. Spark Driver Executor 1 Executor 2 Executor 3 RDD Partition 1 Partition 2 Partition 3
  36. 36. Custom Heuristics
  37. 37. Adding a New Heuristic 1. Create a new heuristic and test it. 2. Create a new view for the heuristic. For example, helpMapperSpill.scala.html 3. Add the details of the heuristic in the HeuristicConf.xml file. <heuristic> <applicationtype>mapreduce</applicationtype> <heuristicname>Mapper GC</heuristicname> <classname>com.linkedin.dre.mapreduce.heuristics.MapperGC</classname> <viewname>views.html.help.mapreduce.helpGC</viewname> </heuristic> 4. Run Dr. Elephant. It should now include the new heuristics.
  38. 38. Configuring Heuristics/Threshold levels <heuristics> <heuristic> <applicationtype>mapreduce</applicationtype> <heuristicname>Mapper Data Skew</heuristicname> <classname>com.linkedin.dre.mapreduce.heuristics.MapperDataSkew</classname> <viewname>views.html.help.mapreduce.helpMapperDataSkew</viewname> <params> <num_tasks_severity>10, 50, 100, 200</num_tasks_severity> <deviation_severity>2, 4, 8, 16</deviation_severity> <files_severity>1/8, 1/4, 1/2, 1</files_severity> </params> </heuristic> </heuristics>
  39. 39. Elephagent
  40. 40. Workflow monitoring and reports ● Performance characteristics change ○ Data Growth ○ Data distribution change ○ Hardware change ○ Incremental software change ● Monitor performance on each execution ● Compare behaviour across revisions ● Cost to Serve analysis
  41. 41. Production Reviews | JIRA Bot ● Separate cluster for critical workloads ● Audit before deployment ● Improved accuracy ● Faster turnaround ● Higher throughput
  42. 42. Future Plans
  43. 43. Upcoming ● Job Resource Usage and Wastage ● Job Wait time ● Real time analysis of a job ● Workflow DAG visualization ● Improved Spark heuristics
  44. 44. References Engineering Blog: engineering.linkedin.com/blog/2016/04/dr-elephant-open- source-self-serve-performance-tuning-hadoop-spark Open Source Github Link: github.com/linkedin/dr-elephant Mailing List: Dr-elephant-users Hadoop Summit 2015: https://www.youtube.com/watch?v=aL3OJ4YoxPA
  45. 45. Thank You
  46. 46. ©2014 LinkedIn Corporation. All Rights Reserved. ©2014 LinkedIn Corporation. All Rights Reserved. © 2016

×