Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytics with Akshay Rai


Published on

Is your job running slower than usual? Do you want to make sense from the thousands of Hadoop & Spark metrics? Do you want to monitor the performance of your flow, get alerts and auto tune them? These are the common questions every Hadoop user asks but there is not a single solution that addresses it. We at Linkedin faced lots of such issues and have built a simple self serve tool for the hadoop users called Dr. Elephant. Dr. Elephant, which is already open sourced, is a performance monitoring and tuning tool for Hadoop and Spark. It tries to improve the developer productivity and cluster efficiency by making it easier to tune jobs. Since its open source, it has been adopted by multiple organizations and followed with a lot of interest in the Hadoop and Spark community. In this talk, we will discuss about Dr. Elephant and outline our efforts to expand the scope of Dr. Elephant to be a comprehensive monitoring, debugging and tuning tool for Hadoop and Spark applications. We will talk about how Dr. Elephant performs exception analysis, give clear and specific suggestions on tuning, tracking metrics and monitoring their historical trends. Open Source:

Published in: Data & Analytics
  • Be the first to comment

Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytics with Akshay Rai

  1. 1. Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytics #EUdev9 Akshay Rai, Linkedin
  2. 2. Overview • Why Dr. Elephant? • How does Dr. Elephant work? • Spark Integration Challenges • Spark Rules/Heuristics • Roadmap 2#EUdev9
  3. 3. Grid Scale at Linkedin 3#EUdev9 2008 2018 1 cluster 10+ clusters 20 nodes 1000s of nodes 5 users 1000s of active users MapReduce Pig, Hive, Spark, etc. Few workflows hundred-thousand workflows
  4. 4. How much can we scale? 4#EUdev9
  5. 5. Cluster and Job Tuning 5#EUdev9
  6. 6. Challenges in Tuning • Too many frameworks • Too Many Different Knobs • Overwhelming Data • Scattered Data 6
  7. 7. Developer Productivity Vs Cluster Efficiency 7
  8. 8. Training Users to Tune • Train and educate Issues: • Newer Frameworks • More & more People • Varied Experience 8#EUdev9
  9. 9. Expert Review 9#EUdev9 • Experts review & approve jobs • Every job is reviewed Issues: • Not Scalable • Delayed Reviews • Manual Review Process
  10. 10. Birth of Dr. Elephant • Identifies Common Issues • Gives Actionable Advice • Recommends Best practices • Tracks Flow and Job Health 10
  11. 11. @Dr. Elephant, please review Ship it! Automated Process 11
  12. 12. 12#EUdev9
  13. 13. How does it work? 13 Metrics Fetcher History Server Application Fetcher Resource Manager Run Rule 1 Run Rule 2 Run Rule 3 Database Dr. Elephant UI
  14. 14. Spark Support in Dr. Elephant 14#EUdev9
  15. 15. Spark Metrics Fetcher 15 Spark Metrics Fetcher Spark History Server
  16. 16. Unstable Spark History Server • Report delayed at Peak Load • Too many Dr. Elephant REST calls - SHS Crashes 16#EUdev9
  17. 17. Solution • Event Log Parser • Fix Spark History Server 17#EUdev9
  18. 18. Event Log Parser • Replay HDFS Logs with Listeners Issues: • Large Log Files. – More Resources – Backlog of jobs due to blocked threads • Parsing logic is not public 18#EUdev9
  19. 19. Fix Spark History Server • Use SPARK-18085 fork • Uses Level DB with improved caching • Some patches to add More metrics – JVM Used Memory – Executor disk space used 19#EUdev9
  20. 20. Spark Rules / Heuristics 20
  21. 21. Spark Memory 21#EUdev9
  22. 22. 22 JVM Used Memory Heuristic Credits: Edwina Lu#EUdev9 Reduce spark.executor.memory!
  23. 23. Unified Memory Heuristic 23#EUdev9 Credits: Edwina Lu If JVM used memory is low then reduce spark.executor.memory. If JVM used memory is high then reduce spark.memory.fraction.
  24. 24. Spark Stage Heuristic 24#EUdev9
  25. 25. Spark Configuration Heuristic 25#EUdev9
  26. 26. More Heuristics • Skewness in the Memory Regions • Execution or Storage Spill • GC Time heuristic 26#EUdev9
  27. 27. Customization • Write and Configure New Rules • Easily integrate with new schedulers • More Fetchers • More Job types 27#EUdev9
  28. 28. Open Source • Active and Vibrant Community • Hangout Meetings • Meet-ups • Spark Discussions 28
  29. 29. Roadmap • Specific Suggestions • Spark Debugging • More Spark Heuristics • Apache Incubation 29#EUdev9
  30. 30. References • Engineering Blogs: – Open Source Blog: Click here – A year later after open source: Click here • Open Source Github Link: • Mailing List: dr-elephant-users • Hangout Meetings: – Agenda: Click here – Minutes: Click here • Hadoop Summit 2015 (Before Open Source): Click here • Fifth Elephant Conference 2016 (After Open Source): Click here 30#EUdev9
  31. 31. ©2014 LinkedIn Corporation. All Rights Reserved. ©2014 LinkedIn Corporation. All Rights Reserved. 31 Thank You