Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Leveraging Your Hadoop Cluster Better - Running Performant Code at Scale


Published on

Video and slides synchronized, mp3 and slide download available at URL

Michael Kopp explains how to run performance code at scale with Hadoop and how to analyze and optimize Hadoop jobs. Filmed at

Michael Kopp has over 12 years of experience as an architect and developer. He is a technology strategist in CompuwareAPM's center of excellence where he focuses on architecture and performance of cloud and big data environments. In this role he drives the dynaTrace Enterprise product strategy works closely with key customers in implementing APM in these environments.

Published in: Technology, Business
  • Be the first to comment

Leveraging Your Hadoop Cluster Better - Running Performant Code at Scale

  1. 1. Leveraging your Hadoop cluster better running efficient code at scale Michael Kopp, Technology Strategist
  2. 2. News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on! /optimize-hadoop-jobs
  3. 3. Presented at QCon New York Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. Why do I do this talk? 2
  5. 5. Effectiveness vs. Efficiency • Effective: Adequate to accomplish a purpose; producing the intended or expected result1 • Efficient: Performing or functioning in the best possible manner with the least waste of time and effort1 …and resources 1)
  6. 6. An Efficient Hadoop Cluster • Is Effective  Gets the job done (in time) • Highly Utilized when Active (unused resources are wasted resources)
  7. 7. What is an efficient Hadoop Job? …efficiency is a measurable concept, quantitatively determined by the ratio of output to input… • same output in less time • less resource usage with same output and same time • more output with same resources in the same time Efficient jobs are effective without adding more hardware!
  8. 8. Efficiency – Using everything we have
  9. 9. Utilization and Distribution CPU Spikes but no real overall usage Not fully utilized
  10. 10. Reasons why your Cluster is not utilized • Map and Reduce Slots • Data Distribution • Bottlenecks – Spill – Shuffle – Reduce – Code
  11. 11. Which Job(s) are dominating the cluster?
  12. 12. Which User? Which Pool?
  13. 13. Pushing the Boundaries – High Utilization • Figure out Spill and Shuffle Bottlenecks • Remove Idle Times, Wait Times, Sync Times • Hotspot Analysis Tools can pinpoint those Items quickly
  14. 14. Identify the Jobs
  15. 15. Job Bottlenecks – Mapping Phase Mapper is waiting for Spill Thread io.sort.spill.percent io.sort.mb Wait Time?
  16. 16. Job Bottleneck - Shuffle Reducer is Waiting for memory mapred.job.shuffle.input.buffer.percent mapred.reduce.parallel.copies Wait Time?
  17. 17. Cluster after simple “Fixes”
  18. 18. Jobs are now resource bound
  19. 19. Efficiency – Use what we have better
  20. 20. Performance Optimization 1. Identify Bounding Resource 2. Optimize and reduce its usage 3. Identify new Bounding Resource Hot Spot Analysis Tools are again the best way to go
  21. 21. Identify Hotspots – which Phase
  22. 22. Cluster Usage
  23. 23. Mapping Phase Hotspot in Outage Analyzer 70% our own code!
  24. 24. CPU Hotspot in Mapping Phase 10% of Mapping CPU 20% of Mapping CPU
  25. 25. Hotspot Analysis of Reduce Phase Wow!
  26. 26. Three simple Hotspots
  27. 27. Before Fix: 6h 30 minutes…
  28. 28. …After Fix: 3.5 hours Utilization went up!
  29. 29. Map Reduce Run Comparison 10% of Mapping CPUReducers Running3 Reducers running
  30. 30. Conclusion • Understanding your bottleneck! • Understand bounding resource • Small fixes can have huge yields…but requires tools
  31. 31. What else did we find? • Short Mappers due to small files – High merge time due to large number of spills – Too much data shuffle  add Combiner but… • Tried Task reuse – Nearly not effect? – 5% less Map Time, but…?
  32. 32. Why did the resuse not help Map Phase over 5 more reducersshuffle
  33. 33. What’s next? • Bigger Files • Add Combiners to reduce shuffle
  34. 34. What about Hive or PIG? • Identify which stage the is slow • Identify configuration Issues • Identify HBase or UDF issues
  35. 35. HBase PIG Job lasting for 15 hours…
  36. 36. HBase major Hotspot… Wow! Roundtrip for every single Row
  37. 37. Cluster Utilization after fix
  38. 38. Performance after Fix: 75 minutes!
  39. 39. Summary • Drive up utilization • Remove Blocks and Sync points • Optimize Big Hotspots
  40. 40. THANK YOU Michael Kopp, Technology Strategist @mikopp
  41. 41. Watch the video with slide synchronization on! -hadoop-jobs