Optimizing Hadoop Map Reduce job performance
Upcoming SlideShare
Loading in...5

Optimizing Hadoop Map Reduce job performance






Total Views
Views on SlideShare
Embed Views



3 Embeds 48

http://www.slideee.com 37
https://www.linkedin.com 10
http://www.linkedin.com 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Optimizing Hadoop Map Reduce job performance Optimizing Hadoop Map Reduce job performance Presentation Transcript

  • Optimizing Hadoop Map Reduce job performance Constantin Ciureanu Software Architect (ciureanu@optymyze.com) Project “Hadoop Data Transformation”
  • Generic Goals - 1 • Correctly split the input data for maximizing mappers performance (directly affects the number of mappers by simply calculating the split size - in order to make the mapper tasks run a reasonable amount of time) • Correctly calculate the number of reducers • Skewed data distribution is a huge problem causing unbalanced load onto some Mapper or Reducer(s) – see for example a complex Tetris brick (L shaped or cross) • Try to spill just once (read/process/write to disk)
  • Generic Goals - 2 • As a thumb rule, no task should take more than 5 minutes to complete • Of course that is a hard to get goal, but at least any task should be able to report its status before the 10mins passed so that it wouldn't get killed and rerun thus wasting important resources (in this case 10 minutes of processing time) • Aim to have map tasks running 1-3 minutes • < 1 minute - wasted resources to start tasks • > 3 minutes - not enough parallelism, hard to share fleet resources • Bottleneck – by default no reducer can start until all mappers are finished (fix = Use slow start for reducers after % percentage of mappers finished)
  • Performance tuning • At least try to perfectly tune and optimize the code that gets executed for each row (eg. in mappers / reducers) because that specific code is potentially run millions of times • In comparison with adding more machines (that instantly help to increase the Hadoop fleet power), adding more developers / optimize code for performance helps too but a bit less :) • Use enough memory for each task (eg. mapred.reduce.child.java.opts = - Xmx500m, while default value is 200m, not enough for large data to process) – this parameter can be calculated on-the-fly according to the input data size
  • Data related optimizations - 1 • Reduce the quantity of data to be read / write from and to disk! • For example shorter rows would help, do not use long strings repeatedly, use some numeric constants / types instead! • Use compression for intermediary output (mapred.compress.map.output = true) • Advantage: reduced data to write on disk / transfer over network • Disadvantage: Higher CPU load • Use compression for output files (GZip / Deflate etc.) • Adding more machines always help! Adding one machine means new available processing power but also new local storage!
  • Data related optimizations • Definitely get rid of the NAS and use local dedicated Hard drives! • Use combiners (between mappers and reducers to minimize the data that flows over network) - there's no such thing in Cascading framework - but at least some part of the Aggregator block might get executed distributed on mapper's phase • Whenever is possible use in-memory data structures for faster joins or lookups
  • Monitoring resources • Idle resources is the biggest problem! • Use some monitoring software, even simple utilities help a lot: – Cloudera manager – Ganglia – top – iostat -dmx 5 – … • Monitor all resources (CPU, memory, disk I/O, network bandwidth and latency) • Performance tuning cycle (run / analyze / benchmark& identify bottleneck (usually another one than previous iteration bottleneck :) )& then tune to improve resource usage (eg. insufficient CPU usage) ... run again). Also test for scalability • Use some sort of profiler tool (easiest case -Xprof on mapred.child.java.opts then check the stdout logs)
  • Conclusions & final words • With appropriate tuning some MR job might finish even 2 times faster! • Optimize for the future data too - plan sizing of all resources (storage, CPU needs, memory, network bandwidth) – because Big Data is inherently in a continuous growth (in both accumulative and income daily data size) • Move gradually toward YARN (Hadoop 2.0)
  • Questions? • Interesting links: – http://download.microsoft.com/download/1/C/6/1C66D134- 1FD5-4493-90BD- 98F94A881626/Hadoop%20Job%20Optimization%20(Microsoft %20IT%20white%20paper).docx