Optimizing MapReduce Job performance

21,425 views

Published on

Published in: Technology, Business
4 Comments
96 Likes
Statistics
Notes
  • I believe the slide number 11 is wrong: the KVINDICES buffer holds the partition, KeyOffset (KeyStart), and ValueOffset (ValStart).

    The KVOFFSETS buffer is the buffer with indexes to the records in the KVINDICES in the following form: 0, 3, 6, 9 (because each 'record' in the kvindices occupy 3 int / 12 bytes). KVOFFSETS is also the buffer used to sort the records when a sortAndSpill method is called.

    All this at least in Hadoop v. 1.2.1
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • nice
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • excellent
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • hadoop 作业调优
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
21,425
On SlideShare
0
From Embeds
0
Number of Embeds
200
Actions
Shares
0
Downloads
0
Comments
4
Likes
96
Embeds 0
No embeds

No notes for slide

Optimizing MapReduce Job performance

  1. 1. June 14, 2012Optimizing MapReduce JobPerformanceTodd Lipcon [@tlipcon]
  2. 2. Introductions •  Software Engineer at Cloudera since 2009 •  Committer and PMC member on HDFS, MapReduce, and HBase •  Spend lots of time looking at full stack performance •  This talk is to help you develop faster jobs –  If you want to hear about how we made Hadoop faster, see my Hadoop World 2011 talk on cloudera.com2 ©2011 Cloudera, Inc. All Rights Reserved.
  3. 3. Aspects of Performance •  Algorithmic performance –  big-O, join strategies, data structures, asymptotes •  Physical performance –  Hardware (disks, CPUs, etc) •  Implementation performance –  Efficiency of code, avoiding extra work –  Make good use of available physical perf3 ©2011 Cloudera, Inc. All Rights Reserved.
  4. 4. Performance fundamentals •  You can’t tune what you don’t understand –  MR’s strength as a framework is its black-box nature –  To get optimal performance, you have to understand the internals •  This presentation: understanding the black box4 ©2011 Cloudera, Inc. All Rights Reserved.
  5. 5. Performance fundamentals (2) •  You can’t improve what you can’t measure –  Ganglia/Cacti/Cloudera Manager/etc a must –  Top 4 metrics: CPU, Memory, Disk, Network –  MR job metrics: slot-seconds, CPU-seconds, task wall-clocks, and I/O •  Before you start: run jobs, gather data5 ©2011 Cloudera, Inc. All Rights Reserved.
  6. 6. Graphing bottlenecks This job might be CPU-bound in map phase Most jobs not CPU-bound Plenty of free RAM, perhaps can make better use of it? Fairly flat-topped network – bottleneck?6 ©2011 Cloudera, Inc. All Rights Reserved.
  7. 7. Performance tuning cycle Identify Address Run job bottleneck bottleneck - Graphs -  Tune configs - Job counters -  Improve code - Job logs -  Rethink algos - Profiler results In order to understand these metrics and make changes, you need to understand MR internals.7 ©2011 Cloudera, Inc. All Rights Reserved.
  8. 8. MR from 10,000 feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task8 ©2011 Cloudera, Inc. All Rights Reserved.
  9. 9. MR from 10,000 feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task9 ©2011 Cloudera, Inc. All Rights Reserved.
  10. 10. Map-side sort/spill overview •  Goal: when complete, map task outputs one sorted file •  What happens when you call OutputCollector.collect ()? Map Task 2. Output Buffer fills up. Contents sorted, partitioned .collect(K,V) and spilled to diskMapOutputBuffer IFile1. In-memory bufferholds serialized, Map-side IFile IFileunsorted key-values Merge 3. Map task finishes. All IFile IFiles merged to single IFile per task10 ©2011 Cloudera, Inc. All Rights Reserved.
  11. 11. Zooming further: MapOutputBuffer (Hadoop 1.0) 12 bytes/rec kvoffsets (Partition, KOff, VOff) per record io.sort.record.percent * io.sort.mb kvindices 4 bytes/rec 1 indirect-sort indexio.sort.mb per record R bytes/rec kvbuffer Raw, serialized (1-io.sort.record.percent) (Key, Val) pairs * io.sort.mb 11 ©2011 Cloudera, Inc. All Rights Reserved.
  12. 12. MapOutputBuffer spill behavior •  Memory is limited: must spill –  If either of the kvbuffer or the metadata buffers fill up, “spill” to disk –  In fact, we spill before it’s full (in another thread): configure io.sort.spill.percent •  Performance impact –  If we spill more than one time, we must re- read and re-write all data: 3x the IO! –  #1 goal for map task optimization: spill once!12 ©2011 Cloudera, Inc. All Rights Reserved.
  13. 13. Spill counters on map tasks •  ratio of Spilled Records vs Map Output Records –  if unequal, then you are doing more than one spill •  FILE: Number of bytes read/written –  get a sense of I/O amplification due to spilling13 ©2011 Cloudera, Inc. All Rights Reserved.
  14. 14. Spill logs on map tasks indicates that the metadata buffers 2012-06-04 11:52:21,445 INFO before the data buffer filled up MapTask: Spilling map output: record full = true 2012-06-04 11:52:21,445 INFO MapTask: bufstart = 0; bufend = 60030900; bufvoid = 228117712 2012-06-04 11:52:21,445 INFO MapTask: kvstart = 0; kvend = 600309; length = 750387 2012-06-04 11:52:24,320 INFO MapTask: Finished spill 0 2012-06-04 11:52:26,117 INFO MapTask: Spilling map output: record full = true 2012-06-04 11:52:26,118 INFO MapTask: bufstart = 60030900; bufend = 120061700; bufvoid = 228117712 2012-06-04 11:52:26,118 INFO MapTask: kvstart = 600309; kvend = 450230; length = 750387 2012-06-04 11:52:26,666 INFO MapTask: Starting flush of map output 2012-06-04 11:52:28,272 INFO MapTask: Finished spill 1 2012-06-04 spills total! maybeINFO MapTask: Finished spill 2 3 11:52:29,105 we can do better?14 ©2011 Cloudera, Inc. All Rights Reserved.
  15. 15. Tuning to reduce spills •  Parameters: –  io.sort.mb: total buffer space –  io.sort.record.percent: proportion between metadata buffers and key/value data –  io.sort.spill.percent: threshold at which spill is triggered –  Total map output generated: can you use more compact serialization? •  Optimal settings depend on your data and available RAM!15 ©2011 Cloudera, Inc. All Rights Reserved.
  16. 16. Setting io.sort.record.percent •  Common mistake: metadata buffers fill up way before kvdata buffer •  Optimal setting: –  io.sort.record.percent = 16/(16 + R) –  R = average record size: divide “Map Output Bytes” counter by “Map Output Records” counter •  Default (0.05) is usually too low (optimal for ~300byte records) •  Hadoop 2.0: this is no longer necessary! –  see MAPREDUCE-64 for gory details16 ©2011 Cloudera, Inc. All Rights Reserved.
  17. 17. Tuning Example (terasort) •  Map input size = output size –  128MB block = 1,342,177 records, each 100 bytes –  metadata: 16 * 1342177 = 20.9MB •  io.sort.mb –  128MB data + 20.9MB meta = 148.9MB •  io.sort.record.percent –  16/(16+100)=0.138 •  io.sort.spill.percent = 1.017 ©2011 Cloudera, Inc. All Rights Reserved.
  18. 18. More tips on spill tuning •  Biggest win is going from 2 spills to 1 spill –  3 spills is approximately the same speed as 2 spills (same IO amplificatoin) •  Calculate if it’s even possible, given your heap size –  io.sort.mb has to fit within your Java heap (plus whatever RAM your Mapper needs, plus ~30% for overhead) •  Only bother if this is the bottleneck! –  Look at map task logs: if the merge step at the end is taking a fraction of a second, not worth it! –  Typically most impact on jobs with big shuffle (sort/ dedup)18 ©2011 Cloudera, Inc. All Rights Reserved.
  19. 19. MR from 10,000 feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task19 ©2011 Cloudera, Inc. All Rights Reserved.
  20. 20. Reducer fetch tuning •  Reducers fetch map output via HTTP •  Tuning parameters: –  Server side: tasktracker.http.threads –  Client side: mapred.reduce.parallel.copies •  Turns out this is not so interesting –  follow the best practices from Hadoop: Definitive Guide20 ©2011 Cloudera, Inc. All Rights Reserved.
  21. 21. Improving fetch bottlenecks •  Reduce intermediate data –  Implement a Combiner: less data transfers faster –  Enable intermediate compression: Snappy is easy to enable; trades off some CPU for less IO/ network •  Double-check for network issues –  Frame errors, NICs auto-negotiated to 100mbit, etc: one or two slow hosts can bottleneck a job –  Tell-tale sign: all maps are done, and reducers sit in fetch stage for many minutes (look at logs)21 ©2011 Cloudera, Inc. All Rights Reserved.
  22. 22. MR from 10,000 feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task22 ©2011 Cloudera, Inc. All Rights Reserved.
  23. 23. Reducer merge (Hadoop 1.0) Yes: RAMManager fetch to RAM-to-disk RAM merges Remote Map Fits in Outputs RAM? (via HTTP) 1. Data accumulated in RAM is merged to No: fetch disk files to disk Local Disk IFile2. If too many disk disk-to-disk Mergedfiles accumulate, IFile merges iteratorthey are re-merged IFile Reduce Task 3. Segments from RAM and disk are23 merged into the reducer code ©2011 Cloudera, Inc. All Rights Reserved.
  24. 24. Reducer merge triggers •  RAMManager –  Total buffer size: mapred.job.shuffle.input.buffer.percent (default 0.70, percentage of reducer heapsize) •  Mem-to-disk merge triggers: –  RAMManager is mapred.job.shuffle.merge.percent % full (default 0.66) –  Or mapred.inmem.merge.threshold segments accumulated (default 1000) •  Disk-to-disk merge –  io.sort.factor on-disk segments pile up (fairly rare)24 ©2011 Cloudera, Inc. All Rights Reserved.
  25. 25. Final merge phase •  MR assumes that reducer code needs the full heap worth of RAM –  Spills all in-RAM segments before running user code to free memory •  This isn’t true if your reducer is simple –  eg sort, simple aggregation, etc with no state •  Configure mapred.job.reduce.input.buffer.percent to 0.70 to keep reducer input data in RAM25 ©2011 Cloudera, Inc. All Rights Reserved.
  26. 26. Reducer merge counters •  FILE: number of bytes read/written –  Ideally close to 0 if you can fit in RAM •  Spilled records: –  Ideally close to 0. If significantly more than reduce input records, job is hitting a multi- pass merge which is quite expensive26 ©2011 Cloudera, Inc. All Rights Reserved.
  27. 27. Tuning reducer merge •  Configure mapred.job.reduce.input.buffer.percent to 0.70 to keep data in RAM if you don’t have any state in reducer •  Experiment with setting mapred.inmem.merge.threshold to 0 to avoid spills •  Hadoop 2.0: experiment with mapreduce.reduce.merge.memtomem.enabled27 ©2011 Cloudera, Inc. All Rights Reserved.
  28. 28. Rules of thumb for # maps/reduces •  Aim for map tasks running 1-3 minutes each –  Too small: wasted startup overhead, less efficient shuffle –  Too big: not enough parallelism, harder to share cluster •  Reduce task count: –  Large reduce phase: base on cluster slot count (a few GB per reducer) –  Small reduce phase: fewer reducers will result in more efficient shuffle phase28 ©2011 Cloudera, Inc. All Rights Reserved.
  29. 29. MR from 10,000 feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task29 ©2011 Cloudera, Inc. All Rights Reserved.
  30. 30. Tuning Java code for MR •  Follow general Java best practices –  String parsing and formatting is slow –  Guard debug statements with isDebugEnabled() –  StringBuffer.append vs repeated string concatenation •  For CPU-intensive jobs, make a test harness/ benchmark outside MR –  Then use your favorite profiler •  Check for GC overhead: -XX:+PrintGCDetails – verbose:gc •  Easiest profiler: add –Xprof to mapred.child.java.opts – then look at stdout task log30 ©2011 Cloudera, Inc. All Rights Reserved.
  31. 31. Other tips for fast MR code •  Use the most compact and efficient data formats –  LongWritable is way faster than parsing text –  BytesWritable instead of Text for SHA1 hashes/dedup –  Avro/Thrift/Protobuf for complex data, not JSON! •  Write a Combiner and RawComparator •  Enable intermediate compression (Snappy/ LZO)31 ©2011 Cloudera, Inc. All Rights Reserved.
  32. 32. Summary •  Understanding MR internals helps understand configurations and tuning •  Focus your tuning effort on things that are bottlenecks, following a scientific approach •  Don’t forget that you can always just add nodes! –  Spending 1 month of engineer time to make your job 20% faster is not worth it if you have a 10 node cluster! •  We’re working on simplifying this where we can, but deep understanding will always allow more efficient jobs32 ©2011 Cloudera, Inc. All Rights Reserved.
  33. 33. Questions? @tlipcontodd@cloudera.com

×