Successfully reported this slideshow.
June 14, 2012Optimizing MapReduce JobPerformanceTodd Lipcon [@tlipcon]
Introductions • Software Engineer at Cloudera since 2009 • Committer and PMC member on HDFS, MapReduce, and HBase • Spend lots of time looking at full stack performance • This talk is to help you develop faster jobs – If you want to hear about how we made Hadoop faster, see my Hadoop World 2011 talk on cloudera.com2 ©2011 Cloudera, Inc. All Rights Reserved.
Aspects of Performance • Algorithmic performance – big-O, join strategies, data structures, asymptotes • Physical performance – Hardware (disks, CPUs, etc) • Implementation performance – Efficiency of code, avoiding extra work – Make good use of available physical perf3 ©2011 Cloudera, Inc. All Rights Reserved.
Performance fundamentals • You can’t tune what you don’t understand – MR’s strength as a framework is its black-box nature – To get optimal performance, you have to understand the internals • This presentation: understanding the black box4 ©2011 Cloudera, Inc. All Rights Reserved.
Performance fundamentals (2) • You can’t improve what you can’t measure – Ganglia/Cacti/Cloudera Manager/etc a must – Top 4 metrics: CPU, Memory, Disk, Network – MR job metrics: slot-seconds, CPU-seconds, task wall-clocks, and I/O • Before you start: run jobs, gather data5 ©2011 Cloudera, Inc. All Rights Reserved.
Graphing bottlenecks This job might be CPU-bound in map phase Most jobs not CPU-bound Plenty of free RAM, perhaps can make better use of it? Fairly flat-topped network – bottleneck?6 ©2011 Cloudera, Inc. All Rights Reserved.
Performance tuning cycle Identify Address Run job bottleneck bottleneck - Graphs - Tune configs - Job counters - Improve code - Job logs - Rethink algos - Profiler results In order to understand these metrics and make changes, you need to understand MR internals.7 ©2011 Cloudera, Inc. All Rights Reserved.
MR from 10,000 feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task8 ©2011 Cloudera, Inc. All Rights Reserved.
MR from 10,000 feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task9 ©2011 Cloudera, Inc. All Rights Reserved.
Map-side sort/spill overview • Goal: when complete, map task outputs one sorted file • What happens when you call OutputCollector.collect ()? Map Task 2. Output Buffer fills up. Contents sorted, partitioned .collect(K,V) and spilled to diskMapOutputBuffer IFile1. In-memory bufferholds serialized, Map-side IFile IFileunsorted key-values Merge 3. Map task finishes. All IFile IFiles merged to single IFile per task10 ©2011 Cloudera, Inc. All Rights Reserved.
Zooming further: MapOutputBuffer (Hadoop 1.0) 12 bytes/rec kvoffsets (Partition, KOff, VOff) per record io.sort.record.percent * io.sort.mb kvindices 4 bytes/rec 1 indirect-sort indexio.sort.mb per record R bytes/rec kvbuffer Raw, serialized (1-io.sort.record.percent) (Key, Val) pairs * io.sort.mb 11 ©2011 Cloudera, Inc. All Rights Reserved.
MapOutputBuffer spill behavior • Memory is limited: must spill – If either of the kvbuffer or the metadata buffers fill up, “spill” to disk – In fact, we spill before it’s full (in another thread): configure io.sort.spill.percent • Performance impact – If we spill more than one time, we must re- read and re-write all data: 3x the IO! – #1 goal for map task optimization: spill once!12 ©2011 Cloudera, Inc. All Rights Reserved.
Spill counters on map tasks • ratio of Spilled Records vs Map Output Records – if unequal, then you are doing more than one spill • FILE: Number of bytes read/written – get a sense of I/O amplification due to spilling13 ©2011 Cloudera, Inc. All Rights Reserved.
Spill logs on map tasks indicates that the metadata buffers 2012-06-04 11:52:21,445 INFO before the data buffer filled up MapTask: Spilling map output: record full = true 2012-06-04 11:52:21,445 INFO MapTask: bufstart = 0; bufend = 60030900; bufvoid = 228117712 2012-06-04 11:52:21,445 INFO MapTask: kvstart = 0; kvend = 600309; length = 750387 2012-06-04 11:52:24,320 INFO MapTask: Finished spill 0 2012-06-04 11:52:26,117 INFO MapTask: Spilling map output: record full = true 2012-06-04 11:52:26,118 INFO MapTask: bufstart = 60030900; bufend = 120061700; bufvoid = 228117712 2012-06-04 11:52:26,118 INFO MapTask: kvstart = 600309; kvend = 450230; length = 750387 2012-06-04 11:52:26,666 INFO MapTask: Starting flush of map output 2012-06-04 11:52:28,272 INFO MapTask: Finished spill 1 2012-06-04 spills total! maybeINFO MapTask: Finished spill 2 3 11:52:29,105 we can do better?14 ©2011 Cloudera, Inc. All Rights Reserved.
Tuning to reduce spills • Parameters: – io.sort.mb: total buffer space – io.sort.record.percent: proportion between metadata buffers and key/value data – io.sort.spill.percent: threshold at which spill is triggered – Total map output generated: can you use more compact serialization? • Optimal settings depend on your data and available RAM!15 ©2011 Cloudera, Inc. All Rights Reserved.
Setting io.sort.record.percent • Common mistake: metadata buffers fill up way before kvdata buffer • Optimal setting: – io.sort.record.percent = 16/(16 + R) – R = average record size: divide “Map Output Bytes” counter by “Map Output Records” counter • Default (0.05) is usually too low (optimal for ~300byte records) • Hadoop 2.0: this is no longer necessary! – see MAPREDUCE-64 for gory details16 ©2011 Cloudera, Inc. All Rights Reserved.
Tuning Example (terasort) • Map input size = output size – 128MB block = 1,342,177 records, each 100 bytes – metadata: 16 * 1342177 = 20.9MB • io.sort.mb – 128MB data + 20.9MB meta = 148.9MB • io.sort.record.percent – 16/(16+100)=0.138 • io.sort.spill.percent = 1.017 ©2011 Cloudera, Inc. All Rights Reserved.
More tips on spill tuning • Biggest win is going from 2 spills to 1 spill – 3 spills is approximately the same speed as 2 spills (same IO amplificatoin) • Calculate if it’s even possible, given your heap size – io.sort.mb has to fit within your Java heap (plus whatever RAM your Mapper needs, plus ~30% for overhead) • Only bother if this is the bottleneck! – Look at map task logs: if the merge step at the end is taking a fraction of a second, not worth it! – Typically most impact on jobs with big shuffle (sort/ dedup)18 ©2011 Cloudera, Inc. All Rights Reserved.
MR from 10,000 feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task19 ©2011 Cloudera, Inc. All Rights Reserved.
Reducer fetch tuning • Reducers fetch map output via HTTP • Tuning parameters: – Server side: tasktracker.http.threads – Client side: mapred.reduce.parallel.copies • Turns out this is not so interesting – follow the best practices from Hadoop: Definitive Guide20 ©2011 Cloudera, Inc. All Rights Reserved.
Improving fetch bottlenecks • Reduce intermediate data – Implement a Combiner: less data transfers faster – Enable intermediate compression: Snappy is easy to enable; trades off some CPU for less IO/ network • Double-check for network issues – Frame errors, NICs auto-negotiated to 100mbit, etc: one or two slow hosts can bottleneck a job – Tell-tale sign: all maps are done, and reducers sit in fetch stage for many minutes (look at logs)21 ©2011 Cloudera, Inc. All Rights Reserved.
MR from 10,000 feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task22 ©2011 Cloudera, Inc. All Rights Reserved.
Reducer merge (Hadoop 1.0) Yes: RAMManager fetch to RAM-to-disk RAM merges Remote Map Fits in Outputs RAM? (via HTTP) 1. Data accumulated in RAM is merged to No: fetch disk files to disk Local Disk IFile2. If too many disk disk-to-disk Mergedfiles accumulate, IFile merges iteratorthey are re-merged IFile Reduce Task 3. Segments from RAM and disk are23 merged into the reducer code ©2011 Cloudera, Inc. All Rights Reserved.
Reducer merge triggers • RAMManager – Total buffer size: mapred.job.shuffle.input.buffer.percent (default 0.70, percentage of reducer heapsize) • Mem-to-disk merge triggers: – RAMManager is mapred.job.shuffle.merge.percent % full (default 0.66) – Or mapred.inmem.merge.threshold segments accumulated (default 1000) • Disk-to-disk merge – io.sort.factor on-disk segments pile up (fairly rare)24 ©2011 Cloudera, Inc. All Rights Reserved.
Final merge phase • MR assumes that reducer code needs the full heap worth of RAM – Spills all in-RAM segments before running user code to free memory • This isn’t true if your reducer is simple – eg sort, simple aggregation, etc with no state • Configure mapred.job.reduce.input.buffer.percent to 0.70 to keep reducer input data in RAM25 ©2011 Cloudera, Inc. All Rights Reserved.
Reducer merge counters • FILE: number of bytes read/written – Ideally close to 0 if you can fit in RAM • Spilled records: – Ideally close to 0. If significantly more than reduce input records, job is hitting a multi- pass merge which is quite expensive26 ©2011 Cloudera, Inc. All Rights Reserved.
Tuning reducer merge • Configure mapred.job.reduce.input.buffer.percent to 0.70 to keep data in RAM if you don’t have any state in reducer • Experiment with setting mapred.inmem.merge.threshold to 0 to avoid spills • Hadoop 2.0: experiment with mapreduce.reduce.merge.memtomem.enabled27 ©2011 Cloudera, Inc. All Rights Reserved.
Rules of thumb for # maps/reduces • Aim for map tasks running 1-3 minutes each – Too small: wasted startup overhead, less efficient shuffle – Too big: not enough parallelism, harder to share cluster • Reduce task count: – Large reduce phase: base on cluster slot count (a few GB per reducer) – Small reduce phase: fewer reducers will result in more efficient shuffle phase28 ©2011 Cloudera, Inc. All Rights Reserved.
MR from 10,000 feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task29 ©2011 Cloudera, Inc. All Rights Reserved.
Tuning Java code for MR • Follow general Java best practices – String parsing and formatting is slow – Guard debug statements with isDebugEnabled() – StringBuffer.append vs repeated string concatenation • For CPU-intensive jobs, make a test harness/ benchmark outside MR – Then use your favorite profiler • Check for GC overhead: -XX:+PrintGCDetails – verbose:gc • Easiest profiler: add –Xprof to mapred.child.java.opts – then look at stdout task log30 ©2011 Cloudera, Inc. All Rights Reserved.
Other tips for fast MR code • Use the most compact and efficient data formats – LongWritable is way faster than parsing text – BytesWritable instead of Text for SHA1 hashes/dedup – Avro/Thrift/Protobuf for complex data, not JSON! • Write a Combiner and RawComparator • Enable intermediate compression (Snappy/ LZO)31 ©2011 Cloudera, Inc. All Rights Reserved.
Summary • Understanding MR internals helps understand configurations and tuning • Focus your tuning effort on things that are bottlenecks, following a scientific approach • Don’t forget that you can always just add nodes! – Spending 1 month of engineer time to make your job 20% faster is not worth it if you have a 10 node cluster! • We’re working on simplifying this where we can, but deep understanding will always allow more efficient jobs32 ©2011 Cloudera, Inc. All Rights Reserved.