Hadoop Summit 2012 | Optimizing MapReduce Job Performance

June 14, 2012

Optimizing MapReduce Job
Performance
Todd Lipcon [@tlipcon]

Introductions

• Software Engineer at Cloudera since 2009
• Committer and PMC member on HDFS,
MapReduce, and HBase
• Spend lots of time looking at full stack
performance

• This talk is to help you develop faster jobs
– If you want to hear about how we made Hadoop
faster, see my Hadoop World 2011 talk on
cloudera.com

2
©2011 Cloudera, Inc. All Rights Reserved.

Aspects of Performance

• Algorithmic performance
– big-O, join strategies, data
structures, asymptotes
• Physical performance
– Hardware (disks, CPUs, etc)
• Implementation performance
– Efficiency of code, avoiding extra work
– Make good use of available physical perf

3

Performance fundamentals

• You can’t tune what you don’t
understand
– MR’s strength as a framework is its black-box
nature
– To get optimal performance, you have to
understand the internals

• This presentation: understanding the
black box

4

Performance fundamentals (2)

• You can’t improve what you can’t
measure
– Ganglia/Cacti/Cloudera Manager/etc a must
– Top 4 metrics: CPU, Memory, Disk, Network
– MR job metrics: slot-seconds, CPU-seconds,
task wall-clocks, and I/O

• Before you start: run jobs, gather data

5

Graphing bottlenecks
This job might
be CPU-bound
in map phase
Most jobs not
CPU-bound
Plenty of free
RAM, perhaps
can make better
use of it?

Fairly flat-topped
network –
bottleneck?

6

Performance tuning cycle

Identify Address
Run job
bottleneck bottleneck
-Graphs - Tune configs
-Job counters - Improve code
-Job logs - Rethink algos
-Profiler results

In order to understand these metrics and make
changes, you need to understand MR internals.

7

MR from 10,000 feet
InputFormat Map Sort/ Fetch Merge Reduce OutputFormat
Task Spill Task

8

MR from 10,000 feet
Task Spill Task

9

Map-side sort/spill overview
• Goal: when complete, map task outputs one sorted file
• What happens when you call
OutputCollector.collect()?

Map
Task 2. Output Buffer fills up.
Contents sorted, partitioned
.collect(K,V) and spilled to disk

MapOutputBuffer IFile
1. In-memory buffer
holds serialized, Map-side
IFile IFile
unsorted key-values Merge
3. Map task finishes. All
IFile IFiles merged to single
IFile per task

10

Zooming further: MapOutputBuffer
(Hadoop 1.0)

12 bytes/rec
kvoffsets
(Partition, KOff, VOff)
per record
io.sort.record.percent
* io.sort.mb

kvindices

4 bytes/rec
1 indirect-sort index
io.sort.mb per record

kvbuffer R bytes/rec

Raw, serialized (1-io.sort.record.percent)
(Key, Val) pairs * io.sort.mb

11

MapOutputBuffer spill behavior

• Memory is limited: must spill
– If either of the kvbuffer or the metadata
buffers fill up, “spill” to disk
– In fact, we spill before it’s full (in another
thread): configure io.sort.spill.percent
• Performance impact
– If we spill more than one time, we must re-
read and re-write all data: 3x the IO!
– #1 goal for map task optimization: spill once!

12

Spill counters on map tasks

• ratio of Spilled Records vs Map Output
Records
– if unequal, then you are doing more than one
spill
• FILE: Number of bytes read/written
– get a sense of I/O amplification due to spilling

13

Spill logs on map tasks
indicates that the metadata buffers
2012-06-04 11:52:21,445 INFO before the data buffer map output:
filled up MapTask: Spilling
record full = true
2012-06-04 11:52:21,445 INFO MapTask: bufstart = 0; bufend
= 60030900; bufvoid = 228117712
2012-06-04 11:52:21,445 INFO MapTask: kvstart = 0; kvend =
600309; length = 750387
2012-06-04 11:52:24,320 INFO MapTask: Finished spill 0
2012-06-04 11:52:26,117 INFO MapTask: Spilling map output:
record full = true
2012-06-04 11:52:26,118 INFO MapTask: bufstart = 60030900;
bufend = 120061700; bufvoid = 228117712
2012-06-04 11:52:26,118 INFO MapTask: kvstart = 600309;
kvend = 450230; length = 750387
2012-06-04 11:52:26,666 INFO MapTask: Starting flush of
map output
2012-06-04 11:52:28,272 INFO MapTask: Finished spill 1
2012-06-04 spills total! maybeINFO MapTask: Finished spill 2
3 11:52:29,105 we can do
better?

14

Tuning to reduce spills

• Parameters:
– io.sort.mb: total buffer space
– io.sort.record.percent: proportion between
metadata buffers and key/value data
– io.sort.spill.percent: threshold at which
spill is triggered
– Total map output generated: can you use
more compact serialization?
• Optimal settings depend on your data and
available RAM!

15

Setting io.sort.record.percent

• Common mistake: metadata buffers fill up
way before kvdata buffer
• Optimal setting:
– io.sort.record.percent = 16/(16 + R)
– R = average record size: divide “Map Output
Bytes” counter by “Map Output Records” counter
• Default (0.05) is usually too low (optimal for
~300byte records)
• Hadoop 2.0: this is no longer necessary!
– see MAPREDUCE-64 for gory details

16

Tuning Example (terasort)

• Map input size = output size
– 128MB block = 1,342,177 records, each 100
bytes
– metadata: 16 * 1342177 = 20.9MB
• io.sort.mb
– 128MB data + 20.9MB meta = 148.9MB
• io.sort.record.percent
– 16/(16+100)=0.138
• io.sort.spill.percent = 1.0

17

More tips on spill tuning
• Biggest win is going from 2 spills to 1 spill
– 3 spills is approximately the same speed as 2 spills
(same IO amplificatoin)
• Calculate if it’s even possible, given your heap
size
– io.sort.mb has to fit within your Java heap (plus
whatever RAM your Mapper needs, plus ~30% for
overhead)
• Only bother if this is the bottleneck!
– Look at map task logs: if the merge step at the end is
taking a fraction of a second, not worth it!
– Typically most impact on jobs with big shuffle
(sort/dedup)

18

MR from 10,000 feet
Task Spill Task

19

Reducer fetch tuning

• Reducers fetch map output via HTTP
• Tuning parameters:
– Server side: tasktracker.http.threads
– Client side:
mapred.reduce.parallel.copies
• Turns out this is not so interesting
– follow the best practices from Hadoop:
Definitive Guide

20

Improving fetch bottlenecks

• Reduce intermediate data
– Implement a Combiner: less data transfers faster
– Enable intermediate compression: Snappy is
easy to enable; trades off some CPU for less
IO/network
• Double-check for network issues
– Frame errors, NICs auto-negotiated to 100mbit,
etc: one or two slow hosts can bottleneck a job
– Tell-tale sign: all maps are done, and reducers sit
in fetch stage for many minutes (look at logs)

21

MR from 10,000 feet
Task Spill Task

22

Reducer merge (Hadoop 1.0)

Yes:
RAMManager
fetch to RAM-to-disk
RAM merges
Remote Map Fits in
Outputs RAM?
(via HTTP) 1. Data accumulated
in RAM is merged to
No: fetch disk files
to disk
Local Disk

IFile
2. If too many disk
disk-to-disk Merged
files accumulate, IFile
merges iterator
they are re-merged

IFile Reduce
Task
3. Segments from
RAM and disk are
23 merged into the
reducer code

Reducer merge triggers
• RAMManager
– Total buffer size:
mapred.job.shuffle.input.buffer.percent
(default 0.70, percentage of reducer heapsize)
• Mem-to-disk merge triggers:
– RAMManager is
mapred.job.shuffle.merge.percent % full
(default 0.66)
– Or mapred.inmem.merge.threshold segments
accumulated (default 1000)
• Disk-to-disk merge
– io.sort.factor on-disk segments pile up (fairly rare)

24

Final merge phase

• MR assumes that reducer code needs the
full heap worth of RAM
– Spills all in-RAM segments before running
user code to free memory
• This isn’t true if your reducer is simple
– eg sort, simple aggregation, etc with no state
• Configure
mapred.job.reduce.input.buffer.percent to
0.70 to keep reducer input data in RAM

25

Reducer merge counters

• FILE: number of bytes read/written
– Ideally close to 0 if you can fit in RAM
• Spilled records:
– Ideally close to 0. If significantly more than
reduce input records, job is hitting a multi-
pass merge which is quite expensive

26

Tuning reducer merge

• Configure
mapred.job.reduce.input.buffer.percent
to 0.70 to keep data in RAM if you don’t
have any state in reducer
• Experiment with setting
mapred.inmem.merge.threshold to 0 to
avoid spills
• Hadoop 2.0: experiment with
mapreduce.reduce.merge.memtomem.enabled

27

Rules of thumb for # maps/reduces

• Aim for map tasks running 1-3 minutes each
– Too small: wasted startup overhead, less efficient
shuffle
– Too big: not enough parallelism, harder to share
cluster
• Reduce task count:
– Large reduce phase: base on cluster slot count (a
few GB per reducer)
– Small reduce phase: fewer reducers will result in
more efficient shuffle phase

28

MR from 10,000 feet
Task Spill Task

29

Tuning Java code for MR
• Follow general Java best practices
– String parsing and formatting is slow
– Guard debug statements with isDebugEnabled()
– StringBuffer.append vs repeated string concatenation
• For CPU-intensive jobs, make a test
harness/benchmark outside MR
– Then use your favorite profiler
• Check for GC overhead: -XX:+PrintGCDetails –
verbose:gc
• Easiest profiler: add –Xprof to
mapred.child.java.opts – then look at
stdout task log

30

Other tips for fast MR code

• Use the most compact and efficient data
formats
– LongWritable is way faster than parsing text
– BytesWritable instead of Text for SHA1
hashes/dedup
– Avro/Thrift/Protobuf for complex data, not JSON!
• Write a Combiner and RawComparator
• Enable intermediate compression
(Snappy/LZO)

31

Summary

• Understanding MR internals helps understand
configurations and tuning
• Focus your tuning effort on things that are
bottlenecks, following a scientific approach
• Don’t forget that you can always just add nodes!
– Spending 1 month of engineer time to make your job
20% faster is not worth it if you have a 10 node
cluster!
• We’re working on simplifying this where we can,
but deep understanding will always allow more
efficient jobs

32

Questions?

@tlipcon
todd@cloudera.com

Hadoop Summit 2012 | Optimizing MapReduce Job Performance

More Related Content

What's hot

Similar to Hadoop Summit 2012 | Optimizing MapReduce Job Performance

More from Cloudera, Inc.

Recently uploaded

Hadoop Summit 2012 | Optimizing MapReduce Job Performance