Shuffle sort 101

When buffer exceeds io.sort.spill.pct, a spill thread begins. The spill thread begins
with the start of the buffer and starts to spill keys and values to disk. If the buffer fills
up before spill is complete, it blocks the mapper until the spill is complete. The spill is
complete when the buffer is completely flushed. The mapper then continues to fill
the buffer until another spill begins. It loops like this until the mapper has emitted all
of its K,V pairs.
A larger value for io.sort.mb means more k,v pairs can fit in memory so you
experience fewer spills. Changing io.sort.spill.pct can give the spill thread more time
so you experience fewer blocks.
5

Another threshold parameter is io.sort.record.percent. The buffer is divided by this
fraction to leave room for accounting info that is required for each record. If the
accounting info room fills up, a spill begins. The amount of room required by
accounting info is a function of the number of records, not the record size. Therefore,
a higher number of records might need more room for accounting to reduce spill.
6

From MAPREDUCE-64.
The point here is that the buffer is actually a circular data structure with two parts:
the key/value index and the buffer. The key/value index is the “accounting info”.
MAPREDUCE-64 basically patches such that that sort.record.percent autotunes
instead of get manually set.
7

This is a diagram of a single spill. The result is a partitioned, possibly-combined spill
file sitting in one of the locations of mapred.local.dir on local disk.
This is a “hot path” in the code. Spills happen often and there are insertion points for
user/developer code: specifically the partitioner but more importantly the combiner
and most importantly the keycomparator and also the valuegroupingcomparator. If
you don’t include a combiner or you have an ineffective combiner, then you’re spilling
more data through the entire cycle. If your comparators are less than efficient, your
whole sort process slows.
8

This illustrates how a tasktracker’s mapred.local.dir might look towards the end of a
particular map task that is processing a large volume of data. Spill files are dumped to
disk round-robin to each directory specified by mapred.local.dir. Each spill file is
partitioned and sorted with the context of a single RAM-sized chunk of data.
Before those files can be served to the reducers, they have to be merged. But how do
you merge files that are already about as large as a buffer?
9

The good news is that it’s computationally very inexpensive to merge sorted sets to
produce a final sorted set. However, it is very IO intensive.
This slide illustrates the spill/merge cycle required to merge the multiple spill files
into a single output file ready to be served to the reducer. This example illustrates the
relationship between io.sort.factor (2 for illustration) and the number of merges. The
smaller io.sort.factor is, the more merges and spills are required, the more disk IO
you have, the slower your job runs. The larger it is, the more memory is required, but
the faster things go. A developer can tweak these settings per job, and it’s very
important to do so, because it directly affects the IO characteristics (and thus
performance) of your mapreduce job.
In real life, io.sort.factor defaults to 10, and this still leads to too many spills and
merges when data really scales. You can increase io.sort.factor to 100 or more on
large clusters or big data sets.
10

In this crude illustration, we’ve increased io.sort.factor to 3 from 2. In this case, we
cut the number of merges required to achieve the same result in half. This cuts down
the number of spills, the number of times the combiner is called, and one full pass
through the entire data set. As you can see, io.sort.factor is a very important
parameter!
11

Reducers obtain data from mappers via HTTP calls. Each HTTP connection has to be
serviced by an HTTP thread. The number of HTTP threads running on a task tracker
dictates the number of parallel reducers we can connect to. For illustration purposes
here, we set the value to 1 and watch all the other reducers queue up. This slows
things down.
12

Increasing the number of HTTP threads increases the amount of parallelism we can
achieve in the shuffle-sort phase, transferring data to the reducers.
13

Reducers obtain data from mappers via HTTP calls. Each HTTP connection has to be
serviced by an HTTP thread. The number of HTTP threads running on a task tracker
dictates the number of parallel reducers we can connect to. For illustration purposes
here, we set the value to 1 and watch all the other reducers queue up. This slows
things down.
15

The parallel copies configuration allows the readucer to retrieve map output from
multiple mappers out in the cluster in parallel.
If the reducer experiences a connection failure to a mapper, it tries again,
exponentially backing off in a loop until the value of mapred.reduce.copy.backoff is
exceeded. Then we timeout and fail that reducer.
16

“That which is written must be read”
In a very similar process to which map output is spilled and merged to create a final
output file for the mapper, the output from multiple mappers must be read, merged,
and spilled to create the input for the reduce function. The final reducer output is not
written to disk in the form of a spill file, but is rather passed to reduce() as a
parameter.
This means that if you have a mistake or a misconfiguration that is slowing you down
on the map side, the same exact configuration mistake is slowing you down double
on the reduce side. When you don’t have combiners in the mix that are reducing the
number of map outputs, this problem is compounded.
17

Suppose K is really a composite key that can be expanded into fields K1, K2…Kn For
the mapper, we set the SortComparator to respect ALL parts of that key.
However, for the reducer, we call a “grouping comparator” which only respects a
SUBSET of those keys. All keys being equal by this subset are sent to the same call to
reduce().
The result is that keys that are equal by the “grouping comparator” go to the same
call to “reduce” with their associated values, which have already been sorted by the
more precise key.
18

This slide illustrates the secondary sort process independently of the shuffle-sort. The
sortComparator orders every key/value set. The grouping comparator just determines
equivalence in terms of which calls to reduce() get which data elements. The cheat
here is that the grouping comparator has to respect the rules of the sort comparator.
It can only be less restrictive. In other words, values that appear equal to the group
comparator will go to the same call to reduce(). The value grouping does not actually
reorder any values.
19

In this crude illustration, we’ve increased io.sort.factor to 3 from 2. In this case, we
cut the number of merges required to achieve the same result in half. This cuts down
the number of spills, and one full pass through the entire data set. As you can see,
io.sort.factor is a very important parameter!
20

The size of the reducer buffer is specified by mapred.job.shuffle.input.buffer.pct in
terms of percent of the total heap allocated to the reduce task. When this buffer fills,
map inputs spill to disk and have to be merged later. The spill begins when the
mapred.shuffle.merge.pct threshhold is reached: this is specified in terms of the
precent of the input buffer size. You can increase this value to reduce the number of
trips to disk in the reduce() phase.
Another paramter to pay attention to is mapred.inmem.merge.threshhold. This is in
terms of the number of map input values. When this value is reached, we spill to disk.
If your mappers explode the data like wordcount does, consider setting this value to
zero.
21

In addition to being a little funny, the point here is that while there are a lot of
tunables to consider in Hadoop, you really only need to focus on a few at a time in
order to get optimum performance of any specific job.
Cluster administrators typically set default values for these tunables, but really these
are best guesses based on their understanding of Hadoop and of the jobs the users
will be submitting to the cluster. Any user can submit a job that cripples a cluster, and
in the interests of themselves and the other users, it behooves developer to
understand and override these configurations.
22

These numbers will grow with scale but the ratios will remain the same. Therefore,
you should be able to tune your mapreduce job on small data sets before unleashing
them on large data sets.
26

Start with a naïve implementation of wordcount with no combiner, and tune down
io.sort.mb and io.sort.factor to very small levels. Run with this setting on a very small
data set. Then run again on a data set twice the size. Now, tune up io.sort.mb and/or
io.sort.factor. Also play with mapred.inmem.merge.threshhold.
Now, add a combiner.
Now, tweak the wordcount to keep a local in-memory hash updated. This causes
more memory consumption in the mapper, but reduces the data set going into
combine() and also reduces the amount of data spilled.
One each run, note the counters. What works best for you?
28

Shuffle sort 101

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Shuffle sort 101

Similar to Shuffle sort 101 (20)

Recently uploaded

Recently uploaded (20)

Shuffle sort 101