1
2
3
4
When buffer exceeds io.sort.spill.pct, a spill thread begins. The spill thread beginswith the start of the buffer and star...
Another threshold parameter is io.sort.record.percent. The buffer is divided by thisfraction to leave room for accounting ...
From MAPREDUCE-64.The point here is that the buffer is actually a circular data structure with two parts:the key/value ind...
This is a diagram of a single spill. The result is a partitioned, possibly-combined spillfile sitting in one of the locati...
This illustrates how a tasktracker’s mapred.local.dir might look towards the end of aparticular map task that is processin...
The good news is that it’s computationally very inexpensive to merge sorted sets toproduce a final sorted set. However, it...
In this crude illustration, we’ve increased io.sort.factor to 3 from 2. In this case, wecut the number of merges required ...
Reducers obtain data from mappers via HTTP calls. Each HTTP connection has to beserviced by an HTTP thread. The number of ...
Increasing the number of HTTP threads increases the amount of parallelism we canachieve in the shuffle-sort phase, transfe...
14
Reducers obtain data from mappers via HTTP calls. Each HTTP connection has to beserviced by an HTTP thread. The number of ...
The parallel copies configuration allows the readucer to retrieve map output frommultiple mappers out in the cluster in pa...
“That which is written must be read”In a very similar process to which map output is spilled and merged to create a finalo...
Suppose K is really a composite key that can be expanded into fields K1, K2…Kn Forthe mapper, we set the SortComparator to...
This slide illustrates the secondary sort process independently of the shuffle-sort. ThesortComparator orders every key/va...
In this crude illustration, we’ve increased io.sort.factor to 3 from 2. In this case, wecut the number of merges required ...
The size of the reducer buffer is specified by mapred.job.shuffle.input.buffer.pct interms of percent of the total heap al...
In addition to being a little funny, the point here is that while there are a lot oftunables to consider in Hadoop, you re...
23
24
25
These numbers will grow with scale but the ratios will remain the same. Therefore,you should be able to tune your mapreduc...
27
Start with a naïve implementation of wordcount with no combiner, and tune downio.sort.mb and io.sort.factor to very small ...
29
30
31
32
33
Upcoming SlideShare
Loading in …5
×

Shuffle sort 101

1,398 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,398
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
29
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Shuffle sort 101

  1. 1. 1
  2. 2. 2
  3. 3. 3
  4. 4. 4
  5. 5. When buffer exceeds io.sort.spill.pct, a spill thread begins. The spill thread beginswith the start of the buffer and starts to spill keys and values to disk. If the buffer fillsup before spill is complete, it blocks the mapper until the spill is complete. The spill iscomplete when the buffer is completely flushed. The mapper then continues to fillthe buffer until another spill begins. It loops like this until the mapper has emitted allof its K,V pairs.A larger value for io.sort.mb means more k,v pairs can fit in memory so youexperience fewer spills. Changing io.sort.spill.pct can give the spill thread more timeso you experience fewer blocks.5
  6. 6. Another threshold parameter is io.sort.record.percent. The buffer is divided by thisfraction to leave room for accounting info that is required for each record. If theaccounting info room fills up, a spill begins. The amount of room required byaccounting info is a function of the number of records, not the record size. Therefore,a higher number of records might need more room for accounting to reduce spill.6
  7. 7. From MAPREDUCE-64.The point here is that the buffer is actually a circular data structure with two parts:the key/value index and the buffer. The key/value index is the “accounting info”.MAPREDUCE-64 basically patches such that that sort.record.percent autotunesinstead of get manually set.7
  8. 8. This is a diagram of a single spill. The result is a partitioned, possibly-combined spillfile sitting in one of the locations of mapred.local.dir on local disk.This is a “hot path” in the code. Spills happen often and there are insertion points foruser/developer code: specifically the partitioner but more importantly the combinerand most importantly the keycomparator and also the valuegroupingcomparator. Ifyou don’t include a combiner or you have an ineffective combiner, then you’re spillingmore data through the entire cycle. If your comparators are less than efficient, yourwhole sort process slows.8
  9. 9. This illustrates how a tasktracker’s mapred.local.dir might look towards the end of aparticular map task that is processing a large volume of data. Spill files are dumped todisk round-robin to each directory specified by mapred.local.dir. Each spill file ispartitioned and sorted with the context of a single RAM-sized chunk of data.Before those files can be served to the reducers, they have to be merged. But how doyou merge files that are already about as large as a buffer?9
  10. 10. The good news is that it’s computationally very inexpensive to merge sorted sets toproduce a final sorted set. However, it is very IO intensive.This slide illustrates the spill/merge cycle required to merge the multiple spill filesinto a single output file ready to be served to the reducer. This example illustrates therelationship between io.sort.factor (2 for illustration) and the number of merges. Thesmaller io.sort.factor is, the more merges and spills are required, the more disk IOyou have, the slower your job runs. The larger it is, the more memory is required, butthe faster things go. A developer can tweak these settings per job, and it’s veryimportant to do so, because it directly affects the IO characteristics (and thusperformance) of your mapreduce job.In real life, io.sort.factor defaults to 10, and this still leads to too many spills andmerges when data really scales. You can increase io.sort.factor to 100 or more onlarge clusters or big data sets.10
  11. 11. In this crude illustration, we’ve increased io.sort.factor to 3 from 2. In this case, wecut the number of merges required to achieve the same result in half. This cuts downthe number of spills, the number of times the combiner is called, and one full passthrough the entire data set. As you can see, io.sort.factor is a very importantparameter!11
  12. 12. Reducers obtain data from mappers via HTTP calls. Each HTTP connection has to beserviced by an HTTP thread. The number of HTTP threads running on a task trackerdictates the number of parallel reducers we can connect to. For illustration purposeshere, we set the value to 1 and watch all the other reducers queue up. This slowsthings down.12
  13. 13. Increasing the number of HTTP threads increases the amount of parallelism we canachieve in the shuffle-sort phase, transferring data to the reducers.13
  14. 14. 14
  15. 15. Reducers obtain data from mappers via HTTP calls. Each HTTP connection has to beserviced by an HTTP thread. The number of HTTP threads running on a task trackerdictates the number of parallel reducers we can connect to. For illustration purposeshere, we set the value to 1 and watch all the other reducers queue up. This slowsthings down.15
  16. 16. The parallel copies configuration allows the readucer to retrieve map output frommultiple mappers out in the cluster in parallel.If the reducer experiences a connection failure to a mapper, it tries again,exponentially backing off in a loop until the value of mapred.reduce.copy.backoff isexceeded. Then we timeout and fail that reducer.16
  17. 17. “That which is written must be read”In a very similar process to which map output is spilled and merged to create a finaloutput file for the mapper, the output from multiple mappers must be read, merged,and spilled to create the input for the reduce function. The final reducer output is notwritten to disk in the form of a spill file, but is rather passed to reduce() as aparameter.This means that if you have a mistake or a misconfiguration that is slowing you downon the map side, the same exact configuration mistake is slowing you down doubleon the reduce side. When you don’t have combiners in the mix that are reducing thenumber of map outputs, this problem is compounded.17
  18. 18. Suppose K is really a composite key that can be expanded into fields K1, K2…Kn Forthe mapper, we set the SortComparator to respect ALL parts of that key.However, for the reducer, we call a “grouping comparator” which only respects aSUBSET of those keys. All keys being equal by this subset are sent to the same call toreduce().The result is that keys that are equal by the “grouping comparator” go to the samecall to “reduce” with their associated values, which have already been sorted by themore precise key.18
  19. 19. This slide illustrates the secondary sort process independently of the shuffle-sort. ThesortComparator orders every key/value set. The grouping comparator just determinesequivalence in terms of which calls to reduce() get which data elements. The cheathere is that the grouping comparator has to respect the rules of the sort comparator.It can only be less restrictive. In other words, values that appear equal to the groupcomparator will go to the same call to reduce(). The value grouping does not actuallyreorder any values.19
  20. 20. In this crude illustration, we’ve increased io.sort.factor to 3 from 2. In this case, wecut the number of merges required to achieve the same result in half. This cuts downthe number of spills, and one full pass through the entire data set. As you can see,io.sort.factor is a very important parameter!20
  21. 21. The size of the reducer buffer is specified by mapred.job.shuffle.input.buffer.pct interms of percent of the total heap allocated to the reduce task. When this buffer fills,map inputs spill to disk and have to be merged later. The spill begins when themapred.shuffle.merge.pct threshhold is reached: this is specified in terms of theprecent of the input buffer size. You can increase this value to reduce the number oftrips to disk in the reduce() phase.Another paramter to pay attention to is mapred.inmem.merge.threshhold. This is interms of the number of map input values. When this value is reached, we spill to disk.If your mappers explode the data like wordcount does, consider setting this value tozero.21
  22. 22. In addition to being a little funny, the point here is that while there are a lot oftunables to consider in Hadoop, you really only need to focus on a few at a time inorder to get optimum performance of any specific job.Cluster administrators typically set default values for these tunables, but really theseare best guesses based on their understanding of Hadoop and of the jobs the userswill be submitting to the cluster. Any user can submit a job that cripples a cluster, andin the interests of themselves and the other users, it behooves developer tounderstand and override these configurations.22
  23. 23. 23
  24. 24. 24
  25. 25. 25
  26. 26. These numbers will grow with scale but the ratios will remain the same. Therefore,you should be able to tune your mapreduce job on small data sets before unleashingthem on large data sets.26
  27. 27. 27
  28. 28. Start with a naïve implementation of wordcount with no combiner, and tune downio.sort.mb and io.sort.factor to very small levels. Run with this setting on a very smalldata set. Then run again on a data set twice the size. Now, tune up io.sort.mb and/orio.sort.factor. Also play with mapred.inmem.merge.threshhold.Now, add a combiner.Now, tweak the wordcount to keep a local in-memory hash updated. This causesmore memory consumption in the mapper, but reduces the data set going intocombine() and also reduces the amount of data spilled.One each run, note the counters. What works best for you?28
  29. 29. 29
  30. 30. 30
  31. 31. 31
  32. 32. 32
  33. 33. 33

×