Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Practices" by Basant Verma


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • (>90% of map tasks are data local)(10X gain with the use of Combiner)
  • Check the speculation formula and update
  • IsolationRunner is intended to facilitate debugging by re-running a specific task, given left-over task files for a (typically failed) past jobCurrently, it is limited to re-running map tasks.mapreduce.task.files.preserve.failedtasks
  • Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Practices" by Basant Verma

    1. 1. 1<br />Map-Reduce Programming & Best Practices<br />Apache Hadoop India Summit 2011<br />Basant Verma<br />Yahoo! India R&D<br />February 16, 2011<br />
    2. 2. Hadoop Components<br />2<br />Client 1<br />Client 2<br />Processing Framework<br />DFS<br />HDFS (Hadoop Distributed File System)<br />Modeled on GFS<br />Reliable, High Bandwidth file system that can store TB' and PB's data.<br />Map-Reduce<br />Using Map/Reduce metaphor from Lisp language<br />A distributed processing framework paradigm that process the data stored onto HDFS in key-value.<br />
    3. 3. Word Count DataFlow<br />
    4. 4. Word Count<br />$ cat ~/wikipedia.txt | <br />sed -e 's/ /n/g' | grep . | <br />sort | <br />uniq -c > <br />~/frequencies.txt<br />4<br />
    5. 5. MR for Word-Count<br />mapper (filename, file-contents):<br /> for each word in file-contents:<br /> emit (word, 1)<br />reducer (word, values[]):<br /> sum = 0<br /> for each value in values:<br /> sum = sum + value<br /> emit (word, sum)<br />5<br />
    6. 6. MR Dataflow<br />6<br />
    7. 7. MapReduce Pipeline<br />7<br />
    8. 8. Pipeline Details<br />8<br />
    9. 9. Available tweaks and optimizations!<br />Input to Maps<br />Map only jobs<br />Combiner<br />Compression<br />Speculation<br />Fault Tolerance<br />Buffer Size<br />Parallelism (threads) <br />Partitioner<br />Reporter<br />DistributedCache<br />Task child environment settings<br />
    10. 10. Input to Map<br />Maps should process significant amount of data to minimize the effect of overhead.<br />Process multiple-files per map for jobs with very large number of small input files.<br />Process large chunk of data for large scale processing<br />Use as fewer maps to process data in parallel, as few as possible without having bad failure recovery cases.<br />Unless the application's maps are heavily CPU bound, there is almost no reason to ever require more than 60,000-70,000 maps for a single application.<br />10<br />
    11. 11. Map only jobs<br />Run map only job once for generating data<br />Run multiple jobs with different reduce implementations<br />Map only jobs will write directly to HDFS<br />
    12. 12. Combiner<br />Provides map-side aggregation of data<br />Each and every record emitted by the Mapper need not be shipped to the reducers.<br />Reduce code can be used as combiner. Example : Word count!<br />Helps reduce network traffic for the shuffle. Results in lesser disk space usage.<br />However, it is important to ensure that<br />Ensure they really work <br />the Combiner does provide sufficient aggregation.<br />
    13. 13. Compression<br />Map and Reduce outputs can be compressed<br />Compressing intermediate data will help reduce the amount of disk usage and network I/O.<br />Compression helps reduce the total data size on the DFS.<br />
    14. 14. Shuffle<br />Shuffle Phase performance depends on the crossbar between the map tasks and the reduce tasks, which must be minimized.<br />Compression of intermediate output<br />Use of Combiner<br />14<br />
    15. 15. Reduces<br />Configure appropriate number of reduces<br />Too few hurt the nodes<br />Too many hurt the cross-bar<br />All reduces must be complete in single wave.<br />Each reduce should process at least 1-2 GB of data, and at most 5-10GB of data, in most scenarios.<br />
    16. 16. Partitioner<br />Distribute data evenly across reduces<br />Uneven distribution will hurt the whole job runtime.<br />Default is hash partitioner<br />hash(key)%num-reducers<br />Why is a custom partitioner needed? <br />Sort<br />WordCount<br />
    17. 17. Output<br />Outputs to a few large files, with each file spanning multiple HDFS blocks and appropriately compressed.<br />Number of output artifacts is linearly proportionate to the number of configured reduces<br />Compress Outputs<br />Use appropriate file-formats for output<br />E.g. compressed text file is not a great idea if not using splittable codec.<br />Consider using Hadoop ARchive (HAR) to reduce namespace usage.<br />Think of the consumers of your data-set<br />
    18. 18. Speculation<br />Slow running tasks can be speculated<br />Slowness is determined by the expected time the task will take to complete. <br />Speculation will kick-in only when there are no pending tasks.<br />Total number of tasks that can be speculated for a job is capped to reduce wastage.<br />
    19. 19. Fault Tolerance<br />Data is stored as blocks on separate nodes<br />Nodes are composed of cheap commodity hardware<br />Tasks are independent of each other<br />New tasks can be scheduled on new nodes<br />The JobTracker tries 4 times (default) before giving up. <br />Job can be configured to tolerate task failures up to N% of the total tasks.<br />
    20. 20. Reporter<br /> Used to report progress to the parent processes.<br /> Commonly used when the tasks try to <br /> - Connect to a remote application like web-service, database<br /> - Do some disk intensive computation<br /> - Get blocked on some event<br /> One can also spawn a thread and make it report the progress periodically<br />
    21. 21. Distributed Cache<br />Efficient distribution of read-only files for applications<br />Localized automatically once the task is scheduled on the slave node<br />Cleaned up once no task running on the slave needs the cache files<br />Designed for small number of mid-size files.<br />Artifacts in the distributed-cache should not require more i/o than the actual input to the application tasks.<br />
    22. 22. Few tips for better performance<br />Increase the memory/buffer allocated to the tasks (io.sort.mb)?<br />Increase the number of tasks that can be run in parallel<br />Increase the number of threads that serve the map outputs<br />Disable unnecessary logging<br />Find the optimal value of dfs block size<br />Share the cluster between the DFS and MR for data locality<br />Turn on speculation<br />Run reducers in one wave as they can be really costly<br />Make proper use of DistributedCache<br />
    23. 23. Anti-Patterns<br />Processing thousands of small files (sized less than 1 HDFS block, typically 128MB) with one map processing a single small file. <br />Processing very large data-sets with small HDFS block size i.e. 128MB resulting in tens of thousands of maps. <br />Applications with a large number (thousands) of maps with a very small runtime (e.g. 5s). <br />Straight-forward aggregations without the use of the Combiner. <br />Applications with greater than 60,000-70,000 maps. <br />Applications processing large data-sets with very few reduces (such as1). <br />Applications using a single reduce for total-order amount the output records.<br />Pig scripts processing large data-sets without using the PARALLEL keyword<br />
    24. 24. Anti-Patterns (Cont…)<br />Applications processing data with very large number of reduces, such that each reduce processes less than 1-2GB of data. <br />Applications writing out multiple, small, output files from each reduce.<br />Applications using the DistributedCache to distribute a large number of artifacts and/or very large artifacts (hundreds of MBs each).<br />Applications using more than 25 counters per task.<br />Applications performing metadata operations (e.g. listStatus) on the file-system from the map/reduce tasks.<br />Applications doing screen-scraping of JobTracker web-ui for status of queues/jobs or worse, job-history of completed jobs.<br />Workflows comprising of hundreds of of small jobs processing small amounts of data with a very high job submission rate.<br />
    25. 25. Debugging<br />Side effect files : Write to external files from M/R code<br />Web UI : Web UI shows stdout/stderr<br />Isolation Runner : Run the task on the tracker where the task failed. Switch to the workspace of the task and run IsolationRunner.<br />Debug Scripts : Upload the script to the DFS, create a symlink and pass this script in the conf file. One common use is to filter out exceptions from the logs/stderr/stdout<br />LocalJobRunner is used to run a MapReduce job on local node. It can be used for faster debugging and proof-of-concept.<br />
    26. 26. Task child environment settings<br />The child-task inherits the environment of the parent TaskTracker. The user can specify additional options to the child JVM via the mapred.child.java.opts<br />An example showing multiple arguments and substitutions<br />showing jvm GC logging<br /> start of a passwordless JVM JMX agent so that it can connect with jconsole<br /> get the thread dumps<br /> sets the maximum heap-size of the child jvm to 512MB <br /> add an additional path to the java.library.path of the child-jvm.<br /><property><br /> <name>mapred.child.java.opts</name><br /> <value>-Xmx512M -Djava.library.path=/home/mycompany/lib -verbose:gc<br /> -Xloggc:/tmp/@taskid@.gc <br /> -Dcom.sun.management.jmxremote.authenticate=false <br /> -Dcom.sun.management.jmxremote.ssl=false<br /> </value><br /></property><br />
    27. 27. Checklist..<br />Are your partitions uniform?<br />Can you combine records at the map side?<br />Are maps reading off a DFS block worth of data?<br />Are you running a single reduce wave (unless the data size per reducers is too big) ?<br />Have you tried compressing intermediate data & final data?<br />Are your buffer sizes large enough to minimize spills but small enough to stay clear of swapping?<br />Do you see unexplained “long tails” ? (can be mitigated via speculative execution)<br />Are you keeping your cores busy? (via slot configuration)<br />Is at least one system resource being loaded?<br />
    28. 28. 28<br />