MapReduce basics


Published on


Distributed processing issues, MR programming model
Sample MR job
How MR can be implemented
Pros and cons of MR, tips for better performance

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

MapReduce basics

  1. 1. MapReduce basics Harisankar H, PhD student, DOS lab, Dept. CSE, IIT Madras 6-Feb-2013
  2. 2. Distributed processing ? • Processing distributed across multiple machines/serversImage from:
  3. 3. Why distributed processing?– Reduce execution time of large jobs • E.g., extracting urls from terabytes of data • 1000 machines could finish the jobs 1000 times faster– Fault-tolerance • Other nodes will take over the jobs if some of the nodes fail – Typically if you have 10,000 servers, on the average one will fail per day
  4. 4. Issues in distributed processing• Realized traditionally using special-purpose implementations – E.g., indexer, log processor• Implementation really hard at socket programming level – Fault-tolerance • Keep track of failure, reassignment of tasks – Hand-coded parallelization – Scheduling across heterogeneous nodes – Locality • Minimise movement of data for computation – How to distribute data?• Results in: – Complex, brittle, non-generic code – Reimplementation of common features like fault-tolerance, distribution
  5. 5. Need for a generic abstraction for distributed processingApp programmer  abstraction  systems developer Separation of concerns Express app Performance, fault logic handling etc. • Tradeoff between genericity and performance – More generic => usually less performance • MapReduce probably a sweet spot where you have both to some extent
  6. 6. MapReduce abstraction(app programmer’s view) • Model input and output as <key,value> pairs • Provide map() and reduce() functions which act on <k,v> pairs • Input: set of <k,v> pairs: {k,v} – For each input <k,v>: map(k1,v1)  list(k2,v2) – For each unique output key from map: reduce(k2,combined list(v2))  list(v3)System will take care of distributing the tasks across thousands of machines,handling locality, fault-tolerance etc.
  7. 7. Example: word count• Problem: – Count the number of occurrences of each unique word in a big collection of documents• Input <k,v> set: – <document name, document contents> • Organize the files in this format• Output: – <word, count> • Get it in output files• Next step: – Define the map() and reduce() functions
  8. 8. Word countmap(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, “1”);reduce(String key, List values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
  9. 9. Program in java public void reduce(Text key, public void map(LongWritable key, Text Iterable<IntWritable> values, Contextvalue, Context context) throws … context) throws … { { String line = value.toString(); int sum = 0; StringTokenizer tokenizer = new for (IntWritable val : values) {StringTokenizer(line); sum += val.get(); while (tokenizer.hasMoreTokens()) { } word.set(tokenizer.nextToken()); context.write(key, new context.write(word, one); IntWritable(sum)); } } }
  10. 10. Implementing MapReduce abstractionApp programmer  abstraction  systems developer • Looked at the application programmer’s view • Need a platform which implements the MapReduce abstraction • Hadoop is the popular open-source implementation of MapReduce abstraction • Questions for the platform developer – How to • parallelize ? • handle faults ? • provide locality ? • distribute the data ?
  11. 11. Basics of platform implementation• parallelize ? – Each map can be executed independently in parallel – After all maps have finished execution, all reduce can be executed in parallel• handle faults ? – map() and reduce() has no internal state • Simply re-execute in case of a failure• distribute the data ? – Have a distributed file system(HDFS)• provide locality ? – Prefer to execute map() on the nodes having input <k,v> pair
  12. 12. MapReduce implementation• Distributed File System(DFS) + MapReduce(MR) Engine – Specifically, MR engine uses a DFS• Distributed files system – Files split into large chunks and stored in the distributed file system(e.g., HDFS) – Large chunks: typically 64MB per block – can have a master-slave architecture • Master assigns and manages replicated blocks in the slaves
  13. 13. MapReduce engine• Has a master slave architecture – Master co-ordinates the task execution across workers – Workers perform the map() and reduce() functions • Reads and writes blocks to/from the DFS – Master keeps tracks of failure of workers and reassigns tasks if necessary • Failure detection usually done through timeouts
  14. 14. network
  15. 15. Some tips for designing MR jobs• Reduce network traffic between map and reduce – Model map() and reduce() jobs appropriately – Use combine() functions • combine(<k,[v]>)  <k,[v]> • combine() executes after all map()s finish in each block – map() [same node] combine() [network]  reduce()• Make map jobs of roughly equal expected execution times• Try to make reduce() jobs less skewed
  16. 16. Pros and cons of MapReduce• Advantages – Simple, easy to use distributed processing system – Reasonably generic – Exploits locality for performance – Simple and less buggy implementation• Issues – Not a magic bullet which fit all problems • Difficult to model iterative and recursive computations – E.g.: k-means clustering – Generate-Map-Reduce • Difficult to model streaming computations • Centralized entities like master becomes bottlenecks • Most real-world problems require large chains of MR jobs
  17. 17. Summary • Today – Distributed processing issues, MR programming model – Sample MR job – How MR can be implemented – Pros and cons of MR, tips for better performance • Tomorrow – Details specific to Hadoop – Downloading and setting up of Hadoop on a clusterAck: some images from: Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified dataprocessing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113.
  18. 18. Hadoop components• HDFS – Master: Namenode – Slave : DataNode• MapReduce engine – Master: JobTracker – Slave: TaskTracker